# Introduction

San Francisco was infamous for housing some of the world's most notorious criminals on the inescapable island of Alcatraz. Today, the city is known more for its tech scene than its criminal past. From Sunset to SOMA, and Marina to Excelsior, this project analyzes 12 years of crime reports from across all of San Francisco's neighborhoods to create a model that predicts the category of crime that occurred, given time and location.

# Project Overview

Crime is a social phenomenon as old as societies themselves, and although there will never be a free from crime society - just because it would need everyone in that society to think and act in the same way - societies always look for a way to minimize it and prevent it. In the modern United States history, crime rates increased after World War II, peaking from the 1970s to the early 1990s. Violent crime nearly quadrupled between 1960 and its peak in 1991. Property crime more than doubled over the same period. Since the 1990s, however, crime in the United States has declined steadily. Until recently crime prevention was studied based on strict behavioral and social methods, but the recent developments in Data Analysis have allowed a more quantitative approach in the subject. We will explore a dataset of nearly 12 years of crime reports from all of San Francisco's neighborhoods, and we will create a model that predicts the category of crime that occurred, given the time and location.

# Problem Statement

To examine the specific problem, we will apply a full Data Science life cycle composed of the following steps:

Data Wrangling to audit the quality of the data and perform all the necessary actions to clean the dataset.
Data Exploration for understanding the variables and create intuition on the data.
Feature Engineering to create additional variables from the existing.
Data Normalization and Data Transformation for preparing the dataset for the learning algorithms (if needed).
Training / Testing data creation to evaluate the performance of our models and fine-tune their hyperparameters.
Model selection and evaluation. This will be the final goal; creating a model that predicts the probability of each type of crime based on the location and the date.

# Data Exploration

The dataset is in a tabular form and includes chronological, geographical and text data and contains incidents derived from the SFPD Crime Incident Reporting system. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

In [None]:
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
#init_notebook_mode(connected=True)
import cufflinks as cf
import plotly.offline as pyo
cf.go_offline()
pyo.init_notebook_mode()
#print(__version__)

In [None]:
df_train= pd.read_csv('../input/sf-crime/train.csv.zip', parse_dates=['Dates'])
df_test=pd.read_csv('../input/sf-crime/test.csv.zip')

In [None]:
print('First date: ', str(df_train.Dates.describe()['first']))
print('Last date: ', str(df_train.Dates.describe()['last']))

In [None]:
df_train.head()

In [None]:
df_test

In [None]:
  df_train.info()

In [None]:
df_test.info()

In [None]:
df_train['Category'].describe()

In [None]:
df_train.dtypes

checking for duplicates

In [None]:
df_train.duplicated().sum()

In [None]:
df_train.drop_duplicates(keep='first', inplace=True)

In [None]:
df_train.duplicated().sum()

In [None]:
df_train['Category'].value_counts()

In [None]:
df_train['Descript'].value_counts()

In [None]:
df_train['Category'].isnull().sum()

In [None]:
df_train['Category'].nunique()

In [None]:
df_train['Descript'].nunique()

In [None]:
df_train['PdDistrict'].value_counts()

countplots of crime categories and the districts with seaborn....plotly provides better visualizations

In [None]:
#plt.figure(figsize=(14,10))
#sns.countplot(x='PdDistrict', data=df_train, palette='viridis')

In [None]:
df_train['PdDistrict'].value_counts().iplot(kind='bar', colors='red')

In [None]:
#plt.figure(figsize=(20,10))
#ax=sns.countplot(x='Category', data=df_train, palette='viridis')
#ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha='right')

In [None]:
df_train['Category'].value_counts().iplot(kind='bar', colors='darkblue' ,title='SAN FRAN CRIME')

In [None]:
df_train['PdDistrict'].value_counts().iplot(kind='bar',colors='Black')

In [None]:
df_train['Category'].value_counts().iplot(kind='box', colors='darkblue' ,title='SAN FRAN CRIME')

In [None]:
df_train['PdDistrict'].value_counts().iplot(kind='box', colors='darkblue' ,title='SAN FRAN CRIME')

Going to explore the specific columns, we are going to start with the Dates column in both the train and test sets
 first we are going to establish the types 

In [None]:
type(df_train['Dates'][0])

In [None]:
type(df_test['Dates'][0])

We dont want that, we want the Date time column to be a date time field not a string, so we are going to use the pd to datetime pandas method

In [None]:
import datetime
df_test['Dates']=pd.to_datetime(df_test['Dates'],infer_datetime_format=True)


In [None]:
type(df_test['Dates'][0])

In [None]:
#df_train['Dates']

In [None]:
#df_test['Dates']

We are going to use .apply() to create 2 new columns called Hour and Month to help us fine grain the time accuracy.Going to create these columns based off of the Dates column

In [None]:
df_train['Hour']=df_train['Dates'].apply(lambda time: time.hour)
df_test['Hour']=df_train['Dates'].apply(lambda time: time.hour)

In [None]:
df_train['Year']=df_train['Dates'].apply(lambda time: time.year)
df_test['Year']=df_train['Dates'].apply(lambda time: time.year)

In [None]:
df_train['Month']=df_train['Dates'].apply(lambda time: time.month)
df_test['Month']=df_test['Dates'].apply(lambda time: time.month)

In [None]:
df_train['Month'].value_counts()

In [None]:
Month_dict= {1:'January',2:'February',3:'March',4:'April',5:'May',6:'June',7:'July',8:'August',9:'September',10:'October',11:'November',12:'December'}
df_train['Month']=df_train['Month'].map(Month_dict)
df_train['Month'].unique()

In [None]:
df_train['Month'].value_counts().iplot(kind='bar', color='darkblue', title='Average crimes per month')

In [None]:
df_train['Hour'].value_counts().iplot(kind='bar', title='Crimes by Hour', color='Darkred')

In [None]:
df_train['DayOfWeek'].value_counts()

In [None]:
df_train['DayOfWeek'].value_counts().iplot(kind='line',color='black', title='Crimes per day')

In [None]:
df_train.head()

In [None]:
plt.figure(figsize=(18,12))
sns.countplot(x='Year',hue='Category',data=df_train)
plt.legend(loc=10,bbox_to_anchor=(1.1, 0.5))

In [None]:
plt.figure(figsize=(18,12))
sns.countplot(x='Year',hue='PdDistrict',data=df_train)
plt.legend(loc=10,bbox_to_anchor=(1.1, 0.5))

In [None]:
#top 13
bob=df_train['Category'].value_counts().head(13)
bob

In [None]:
monthyear= df_train.groupby(by=['Month', 'Year']).count()['Category'].unstack()

In [None]:
monthyear

In [None]:
monthyear.iloc[0,:].iplot(kind='line', title='January over the years')

In [None]:
monthyear=monthyear.reindex(["January","February","March","April","May","June","July","August","September","October","November","December"])

In [None]:
monthyear

In [None]:
monthyear.iloc[:,11].iplot(kind='line', title='2004')

In [None]:
monthyear.sum()

In [None]:
monthyear.sum().iplot(kind='line')

In [None]:
mycorr=monthyear.corr()

In [None]:
plt.figure(figsize=(18,12))
sns.heatmap(mycorr, annot=True)

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(monthyear)

In [None]:
df_train=df_train.drop(['Dates','Descript'], axis=1)

In [None]:
df_train.head()

In [None]:
df_train=df_train.drop(['Resolution'], axis=1)

In [None]:
df_train.head()

In [None]:
df_test=df_test.drop(['Id', 'Dates'], axis=1)

In [None]:
df_test.head()

In [None]:
df_train.head()

In [None]:
#Month_dict= {1:'January',2:'February',3:'March',4:'April',5:'May',6:'June',7:'July',8:'August',9:'September',10:'October',11:'November',12:'December'}
#df_test['Month']=df_test['Month'].map(Month_dict)

In [None]:
df_test.head()

In [None]:
type(df_test['Hour'][0])

In [None]:
type(df_train['Hour'][0])

In [None]:
df_test['Hour']=df_test['Hour'].fillna(18)

In [None]:
df_test['Hour']=df_test['Hour'].astype(int)

In [None]:
df_test['Year']=df_test['Year'].fillna(2014)

In [None]:
df_test['Year']=df_test['Year'].astype(int)

In [None]:
df_test.head()

In [None]:
df_train.head()

In [None]:
Month_num= {'January':1,'February':2,'March':3,'April':4,'May':5,'June':6,'July':7,'August':8,'September':9,'October':10,'November':11,'December':12}
df_train['Month']=df_train['Month'].map(Month_num)
df_train['Month'].unique()

In [None]:
pd_map={'NORTHERN':100,'PARK':200,'INGLESIDE':300,'BAYVIEW':400,'RICHMOND':500,'CENTRAL':600,'TARAVAL':700,'TENDERLOIN':800,'MISSION':900,'SOUTHERN':1000}
df_train['PdDistrict']=df_train['PdDistrict'].map(pd_map)
df_train['PdDistrict'].unique()

In [None]:
df_test['PdDistrict']=df_test['PdDistrict'].map(pd_map)
df_test['PdDistrict'].unique()

In [None]:
day_map={'Sunday':1,'Monday':2,'Tuesday':3,'Wednesday':4,'Thursday':5,'Friday':6,'Saturday':7}
df_train['DayOfWeek']=df_train['DayOfWeek'].map(day_map)
df_test['DayOfWeek']=df_test['DayOfWeek'].map(day_map)

In [None]:
df_test['DayOfWeek'].unique()

In [None]:
y=df_train['Category']

In [None]:
df_test=df_test.drop('Address', axis=1)

In [None]:
X=df_train.drop(['Category','Address'],axis=1)

In [None]:
X.head()

In [None]:
df_test.head()

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=101)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from time import time

In [None]:
rfc = RandomForestClassifier(15)
print('Random Forest...')
start = time()
rfc.fit(X_train, y_train)
end = time()
print('Trained model in {:3f} seconds...'.format(end - start))

In [None]:
rfc.score(X_train, y_train)

In [None]:
predictions=rfc.predict(df_test)

In [None]:
predictions