# Crime Prediction

The nature of the dataset, particularly the number of different crimes and the unbalanced nature of the dataset, makes it difficult to predict what crime will predict and when. We can, however, repurpose the Crimes DataFrame by spliting the dataset into two distinct sets. 

## Import the Required Libraries

Import the librarie that are required to execute this Notebook.

In [162]:
# Import Pandas
import pandas as pd

# Import Numpy
import numpy as np

# All the SciKit Learn Libraries Required
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.model_selection import KFold, cross_val_score

from random import randint

## Import the DataFrames

Import the saved dataframes and create the Features DataFrame

In [163]:
# Import the Pickle of the Crimes DataFrame
df_crimes = pd.read_pickle('./capstone_pickles/crimes.pkl')
df_crimes.drop('index', inplace=True, axis=1)

# Import the Pickle of the Top Venues DataFrame
df_topvnues = pd.read_pickle('./capstone_pickles/top_venues.pkl')

# Import the Pickle of the Restaurants DataFrame
df_rest = pd.read_pickle('./capstone_pickles/restaurants.pkl')

# Start by copying the Latitude and Longitude to the new DataFrame
df_features = df_crimes[['latitude', 'longitude']]

# Next and One Hot Encoding of the hour, day and month variables
df_features = df_features.join(pd.get_dummies(df_crimes.hour, prefix='hour'))
df_features = df_features.join(pd.get_dummies(df_crimes.day_name))
df_features = df_features.join(pd.get_dummies(df_crimes.month_name))

# Finally add the ward column, copied from the original Primary Description column
df_features['ward'] = df_crimes[['ward']]
df_features['crimes'] = df_crimes[['primary_description']]

### Fake Crime Data

Next we'll generate the fake crime data. The crimes will be equally divided between a crimes happened `0` and no crime happened `1`. The Random Forest model will be trainined again on the data from October 2017 to Augest 2018 and tested against September 2018 to predict the acccuracy of the model.

A new test dataset will then be created for each location in the Top Venues DataFrame and for each Restaurant associated with each of the top Venues. A random visit Date, in September 2018, and time will be associated with each row and then a prediction will be made whether a crime would be committed at each location and date or not.

In [164]:
# Assign Random 
df_features['random_crimes'] = np.random.randint(0, 2, df_features.shape[0])

In [165]:
df_features.head()

Unnamed: 0,latitude,longitude,hour_0,hour_1,hour_2,hour_3,hour_4,hour_5,hour_6,hour_7,...,July,June,March,May,November,October,September,ward,crimes,random_crimes
0,41.897895,-87.760744,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,37.0,NARCOTICS,1
1,41.798635,-87.604823,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,5.0,CRIMINAL DAMAGE,1
2,41.780946,-87.621995,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,20.0,THEFT,1
3,41.965404,-87.736202,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,39.0,THEFT,0
4,41.850673,-87.735597,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,22.0,ARSON,1


In [166]:
feature_cols = df_features.columns.tolist()

### Create the Test Datasets

In [167]:
# Create the Train Dataset
X_Train = df_features.copy()
X_Train.drop('crimes', axis=1, inplace=True)
X_Train.drop('random_crimes', axis=1, inplace=True)
X_Train.drop('ward', axis=1, inplace=True)

# Normalise df_features
X_Train = preprocessing.StandardScaler().fit(X_Train).transform(X_Train)

y_Train = df_features.random_crimes.values

### Recreate the Random Forest Model

In [168]:
Forest_model_final = RandomForestClassifier(n_estimators = 22, max_features = 'sqrt').fit(X_Train, y_Train)

### Build the DataFrame of Potential Venue visits

In [169]:
# Create temporary dataframes of just latitude and longitude and name
#    from the Top Venues and Restaurant DataFrames
df_top = df_topvnues[['name', 'latitude', 'longitude']]
df_res = df_rest[['name', 'latitude', 'longitude']]

#Join the two dataframes
df_final = pd.concat([df_top, df_res])

# Drop duplicate entries
df_final.drop_duplicates(keep=False, inplace=True)
df_final.shape

(222, 3)

In [170]:
# Add a randon Date / Time to visit each locatio

# Year will always be 2018
year = 2018

# Empty list to hold the dates
dates = []

# Generate a random date for each entry in the dataframe
for i in range(0, df_final.shape[0]):
    month = randint(1, 12)
    day = randint(1, 28)
    hour = randint(0, 23)
    minute = randint(0, 59)
    date = '{:02d}-{:02d}-{:02d} {:02d}:{:02d}:00'.format(month,
                                      day,
                                      year,
                                      hour,
                                      minute)
    dates.append(date)

In [171]:
# We now have a date for each
se = pd.Series(dates)

# Then add the values to the DataFrame:
df_final['date'] = se.values

# Convert the date to a proper DateTime Object
df_final['date'] =  pd.to_datetime(df_final['date'], format='%m-%d-%Y %H:%M:%S')

In [172]:
# Add new columns to the dataframe to allow hourly, daily & monthly analysis
df_final['hour'] = df_final['date'].dt.hour
df_final['day_name'] = df_final['date'].dt.day_name()
df_final['day'] = df_final['date'].dt.dayofweek
df_final['month_name'] = df_final['date'].dt.month_name()
df_final['month'] = df_final['date'].dt.month
df_final['year'] = df_final['date'].dt.year
df_final['year_month'] = df_final['date'].dt.to_period('M')

In [173]:
df_final.reset_index(inplace=True)
df_final.drop('index', inplace=True, axis=1)
df_final.head()

Unnamed: 0,name,latitude,longitude,date,hour,day_name,day,month_name,month,year,year_month
0,Millennium Park,41.882699,-87.623644,2018-03-21 07:27:00,7,Wednesday,2,March,3,2018,2018-03
1,Chicago Lakefront Trail,41.967053,-87.646909,2018-08-11 04:53:00,4,Saturday,5,August,8,2018,2018-08
2,The Art Institute of Chicago,41.879665,-87.62363,2018-07-14 06:25:00,6,Saturday,5,July,7,2018,2018-07
3,The Chicago Theatre,41.885578,-87.627286,2018-04-04 11:48:00,11,Wednesday,2,April,4,2018,2018-04
4,Symphony Center (Chicago Symphony Orchestra),41.879275,-87.62468,2018-09-03 23:11:00,23,Monday,0,September,9,2018,2018-09


# Data Preparation for Modelling

Before we start modelling we need to prepare the data frame to include only mumerical data and by removing unneeded columns.

Rather than removing colums a new `df_features` DataFrame will be created with just the required columns. This `df_features` DataFrame will then be processed to remove Categorical Data Types and replace them with One Hot encoding. Finally the Dependant Variables will be Normalised and Principal Component Analysis will be used to reduce the dimensionality of the DataFrame.

In [174]:
# Start by copying the Latitude and Longitude to the new DataFrame
df_features_final = df_final[['latitude', 'longitude']]

# Next and One Hot Encoding of the hour, day and month variables
df_features_final = df_features_final.join(pd.get_dummies(df_final.hour, prefix='hour'))
df_features_final = df_features_final.join(pd.get_dummies(df_final.day_name))
df_features_final = df_features_final.join(pd.get_dummies(df_final.month_name))

In [175]:
df_features_final.shape

(222, 45)

In [176]:
yhat = Forest_model_final.predict(df_features_final)

In [178]:
yhat

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0])

In [179]:
df_final['prediction'] = yhat.tolist()

In [180]:
df_final

Unnamed: 0,name,latitude,longitude,date,hour,day_name,day,month_name,month,year,year_month,prediction
0,Millennium Park,41.882699,-87.623644,2018-03-21 07:27:00,7,Wednesday,2,March,3,2018,2018-03,0
1,Chicago Lakefront Trail,41.967053,-87.646909,2018-08-11 04:53:00,4,Saturday,5,August,8,2018,2018-08,0
2,The Art Institute of Chicago,41.879665,-87.623630,2018-07-14 06:25:00,6,Saturday,5,July,7,2018,2018-07,0
3,The Chicago Theatre,41.885578,-87.627286,2018-04-04 11:48:00,11,Wednesday,2,April,4,2018,2018-04,0
4,Symphony Center (Chicago Symphony Orchestra),41.879275,-87.624680,2018-09-03 23:11:00,23,Monday,0,September,9,2018,2018-09,0
5,Grant Park,41.873407,-87.620747,2018-05-06 17:56:00,17,Sunday,6,May,5,2018,2018-05,0
6,Chicago Riverwalk,41.887280,-87.627217,2018-01-04 04:03:00,4,Thursday,3,January,1,2018,2018-01,0
7,Garfield Park Conservatory,41.886259,-87.717177,2018-11-12 02:20:00,2,Monday,0,November,11,2018,2018-11,0
8,Music Box Theatre,41.949798,-87.663938,2018-12-22 15:27:00,15,Saturday,5,December,12,2018,2018-12,0
9,Nature Boardwalk,41.918102,-87.633283,2018-04-02 05:27:00,5,Monday,0,April,4,2018,2018-04,0
