## Problem Description:

Crimes have a dominant impact in our social, economical, physical , moral aspect of life. Since beginning of time this is something being prominent is different shape and number across different domain of our society. Different orginizations are trying to deal it with their utmost capacity but yet that process is to some extent relies on descriptive analytics. Our approach is to make this process more time and quality efficient by merging descriptive and predictive analytics.In that path, we have attempted to use machine learning to explore how it can aid all current processes used by different social and goverment organizations in terms of predicting crimes across different times and location types.

## Data to be used:

Data is the key behind any analytics use cases and for predictive side it is even more valuable as this drives the overall life cycle and quality of predictions. To predict crimes, we need historial data and evidences of past occurances and related attributes about different occurances. 
We have used the popular Kaggle as a source of data.

- This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to  2017, minus the most recent seven days. Data is extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system. In order to protect the privacy of crime victims, addresses are shown at the block level only and specific locations are not identified. Should you have questions about this dataset, you may contact the Research & Development Division of the Chicago Police Department at 312.745.6071 or RDAnalysis@chicagopolice.org
Further details about the dataset is given here: https://www.kaggle.com/currie32/crimes-in-chicago/home

In [1]:
import numpy as np
import pandas as pd
from sklearn import preprocessing
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn')
from IPython.display import display

from watson_machine_learning_client import WatsonMachineLearningAPIClient

In [2]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0.1,Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,3,10508693,HZ250496,05/03/2016 11:40:00 PM,013XX S SAWYER AVE,486,BATTERY,DOMESTIC BATTERY SIMPLE,APARTMENT,True,...,24.0,29.0,08B,1154907.0,1893681.0,2016,05/10/2016 03:56:50 PM,41.864073,-87.706819,"(41.864073157, -87.706818608)"


## Methodology:

The aim is to undertand crimes around different location spaces types and crime types in accordance with timeline. We have used the basic machine learning pipeline and adopted following steps:

- Data clean and exploration: 
    - We have used the dataset consists of crime records (2005-2017). Initally we have cleaned up some columns having garbage values. After that we have used data exploration concepts like visualizations and tabular format to explore different columns and their distribution pattern related to times.From EDA I have found that crime in Chicago has steadily decreased year over year for the most part and that we consistently see upticks in crime during the summer (seasonality). Another timeseries trend that makes itself aparent is that more crimes occur in the latter half of the week than in the beginning. Also, the majority of crimes occur on the South and West sides. 
- Data ETL and feature engineering: 
    - Once we got the view of the data distributions and some basic insights, we have loaded the data in our environment (Watson Studio) notebook and setup clean up steps around the features.After that, we have extracted the features that are contrbuting more in the target class (location types) based on the data exploration phase.We have also encoded out target class (Location Description Number) limited to 4 classes (APARTMENT/RESINDENCE, OPEN SPACES, CLOSED SPACES, OTHER) only so it gives us a better perspective. Also for all this methods, we have analyzed the data in relevance to the date and time associated to the crimes.After doing all feature engineering, features data are saved as part of the ETL process to the cloud object store.
- Model training: 
    - We have used neural network model and keras framework to generate the model using the features selected in previous step.As for the model config, we have keras sequntial model with three layers with relu activation funtion and the output layer with softmax activation. 
- Model Evaluation: 
    - Since our implementation is multilable classification in nature, we have used standard loss metric like categorical_crossentropy during model training and confusion metrics (accuracy,precision, recall, f1-score) as part of the model performance validation.
- Model deploytment: 
    - We have used IBM Watson machine learning as a medium for generating, saving the model in cloud object storage and deploying in the cloud as web service and thus generated an API end point to access from any applications. Basic steps to follow:
    - Setup watson machine learning as service
    - get associated credentials
    - generate model using wml_client api 
    - get scoring URL and use for predictions

In [3]:
cdata17.Date = pd.to_datetime(cdata17.Date, format='%m/%d/%Y %I:%M:%S %p')

In [4]:
features = pd.DataFrame(columns = ["Date", "Primary Type", "Location Description", "Community Area", "Beat", "District", "Ward"])
feat_value = cdata17[["Date", "Primary Type", "Location Description", "Community Area", "Beat", "District", "Ward"]]

In [5]:
feat_value.index = pd.DatetimeIndex(feat_value.Date)

In [54]:
## random test data

In [55]:
crimes_2017 = feat_value.loc['2017']
test_record= crimes_2017.sample(n=1)

In [56]:
test_record

Unnamed: 0_level_0,Date,Primary Type,Location Description,Community Area,Beat,District,Ward
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2017-01-09 14:00:00,2017-01-09 14:00:00,THEFT,PARKING LOT/GARAGE(NON.RESID.),6.0,1924,19.0,44.0


In [57]:
primary_type = preprocessing.LabelEncoder()

In [58]:
test_record.loc[:, 'Primary Type in number'] = primary_type.fit_transform(test_record['Primary Type'])

In [59]:
feat_onehotcode = test_record['Primary Type'].astype(str).str.get_dummies()

In [60]:
#set labels for time in a day
test_record['hour'] = test_record.index.hour
test_record['min'] = test_record.index.minute
test_record["Business Hour"] =np.where(test_record['hour'].between(9,17), 1, 0)
#set labels for day of the week
test_record['Day of Week'] = test_record.index.dayofweek
test_record["Business Day"] =np.where(test_record['Day of Week'].between(0,4), 1, 0)
#set day of a month
test_record['Day of Month']= test_record.index.day

In [61]:
#set labels for crime spaces
location = test_record['Location Description']

In [62]:
test_record.loc[:, 'Location Description Number'] =np.where(location.str.contains('RESIDEN')
                                                      |location.str.contains('APARTMENT'), 1,
                                                  np.where(location.str.contains('STREET')
                                                      |location.str.contains('ALLEY')
                                                      |location.str.contains('SIDEWALK')
                                                      |location.str.contains('LOT')
                                                      |location.str.contains('PARK')
                                                      |location.str.contains('STATION')
                                                      |location.str.contains('PUBLIC')
                                                      |location.str.contains('PLATFORM'), 2,
                                                  np.where(location.str.contains('STORE')
                                                       |location.str.contains('RESTAURANT')
                                                       |location.str.contains('SCHOOL')
                                                       |location.str.contains('BUILDING')
                                                       |location.str.contains('BAR')
                                                       |location.str.contains('OFFICE')
                                                       |location.str.contains('BUS')
                                                       |location.str.contains('BANK')
                                                       |location.str.contains('HOTEL')
                                                       |location.str.contains('TRAIN')
                                                        |location.str.contains('VEHICLE'), 3, 0)))

In [63]:
test_record

Unnamed: 0_level_0,Date,Primary Type,Location Description,Community Area,Beat,District,Ward,Primary Type in number,hour,min,Business Hour,Day of Week,Business Day,Day of Month,Location Description Number
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2017-01-09 14:00:00,2017-01-09 14:00:00,THEFT,PARKING LOT/GARAGE(NON.RESID.),6.0,1924,19.0,44.0,0,14,0,1,0,1,9,2


In [64]:
selected_features = test_record[["Date", "Primary Type in number",
                                "Community Area", "hour", "Day of Week", "Business Hour", "Business Day",
                                "Day of Month", "Beat", "District", "Ward"]]
selected_features = pd.concat([feat_onehotcode, selected_features], axis=1)
selected_features




Unnamed: 0_level_0,THEFT,Date,Primary Type in number,Community Area,hour,Day of Week,Business Hour,Business Day,Day of Month,Beat,District,Ward
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2017-01-09 14:00:00,1,2017-01-09 14:00:00,0,6.0,14,0,1,1,9,1924,19.0,44.0


In [65]:
features_test = selected_features[["hour", "Day of Week", "Primary Type in number", "Community Area", "Business Hour",
                             "Business Day"]]
features_test

Unnamed: 0_level_0,hour,Day of Week,Primary Type in number,Community Area,Business Hour,Business Day
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2017-01-09 14:00:00,14,0,0,6.0,1,1


## Scoring

In [66]:
# The code was removed by Watson Studio for sharing.

In [67]:
scoring_url='https://us-south.ml.cloud.ibm.com/v3/wml_instances/5bf45d81-7c64-4339-84e9-afb366ef2db8/deployments/24d56b17-ddfd-43da-8445-52c27b33b714/online'

In [68]:
import json
a=json.dumps(np.array(features_test.values).tolist())
b=json.loads(a)

In [70]:
#0: OTHER
#1: APARTMENT, RESINDENCE
#2: OPEN/PUBLIC SPACES (STREET,ALLEY,SIDEWALK,LOT,PARK,STATION,PUBLIC,PLATFORM)
#3: CLOSED/CORPORATE PLACES (STORE, RESTUARENT, SCHOOL, BUILDING, BAR, OFFICE, BUS, BANK, HOTEL, TRAIN, VEHICLE) 


In [71]:
import json
scoring_data = {'values':b}
print (scoring_data)
predictions = wml_client.deployments.score(scoring_url, scoring_data)
print("Scoring result: " + str(predictions))

{'values': [[14.0, 0.0, 0.0, 6.0, 1.0, 1.0]]}
Scoring result: {'fields': ['prediction', 'prediction_classes', 'probability'], 'values': [[[1.4448665751842782e-05, 0.28973039984703064, 0.4399290680885315, 0.2703261077404022], 2, [1.4448665751842782e-05, 0.28973039984703064, 0.4399290680885315, 0.2703261077404022]]]}


The project have much more further scope of exploration and potential of exploring different aspects of the problems like, crime rates, specific crime types given specific locations.