The Objective of this notebook is to :  
- Analyze most severe crimes in Dallas.  
- To find the locations where such crimes happens frequently as compared to other neighborhood. 
- To see the relationship between crimes and month,year,location type. 
- Visualize the distribution of crime using maps.  
- Create model for classification of crime data.  
- Evaluate the accuracy of models developed.  
- Analyze and discuss the results. 

<h1> Table of contents</h1>

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ol>
        <li><a href="#load_dataset">Load the Dallas Crime data</a></li>
        <li><a href="#analyze">Analyze the Dallas Crime data</a></li>
        <li><a href="#map">Visualize Crime on Map</a></li>
        <li><a href="#modeling">Modeling</a></li>
        <li><a href="#evaluation">Evaluation</a></li>
    </ol>
</div>
<br>
<hr>

In [None]:
#Import libraries used in the notebook

import pandas as pd
import numpy as np
import re
import folium
from IPython.display import Image
from IPython.display import FileLink, FileLinks

#from geopy.geocoders import Nominatim

import seaborn as sns


# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt 
import holoviews as hv
import hvplot.pandas
hv.extension ('bokeh' , 'matplotlib')

#modules for modelling & accuracy 

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.cluster import KMeans
from sklearn.metrics import classification_report, confusion_matrix
import itertools


from xgboost import XGBClassifier
#!conda install -c conda-forge/label/cf201901 xgboost -y #if xgboost error is found uncomment this line 

%matplotlib inline

<h2 id="load_dataset">Load the Dallas Crime data</h2>  
This section describes data sources and its properties. Dallas crime data is available publicily at following link :

https://www.dallasopendata.com/Public-Safety/Police-Incident-location-and-pertinent-information/v3r6-776m

Please download CSV file and place it in the same folder from where this notebook is being run. File name should be **"Police_Incidents.csv"**

In [None]:
#to download uncomment the below line
#!wget https://www.dallasopendata.com/api/views/qv6i-rri7/rows.csv?accessType=DOWNLOAD

In [None]:
#df_read_input=pd.read_csv("Police_Incidents_References.csv") #This file is being used for reference only 

df_read_input=pd.read_csv("Police_Incidents.csv")



CSV file is with headers so it makes life easier. Let's look at the number of records and columns.

In [None]:
#Number of rows 

df_read_input.shape[0]

In [None]:
#Number of Columns

df_read_input.shape[1]

A quick overlook of the dataframe.

In [None]:
df_read_input.head()

As it can be seen from above rows that the crime data is not in an ideal condition for any analysis. It contains many null values and some columns might not be required for our current scope of studeis, the data types compatibility for statistical analysis. 

Source data has more than fifty thousand rows and around one hundred columns. Let us explore the data types of each column:

In [None]:
pd.set_option('display.max_rows', 120)
df_read_input.dtypes

The data types are mostly object except for few floats like Year of Occurance, Age etc. We will change the data type of required fields once we sort them.  
Now let us check the number of nulls in each column.

In [None]:
df_read_input.isna().sum()

As this is locaiton based analysis we will remove the rows in which locaiton is not provided.

In [None]:
#drop rows will null values 

df_read_input.dropna(axis=0,  inplace=True,subset=['Location1','Type of Incident'])
#df_read_input =df_read_input.iloc[:10000,:].copy()

In [None]:
#Check how many rows got affected due to 

df_read_input.shape

Next we will create a bucket for type of incidents reported, this will make it easier to identify each crime's type, corresponding to it's rank and severity.

In [None]:
#Few example type of incidents are listed below 

df_read_input['Type of Incident'].head()

Create a function which will return us the bucket in which each incident belongs. Also we will use this oppurtunity to assign rank and severity to each incident type. 

In [None]:
#function for incident type bucket 
def func_incident_type(data):
    type1,  rank,severity = func_incident_values(data)
    return type1

#function for rank of crime
def func_rank(data):
    type1,  rank,severity = func_incident_values(data)
    return rank

#function for severity of crime 
def func_severity(data):
    type1,  rank,severity = func_incident_values(data)
    return severity


In [None]:
#Declare the incident buckets, its rank and its severity 

Incident_bucket1 = 'ACCIDENT / RANDOM / ? / OTHER / NO OFFENSE' # Incidnet bucket
r1=4 #rank of crime 
s1=19 #severity of crime 
Incident_bucket2 = 'THEFT RELATED'
r2=1
s2=4
Incident_bucket3 = 'FRAUD / ORGANIZED CRIME / FALSE INFO'
r3=3
s3=14
Incident_bucket4 = 'TRAFFIC RELATED'
r4=2
s4=9
Incident_bucket5 = 'CRIMINAL MISHIEF / DAMAGE OF PROPERTY ARSON'
r5=3
s5=12
Incident_bucket6 = 'HARASSMENT / KIDNAPPING HOSTAGE'
r6=3
s6=11
Incident_bucket7 = 'ASSAULT RELATED'
r7=2
s7=3
Incident_bucket8 = 'EVADING ARREST RELATED'
r8=4
s8=17
Incident_bucket9 = 'THREAT / SECURITY BREACH / ALARM INCIDENT'
r9=2
s9=8
Incident_bucket10 = 'DEADLY CONDUCT'
r10=1
s10=2
Incident_bucket11 = 'MURDER RELATED'
r11=1
s11=1
Incident_bucket12 = 'GUN / WEAPON RELATED'
r12=1
s12=5
Incident_bucket13 = 'ANIMAL / LITTERING RELATED'
r13=4
s13=16
Incident_bucket14 = 'DRUG RELATED'
r14=2
s14=6
Incident_bucket15 = 'MANSLAUGHTER'
r15=2
s15=7
Incident_bucket16 = 'INTOXICATION / ALCOHOL RELATED'
r16=3
s16=13
Incident_bucket17 = 'CUSTODY / COURT / BOND / INTERFERENCE / WARRANT RELATED'
r17=4
s17=18


#Main function which processes data to return Incident bucket, Rank and Severity in Order

def func_incident_values(data):   
    
   #As per Refernce provided create a list of possible crimes and assign bucket to them 
    if ('THEFT' in data) or ('BMV' in data) or ('BURGLARY' in data) or ('ROBBERY' in data):
        return Incident_bucket2,r2,s2
    elif ('FRAUD' in data) or ('ORGANIZED CRIME' in data)or ('FALSE INFO' in data) :
        return Incident_bucket3,r3,s3 
    elif ('TRAFFIC' in data):
        return Incident_bucket4,r4,s4
    elif ('CRIMINAL MISHIEF' in data) or ('DAMAGE OF PROPERTY ARSON' in data) or ('CRIM MISCHIEF' in data ):
        return Incident_bucket5,r5,s5
    elif ('HARASSMENT ' in data) or ('KIDNAPPING' in data)or ('HOSTAGE' in data) or ('STALKING' in data):
        return Incident_bucket6,r6,s6
    elif ('ASSAULT' in data) :
        return Incident_bucket7,r7,s7
    elif ('EVADING ARREST' in data) :
        return Incident_bucket8,r8,s8
    elif ('THREAT' in data) or ('SECURITY BREACH' in data)or ('ALARM INCIDENT' in data):
        return Incident_bucket9,r9,s9
    elif ('DEADLY CONDUCT' in data) :
        return Incident_bucket10,r10,s10
    elif ('MURDER' in data) :
        return Incident_bucket11,r11,s11   
    elif ('GUN' in data) or ('WEAPON' in data):
        return Incident_bucket12,r12,s12    
    elif ('ANIMAL' in data) or ('LITTERING' in data):
        return Incident_bucket13,r13,s13 
    elif ('DRUG' in data) :
        return Incident_bucket14,r14,s14
    elif ('MANSLAUGHTER' in data):
        return Incident_bucket15,r15,s15
    elif ('INTOXICATION' in data) or ('ALCOHOL' in data):
        return Incident_bucket16,r16,s16
    elif ('CUSTODY' in data) or ('COURT' in data)or ('BOND' in data) or ('INTERFERENCE' in data) or ('WARRANT' in data):
        return Incident_bucket17,r17,s17
    else :
        #rest of the crimes are listed in this category 
        return Incident_bucket1,r1,s1
            

Since functions are created now let us call each function for each row to get our desried result in dataframe columns. 

In [None]:
#Create new column for Type_of_Incident_Bucket
df_read_input['Type_of_Incident_Bucket'] = df_read_input['Type of Incident'].apply(lambda x:func_incident_type(x))

#Use our created function to rank each crime row 
df_read_input['rank']= df_read_input['Type of Incident'].apply(lambda x:func_rank(x))

#Assign severity to each crime using function func_severity
df_read_input['severity']=df_read_input['Type of Incident'].apply(lambda x:func_severity(x))

Let us analyze average severity, amount of crimes using data frame 

In [None]:
df_read_input['severity'].describe()

This data shows us the mean value of severity of crime in dallas. Please note that median for crime severity is very different then its mean value.  

As this study focus is on severe crimes so let us fliter top most severe crimes rows. This will also filter out rows as the number of rows imported is large. 

In [None]:
to_crimes_count = 3

#Filter rows based on crime severity 
df_read_input= df_read_input[df_read_input['severity']<=to_crimes_count ]
df_read_input.shape

As we can see from above shape that the number of rows have reduced much.  

Location type bucket creation : This will allow us to provide more better insight of what kind of locations are being targeted in such crimes. 

In [None]:
#Function to get location types as per reference 

def func_loc_type(data):
    if ( 'Government Facility' in data): return 'Government Facility'
    elif ( 'Residential Area' in data): return 'Residential Area'
    elif ( 'Parking Lot' in data)or (( 'Parking' in data)): return 'Parking Lot'
    elif ( 'Street' in data): return 'Street'
    elif ( 'Business Office' in data): return 'Business Office'
    elif ( 'Retail Store' in data) or ( 'Retail' in data) or ( 'Store' in data): return 'Retail Store'
    elif ( 'Apartment Complex' in data) or ( 'Apartment' in data) or ( 'Complex' in data): return 'Apartment Complex'
    elif ( 'Hotel' in data): return 'Hotel'
    elif ( 'Commercial Property' in data) or ( 'Commercial' in data): return 'Commercial Property'
    elif ( 'ATM / Bank' in data) or ( 'ATM' in data) or ( 'Bank' in data): return 'ATM / Bank'
    elif ( 'Pharmacy' in data): return 'Pharmacy'
    elif ( 'Entertainment/Sports Venue' in data)or ( 'Entertainment' in data) or ( 'Sports' in data): return 'Entertainment/Sports Venue'
    elif ( 'Park' in data): return 'Park'
    elif ( 'Cyberspace' in data): return 'Cyberspace'
    elif ( 'Financial Institution' in data)or ( 'Financial' in data): return 'Financial Institution'
    elif ( 'Restraunt' in data): return 'Restraunt'
    elif ( 'Construction' in data) or ( 'Manufacturing' in data): return 'Construction / Manufacturing Site'
    elif ( 'Bar' in data): return 'Bar'
    elif ( 'School' in data): return 'School'
    elif ( 'Agricultural Area' in data)or ( 'Agricultural' in data): return 'Agricultural Area'
    elif ( 'Corrections Facility' in data): return 'Corrections Facility'
    elif ( 'Storage Facility' in data): return 'Storage Facility'
    elif ( 'Hospital' in data): return 'Hospital'
    elif ( 'Airport' in data): return 'Airport'
    elif ( 'Religious Building' in data) or ( 'Religious' in data): return 'Religious Building'
    elif ( 'Gas Station' in data) or ( 'Gas' in data): return 'Gas Station'
    elif ( 'City Property' in data): return 'City Property'
    else  : return 'Other'


Applying the function to create types of location. 

In [None]:
#Type_of_Location 
df_read_input['Type_of_Location']=df_read_input['Type  Location'].apply(lambda x:func_loc_type(str(x)))

Let us extract the coordinates from location field. 

In [None]:
#Location contains coordinates in brackets so we will split by bracket and then split further based on comma to get latitude and longitude

df_read_input[['Location2','coordinates2']]=df_read_input['Location1'].str.split('(',expand=True)
df_read_input[['latitude','longitude']]=df_read_input['coordinates2'].str.split(',',expand=True)
df_read_input[['longitude','temp']]=df_read_input['longitude'].str.split(')',expand=True)

In [None]:
#Take a look at if the values are successfully extracted 

df_read_input[['Location1','latitude','longitude']].head()

Extract Hour from Time of Occurance for analysis. 

In [None]:
df_read_input[['Hour','Minute']]=df_read_input['Time1 of Occurrence'].str.split(':',expand=True)

#Check if extraction worked

df_read_input[['Time1 of Occurrence','Hour']].head()


<h2 id="analyze">Analyze the Dallas Crime data</h2> 
In this section we will analyze different factors which impact crime severity consisting of following columns :

'latitude','longitude','Type_of_Incident_Bucket','rank','severity','Zip Code','Type_of_Location','Month1 of Occurence','Hour','Division','Council District'

In [None]:
df_incidents=df_read_input[['latitude','longitude','Type_of_Incident_Bucket','rank','severity','Zip Code','Type_of_Location','Month1 of Occurence','Hour','Division','Council District','Offense Status']].copy()

#Drop null rows and create a new column count, this column will help us in counting rows in dataframe

df_incidents.dropna(inplace=True)
df_incidents['count']=1
df_incidents.head()

Again we will analyze the data types of above columns. This is requried since it will affect our analysis. 

In [None]:
df_incidents.dtypes

Next we will change these columns data types into float and int as shown below. 

In [None]:
df_incidents.latitude=df_incidents.latitude.astype(float)
df_incidents.longitude=df_incidents.longitude.astype(float)
df_incidents.rank=df_incidents['rank'].astype(float)
df_incidents.severity=df_incidents.severity.astype(int)
df_incidents['Zip Code']=df_incidents['Zip Code'].astype(int)
df_incidents['Hour']=df_incidents['Hour'].astype(int)

Firstly we will look at the number of crimes in each severity.

In [None]:

df_incidents[['severity','count']].groupby('severity').count()



This shows that most sever crime count is much less than second and third severe crime. Next we will analyze the hourly data to see any trend. 

In [None]:
image_path="webapp/static/images/"

df_hourly=df_incidents.groupby('Hour').count()

df_hourly['count'].plot(kind='bar', figsize=(12, 6))
plt.ylabel("Count of Crimes")
plt.xlabel("Hour")
plt.title("Time Vs Crime ")

plt.legend()
#plt.show()
plt.savefig(image_path+'df_hourly.png')


This graph shows that when the time is 18 Hours then maximum number of crimes take place. Hours between 1-7 have the least number of criminal activities going on. Next let us visualize the location types where these crimes occur.

In [None]:
df_Type_of_Location=df_incidents.groupby('Type_of_Location').count()

df_Type_of_Location=df_Type_of_Location.sort_values('count')
df_Type_of_Location['count'].plot(kind='barh', figsize=(12, 6))
plt.xlabel("Count of Crimes")
plt.ylabel("Locations")
plt.title("Locations Vs Crime ")

plt.legend()
#plt.show()
plt.savefig(image_path+'df_Type_of_Location.png')


This graph reveals that street, apartment complex and parking lot are the most dangerous place to be in case of crime prediction. Next we will focus on Dallas division where major crimes occur.

In [None]:
df_division=df_incidents.groupby('Division').count()

df_division=df_division.sort_values('count')
df_division['count'].plot(kind='barh', figsize=(10, 5))
plt.xlabel("Count of Crimes")
plt.ylabel("Divisions")
plt.title("Divisions Vs Crime ")

plt.legend()
#plt.show()
plt.savefig(image_path+'df_division.png')

As seen by the graph Southeast division is victim of most crimes while north central is compartively safer in terms of severe crimes. Now we will analyze Council wise crime distribution. 

In [None]:
df_council=df_incidents.groupby('Council District').count()

df_council=df_council.sort_values('count')
df_council['count'].plot(kind='bar', figsize=(12, 6))
plt.ylabel("Count of Crimes")
plt.xlabel("Council District")
plt.title("Council District Vs Crime ")

plt.legend()
#plt.show()
plt.savefig(image_path+'df_council.png')

As per above graph Council D7 takes the lead for severe crime. Next we will analyze the month in which most crimes happens. 

In [None]:
df_month=df_incidents.groupby('Month1 of Occurence').count()
df_month=df_month.sort_values('count')
df_month['count'].plot(kind='bar', figsize=(12, 6))
plt.ylabel("Count of Crimes")
plt.xlabel("Month of Occurence")
plt.title("Month of Occurence Vs Crime ")

plt.legend()
#plt.show()
plt.savefig(image_path+'df_month.png')

The bars are almost equal in height and it is not clear from the graph so let us see the values of montly crime in dallas 

In [None]:
df_month['count']

This shows that maximum number of crimes is being committed in the month of July while February has the least number. 

Analyzing the 'Offense Status' below

In [None]:
df_offense=df_incidents.groupby(['Month1 of Occurence', 'Offense Status'])['count'].count()

by_month = df_offense.hvplot.bar('Month1 of Occurence', groupby='Offense Status', width=700, dynamic=False)
#by_month.save('df_offense.png')  
hv.save(by_month,'df_offense.png')
by_month

Analyzing the crimes based on the offense status and looking at their trend during the month of the year. The number of crimes which are clear by arrest is lowest in the month of January and Febraury. The highest number of crimes cleared by exceptional arrest are noted to be in the month of July. 
June, August and December have the lowest crimes cases closed.While, May has the highest open crime cases.

In [None]:
def nansum(a):
    return np.nan if np.isnan(a).all() else np.nansum(a)

heatmap= df_incidents.hvplot.heatmap('Council District', 'Offense Status', 'count',reduce_function=nansum,
                       flip_yaxis=True, xaxis=True, logz=True,  height= 300, width =900)
hv.save(heatmap,'df_council_heatmap.png')
heatmap

This heatmap shows the Council District wise offense status. District D7 has the highest number of suspended cases, following by D4, D8, D2 and D6. There is no signigicant variation in other offense status across the districts. 

<h2 id="map">Visualize Crime on Map</h2> 

In this section we will use folium maps to locate crime in Dallas. This gives a better understanding of which locations are more prone to severe crimes.  We will start by plotting a simple map of Dallas

In [None]:
# Dallas latitude and longitude values
latitude = 32.77
longitude = -96.70


html_path = "webapp/templates/"

Since the crime data is very large, Plotting all at once will take large resources. Therefore we will plot maps for each crime severity, starting with severity level one crime. 

In [None]:
df_severity_one =df_incidents[df_incidents.severity ==1]
df_severity_one.shape 

In [None]:
from folium import plugins

# let's start again with a clean copy of the map of San Dallas
dallas_map = folium.Map(location = [latitude, longitude], zoom_start = 11)

# instantiate a mark cluster object for the incidents in the dataframe
incidents = plugins.MarkerCluster().add_to(dallas_map)

# loop through the dataframe and add each data point to the mark cluster
for lat, lng, label, in zip(df_severity_one.latitude, df_severity_one.longitude, df_severity_one['Type_of_Incident_Bucket']):
    folium.Marker(
        location=[lat, lng],
        icon=None,
        popup=str(label),
    ).add_to(incidents)
    

# display map
dallas_map.save(outfile= html_path+'dallas_severity1_crime_map.html')

Above map shows severity level one crime locations in Dallas. Up next we will show severitty level two crime on map. 

In [None]:
Image("severity1.jpg") 

Click on the link below to see the interactive map.

In [None]:

FileLink(html_path+'dallas_severity1_crime_map.html')

In [None]:
df_severity_two =df_incidents[df_incidents.severity ==2]
df_severity_two.shape 

In [None]:
from folium import plugins

# let's start again with a clean copy of the map of San Dallas
dallas_map = folium.Map(location = [latitude, longitude], zoom_start = 10)

# instantiate a mark cluster object for the incidents in the dataframe
incidents = plugins.MarkerCluster().add_to(dallas_map)

# loop through the dataframe and add each data point to the mark cluster
for lat, lng, label, in zip(df_severity_two.latitude, df_severity_two.longitude, df_severity_two['Type_of_Incident_Bucket']):
    folium.Marker(
        location=[lat, lng],
        icon=None,
        popup=str(label),
    ).add_to(incidents)
    

# display map
dallas_map.save(outfile=html_path+'dallas_severity2_crime_map.html')

In [None]:
Image("severity2.jpg") 

Click on the link below to see the interactive map.

In [None]:
FileLink(html_path+'dallas_severity2_crime_map.html')

In [None]:
df_severity_three =df_incidents[df_incidents.severity ==3]
df_severity_three.shape 

In [None]:
# from folium import plugins

# let's start again with a clean copy of the map of San Dallas
dallas_map = folium.Map(location = [latitude, longitude], zoom_start = 11)

# instantiate a mark cluster object for the incidents in the dataframe
incidents = plugins.MarkerCluster().add_to(dallas_map)

# loop through the dataframe and add each data point to the mark cluster
for lat, lng, label, in zip(df_severity_three.latitude, df_severity_three.longitude, df_severity_three['Type_of_Incident_Bucket']):
    folium.Marker(
        location=[lat, lng],
        icon=None,
        popup=str(label),
    ).add_to(incidents)
    

# display map
dallas_map.save(outfile=html_path+'dallas_severity3_crime_map.html')

In [None]:
Image("severity3.jpg")

Click on the link below to see the interactive map.

In [None]:
FileLink(html_path+'dallas_severity3_crime_map.html')

As one can see that folium is not performing well once the count of crime goes beyond three thousand so we will find another method to visualize this data but first data will be prepared Council District wise.

In [None]:
#Group by Distrcit Council 

df_zip_grouped= df_incidents[['Council District','count']]
df_zip_grouped=df_zip_grouped.groupby('Council District').sum()

df_zip_grouped.head()

# Choropleth Maps <a id="choropleth"></a>

A `Choropleth` map is a thematic map in which areas are shaded or patterned in proportion to the measurement of the statistical variable being displayed on the map, such as population density or per-capita income. The choropleth map provides an easy way to visualize how a measurement varies across a geographic area or it shows the level of variability within a region. Dallas crime is plotted below on choropleth map. 
And now to create a `Choropleth` map, we will use the *choropleth* method with the following main parameters:

1. geo_data, which is the GeoJSON file.
2. data, which is the dataframe containing the data.
3. columns, which represents the columns in the dataframe that will be used to create the `Choropleth` map.
4. key_on, which is the key or variable in the GeoJSON file that contains the name of the variable of interest. To determine that, you will need to open the GeoJSON file using any text editor and note the name of the key or variable that contains the name of the countries, since the countries are our variable of interest. In this case, **dist_name** is the key in the GeoJSON file that contains the name of the countries. 

In [None]:
#Choropleth Dallas Map

#Use below open link to download Dallas GeoJSON file 
#https://www.dallasopendata.com/Geography-Boundaries/Adopted-Council-Districts/6dcw-hhpj
dallas_geo =r'Dallas.geojson' # geojson file

# create a numpy array of length 6 and has linear spacing from the minium total crime to the maximum total crime
threshold_scale = np.linspace(df_zip_grouped['count'].min(),
                              df_zip_grouped['count'].max(),
                              6, dtype=int)
threshold_scale = threshold_scale.tolist() # change the numpy array to a list
threshold_scale[-1] = threshold_scale[-1] + 1 # make sure that the last value of the list is greater than the maximum crime


# create a plain world map
dallas_map = folium.Map(location=[latitude, longitude], zoom_start=10, tiles='Mapbox Bright')
dallas_map.choropleth(
    geo_data=dallas_geo,
    data=df_zip_grouped,
    columns=[df_zip_grouped.index, 'count'],
    key_on='feature.properties.dist_name',
    threshold_scale=threshold_scale,
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Crime Count',
    reset=True
)
dallas_map.save(outfile=html_path+'dallas_crime_choropleth.html')


This gives us a better visualization of Severe crimes of Dallas location wise. The dark red places have high crime rate and lighter color districts have low crime rate. 

In [None]:
Image("chloropleth_dallas.jpg")

Click on the link below to see the interactive map.

In [None]:
FileLink(html_path+'dallas_crime_choropleth.html')

<h2 id="modeling1">Modeling XGBoost</h2>

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.

To start with modeling the feature selection will be performed and then normalization to eliminate the difference of scale or size.  

In [None]:
#df_incidents.head()


X=df_incidents[['rank','Zip Code','Type_of_Location','Month1 of Occurence','Hour']]
X=pd.get_dummies(data=X, columns=['Type_of_Location','Month1 of Occurence'])
X=np.asarray(X)
X=StandardScaler().fit(X).transform(X)
X[:2]

Since Location and Months are string data type in our data so we will create convert them into numerical values using pandas feature of get_dummies. The aim here is to classify each record into correct severity level, therefore output is set to severity vector. 

In [None]:
Y = np.asarray(df_incidents['severity'])
Y

The data is split into training and testing set using skitlearn function.

In [None]:
# split data into train and test sets
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

In [None]:
# First XGBoost model  dataset

# fit model no training data
model = XGBClassifier()
model.fit(X_train, y_train)
# make predictions for test data
y_pred = model.predict(X_test)

y_pred


<h2 id="modeling2">Modeling (SVM with Scikit-learn)</h2>

The SVM algorithm offers a choice of kernel functions for performing its processing. Basically, mapping data into a higher dimensional space is called kernelling. The mathematical function used for the transformation is known as the kernel function, and can be of different types, such as:

    1.Linear
    2.Polynomial
    3.Radial basis function (RBF)
    4.Sigmoid
Each of these functions has its characteristics, its pros and cons, and its equation, but as there's no easy way of knowing which function performs best with any given dataset, we usually choose different functions in turn and compare the results. Let's just use the default, RBF (Radial Basis Function).

In [None]:
from sklearn import svm
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train) 

After being fitted, the model can then be used to predict new values:

In [None]:
yhat = clf.predict(X_test)
yhat [0:5]

<h2 id="evaluation">Evaluation</h2>

Until now we cleaned the data and then we created models for the data which can train on it. This is essential part of data modeling to have enough data on which training is performed. Now we will evaluate if the model is working as expected on unseen data, referred to as testing data in data science world.  
First we will **evaluate XGBoost model's** out put here.


In [None]:
# evaluate predictions
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy Score: %.2f%%" % (accuracy * 100.0))

Accuracy score provides the number of correctly classified samples, which in case of XGBoost is high score.

In [None]:
print (classification_report(y_test, y_pred))



In pattern recognition and information retrieval with binary classification, precision (also called positive predictive value) is the fraction of retrieved instances that are relevant, while recall (also known as sensitivity) is the fraction of relevant instances that are retrieved. Both precision and recall are therefore based on an understanding and measure of relevance.
  
The f1-score gives you the harmonic mean of precision and recall. The scores corresponding to every class will tell you the accuracy of the classifier in classifying the data points in that particular class compared to all other classes. 
The support is the number of samples of the true response that lie in that class.

Crime severity 1 is very low in number and model is not able to precisely predict its class correctly, while it gets much higher for crime with severity 2. The best part is crime with severity 3 are all correctly predicted by XGBoost model. 

 Above figure looks promising so let us visualize it too with help of confusion matrix.


In [None]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred, labels=[1,2,3])
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['Severity 1','Severity 2','Severity 3'],normalize= False,  title='Confusion matrix XGBoost Model')
plt.savefig('Confusion matrix XGBoost Model.png')

Confusion matrix is visual representation of how good XGBoot model is doing in predicting which crime belong to which severity level class. Severity level 3 performing the best as compared to severity level 2 or level 1.  



Below we will **evaluate SVM model's output**. 

In [None]:
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print("Accuracy Score: %.2f%%" % (accuracy * 100.0))

For SVM model the accuracy is almost the same as it was for XGBoot model. Let us explore which severity level perform better:

In [None]:
print (classification_report(y_test, yhat))

Classification report for SVM model shows us that for severity level 3 crimes precision and recall is 1, indicating model has predicted with excellent accuracy. For crime severity 2 the prediction is lower than severity 3 while severity 1 crime is not predicted correctly by SVM model either.

In [None]:
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, yhat, labels=[1,2,3])
np.set_printoptions(precision=2)


# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['Severity 1','Severity 2','Severity 3'],normalize= False,  title='Confusion matrix SVM Model')
plt.savefig('Confusion matrix SVM Model')

It is easy to visualize that SVM model is very good at predicting severity 3 crimes. This comes from the fact that in our chosen data set number of crimes mostly are of severity level 3 so the model is able to train for its prediction. For severity level 2 most of the classification is correct but for severity level 1 crime SVM model is not able to classify correctly. Since they are low in numbers therefore the overall score of model accuracy is still high.  
The diagonal line represent true positive by SVM model and dark blue color shows number of records in each class. This also gives us visual of how many severity 1 crimes are label as severity 2 (131) and severity 3(4) crimes. Similarly only 7 crimes of severity 2 crimes are label as severity 3 while for severity 3 all are predicted correctly. 

#### The End