# Predicting Crime:

## Introduction-

In this post, we will be trying to predict the category of a crime based on various statistics provided by the Police Department. 

This task provides excellent practice for a number of technical areas of data science.  

First of all, it provides a large set of data for us to work with. The file contains 6,000,000 observations. Next, it provides us with a classification task upon which we can test a number of different algorithms. Also, it provides location information of crimes so we can experiment with some of the R language's plotting functions as well.  

For starters, the data can be downloaded [here](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2).  

## Data Transformation-

The first task of any data science project is to become familiar with the data through exploration. We must also manipulate and transform it into a useful format that we can then develop insight from.  

In order to process our data before we can feed it into our machine learning algorithms, we must clean it with the following steps:

1. Remove spaces from the headers
2. Convert all headers to lower-case
3. Parse and format the dates
4. Merge specific columns

Lets dive right in, shall we?

In [2]:
# Import the required libraries
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split as split
import seaborn as sns
import matplotlib.pyplot as plt 
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [3]:
# Set directory to appropriate folder
path = "/path/to/files/chicago_crime"

os.chdir(path)

# Load the data
data = pd.read_csv("chicago_crime.csv")

FileNotFoundError: [Errno 2] No such file or directory: '/path/to/files/chicago_crime'

The next step is somewhat controversial. The general rule of thumb within the data science community is that if records with missing values account for less than 5% of the total- they can be deleted outright, otherwise the missing values should be imputed.

For the sake of convenience, we are going to simply delete all rows with missing data rather than imputing their values. This is mainly just because the dataset is already so large that we will still have plenty of records for training and testing.

In [None]:
# Remove missing information
data1 = data.dropna()

In [None]:
# Fix the headers

# Remove spaces
data1.columns = data1.columns.str.replace('\s+', '_')

# Remove upper case
data1.columns = map(str.lower, data1.columns)

# Fix date and time
data1['date'] = pd.to_datetime(data1['date'], format = '%m/%d/%Y %I:%M:%S %p')

# Drop unimportant columns
data2 = data1.drop(columns = ['date', 'id', 'case_number', 'block', 'updated_on', 'location'])

In [None]:
# Convert categorical variables
data2['iucr'] = pd.factorize(data2.iucr)[0]
data2['primary_type'] = pd.factorize(data2.primary_type)[0]
data2['arrest'] = pd.factorize(data2.arrest)[0]
data2['description'] = pd.factorize(data2.description)[0]
data2['location_description'] = pd.factorize(data2.location_description)[0]
data2['fbi_code'] = pd.factorize(data2.fbi_code)[0]

## Machine Learning-

Now that we have cleaned up our data and added some formatting to it, we are ready to get started with actually implementing some models and making predictions. 

For the sake of simplicity, we will first work on trying to predict whether or not an arrest was made in each situation.

### Predicting Arrest:

#### Logistic Regression

Since the arrest category is a binary variable, it is a prime candidate for logistic regression.

However, before we begin- lets take a quick look at the actual breakdown of the arrest category:

In [None]:
# Table
data2['arrest'].value_counts()

In [None]:
# Visual
sns.countplot(x = 'arrest', data = data2)
plt.show

From this information, we can already see that the majority of incidents actually do not end up in an arrest. Overall, however, the distribution is still such that we shouldnt have to worry about a "class-imbalance" issue arising. 

One of the assumptions made by the logistic regression model is that the independent variables are uncorrelated to one another. Lets go ahead and check to see if this is true.

In [None]:
# Correlation
sns.heatmap(data2.corr())
plt.show()

This heatmap shows that all of the variables except those relating to geography are relatively un-correlated. For this example, we will go ahead and accept the correlation- otherwise we would have to remove some variables using a feature selection method. 

The next step is to convert the categorical variables so that they can be interpreted by the model. Normally, we would do this by creating a dummy variable for each individual factor level. The problem with this approach is that it would lead to too many new columns and hence the "curse of dimensionality". 

Instead, we will try a different approach and encode the factors as numeric variables. 

In [1]:
# Check column types
data2.dtypes

NameError: name 'data2' is not defined

In [None]:
# Separate the independent and dependent variables
labels = data3['arrest']

features = data3.loc[:, data3.columns != 'arrest']

In [None]:
# Create training and testing sets
train_features, test_features, train_labels, test_labels = split(features, labels, test_size = .25, random_state = 100)

Now that we have selected all of our variables and appropriately encoded them and split them into training and testing sets; we are ready to actually implement the model. 

In [None]:
# Create the model
logreg = LogisticRegression()
logreg.fit(train_features, train_labels)

In [None]:
# Make predictions
test_pred = logreg.predict(test_features)

In [None]:
# Check the performance metrics
logreg.score(test_features, test_labels)

## Conclusion- 

So, our final accuracy on the test set came out to 72%. This isnt bad considering the obstacles we faced, as well as the simplified procedure. 

In order to enhance the accuracy of this model, there are several additional steps we could take such as:

* Feature selection
* Cross-validation
* Ensemble modeling
* and more...

These steps we will save for a later date.