# W207 FINAL Project - San Francisco Crime Classification

Project Team: Shih Yu Chang, Sriram Rao, Jingjing Rong, Frank Xie

Objective: Predict the category of crimes that occurred in the city by the bay

Background: From 1934 to 1963, San Francisco was infamous for housing some of the world's most notorious criminals on the inescapable island of Alcatraz.

Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay.

From Sunset to SOMA, and Marina to Excelsior, this competition's dataset provides nearly 12 years of crime reports from across all of San Francisco's neighborhoods. Given time and location, you must predict the category of crime that occurred.
This is a baseline Model that assigns random labels to the dev dataset

Dataset: The dataset contains incidents derived from SFPD Crime Incident Reporting system. The data ranges from 1/1/2003 to 5/13/2015. The training set and test set rotate every week, meaning week 1,3,5,7... belong to test set, week 2,4,6,8 belong to training set. 

Data fields:
    Dates - timestamp of the crime incident
    Category - category of the crime incident (only in train.csv). This is the target variable you are going to predict.
    Descript - detailed description of the crime incident (only in train.csv)
    DayOfWeek - the day of the week
    PdDistrict - name of the Police Department District
    Resolution - how the crime incident was resolved (only in train.csv)
    Address - the approximate street address of the crime incident 
    X - Longitude
    Y - Latitude

Evaluation: Submissions are evaluated using the multi-class logarithmic loss. Each incident has been labeled with one true class. For each incident, you must submit a set of predicted probabilities (one for every class).

Submission Format:You must submit a csv file with the incident id, all candidate class names, and a probability for each class. The order of the rows does not matter

Source: https://www.kaggle.com/c/sf-crime

README: This is the baseline submission of the project with the basic processing pipeline and a baseline prediction model. As a baseline, we assume that all crimes as the most common type of crime theft. The baseline solution has a log loss score of 32.89184

In [1]:
%matplotlib inline

import numpy as np
import math
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.preprocessing import LabelEncoder,OneHotEncoder,StandardScaler,FunctionTransformer
from sklearn.decomposition import PCA
from sklearn.utils import shuffle
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn import preprocessing
from sklearn.metrics import log_loss
from sklearn.metrics import make_scorer
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.ensemble import RandomForestClassifier
from matplotlib.colors import LogNorm
from sklearn.decomposition import PCA
from sklearn.naive_bayes import GaussianNB
from copy import deepcopy
import pygeohash as pgh         # Module to encode Latitude and Longitude as floating point number

In [10]:
# This cell is to load and explore the training data (train.csv file) provided by the Kaggle website

train_data = pd.read_csv("train.csv")
print ("Training Data shape ", train_data.shape)
print("\nPrinting a few rows", train_data.head(6))

print("\nList of Unique Crime Categories ", train_data.Category.unique())
print("\nNumber of unique latitude and longitudes", len(train_data.Y.unique()))
print("\nNumber of unique latitude and longitudes", len(train_data.X.unique()))

Training Data shape  (878049, 9)

Printing a few rows                  Dates        Category                        Descript  \
0  2015-05-13 23:53:00        WARRANTS                  WARRANT ARREST   
1  2015-05-13 23:53:00  OTHER OFFENSES        TRAFFIC VIOLATION ARREST   
2  2015-05-13 23:33:00  OTHER OFFENSES        TRAFFIC VIOLATION ARREST   
3  2015-05-13 23:30:00   LARCENY/THEFT    GRAND THEFT FROM LOCKED AUTO   
4  2015-05-13 23:30:00   LARCENY/THEFT    GRAND THEFT FROM LOCKED AUTO   
5  2015-05-13 23:30:00   LARCENY/THEFT  GRAND THEFT FROM UNLOCKED AUTO   

   DayOfWeek PdDistrict      Resolution                    Address  \
0  Wednesday   NORTHERN  ARREST, BOOKED         OAK ST / LAGUNA ST   
1  Wednesday   NORTHERN  ARREST, BOOKED         OAK ST / LAGUNA ST   
2  Wednesday   NORTHERN  ARREST, BOOKED  VANNESS AV / GREENWICH ST   
3  Wednesday   NORTHERN            NONE   1500 Block of LOMBARD ST   
4  Wednesday       PARK            NONE  100 Block of BRODERICK ST   
5  Wedn

In [11]:
######### Data Pre-Processing Section #########
# Remove 67 data points with incorrect latitude and longitude (those of Antartica)
# Extract Year, Date, Hour from timeline provided in the dataset
# Encode the geographical location provided as two values - latitude and longitude as a single floating point number

def geoencode(x):
    return pgh.encode(x['Latitude'], x['Longitude'], precision=7)


# Remove 67 outlier data points that have wrong Latitude and Longitude
train_data = train_data[abs(train_data["Y"])<38]

# Drop columns that add no value to our analysis
train_data = train_data.drop(['Resolution', 'Descript'], axis=1)

#Extract Date, Year and Hour from the Dates field
train_data['Dates'] = pd.to_datetime(train_data['Dates'])
train_data['Year'], train_data['Month']  = train_data['Dates'].dt.year, train_data['Dates'].dt.month 
train_data['Hour'] = train_data['Dates'].dt.hour

#Rename the X, Y columns as longitude and latitude respectively
train_data['Latitude'] = train_data['Y']
train_data['Longitude'] = train_data['X']
train_data = train_data.drop(['X', 'Y'], axis=1)

# There are 34243 X 34243 values for lat and long. This may be too granular for the analysis
# Therefore we try to geo-encode the location as a floating point number
# With a encoding precision of 7, the 34243 X 34243 values get encoded as a matrix with 5000 values 
# Area of San Francisco is about 47 Square miles - this translates to each sqaure mile sliced in to roughly a 10X10 matrix
train_data['GeoCode'] = train_data.apply(geoencode, axis=1)
#train_data.to_csv("sfcrimedata.csv")   
train_labels = train_data['Category']

In [12]:
# There are 800K+ rows of training data. This will bog down the initial analysis.
# Create a smaller baseline training and dev data set for quick analysis
# Also split the Training data and labels

#TO DO: shuffle the data up
baseline_train_data_count = 50000
btrain_data = train_data[:(baseline_train_data_count*2)]
btrain_datacopy = btrain_data
btrain_labels = btrain_data['Category']

# Get Dummy values of geocode and convert as columns
geocode = pd.get_dummies(btrain_data.GeoCode)
btrain_data = pd.concat([geocode], axis=1)
#btrain_data['Year'] = btrain_datacopy['Year']
btrain_data['Month'] = btrain_datacopy['Month']
btrain_data['Hour'] = btrain_datacopy['Hour']

bdev_data = btrain_data[(baseline_train_data_count+1):]
bdev_labels = btrain_labels[(baseline_train_data_count+1):]
btrain_data = btrain_data[:baseline_train_data_count]
btrain_labels = btrain_labels[:baseline_train_data_count]

Unnamed: 0,9q8ys3x,9q8ys6w,9q8ys7x,9q8ys9e,9q8ys9s,9q8ysb8,9q8yscb,9q8yscu,9q8ysf3,9q8ysfd,...,9q8znd7,9q8znd8,9q8znd9,9q8zndb,9q8zndd,9q8zndh,9q8zndj,9q8zndk,9q8zpe0,9q8zpkc
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
# TO DO: Plot a frequency histogram to understand correlation between crime count and each of the following crime type, 
# day of the week, police district, encoded geo-location, year, month and hour
# UPDATE: This was completed in Tableau. 
#Visualization available in a PDF document - "W207 - SF Crime Prediction - Data Visualization Dashboard.pdf"



In [16]:
# Assign a baseline label to dev dataset and check accuracy
list1 = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
#list1 = [0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
baselinepredict = np.array([list1], float)
for i in range (2, baseline_train_data_count):
    baselinepredict = np.append(baselinepredict, [list1], axis=0)

#print(baselinepredict.shape)
print("Baseline log loss ",log_loss(bdev_labels, baselinepredict))

Baseline log loss  33.8694015209


In [20]:
# Since we have too many features, start the analysis with Random Forest
Num_Trees = 400
rf_model = RandomForestClassifier(n_estimators=Num_Trees, max_depth=50, n_jobs=4)
rf_model.fit(btrain_data, btrain_labels)
predicted = np.array(rf_model.predict_proba(bdev_data))
#print('predicted', predicted)
print( 'Accuracy for Random Forest model with %d trees is: %0.3f' %(Num_Trees, rf_model.score(bdev_data, bdev_labels)))
#log_loss(bdev_labels, predicted)
#print(rf_model.classes_)

Accuracy for Random Forest model with 400 trees is: 0.250


In [None]:
#Gaussian NaiveBayes
#clf = GaussianNB()
#clf.fit(btrain_data, btrain_labels)
#prediction = clf.predict(bdev_data)
#predicted = np.array(clf.predict(bdev_data))
#accuracy=clf.score(bdev_data, bdev_labels)
#print('Accuracy for GaussianNB is: %0.3f' %(accuracy))
