# San Francisco Crime Classification
*Predict the category of crimes that occurred in the city by the bay*

[Kaggle Project Link](https://www.kaggle.com/c/sf-crime)

From 1934 to 1963, San Francisco was infamous for housing some of the world's most notorious criminals on the inescapable island of Alcatraz.

Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay.

From Sunset to SOMA, and Marina to Excelsior, this competition's dataset provides nearly 12 years of crime reports from across all of San Francisco's neighborhoods. Given time and location, you must predict the category of crime that occurred.

We're also encouraging you to explore the dataset visually. What can we learn about the city through visualizations like this Top Crimes Map? The top most up-voted scripts from this competition will receive official Kaggle swag as prizes. 

In [192]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm
import seaborn as sns
from IPython.display import display
#import tools as t
from pygeocoder import Geocoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn import metrics
import random
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import ExtraTreesClassifier
%matplotlib inline

# Load Data

In [193]:
train = pd.read_csv('train.csv.zip', parse_dates=['Dates'])
test = pd.read_csv('test.csv.zip', parse_dates=['Dates'])
#print(train[:10])
#print(train['Resolution'])
train.head(10)

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122,38
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122,38
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122,38
3,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122,38
4,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122,38
5,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM UNLOCKED AUTO,Wednesday,INGLESIDE,NONE,0 Block of TEDDY AV,-122,38
6,2015-05-13 23:30:00,VEHICLE THEFT,STOLEN AUTOMOBILE,Wednesday,INGLESIDE,NONE,AVALON AV / PERU AV,-122,38
7,2015-05-13 23:30:00,VEHICLE THEFT,STOLEN AUTOMOBILE,Wednesday,BAYVIEW,NONE,KIRKWOOD AV / DONAHUE ST,-122,38
8,2015-05-13 23:00:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,RICHMOND,NONE,600 Block of 47TH AV,-123,38
9,2015-05-13 23:00:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,CENTRAL,NONE,JEFFERSON ST / LEAVENWORTH ST,-122,38


##### Other potential derived features:
- Time of Day: 1-6 (group into 4 hour intervals)
- Month of Crime: 1-12
- Convert longtitude and latitude to zipcode/ address


# Data Exploration
## Observations:
- Descript tends to have keywords that are repeated in category. Should refine to fewer categories by splitting into words, and choose top key words
- Address can be cleaned, maybe to street level
- PdDistrict has few unique values, which is good 
- DayOfWeek looks accurate
- Resolution: seems to have many NONE values; could be cleaned, maybe to fewer categories
  eg. not prosecuted/ juvenile/ arrest/ located/ prosecuted/ psychopathic/ unfounded.

In [194]:
train.describe(include='all')

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
count,878049,878049,878049,878049,878049,878049,878049,878049.0,878049.0
unique,389257,39,879,7,10,17,23228,,
top,2011-01-01 00:01:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Friday,SOUTHERN,NONE,800 Block of BRYANT ST,,
freq,185,174900,60022,133734,157182,526790,26533,,
first,2003-01-06 00:01:00,,,,,,,,
last,2015-05-13 23:53:00,,,,,,,,
mean,,,,,,,,-122.0,38.0
std,,,,,,,,0.0,0.0
min,,,,,,,,-123.0,38.0
25%,,,,,,,,-122.0,38.0


In [195]:
train['Category'].value_counts()

LARCENY/THEFT                  174900
OTHER OFFENSES                 126182
NON-CRIMINAL                    92304
ASSAULT                         76876
DRUG/NARCOTIC                   53971
VEHICLE THEFT                   53781
VANDALISM                       44725
WARRANTS                        42214
BURGLARY                        36755
SUSPICIOUS OCC                  31414
MISSING PERSON                  25989
ROBBERY                         23000
FRAUD                           16679
FORGERY/COUNTERFEITING          10609
SECONDARY CODES                  9985
WEAPON LAWS                      8555
PROSTITUTION                     7484
TRESPASS                         7326
STOLEN PROPERTY                  4540
SEX OFFENSES FORCIBLE            4388
DISORDERLY CONDUCT               4320
DRUNKENNESS                      4280
RECOVERED VEHICLE                3138
KIDNAPPING                       2341
DRIVING UNDER THE INFLUENCE      2268
RUNAWAY                          1946
LIQUOR LAWS 

In [196]:
train['PdDistrict'].value_counts()

SOUTHERN      157182
MISSION       119908
NORTHERN      105296
BAYVIEW        89431
CENTRAL        85460
TENDERLOIN     81809
INGLESIDE      78845
TARAVAL        65596
PARK           49313
RICHMOND       45209
Name: PdDistrict, dtype: int64

# Data Transformation

### Create "TimeOfDay", "DayOfMonth","Year" Variable
- We only want to group TimeOfDay into 3-hour blocks to be more generalized. Hence, it has 8 uniques values, from 0-7
- DayOfMonth is also grouped into 2-day blocks, so range from 1-15


In [207]:
train['TimeOfDay'] = (train['Dates'].dt.hour/3).astype(int)
test['TimeOfDay'] = (test['Dates'].dt.hour/3).astype(int)
train['DayOfMonth'] = (train['Dates'].dt.day/2).astype(int)
test['DayOfMonth'] = (test['Dates'].dt.day/2).astype(int)
train['Year'] = (train['Dates'].dt.year).astype(int)
test['Year'] = (test['Dates'].dt.year).astype(int)


In [208]:
train.tail(5)


Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y,TimeOfDay,DayOfMonth,Year
878044,2003-01-06 00:15:00,ROBBERY,ROBBERY ON THE STREET WITH A GUN,Monday,TARAVAL,NONE,FARALLONES ST / CAPITOL AV,-122,38,0,3,2003
878045,2003-01-06 00:01:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Monday,INGLESIDE,NONE,600 Block of EDNA ST,-122,38,0,3,2003
878046,2003-01-06 00:01:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Monday,SOUTHERN,NONE,5TH ST / FOLSOM ST,-122,38,0,3,2003
878047,2003-01-06 00:01:00,VANDALISM,"MALICIOUS MISCHIEF, VANDALISM OF VEHICLES",Monday,SOUTHERN,NONE,TOWNSEND ST / 2ND ST,-122,38,0,3,2003
878048,2003-01-06 00:01:00,FORGERY/COUNTERFEITING,"CHECKS, FORGERY (FELONY)",Monday,BAYVIEW,NONE,1800 Block of NEWCOMB AV,-122,38,0,3,2003


In [209]:
print(train.shape)
print(test.shape)

(878049, 12)
(884262, 10)


### Create "Route" Variable

In [211]:
Route = []
for row in range(len(train['X'])):
    for row in range(len(train['Y'])):
        Route.append(Geocoder.reverse_geocode(train['Y'][row], train['X'][row]).route)

GeocoderError: Error OVER_QUERY_LIMIT
Query: https://maps.google.com/maps/api/geocode/json?latlng=37.783004%2C-122.412414&sensor=false&bounds=&region=&language=

# Training
### Create 3 datasets:
- train_features, train_labels for training
- dev_features, dev_labels for validating
- mini_train_features, mini_train_labelds for light-weight training

In [221]:
# Use about 85% of train dataset for training, the rest for validating
train_features = np.array(train[['DayOfWeek', 'TimeOfDay', 'DayOfMonth', 'Year', 'PdDistrict']])
train_labels = np.array(train['Category'])
X_train, X_test, y_train, y_test = train_test_split(train_features,
                                                    train_labels,
                                                    test_size=0.30,
                                                    random_state=42)

# Use first 1000 rows for light-weight training
X_mini_train= X_train[:1000]
y_mini_train = y_train[:1000]

In [222]:
# Check datashapes for each of the above datasets
print ('train_features shape is', train_features.shape)
print ('train_labels shape is', train_labels.shape)
print ('X_train shape is', X_train.shape)
print ('X_test shape is', X_test.shape)
print ('y_train shape is', y_train.shape)
print ('y_test shape is', y_test.shape)
print ('X_mini_train shape is', X_mini_train.shape)
print ('y_mini_train shape is', y_mini_train.shape)


train_features shape is (878049, 5)
train_labels shape is (878049,)
X_train shape is (614634, 5)
X_test shape is (263415, 5)
y_train shape is (614634,)
y_test shape is (263415,)
X_mini_train shape is (1000, 5)
y_mini_train shape is (1000,)


In [223]:
X_mini_train

array([['Friday', 7, 13, 2005, 'SOUTHERN'],
       ['Saturday', 7, 3, 2004, 'CENTRAL'],
       ['Thursday', 5, 5, 2009, 'MISSION'],
       ..., 
       ['Thursday', 7, 3, 2005, 'TARAVAL'],
       ['Wednesday', 0, 3, 2010, 'INGLESIDE'],
       ['Monday', 4, 0, 2013, 'INGLESIDE']], dtype=object)

### Training K-Means

In [224]:
# data preprocessing
le = preprocessing.LabelEncoder()
X_train[:,0] = le.fit_transform(X_train[:,0]).astype('str')
X_train[:,4] = le.fit_transform(X_train[:,4]).astype('str')
X_test[:,0] = le.fit_transform(X_test[:,0]).astype('str')
X_test[:,4] = le.fit_transform(X_test[:,4]).astype('str')



In [225]:
random.seed(100)
knn = KNeighborsClassifier(39)
# fit model using training set
knn.fit(X_mini_train, y_mini_train)
accuracy = knn.score(X_test, y_test)
print (accuracy)

0.185900575138


### Random Forests

In [226]:
random.seed(100)
rfc = RandomForestClassifier()
rfc.fit(X_mini_train, y_mini_train)
rfc.score(X_test, y_test)

0.13819638213465443

### Decision Tree

In [227]:
random.seed(100)
clf = DecisionTreeClassifier(max_depth=None, min_samples_split=2, random_state=0)
clf.fit(X_mini_train, y_mini_train)
clf.score(X_test, y_test)

0.11392669362033293

#### Performance is not great so we'll combine level next.