# Final Project 2: San Francisco Crime Classification

From 1934 to 1963, San Francisco was infamous for housing some of the world's most notorious criminals on the inescapable island of Alcatraz.

Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay.

From Sunset to SOMA, and Marina to Excelsior, this competition's dataset provides nearly 12 years of crime reports from across all of San Francisco's neighborhoods. Given time and location, you must predict the category of crime that occurred.

We're also encouraging you to explore the dataset visually. What can we learn about the city through visualizations like this Top Crimes Map? The top most up-voted scripts from this competition will receive official Kaggle swag as prizes. 

In [1]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# General libraries.
import re
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report

# SK-learn library for importing the newsgroup data.
from sklearn.datasets import fetch_20newsgroups

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

#SK-learn libraries for shuffling
from sklearn.utils import shuffle

Load the data, stripping out metadata so that we learn classifiers that only use textual features. By default, newsgroups data is split into train and test sets. We further split the test so we have a dev set. Note that we specify 4 categories to use for this project. If you remove the categories argument from the fetch function, you'll get all 20 categories.

In [36]:
#uploading test data
test_data = pd.read_csv('test.csv')
test_data.head()

Unnamed: 0,Id,Dates,DayOfWeek,PdDistrict,Address,X,Y
0,0,2015-05-10 23:59:00,Sunday,BAYVIEW,2000 Block of THOMAS AV,-122.399588,37.735051
1,1,2015-05-10 23:51:00,Sunday,BAYVIEW,3RD ST / REVERE AV,-122.391523,37.732432
2,2,2015-05-10 23:50:00,Sunday,NORTHERN,2000 Block of GOUGH ST,-122.426002,37.792212
3,3,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412
4,4,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412


In [51]:
int(len(train_data_1)*0.8)

702439

In [52]:
#uploading train data
train_data_df = pd.read_csv('train.csv')
#shuffle data
train_data_df = shuffle(train_data_df)
train_data_df = train_data_df.reset_index()

#setting train data 
# train_data_1 = train_data_df[['Dates','DayOfWeek','PdDistrict','Address','X','Y']]
train_data_1 = train_data_df[['X','Y']]
dev_data = train_data_1[int(len(train_data_1)*0.8):]
train_data = train_data_1[:int(len(train_data_1)*0.8)]


#setting train labels
train_labels_1 = train_data_df[['Category']]
dev_labels = train_labels_1[int(len(train_data_1)*0.8):]
train_labels = train_labels_1[:int(len(train_data_1)*0.8)]



In [53]:
print len(train_data)
print len(dev_data)
print len(train_labels)
print len(dev_labels)

702439
175610
702439
175610


In [54]:
train_data.head()

Unnamed: 0,X,Y
0,-122.386571,37.750326
1,-122.45605,37.713194
2,-122.429789,37.766652
3,-122.459049,37.739631
4,-122.403405,37.775421


In [55]:
dev_labels.head()

Unnamed: 0,Category
702439,PROSTITUTION
702440,OTHER OFFENSES
702441,OTHER OFFENSES
702442,ASSAULT
702443,ROBBERY


In [56]:
dev_data.head()

Unnamed: 0,X,Y
702439,-122.416721,37.757168
702440,-122.437744,37.760779
702441,-122.398919,37.796954
702442,-122.435637,37.768169
702443,-122.434895,37.769114


We are going to make a couple of adjustments in order for the models to run:
* convert the Dates from string to date time
* trunc the date to the day
* transfor the Day of the Week in a number
* Remove the District and Address components

In [57]:
#copying the df in another to start the treatment
train_data_co = train_data.copy()
train_labels_co = train_labels.copy()
dev_data_co = dev_data.copy()
dev_labels_co = dev_labels.copy()

Running for this first baseline model a kNN, Multinomial and a Logistic Regression

In [58]:
print "For KNN"
#fitting a knn and understanding what is the optimal f1_score
# we will run a for with the following numbers 1,2,3,4,5,7,9,11,13
k_values = [1,2,3,4,5,7,9,11,13,14,15,16,17,18,19,20,50,100,200]
# #creating a variable to receive the values of the report
f1_ar = []
ac_ar = []
for ik in xrange(0,len(k_values)):
    #adding to X the mini train data (using the modified train data from exercise 2.a)
    X_3knn = train_data_co
    #adding to y the mini train Label
    y = train_labels_co

    #we wil store in the variable neib the result of the function KNeighborsClassifier. 
    #the number of neighbors comes from the input of the user and it is passed using the variable k_values
    neib = KNeighborsClassifier(n_neighbors=k_values[ik])
    #fitting
    neib.fit(X_3knn, y)

    #printing the accuracy per k_value
    preds = neib.predict(dev_data_co)
    correct, total = 0, 0
    for pred, label in zip(preds, dev_labels_co):
        if pred == label: correct += 1
        total += 1
    f1 = metrics.f1_score(dev_labels_co, preds, average='weighted')
    string = 'k-values:%3d accuracy: %3.2f f1-score: %3.2f' %(k_values[ik],1.0*correct/total,f1)
    print string
    
    #returning the classification report 
    f1_ar.append(f1)
    ac_ar.append(1.0*correct/total)

best = 0
for i3 in xrange(0,len(f1_ar)):
    if f1_ar[i3] > best:
        best = f1_ar[i3]
        guery3 = k_values[i3]
        ac = ac_ar[i3]

print "The best f1-score is: %3.2f" %(best)
print "This value is found when the K is: " + str(guery3)
print "The accuracy found is : %3.2f" %(ac)
print ""



For KNN




k-values:  1 accuracy: 0.00 f1-score: 0.18
k-values:  2 accuracy: 0.00 f1-score: 0.17
k-values:  3 accuracy: 0.00 f1-score: 0.18
k-values:  4 accuracy: 0.00 f1-score: 0.20
k-values:  5 accuracy: 0.00 f1-score: 0.20
k-values:  7 accuracy: 0.00 f1-score: 0.21
k-values:  9 accuracy: 0.00 f1-score: 0.22
k-values: 11 accuracy: 0.00 f1-score: 0.22
k-values: 13 accuracy: 0.00 f1-score: 0.22
k-values: 14 accuracy: 0.00 f1-score: 0.22
k-values: 15 accuracy: 0.00 f1-score: 0.22
k-values: 16 accuracy: 0.00 f1-score: 0.22
k-values: 17 accuracy: 0.00 f1-score: 0.22
k-values: 18 accuracy: 0.00 f1-score: 0.22
k-values: 19 accuracy: 0.00 f1-score: 0.22
k-values: 20 accuracy: 0.00 f1-score: 0.22
k-values: 50 accuracy: 0.00 f1-score: 0.22
k-values:100 accuracy: 0.00 f1-score: 0.22
k-values:200 accuracy: 0.00 f1-score: 0.21
The best f1-score is: 0.22
This value is found when the K is: 50
The accuracy found is : 0.00



In [69]:
#making the interaction for the winning k
#adding to X the mini train data (using the modified train data from exercise 2.a)
X_3knn = train_data_co
#adding to y the mini train Label
y = train_labels_co

#we wil store in the variable neib the result of the function KNeighborsClassifier. 
#the number of neighbors comes from the input of the user and it is passed using the variable k_values
neib = KNeighborsClassifier(n_neighbors=guery3)
#fitting
neib.fit(X_3knn, y)

#printing the accuracy per k_value
preds = neib.predict(dev_data_co)
correct, total = 0, 0
for pred, label in zip(preds, dev_labels_co):
    if pred == label: correct += 1
    total += 1
f1 = metrics.f1_score(dev_labels_co, preds, average='weighted')
string = 'k-values:%3d accuracy: %3.2f f1-score: %3.2f' %(k_values[ik],1.0*correct/total,f1)
print string

  # This is added back by InteractiveShellApp.init_path()


k-values:200 accuracy: 0.00 f1-score: 0.22


In [66]:
#setting test data
#uploading test data
test_data_df = pd.read_csv('test.csv')

# train_data_1 = train_data_df[['Dates','DayOfWeek','PdDistrict','Address','X','Y']]
test_data_1 = test_data_df[['X','Y']]

In [68]:
neib.predict_proba(test_data_1)

array([[0.   , 0.15 , 0.   , ..., 0.115, 0.015, 0.025],
       [0.   , 0.1  , 0.   , ..., 0.055, 0.105, 0.015],
       [0.   , 0.035, 0.   , ..., 0.165, 0.   , 0.   ],
       ...,
       [0.   , 0.13 , 0.   , ..., 0.12 , 0.045, 0.005],
       [0.005, 0.15 , 0.   , ..., 0.035, 0.03 , 0.06 ],
       [0.   , 0.065, 0.   , ..., 0.025, 0.065, 0.01 ]])