# The State of the Art Script of San Francisco Crime Classification

This is a deliverable notebook of [Kaggle](https://www.kaggle.com)'s [San Francisco Crime Classification](https://www.kaggle.com/c/sf-crime). All members who are participating this competition can apply their experiments to this notebook. The submission output should always keep the high score. When you finished to apply new experiment on this notebook, you need to review your codes using [pull request](https://help.github.com/articles/using-pull-requests/).


### Overview
** Model **
  * BernoulliNB. All hyperparameters are default.
  
** Features **
  * X, Y
      * add X*Y feature : 2.589879
      * add X^2, Y^2 feature : 2.589881 -> 2.59008
      
  * Dates (Convert to numerical columns)
    * Conver the Dates column to numerical columns named Dates-Year, Dates-Month, Dates-Day, Dates-Hour and Dates-Minute.
    * Modify Dates-Minute to zero if the value is equal to 30.
    * Use only Dates-Hour and Dates-Minute.


### Result
  * 5-fold Cross Validation = **2.589878**
  * Leaderboard = **2.59008**

In [1]:
import numpy as np
import pandas as pd

## Load Data

In [2]:
train = pd.read_csv("train.csv/train.csv")
train.head(3)

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414


In [3]:
test = pd.read_csv("test.csv/test.csv")
test.head(3)

Unnamed: 0,Id,Dates,DayOfWeek,PdDistrict,Address,X,Y
0,0,2015-05-10 23:59:00,Sunday,BAYVIEW,2000 Block of THOMAS AV,-122.399588,37.735051
1,1,2015-05-10 23:51:00,Sunday,BAYVIEW,3RD ST / REVERE AV,-122.391523,37.732432
2,2,2015-05-10 23:50:00,Sunday,NORTHERN,2000 Block of GOUGH ST,-122.426002,37.792212


## Feature Extraction

### Merge both the train and the test dataframe

In [4]:
seperator = train.shape[0]

train["combi-index"] = ["train-{0}".format(index) for index, _ in train.iterrows()]
test["combi-index"] = ["test-{0}".format(index) for index, _ in test.iterrows()]

combi = pd.concat([train, test])
combi = combi.set_index("combi-index")

combi.head(3)

Unnamed: 0_level_0,Address,Category,Dates,DayOfWeek,Descript,Id,PdDistrict,Resolution,X,Y
combi-index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
train-0,OAK ST / LAGUNA ST,WARRANTS,2015-05-13 23:53:00,Wednesday,WARRANT ARREST,,NORTHERN,"ARREST, BOOKED",-122.425892,37.774599
train-1,OAK ST / LAGUNA ST,OTHER OFFENSES,2015-05-13 23:53:00,Wednesday,TRAFFIC VIOLATION ARREST,,NORTHERN,"ARREST, BOOKED",-122.425892,37.774599
train-2,VANNESS AV / GREENWICH ST,OTHER OFFENSES,2015-05-13 23:33:00,Wednesday,TRAFFIC VIOLATION ARREST,,NORTHERN,"ARREST, BOOKED",-122.424363,37.800414


### Convert the 'Dates' column to numerical columns


In [5]:
from datetime import datetime

total_count = combi.shape[0]
count = 0

dates_data = []

for index, row in combi["Dates"].iteritems():
    count = count + 1

    if count % 100000 == 0:
        print("processing... {0}/{1}".format(count, total_count))

    date = datetime.strptime(row, "%Y-%m-%d %H:%M:%S")

    dates_data.append({
        "combi-index": index,
        "Dates-Year": date.year,
        "Dates-Month": date.month,
        "Dates-Day": date.day,
        "Dates-Hour": date.hour,
        "Dates-Minute": date.minute,
        "Dates-Second": date.second,
    })
    
dates_dataframe = pd.DataFrame.from_dict(dates_data)
dates_dataframe = dates_dataframe.set_index("combi-index")

dates_columns = ["Dates-Year", "Dates-Month", "Dates-Day", "Dates-Hour", "Dates-Minute", "Dates-Second"]
dates_dataframe = dates_dataframe[dates_columns]

# All "Dates-Second" variable is equal to zero. Therefore, we can remove it.
second_list = dates_dataframe["Dates-Second"].unique()
print("list of seconds = {0}".format(second_list))

dates_dataframe = dates_dataframe.drop("Dates-Second", axis=1)

combi = pd.concat([combi, dates_dataframe], axis=1)

combi.head(3)

processing... 100000/1762311
processing... 200000/1762311
processing... 300000/1762311
processing... 400000/1762311
processing... 500000/1762311
processing... 600000/1762311
processing... 700000/1762311
processing... 800000/1762311
processing... 900000/1762311
processing... 1000000/1762311
processing... 1100000/1762311
processing... 1200000/1762311
processing... 1300000/1762311
processing... 1400000/1762311
processing... 1500000/1762311
processing... 1600000/1762311
processing... 1700000/1762311
list of seconds = [0]


Unnamed: 0_level_0,Address,Category,Dates,DayOfWeek,Descript,Id,PdDistrict,Resolution,X,Y,Dates-Year,Dates-Month,Dates-Day,Dates-Hour,Dates-Minute
combi-index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
train-0,OAK ST / LAGUNA ST,WARRANTS,2015-05-13 23:53:00,Wednesday,WARRANT ARREST,,NORTHERN,"ARREST, BOOKED",-122.425892,37.774599,2015,5,13,23,53
train-1,OAK ST / LAGUNA ST,OTHER OFFENSES,2015-05-13 23:53:00,Wednesday,TRAFFIC VIOLATION ARREST,,NORTHERN,"ARREST, BOOKED",-122.425892,37.774599,2015,5,13,23,53
train-2,VANNESS AV / GREENWICH ST,OTHER OFFENSES,2015-05-13 23:33:00,Wednesday,TRAFFIC VIOLATION ARREST,,NORTHERN,"ARREST, BOOKED",-122.424363,37.800414,2015,5,13,23,33


### Modify the **Dates-Minute** to 0 if the value is 30

In [6]:
combi.loc[combi["Dates-Minute"] == 30, "Dates-Minute"] = 0
print("The number of rows which the Date-Minutes is equal to 30 = {0}".format(combi[combi["Dates-Minute"] == 30].shape[0]))

The number of rows which the Date-Minutes is equal to 30 = 0


### Split to the train and the test dataframe

In [7]:
train = combi[:seperator]
train = train.drop("Id", axis=1)
train.head(3)

Unnamed: 0_level_0,Address,Category,Dates,DayOfWeek,Descript,PdDistrict,Resolution,X,Y,Dates-Year,Dates-Month,Dates-Day,Dates-Hour,Dates-Minute
combi-index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
train-0,OAK ST / LAGUNA ST,WARRANTS,2015-05-13 23:53:00,Wednesday,WARRANT ARREST,NORTHERN,"ARREST, BOOKED",-122.425892,37.774599,2015,5,13,23,53
train-1,OAK ST / LAGUNA ST,OTHER OFFENSES,2015-05-13 23:53:00,Wednesday,TRAFFIC VIOLATION ARREST,NORTHERN,"ARREST, BOOKED",-122.425892,37.774599,2015,5,13,23,53
train-2,VANNESS AV / GREENWICH ST,OTHER OFFENSES,2015-05-13 23:33:00,Wednesday,TRAFFIC VIOLATION ARREST,NORTHERN,"ARREST, BOOKED",-122.424363,37.800414,2015,5,13,23,33


# Add X*Y feature

In [8]:
train['XY']=train['X']*train['Y']
train['X^2'] = train['X']*train['X']
train['Y^2'] = train['Y']*train['Y']

train.head()

Unnamed: 0_level_0,Address,Category,Dates,DayOfWeek,Descript,PdDistrict,Resolution,X,Y,Dates-Year,Dates-Month,Dates-Day,Dates-Hour,Dates-Minute,XY,X^2,Y^2
combi-index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
train-0,OAK ST / LAGUNA ST,WARRANTS,2015-05-13 23:53:00,Wednesday,WARRANT ARREST,NORTHERN,"ARREST, BOOKED",-122.425892,37.774599,2015,5,13,23,53,-4624.588916,14988.098952,1426.920299
train-1,OAK ST / LAGUNA ST,OTHER OFFENSES,2015-05-13 23:53:00,Wednesday,TRAFFIC VIOLATION ARREST,NORTHERN,"ARREST, BOOKED",-122.425892,37.774599,2015,5,13,23,53,-4624.588916,14988.098952,1426.920299
train-2,VANNESS AV / GREENWICH ST,OTHER OFFENSES,2015-05-13 23:33:00,Wednesday,TRAFFIC VIOLATION ARREST,NORTHERN,"ARREST, BOOKED",-122.424363,37.800414,2015,5,13,23,33,-4627.691645,14987.724661,1428.871323
train-3,1500 Block of LOMBARD ST,LARCENY/THEFT,2015-05-13 23:30:00,Wednesday,GRAND THEFT FROM LOCKED AUTO,NORTHERN,NONE,-122.426995,37.800873,2015,5,13,23,0,-4627.847257,14988.369185,1428.905972
train-4,100 Block of BRODERICK ST,LARCENY/THEFT,2015-05-13 23:30:00,Wednesday,GRAND THEFT FROM LOCKED AUTO,PARK,NONE,-122.438738,37.771541,2015,5,13,23,0,-4624.699819,14991.244471,1426.689323


In [9]:
test = combi[seperator:]
test = test.drop(["Category", "Descript", "Resolution"], axis=1)
test["Id"] = test["Id"].astype('int32')

test = test.set_index(["Id"])
test.head(3)

Unnamed: 0_level_0,Address,Dates,DayOfWeek,PdDistrict,X,Y,Dates-Year,Dates-Month,Dates-Day,Dates-Hour,Dates-Minute
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,2000 Block of THOMAS AV,2015-05-10 23:59:00,Sunday,BAYVIEW,-122.399588,37.735051,2015,5,10,23,59
1,3RD ST / REVERE AV,2015-05-10 23:51:00,Sunday,BAYVIEW,-122.391523,37.732432,2015,5,10,23,51
2,2000 Block of GOUGH ST,2015-05-10 23:50:00,Sunday,NORTHERN,-122.426002,37.792212,2015,5,10,23,50


In [10]:
test['XY'] = test['X']*test['Y']
test['X^2']=test['X']*test['X']
test['Y^2']=test['Y']*test['Y']

test.head(3)

Unnamed: 0_level_0,Address,Dates,DayOfWeek,PdDistrict,X,Y,Dates-Year,Dates-Month,Dates-Day,Dates-Hour,Dates-Minute,XY,X^2,Y^2
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,2000 Block of THOMAS AV,2015-05-10 23:59:00,Sunday,BAYVIEW,-122.399588,37.735051,2015,5,10,23,59,-4618.754686,14981.65907,1423.934075
1,3RD ST / REVERE AV,2015-05-10 23:51:00,Sunday,BAYVIEW,-122.391523,37.732432,2015,5,10,23,51,-4618.129862,14979.684876,1423.736454
2,2000 Block of GOUGH ST,2015-05-10 23:50:00,Sunday,NORTHERN,-122.426002,37.792212,2015,5,10,23,50,-4626.749474,14988.125955,1428.251321


## Score

In [11]:
from sklearn.cross_validation import cross_val_score

feature_names = ["X", "Y","XY",'X^2','Y^2'] + ["Dates-Hour", "Dates-Minute"]
label_name = "Category"

train_X = train[feature_names]
test_X = test[feature_names]

train_y = train[label_name]

train_X.head(3)

Unnamed: 0_level_0,X,Y,XY,X^2,Y^2,Dates-Hour,Dates-Minute
combi-index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
train-0,-122.425892,37.774599,-4624.588916,14988.098952,1426.920299,23,53
train-1,-122.425892,37.774599,-4624.588916,14988.098952,1426.920299,23,53
train-2,-122.424363,37.800414,-4627.691645,14987.724661,1428.871323,23,33


In [12]:
from sklearn.naive_bayes import BernoulliNB

model = BernoulliNB()
score = cross_val_score(model, train_X, train_y, scoring='log_loss', cv=5).mean()
print("BernoulliNB = {0:.6f}".format(-1.0 * score))

BernoulliNB = 2.589881


## Prediction

In [13]:
model = BernoulliNB()
model.fit(train_X, train_y)

prediction = model.predict_proba(test_X)
prediction[0:100]

array([[ 0.00216754,  0.10015113,  0.0002371 , ...,  0.0233816 ,
         0.07588424,  0.01436819],
       [ 0.00216754,  0.10015113,  0.0002371 , ...,  0.0233816 ,
         0.07588424,  0.01436819],
       [ 0.00216754,  0.10015113,  0.0002371 , ...,  0.0233816 ,
         0.07588424,  0.01436819],
       ..., 
       [ 0.00216754,  0.10015113,  0.0002371 , ...,  0.0233816 ,
         0.07588424,  0.01436819],
       [ 0.00216754,  0.10015113,  0.0002371 , ...,  0.0233816 ,
         0.07588424,  0.01436819],
       [ 0.00216754,  0.10015113,  0.0002371 , ...,  0.0233816 ,
         0.07588424,  0.01436819]])

# Submission

In [26]:
sample = pd.read_csv("sampleSubmission.csv/sampleSubmission.csv", index_col="Id")
sample.head(3)

Unnamed: 0_level_0,ARSON,ASSAULT,BAD CHECKS,BRIBERY,BURGLARY,DISORDERLY CONDUCT,DRIVING UNDER THE INFLUENCE,DRUG/NARCOTIC,DRUNKENNESS,EMBEZZLEMENT,...,SEX OFFENSES NON FORCIBLE,STOLEN PROPERTY,SUICIDE,SUSPICIOUS OCC,TREA,TRESPASS,VANDALISM,VEHICLE THEFT,WARRANTS,WEAPON LAWS
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [27]:
submission = pd.DataFrame(prediction, index=sample.index)
submission.columns = sample.columns
submission.head()

Unnamed: 0_level_0,ARSON,ASSAULT,BAD CHECKS,BRIBERY,BURGLARY,DISORDERLY CONDUCT,DRIVING UNDER THE INFLUENCE,DRUG/NARCOTIC,DRUNKENNESS,EMBEZZLEMENT,...,SEX OFFENSES NON FORCIBLE,STOLEN PROPERTY,SUICIDE,SUSPICIOUS OCC,TREA,TRESPASS,VANDALISM,VEHICLE THEFT,WARRANTS,WEAPON LAWS
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.002168,0.100151,0.000237,0.000361,0.026685,0.00692,0.003925,0.099106,0.007447,0.000844,...,9.8e-05,0.00709,0.000694,0.038187,5e-06,0.01142,0.037389,0.023382,0.075884,0.014368
1,0.002168,0.100151,0.000237,0.000361,0.026685,0.00692,0.003925,0.099106,0.007447,0.000844,...,9.8e-05,0.00709,0.000694,0.038187,5e-06,0.01142,0.037389,0.023382,0.075884,0.014368
2,0.002168,0.100151,0.000237,0.000361,0.026685,0.00692,0.003925,0.099106,0.007447,0.000844,...,9.8e-05,0.00709,0.000694,0.038187,5e-06,0.01142,0.037389,0.023382,0.075884,0.014368
3,0.002168,0.100151,0.000237,0.000361,0.026685,0.00692,0.003925,0.099106,0.007447,0.000844,...,9.8e-05,0.00709,0.000694,0.038187,5e-06,0.01142,0.037389,0.023382,0.075884,0.014368
4,0.002168,0.100151,0.000237,0.000361,0.026685,0.00692,0.003925,0.099106,0.007447,0.000844,...,9.8e-05,0.00709,0.000694,0.038187,5e-06,0.01142,0.037389,0.023382,0.075884,0.014368


In [32]:
from datetime import datetime

current_time = datetime.now().strftime("%Y.%m.%d %H:%M:%S")
description = "Use the Dates column"

#filename = "{0} {1}.csv".format(current_time, description)

submission.to_csv("poly feature sub")