- Dates - timestamp of the crime incident
- Category - category of the crime incident (only in train.csv). This is the target variable you are going to predict.
- Descript - detailed description of the crime incident (only in train.csv)
- DayOfWeek - the day of the week
- PdDistrict - name of the Police Department District
- Resolution - how the crime incident was resolved (only in train.csv)
- Address - the approximate street address of the crime incident 
- X - Longitude
- Y - Latitude

In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import numpy as np
import pandas as pd
pd.set_option("display.max_columns", None)
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import LabelEncoder, LabelBinarizer, OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.base import clone
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.metrics import log_loss
import lightgbm as lgb
import catboost

In [2]:
X_train = pd.read_csv("data/train.csv")
X_test = pd.read_csv("data/test.csv")

In [3]:
X_train[:5]

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541


In [4]:
print(X_train.duplicated().sum())
X_train.drop_duplicates(inplace=True)
assert X_train.duplicated().sum() == 0

2323


In [5]:
X_test[:5]

Unnamed: 0,Id,Dates,DayOfWeek,PdDistrict,Address,X,Y
0,0,2015-05-10 23:59:00,Sunday,BAYVIEW,2000 Block of THOMAS AV,-122.399588,37.735051
1,1,2015-05-10 23:51:00,Sunday,BAYVIEW,3RD ST / REVERE AV,-122.391523,37.732432
2,2,2015-05-10 23:50:00,Sunday,NORTHERN,2000 Block of GOUGH ST,-122.426002,37.792212
3,3,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412
4,4,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412


In [6]:
y_train = X_train['Category']
X_train_description = X_train['Descript']
X_train_resolution = X_train['Resolution']
X_train.drop(["Category", "Descript", "Resolution"], axis=1, inplace=True)

In [7]:
test_ID = X_test["Id"]
X_test.drop("Id", axis=1, inplace=True)

In [8]:
X_train[:5]

Unnamed: 0,Dates,DayOfWeek,PdDistrict,Address,X,Y
0,2015-05-13 23:53:00,Wednesday,NORTHERN,OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,Wednesday,NORTHERN,OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,Wednesday,NORTHERN,VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,Wednesday,NORTHERN,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,Wednesday,PARK,100 Block of BRODERICK ST,-122.438738,37.771541


In [9]:
X_test[:5]

Unnamed: 0,Dates,DayOfWeek,PdDistrict,Address,X,Y
0,2015-05-10 23:59:00,Sunday,BAYVIEW,2000 Block of THOMAS AV,-122.399588,37.735051
1,2015-05-10 23:51:00,Sunday,BAYVIEW,3RD ST / REVERE AV,-122.391523,37.732432
2,2015-05-10 23:50:00,Sunday,NORTHERN,2000 Block of GOUGH ST,-122.426002,37.792212
3,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412
4,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412


In [10]:
X_train.shape

(875726, 6)

In [11]:
X_test.shape

(884262, 6)

In [12]:
y_train.value_counts()

LARCENY/THEFT                  174320
OTHER OFFENSES                 125960
NON-CRIMINAL                    91915
ASSAULT                         76815
DRUG/NARCOTIC                   53919
VEHICLE THEFT                   53706
VANDALISM                       44581
WARRANTS                        42145
BURGLARY                        36600
SUSPICIOUS OCC                  31394
MISSING PERSON                  25669
ROBBERY                         22988
FRAUD                           16637
FORGERY/COUNTERFEITING          10592
SECONDARY CODES                  9979
WEAPON LAWS                      8550
PROSTITUTION                     7446
TRESPASS                         7318
STOLEN PROPERTY                  4537
SEX OFFENSES FORCIBLE            4380
DISORDERLY CONDUCT               4313
DRUNKENNESS                      4277
RECOVERED VEHICLE                3132
KIDNAPPING                       2340
DRIVING UNDER THE INFLUENCE      2268
LIQUOR LAWS                      1899
RUNAWAY     

In [13]:
le = LabelEncoder()
y_train = le.fit_transform(y_train)
print(le.classes_)

['ARSON' 'ASSAULT' 'BAD CHECKS' 'BRIBERY' 'BURGLARY' 'DISORDERLY CONDUCT'
 'DRIVING UNDER THE INFLUENCE' 'DRUG/NARCOTIC' 'DRUNKENNESS'
 'EMBEZZLEMENT' 'EXTORTION' 'FAMILY OFFENSES' 'FORGERY/COUNTERFEITING'
 'FRAUD' 'GAMBLING' 'KIDNAPPING' 'LARCENY/THEFT' 'LIQUOR LAWS' 'LOITERING'
 'MISSING PERSON' 'NON-CRIMINAL' 'OTHER OFFENSES'
 'PORNOGRAPHY/OBSCENE MAT' 'PROSTITUTION' 'RECOVERED VEHICLE' 'ROBBERY'
 'RUNAWAY' 'SECONDARY CODES' 'SEX OFFENSES FORCIBLE'
 'SEX OFFENSES NON FORCIBLE' 'STOLEN PROPERTY' 'SUICIDE' 'SUSPICIOUS OCC'
 'TREA' 'TRESPASS' 'VANDALISM' 'VEHICLE THEFT' 'WARRANTS' 'WEAPON LAWS']


In [14]:
num_train = X_train.shape[0]
all_data = pd.concat((X_train, X_test), ignore_index=True)

In [15]:
date = pd.to_datetime(all_data['Dates'])
all_data['year'] = date.dt.year
all_data['month'] = date.dt.month
all_data['day'] = date.dt.day
all_data['hour'] = date.dt.hour
all_data['minute'] = date.dt.minute
# all_data['second'] = date.dt.second  # all zero
all_data["n_days"] = (date - date.min()).apply(lambda x: x.days)
all_data.drop("Dates", axis=1, inplace=True)

In [16]:
all_data["DayOfWeek"].value_counts()

Friday       268074
Wednesday    259228
Saturday     253507
Tuesday      251543
Thursday     251298
Monday       243529
Sunday       232809
Name: DayOfWeek, dtype: int64

In [17]:
all_data["PdDistrict"].value_counts()

SOUTHERN      313984
MISSION       240172
NORTHERN      212122
BAYVIEW       178689
CENTRAL       171397
TENDERLOIN    163389
INGLESIDE     158806
TARAVAL       132017
PARK           99360
RICHMOND       90052
Name: PdDistrict, dtype: int64

In [18]:
all_data['block'] = all_data["Address"].str.contains("block", case=False)
all_data.drop("Address", axis=1, inplace=True)

In [19]:
all_data["X+Y"] = all_data["X"] + all_data["Y"]
all_data["X-Y"] = all_data["X"] - all_data["Y"]
all_data["XY1"] = (all_data["X"] - all_data["X"].min()) ** 2 + (all_data["Y"] - all_data["Y"].min()) ** 2
all_data["XY2"] = (all_data["X"].max() - all_data["X"]) ** 2 + (all_data["Y"] - all_data["Y"].min()) ** 2
all_data["XY3"] = (all_data["X"] - all_data["X"].min()) ** 2 + (all_data["Y"].max() - all_data["Y"]) ** 2
all_data["XY4"] = (all_data["X"].max() - all_data["X"]) ** 2 + (all_data["Y"].max() - all_data["Y"]) ** 2

In [20]:
categorical_features = ["DayOfWeek", "PdDistrict", "block"]
ct = ColumnTransformer(transformers=[("categorical_features", OrdinalEncoder(), categorical_features)],
                       remainder="passthrough")
all_data = ct.fit_transform(all_data)

In [21]:
X_train = all_data[:num_train]
X_test = all_data[num_train:]

In [22]:
def cross_val_score_prob(clf, X, y):
    scores = []
    cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)
    for train_index, test_index in cv.split(X, y):
        est = clone(clf)
        est.fit(X[train_index], y[train_index])
        prob = est.predict_proba(X[test_index])
        scores.append(log_loss(y[test_index], prob))
    return scores

In [23]:
X_train_1, X_train_2, y_train_1, y_train_2 = train_test_split(X_train, y_train, random_state=0, stratify=y_train)
clf = catboost.CatBoostClassifier(n_estimators=5000, learning_rate=0.05,
                                  cat_features=np.arange(len(categorical_features)),
                                  random_seed=0, task_type="GPU", verbose=50)
clf.fit(X_train_1, y_train_1, eval_set=(X_train_2, y_train_2))

0:	learn: 3.4709277	test: 3.4703963	best: 3.4703963 (0)	total: 226ms	remaining: 18m 48s
50:	learn: 2.4329643	test: 2.4334968	best: 2.4334968 (50)	total: 11s	remaining: 17m 49s
100:	learn: 2.3721599	test: 2.3743691	best: 2.3743691 (100)	total: 21.6s	remaining: 17m 27s
150:	learn: 2.3486693	test: 2.3529308	best: 2.3529308 (150)	total: 32.1s	remaining: 17m 11s
200:	learn: 2.3321422	test: 2.3384142	best: 2.3384142 (200)	total: 42.7s	remaining: 16m 58s
250:	learn: 2.3193347	test: 2.3277720	best: 2.3277720 (250)	total: 53.3s	remaining: 16m 48s
300:	learn: 2.3081048	test: 2.3185488	best: 2.3185488 (300)	total: 1m 4s	remaining: 16m 39s
350:	learn: 2.2987179	test: 2.3113592	best: 2.3113592 (350)	total: 1m 14s	remaining: 16m 30s
400:	learn: 2.2906905	test: 2.3053063	best: 2.3053063 (400)	total: 1m 25s	remaining: 16m 20s
450:	learn: 2.2832250	test: 2.3001076	best: 2.3001076 (450)	total: 1m 36s	remaining: 16m 11s
500:	learn: 2.2760367	test: 2.2950292	best: 2.2950292 (500)	total: 1m 47s	remaining: 

4400:	learn: 2.0682282	test: 2.2365904	best: 2.2365766 (4331)	total: 15m 49s	remaining: 2m 9s
4450:	learn: 2.0662727	test: 2.2364910	best: 2.2364887 (4436)	total: 16m	remaining: 1m 58s
4500:	learn: 2.0642231	test: 2.2364516	best: 2.2364516 (4500)	total: 16m 11s	remaining: 1m 47s
4550:	learn: 2.0623308	test: 2.2363244	best: 2.2363217 (4549)	total: 16m 22s	remaining: 1m 36s
4600:	learn: 2.0604979	test: 2.2361522	best: 2.2361522 (4600)	total: 16m 32s	remaining: 1m 26s
4650:	learn: 2.0584654	test: 2.2360048	best: 2.2360019 (4648)	total: 16m 43s	remaining: 1m 15s
4700:	learn: 2.0565801	test: 2.2358750	best: 2.2358749 (4699)	total: 16m 54s	remaining: 1m 4s
4750:	learn: 2.0547495	test: 2.2358445	best: 2.2358302 (4740)	total: 17m 5s	remaining: 53.8s
4800:	learn: 2.0529496	test: 2.2357334	best: 2.2357300 (4798)	total: 17m 16s	remaining: 43s
4850:	learn: 2.0511561	test: 2.2356628	best: 2.2356442 (4840)	total: 17m 27s	remaining: 32.2s
4900:	learn: 2.0491671	test: 2.2355223	best: 2.2355209 (4896)	

<catboost.core.CatBoostClassifier at 0x7f9756e1ec88>

In [24]:
clf = catboost.CatBoostClassifier(n_estimators=5000, learning_rate=0.05,
                                  cat_features=np.arange(len(categorical_features)),
                                  random_seed=0, task_type="GPU", verbose=50)

In [25]:
# scores = cross_val_score_prob(clf, X_train, y_train)
# print(np.mean(scores), np.std(scores))

In [26]:
clf.fit(X_train, y_train)
prob = clf.predict_proba(X_test)

0:	learn: 3.4702610	total: 294ms	remaining: 24m 31s
50:	learn: 2.4327384	total: 14.6s	remaining: 23m 36s
100:	learn: 2.3704925	total: 28.7s	remaining: 23m 10s
150:	learn: 2.3468701	total: 42.7s	remaining: 22m 51s
200:	learn: 2.3310041	total: 56.7s	remaining: 22m 33s
250:	learn: 2.3189392	total: 1m 10s	remaining: 22m 17s
300:	learn: 2.3084334	total: 1m 24s	remaining: 22m 1s
350:	learn: 2.2996686	total: 1m 38s	remaining: 21m 46s
400:	learn: 2.2916028	total: 1m 52s	remaining: 21m 33s
450:	learn: 2.2846580	total: 2m 6s	remaining: 21m 18s
500:	learn: 2.2785305	total: 2m 20s	remaining: 21m 4s
550:	learn: 2.2730487	total: 2m 34s	remaining: 20m 49s
600:	learn: 2.2677665	total: 2m 48s	remaining: 20m 35s
650:	learn: 2.2629113	total: 3m 2s	remaining: 20m 21s
700:	learn: 2.2583990	total: 3m 16s	remaining: 20m 6s
750:	learn: 2.2540455	total: 3m 30s	remaining: 19m 52s
800:	learn: 2.2501834	total: 3m 44s	remaining: 19m 38s
850:	learn: 2.2463076	total: 3m 58s	remaining: 19m 24s
900:	learn: 2.2428118	t

In [27]:
%%time
submission = pd.DataFrame(np.c_[test_ID, prob], columns=["Id"] + list(le.classes_))
submission["Id"] = submission["Id"].astype(int)
submission.to_csv("submission/v1.gz", compression="gzip", index=False)

CPU times: user 2min 30s, sys: 0 ns, total: 2min 30s
Wall time: 2min 31s
