- Dates - timestamp of the crime incident
- Category - category of the crime incident (only in train.csv). This is the target variable you are going to predict.
- Descript - detailed description of the crime incident (only in train.csv)
- DayOfWeek - the day of the week
- PdDistrict - name of the Police Department District
- Resolution - how the crime incident was resolved (only in train.csv)
- Address - the approximate street address of the crime incident 
- X - Longitude
- Y - Latitude

In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import numpy as np
import pandas as pd
pd.set_option("display.max_columns", None)
import matplotlib as plt
%matplotlib inline
from sklearn.preprocessing import LabelEncoder, LabelBinarizer, OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.base import clone
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.metrics import log_loss
import lightgbm as lgb
import catboost

In [2]:
X_train = pd.read_csv("data/train.csv")
X_test = pd.read_csv("data/test.csv")

In [3]:
X_train[:5]

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541


In [4]:
print(X_train.duplicated().sum())
X_train.drop_duplicates(inplace=True)
assert X_train.duplicated().sum() == 0

2323


In [5]:
X_test[:5]

Unnamed: 0,Id,Dates,DayOfWeek,PdDistrict,Address,X,Y
0,0,2015-05-10 23:59:00,Sunday,BAYVIEW,2000 Block of THOMAS AV,-122.399588,37.735051
1,1,2015-05-10 23:51:00,Sunday,BAYVIEW,3RD ST / REVERE AV,-122.391523,37.732432
2,2,2015-05-10 23:50:00,Sunday,NORTHERN,2000 Block of GOUGH ST,-122.426002,37.792212
3,3,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412
4,4,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412


In [6]:
y_train = X_train['Category']
X_train_description = X_train['Descript']
X_train_resolution = X_train['Resolution']
X_train.drop(["Category", "Descript", "Resolution"], axis=1, inplace=True)

In [7]:
test_ID = X_test["Id"]
X_test.drop("Id", axis=1, inplace=True)

In [8]:
X_train[:5]

Unnamed: 0,Dates,DayOfWeek,PdDistrict,Address,X,Y
0,2015-05-13 23:53:00,Wednesday,NORTHERN,OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,Wednesday,NORTHERN,OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,Wednesday,NORTHERN,VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,Wednesday,NORTHERN,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,Wednesday,PARK,100 Block of BRODERICK ST,-122.438738,37.771541


In [9]:
X_test[:5]

Unnamed: 0,Dates,DayOfWeek,PdDistrict,Address,X,Y
0,2015-05-10 23:59:00,Sunday,BAYVIEW,2000 Block of THOMAS AV,-122.399588,37.735051
1,2015-05-10 23:51:00,Sunday,BAYVIEW,3RD ST / REVERE AV,-122.391523,37.732432
2,2015-05-10 23:50:00,Sunday,NORTHERN,2000 Block of GOUGH ST,-122.426002,37.792212
3,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412
4,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412


In [10]:
X_train.shape

(875726, 6)

In [11]:
X_test.shape

(884262, 6)

In [12]:
y_train.value_counts()

LARCENY/THEFT                  174320
OTHER OFFENSES                 125960
NON-CRIMINAL                    91915
ASSAULT                         76815
DRUG/NARCOTIC                   53919
VEHICLE THEFT                   53706
VANDALISM                       44581
WARRANTS                        42145
BURGLARY                        36600
SUSPICIOUS OCC                  31394
MISSING PERSON                  25669
ROBBERY                         22988
FRAUD                           16637
FORGERY/COUNTERFEITING          10592
SECONDARY CODES                  9979
WEAPON LAWS                      8550
PROSTITUTION                     7446
TRESPASS                         7318
STOLEN PROPERTY                  4537
SEX OFFENSES FORCIBLE            4380
DISORDERLY CONDUCT               4313
DRUNKENNESS                      4277
RECOVERED VEHICLE                3132
KIDNAPPING                       2340
DRIVING UNDER THE INFLUENCE      2268
LIQUOR LAWS                      1899
RUNAWAY     

In [13]:
le = LabelEncoder()
y_train = le.fit_transform(y_train)
print(le.classes_)

['ARSON' 'ASSAULT' 'BAD CHECKS' 'BRIBERY' 'BURGLARY' 'DISORDERLY CONDUCT'
 'DRIVING UNDER THE INFLUENCE' 'DRUG/NARCOTIC' 'DRUNKENNESS'
 'EMBEZZLEMENT' 'EXTORTION' 'FAMILY OFFENSES' 'FORGERY/COUNTERFEITING'
 'FRAUD' 'GAMBLING' 'KIDNAPPING' 'LARCENY/THEFT' 'LIQUOR LAWS' 'LOITERING'
 'MISSING PERSON' 'NON-CRIMINAL' 'OTHER OFFENSES'
 'PORNOGRAPHY/OBSCENE MAT' 'PROSTITUTION' 'RECOVERED VEHICLE' 'ROBBERY'
 'RUNAWAY' 'SECONDARY CODES' 'SEX OFFENSES FORCIBLE'
 'SEX OFFENSES NON FORCIBLE' 'STOLEN PROPERTY' 'SUICIDE' 'SUSPICIOUS OCC'
 'TREA' 'TRESPASS' 'VANDALISM' 'VEHICLE THEFT' 'WARRANTS' 'WEAPON LAWS']


In [14]:
num_train = X_train.shape[0]
all_data = pd.concat((X_train, X_test), ignore_index=True)

In [15]:
date = pd.to_datetime(all_data['Dates'])
all_data['year'] = date.dt.year
all_data['month'] = date.dt.month
all_data['day'] = date.dt.day
all_data['hour'] = date.dt.hour
all_data['minute'] = date.dt.minute
# all_data['second'] = date.dt.second  # all zero
all_data["n_days"] = (date - date.min()).apply(lambda x: x.days)
all_data.drop("Dates", axis=1, inplace=True)

In [16]:
all_data["DayOfWeek"].value_counts()

Friday       268074
Wednesday    259228
Saturday     253507
Tuesday      251543
Thursday     251298
Monday       243529
Sunday       232809
Name: DayOfWeek, dtype: int64

In [17]:
all_data["PdDistrict"].value_counts()

SOUTHERN      313984
MISSION       240172
NORTHERN      212122
BAYVIEW       178689
CENTRAL       171397
TENDERLOIN    163389
INGLESIDE     158806
TARAVAL       132017
PARK           99360
RICHMOND       90052
Name: PdDistrict, dtype: int64

In [18]:
all_data['block'] = all_data["Address"].str.contains("block", case=False)
all_data.drop("Address", axis=1, inplace=True)

In [19]:
all_data["X+Y"] = all_data["X"] + all_data["Y"]
all_data["X-Y"] = all_data["X"] - all_data["Y"]
all_data["XY1"] = (all_data["X"] - all_data["X"].min()) ** 2 + (all_data["Y"] - all_data["Y"].min()) ** 2
all_data["XY2"] = (all_data["X"].max() - all_data["X"]) ** 2 + (all_data["Y"] - all_data["Y"].min()) ** 2
all_data["XY3"] = (all_data["X"] - all_data["X"].min()) ** 2 + (all_data["Y"].max() - all_data["Y"]) ** 2
all_data["XY4"] = (all_data["X"].max() - all_data["X"]) ** 2 + (all_data["Y"].max() - all_data["Y"]) ** 2

In [20]:
categorical_features = ["DayOfWeek", "PdDistrict", "year", "month", "day", "hour", "minute", "block"]
ct = ColumnTransformer(transformers=[("categorical_features", OrdinalEncoder(), categorical_features)],
                       remainder="passthrough")
all_data = ct.fit_transform(all_data)

In [21]:
X_train = all_data[:num_train]
X_test = all_data[num_train:]

In [22]:
def cross_val_score_prod(clf, X, y):
    scores = []
    cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)
    for train_index, test_index in cv.split(X, y):
        est = clone(clf)
        est.fit(X[train_index], y[train_index],
                cat_features=np.arange(len(categorical_features)))
        prob = est.predict_proba(X[test_index])
        scores.append(log_loss(y[test_index], prob))
    return scores

In [23]:
X_train_1, X_train_2, y_train_1, y_train_2 = train_test_split(X_train, y_train, random_state=0, stratify=y_train)
clf = catboost.CatBoostClassifier(n_estimators=5000, learning_rate=0.05, random_seed=0, task_type="GPU", verbose=50)
clf.fit(X_train_1, y_train_1, eval_set=(X_train_2, y_train_2))

0:	learn: 3.4715295	test: 3.4712451	best: 3.4712451 (0)	total: 81.5ms	remaining: 6m 47s
50:	learn: 2.4366165	test: 2.4375908	best: 2.4375908 (50)	total: 3.86s	remaining: 6m 14s
100:	learn: 2.3741110	test: 2.3767373	best: 2.3767373 (100)	total: 7.54s	remaining: 6m 5s
150:	learn: 2.3483280	test: 2.3528554	best: 2.3528554 (150)	total: 11.2s	remaining: 5m 59s
200:	learn: 2.3310937	test: 2.3377530	best: 2.3377530 (200)	total: 14.9s	remaining: 5m 55s
250:	learn: 2.3180034	test: 2.3264521	best: 2.3264521 (250)	total: 18.5s	remaining: 5m 50s
300:	learn: 2.3074672	test: 2.3176698	best: 2.3176698 (300)	total: 22.2s	remaining: 5m 46s
350:	learn: 2.2983276	test: 2.3103608	best: 2.3103608 (350)	total: 25.8s	remaining: 5m 42s
400:	learn: 2.2903115	test: 2.3043886	best: 2.3043886 (400)	total: 29.5s	remaining: 5m 38s
450:	learn: 2.2830372	test: 2.2992941	best: 2.2992941 (450)	total: 33.2s	remaining: 5m 34s
500:	learn: 2.2762769	test: 2.2946217	best: 2.2946217 (500)	total: 36.9s	remaining: 5m 31s
550:	

4450:	learn: 2.0537505	test: 2.2356875	best: 2.2356875 (4450)	total: 5m 39s	remaining: 41.9s
4500:	learn: 2.0518298	test: 2.2355839	best: 2.2355771 (4494)	total: 5m 43s	remaining: 38.1s
4550:	learn: 2.0499255	test: 2.2356055	best: 2.2355653 (4529)	total: 5m 47s	remaining: 34.3s
4600:	learn: 2.0480109	test: 2.2355203	best: 2.2355190 (4599)	total: 5m 51s	remaining: 30.5s
4650:	learn: 2.0460291	test: 2.2354240	best: 2.2354240 (4650)	total: 5m 55s	remaining: 26.7s
4700:	learn: 2.0440627	test: 2.2353616	best: 2.2353616 (4700)	total: 5m 59s	remaining: 22.9s
4750:	learn: 2.0420697	test: 2.2352661	best: 2.2352530 (4736)	total: 6m 3s	remaining: 19.1s
4800:	learn: 2.0401334	test: 2.2351912	best: 2.2351912 (4800)	total: 6m 7s	remaining: 15.2s
4850:	learn: 2.0382473	test: 2.2350858	best: 2.2350780 (4848)	total: 6m 11s	remaining: 11.4s
4900:	learn: 2.0363723	test: 2.2350333	best: 2.2350333 (4900)	total: 6m 15s	remaining: 7.59s
4950:	learn: 2.0345266	test: 2.2349474	best: 2.2349474 (4950)	total: 6m 

<catboost.core.CatBoostClassifier at 0x7fcee1e04da0>

In [24]:
clf = catboost.CatBoostClassifier(n_estimators=5000, learning_rate=0.05, random_seed=0, task_type="GPU", verbose=50)

In [25]:
# scores = cross_val_score_prod(clf, X_train, y_train)
# print(np.mean(scores), np.std(scores))

In [26]:
clf.fit(X_train, y_train)
prob = clf.predict_proba(X_test)

0:	learn: 3.4710514	total: 104ms	remaining: 8m 39s
50:	learn: 2.4351041	total: 4.93s	remaining: 7m 58s
100:	learn: 2.3735662	total: 9.65s	remaining: 7m 47s
150:	learn: 2.3498073	total: 14.4s	remaining: 7m 41s
200:	learn: 2.3325451	total: 19.1s	remaining: 7m 37s
250:	learn: 2.3202509	total: 23.9s	remaining: 7m 32s
300:	learn: 2.3098051	total: 28.7s	remaining: 7m 28s
350:	learn: 2.3000980	total: 33.6s	remaining: 7m 24s
400:	learn: 2.2922514	total: 38.4s	remaining: 7m 20s
450:	learn: 2.2851027	total: 43.2s	remaining: 7m 15s
500:	learn: 2.2788887	total: 48s	remaining: 7m 10s
550:	learn: 2.2730053	total: 52.7s	remaining: 7m 5s
600:	learn: 2.2678372	total: 57.6s	remaining: 7m 1s
650:	learn: 2.2628399	total: 1m 2s	remaining: 6m 57s
700:	learn: 2.2581450	total: 1m 7s	remaining: 6m 52s
750:	learn: 2.2535406	total: 1m 12s	remaining: 6m 47s
800:	learn: 2.2494025	total: 1m 16s	remaining: 6m 42s
850:	learn: 2.2454679	total: 1m 21s	remaining: 6m 37s
900:	learn: 2.2417394	total: 1m 26s	remaining: 6m 

In [27]:
%%time
submission = pd.DataFrame(np.c_[test_ID, prob], columns=["Id"] + list(le.classes_))
submission["Id"] = submission["Id"].astype(int)
submission.to_csv("submission/v5.gz", compression="gzip", index=False)

CPU times: user 2min 32s, sys: 109 ms, total: 2min 32s
Wall time: 2min 32s
