- Dates - timestamp of the crime incident
- Category - category of the crime incident (only in train.csv). This is the target variable you are going to predict.
- Descript - detailed description of the crime incident (only in train.csv)
- DayOfWeek - the day of the week
- PdDistrict - name of the Police Department District
- Resolution - how the crime incident was resolved (only in train.csv)
- Address - the approximate street address of the crime incident 
- X - Longitude
- Y - Latitude

In [1]:
import numpy as np
import pandas as pd
pd.set_option("display.max_columns", None)
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.base import clone
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.metrics import log_loss
from sklearn.decomposition import PCA
from sklearn.mixture import GaussianMixture
import catboost
import gensim

In [2]:
X_train = pd.read_csv("data/train.csv")
X_test = pd.read_csv("data/test.csv")

In [3]:
X_train[:5]

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541


In [4]:
print(X_train.duplicated().sum())
X_train.drop_duplicates(inplace=True)
assert X_train.duplicated().sum() == 0

2323


In [5]:
X_test[:5]

Unnamed: 0,Id,Dates,DayOfWeek,PdDistrict,Address,X,Y
0,0,2015-05-10 23:59:00,Sunday,BAYVIEW,2000 Block of THOMAS AV,-122.399588,37.735051
1,1,2015-05-10 23:51:00,Sunday,BAYVIEW,3RD ST / REVERE AV,-122.391523,37.732432
2,2,2015-05-10 23:50:00,Sunday,NORTHERN,2000 Block of GOUGH ST,-122.426002,37.792212
3,3,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412
4,4,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412


In [6]:
y_train = X_train['Category']
X_train_description = X_train['Descript']
X_train_resolution = X_train['Resolution']
X_train.drop(["Category", "Descript", "Resolution"], axis=1, inplace=True)

In [7]:
test_ID = X_test["Id"]
X_test.drop("Id", axis=1, inplace=True)

In [8]:
X_train[:5]

Unnamed: 0,Dates,DayOfWeek,PdDistrict,Address,X,Y
0,2015-05-13 23:53:00,Wednesday,NORTHERN,OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,Wednesday,NORTHERN,OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,Wednesday,NORTHERN,VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,Wednesday,NORTHERN,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,Wednesday,PARK,100 Block of BRODERICK ST,-122.438738,37.771541


In [9]:
X_test[:5]

Unnamed: 0,Dates,DayOfWeek,PdDistrict,Address,X,Y
0,2015-05-10 23:59:00,Sunday,BAYVIEW,2000 Block of THOMAS AV,-122.399588,37.735051
1,2015-05-10 23:51:00,Sunday,BAYVIEW,3RD ST / REVERE AV,-122.391523,37.732432
2,2015-05-10 23:50:00,Sunday,NORTHERN,2000 Block of GOUGH ST,-122.426002,37.792212
3,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412
4,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412


In [10]:
X_train.shape

(875726, 6)

In [11]:
X_test.shape

(884262, 6)

In [12]:
y_train.value_counts()

LARCENY/THEFT                  174320
OTHER OFFENSES                 125960
NON-CRIMINAL                    91915
ASSAULT                         76815
DRUG/NARCOTIC                   53919
VEHICLE THEFT                   53706
VANDALISM                       44581
WARRANTS                        42145
BURGLARY                        36600
SUSPICIOUS OCC                  31394
MISSING PERSON                  25669
ROBBERY                         22988
FRAUD                           16637
FORGERY/COUNTERFEITING          10592
SECONDARY CODES                  9979
WEAPON LAWS                      8550
PROSTITUTION                     7446
TRESPASS                         7318
STOLEN PROPERTY                  4537
SEX OFFENSES FORCIBLE            4380
DISORDERLY CONDUCT               4313
DRUNKENNESS                      4277
RECOVERED VEHICLE                3132
KIDNAPPING                       2340
DRIVING UNDER THE INFLUENCE      2268
LIQUOR LAWS                      1899
RUNAWAY     

In [13]:
le = LabelEncoder()
y_train = le.fit_transform(y_train)
print(le.classes_)

['ARSON' 'ASSAULT' 'BAD CHECKS' 'BRIBERY' 'BURGLARY' 'DISORDERLY CONDUCT'
 'DRIVING UNDER THE INFLUENCE' 'DRUG/NARCOTIC' 'DRUNKENNESS'
 'EMBEZZLEMENT' 'EXTORTION' 'FAMILY OFFENSES' 'FORGERY/COUNTERFEITING'
 'FRAUD' 'GAMBLING' 'KIDNAPPING' 'LARCENY/THEFT' 'LIQUOR LAWS' 'LOITERING'
 'MISSING PERSON' 'NON-CRIMINAL' 'OTHER OFFENSES'
 'PORNOGRAPHY/OBSCENE MAT' 'PROSTITUTION' 'RECOVERED VEHICLE' 'ROBBERY'
 'RUNAWAY' 'SECONDARY CODES' 'SEX OFFENSES FORCIBLE'
 'SEX OFFENSES NON FORCIBLE' 'STOLEN PROPERTY' 'SUICIDE' 'SUSPICIOUS OCC'
 'TREA' 'TRESPASS' 'VANDALISM' 'VEHICLE THEFT' 'WARRANTS' 'WEAPON LAWS']


In [14]:
num_train = X_train.shape[0]
all_data = pd.concat((X_train, X_test), ignore_index=True)

In [15]:
date = pd.to_datetime(all_data['Dates'])
all_data['year'] = date.dt.year
all_data['month'] = date.dt.month
all_data['day'] = date.dt.day
all_data['hour'] = date.dt.hour
all_data['minute'] = date.dt.minute
all_data['special_time'] = all_data['minute'].isin([0, 30]).astype(int)
# all_data['second'] = date.dt.second  # all zero
all_data["n_days"] = (date - date.min()).apply(lambda x: x.days)
all_data.drop("Dates", axis=1, inplace=True)

In [16]:
all_data["DayOfWeek"].value_counts()

Friday       268074
Wednesday    259228
Saturday     253507
Tuesday      251543
Thursday     251298
Monday       243529
Sunday       232809
Name: DayOfWeek, dtype: int64

In [17]:
all_data["PdDistrict"].value_counts()

SOUTHERN      313984
MISSION       240172
NORTHERN      212122
BAYVIEW       178689
CENTRAL       171397
TENDERLOIN    163389
INGLESIDE     158806
TARAVAL       132017
PARK           99360
RICHMOND       90052
Name: PdDistrict, dtype: int64

In [18]:
sentences = []
for s in all_data["Address"]:
    sentences.append(s.split(" "))
address_model = gensim.models.Word2Vec(sentences, min_count=1)
encoded_address = np.zeros((all_data.shape[0], 100))
for i in range(len(sentences)):
    for j in range(len(sentences[i])):
        encoded_address[i] += address_model.wv[sentences[i][j]]
    encoded_address[j] /= len(sentences[i])

In [19]:
all_data['block'] = all_data["Address"].str.contains("block", case=False)
all_data.drop("Address", axis=1, inplace=True)

In [20]:
print(all_data["X"].min(), all_data["X"].max())
print(all_data["Y"].min(), all_data["Y"].max())

-122.51364206429 -120.5
37.7078790224135 90.0


In [21]:
X_median = all_data[all_data["X"] < -120.5]["X"].median()
Y_median = all_data[all_data["Y"] < 90]["Y"].median()
all_data.loc[all_data["X"] >= -120.5, "X"] = X_median
all_data.loc[all_data["Y"] >= 90, "Y"] = Y_median

In [22]:
print(all_data["X"].min(), all_data["X"].max())
print(all_data["Y"].min(), all_data["Y"].max())

-122.51364206429 -122.364750704393
37.7078790224135 37.82062083807021


In [23]:
all_data["X+Y"] = all_data["X"] + all_data["Y"]
all_data["X-Y"] = all_data["X"] - all_data["Y"]
all_data["XY30_1"] = all_data["X"] * np.cos(np.pi / 6) + all_data["Y"] * np.sin(np.pi / 6)
all_data["XY30_2"] = all_data["Y"] * np.cos(np.pi / 6) - all_data["X"] * np.sin(np.pi / 6)
all_data["XY60_1"] = all_data["X"] * np.cos(np.pi / 3) + all_data["Y"] * np.sin(np.pi / 3)
all_data["XY60_2"] = all_data["Y"] * np.cos(np.pi / 3) - all_data["X"] * np.sin(np.pi / 3)
all_data["XY1"] = (all_data["X"] - all_data["X"].min()) ** 2 + (all_data["Y"] - all_data["Y"].min()) ** 2
all_data["XY2"] = (all_data["X"].max() - all_data["X"]) ** 2 + (all_data["Y"] - all_data["Y"].min()) ** 2
all_data["XY3"] = (all_data["X"] - all_data["X"].min()) ** 2 + (all_data["Y"].max() - all_data["Y"]) ** 2
all_data["XY4"] = (all_data["X"].max() - all_data["X"]) ** 2 + (all_data["Y"].max() - all_data["Y"]) ** 2
all_data["XY5"] = (all_data["X"] - X_median) ** 2 + (all_data["Y"] - Y_median) ** 2
pca = PCA(n_components=2).fit(all_data[["X", "Y"]])
XYt = pca.transform(all_data[["X", "Y"]])
all_data["XYpca1"] = XYt[:, 0]
all_data["XYpca2"] = XYt[:, 1]
# n_components selected by aic/bic
clf = GaussianMixture(n_components=150, covariance_type="diag",
                      random_state=0).fit(all_data[["X", "Y"]])
all_data["XYcluster"] = clf.predict(all_data[["X", "Y"]])

In [24]:
categorical_features = ["DayOfWeek", "PdDistrict", "block", "special_time", "XYcluster"]
ct = ColumnTransformer(transformers=[("categorical_features", OrdinalEncoder(), categorical_features)],
                       remainder="passthrough")
all_data = ct.fit_transform(all_data)

In [25]:
all_data = np.hstack((all_data, encoded_address))

In [26]:
X_train = all_data[:num_train]
X_test = all_data[num_train:]

In [27]:
def cross_val_score_prod(clf, X, y):
    scores = []
    cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)
    for train_index, test_index in cv.split(X, y):
        est = clone(clf)
        est.fit(X[train_index], y[train_index])
        prob = est.predict_proba(X[test_index])
        scores.append(log_loss(y[test_index], prob))
    return scores

In [28]:
X_train_1, X_train_2, y_train_1, y_train_2 = train_test_split(X_train, y_train, random_state=0, stratify=y_train)
clf = catboost.CatBoostClassifier(n_estimators=5000, learning_rate=0.05,
                                  cat_features=np.arange(len(categorical_features)),
                                  random_seed=0, task_type="GPU", devices="0", verbose=50)
clf.fit(X_train_1, y_train_1, eval_set=(X_train_2, y_train_2))

0:	learn: 3.4591855	test: 3.4581954	best: 3.4581954 (0)	total: 305ms	remaining: 25m 24s
50:	learn: 2.4115088	test: 2.4123301	best: 2.4123301 (50)	total: 13.9s	remaining: 22m 30s
100:	learn: 2.3443610	test: 2.3479929	best: 2.3479929 (100)	total: 27s	remaining: 21m 49s
150:	learn: 2.3172307	test: 2.3240027	best: 2.3240027 (150)	total: 40s	remaining: 21m 23s
200:	learn: 2.2987508	test: 2.3090604	best: 2.3090604 (200)	total: 53.5s	remaining: 21m 17s
250:	learn: 2.2838801	test: 2.2976955	best: 2.2976955 (250)	total: 1m 7s	remaining: 21m 11s
300:	learn: 2.2712466	test: 2.2882375	best: 2.2882375 (300)	total: 1m 20s	remaining: 21m 2s
350:	learn: 2.2604814	test: 2.2808161	best: 2.2808161 (350)	total: 1m 34s	remaining: 20m 51s
400:	learn: 2.2510209	test: 2.2747082	best: 2.2747082 (400)	total: 1m 48s	remaining: 20m 41s
450:	learn: 2.2422915	test: 2.2692978	best: 2.2692978 (450)	total: 2m 1s	remaining: 20m 29s
500:	learn: 2.2340224	test: 2.2642526	best: 2.2642526 (500)	total: 2m 15s	remaining: 20m

4350:	learn: 1.9862048	test: 2.2071882	best: 2.2071882 (4349)	total: 19m 53s	remaining: 2m 57s
4400:	learn: 1.9840306	test: 2.2071545	best: 2.2071519 (4398)	total: 20m 6s	remaining: 2m 44s
4450:	learn: 1.9817428	test: 2.2070754	best: 2.2070754 (4450)	total: 20m 19s	remaining: 2m 30s
4500:	learn: 1.9794570	test: 2.2070175	best: 2.2070175 (4500)	total: 20m 33s	remaining: 2m 16s
4550:	learn: 1.9771930	test: 2.2069845	best: 2.2069794 (4536)	total: 20m 47s	remaining: 2m 3s
4600:	learn: 1.9749528	test: 2.2068902	best: 2.2068793 (4597)	total: 21m	remaining: 1m 49s
4650:	learn: 1.9726537	test: 2.2068676	best: 2.2068625 (4647)	total: 21m 14s	remaining: 1m 35s
4700:	learn: 1.9705012	test: 2.2067704	best: 2.2067704 (4700)	total: 21m 27s	remaining: 1m 21s
4750:	learn: 1.9682775	test: 2.2067030	best: 2.2067030 (4750)	total: 21m 41s	remaining: 1m 8s
4800:	learn: 1.9660337	test: 2.2066441	best: 2.2066441 (4800)	total: 21m 54s	remaining: 54.5s
4850:	learn: 1.9637725	test: 2.2065955	best: 2.2065955 (48

<catboost.core.CatBoostClassifier at 0x7fc925bb3208>

In [29]:
clf = catboost.CatBoostClassifier(n_estimators=5000, learning_rate=0.05,
                                  cat_features=np.arange(len(categorical_features)),
                                  random_seed=0, task_type="GPU", devices="0", verbose=50)

In [30]:
# scores = cross_val_score_prod(clf, X_train, y_train)
# print(np.mean(scores), np.std(scores))

In [31]:
clf.fit(X_train, y_train)
prob = clf.predict_proba(X_test)

0:	learn: 3.4586714	total: 393ms	remaining: 32m 43s
50:	learn: 2.4113510	total: 18.3s	remaining: 29m 33s
100:	learn: 2.3439746	total: 36.1s	remaining: 29m 9s
150:	learn: 2.3178548	total: 53.8s	remaining: 28m 49s
200:	learn: 2.2994341	total: 1m 11s	remaining: 28m 24s
250:	learn: 2.2857706	total: 1m 28s	remaining: 27m 58s
300:	learn: 2.2737738	total: 1m 46s	remaining: 27m 40s
350:	learn: 2.2633217	total: 2m 3s	remaining: 27m 22s
400:	learn: 2.2543457	total: 2m 21s	remaining: 27m 2s
450:	learn: 2.2461650	total: 2m 38s	remaining: 26m 43s
500:	learn: 2.2388652	total: 2m 56s	remaining: 26m 23s
550:	learn: 2.2318572	total: 3m 13s	remaining: 26m 5s
600:	learn: 2.2256222	total: 3m 31s	remaining: 25m 47s
650:	learn: 2.2198316	total: 3m 48s	remaining: 25m 28s
700:	learn: 2.2141869	total: 4m 6s	remaining: 25m 10s
750:	learn: 2.2087120	total: 4m 24s	remaining: 24m 54s
800:	learn: 2.2036333	total: 4m 41s	remaining: 24m 37s
850:	learn: 2.1992010	total: 4m 59s	remaining: 24m 18s
900:	learn: 2.1947394	

In [32]:
%%time
submission = pd.DataFrame(np.c_[test_ID, prob], columns=["Id"] + list(le.classes_))
submission["Id"] = submission["Id"].astype(int)
submission.to_csv("submission/v3.gz", compression="gzip", index=False)

CPU times: user 2min 30s, sys: 0 ns, total: 2min 30s
Wall time: 2min 30s
