- Dates - timestamp of the crime incident
- Category - category of the crime incident (only in train.csv). This is the target variable you are going to predict.
- Descript - detailed description of the crime incident (only in train.csv)
- DayOfWeek - the day of the week
- PdDistrict - name of the Police Department District
- Resolution - how the crime incident was resolved (only in train.csv)
- Address - the approximate street address of the crime incident 
- X - Longitude
- Y - Latitude

In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import numpy as np
import pandas as pd
pd.set_option("display.max_columns", None)
import matplotlib as plt
%matplotlib inline
from sklearn.preprocessing import LabelEncoder, LabelBinarizer, OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.base import clone
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.metrics import log_loss
import lightgbm as lgb
import catboost

In [2]:
X_train = pd.read_csv("data/train.csv")
X_test = pd.read_csv("data/test.csv")

In [3]:
X_train[:5]

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541


In [4]:
print(X_train.duplicated().sum())
X_train.drop_duplicates(inplace=True)
assert X_train.duplicated().sum() == 0

2323


In [5]:
X_test[:5]

Unnamed: 0,Id,Dates,DayOfWeek,PdDistrict,Address,X,Y
0,0,2015-05-10 23:59:00,Sunday,BAYVIEW,2000 Block of THOMAS AV,-122.399588,37.735051
1,1,2015-05-10 23:51:00,Sunday,BAYVIEW,3RD ST / REVERE AV,-122.391523,37.732432
2,2,2015-05-10 23:50:00,Sunday,NORTHERN,2000 Block of GOUGH ST,-122.426002,37.792212
3,3,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412
4,4,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412


In [6]:
y_train = X_train['Category']
X_train_description = X_train['Descript']
X_train_resolution = X_train['Resolution']
X_train.drop(["Category", "Descript", "Resolution"], axis=1, inplace=True)

In [7]:
test_ID = X_test["Id"]
X_test.drop("Id", axis=1, inplace=True)

In [8]:
X_train[:5]

Unnamed: 0,Dates,DayOfWeek,PdDistrict,Address,X,Y
0,2015-05-13 23:53:00,Wednesday,NORTHERN,OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,Wednesday,NORTHERN,OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,Wednesday,NORTHERN,VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,Wednesday,NORTHERN,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,Wednesday,PARK,100 Block of BRODERICK ST,-122.438738,37.771541


In [9]:
X_test[:5]

Unnamed: 0,Dates,DayOfWeek,PdDistrict,Address,X,Y
0,2015-05-10 23:59:00,Sunday,BAYVIEW,2000 Block of THOMAS AV,-122.399588,37.735051
1,2015-05-10 23:51:00,Sunday,BAYVIEW,3RD ST / REVERE AV,-122.391523,37.732432
2,2015-05-10 23:50:00,Sunday,NORTHERN,2000 Block of GOUGH ST,-122.426002,37.792212
3,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412
4,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412


In [10]:
X_train.shape

(875726, 6)

In [11]:
X_test.shape

(884262, 6)

In [12]:
y_train.value_counts()

LARCENY/THEFT                  174320
OTHER OFFENSES                 125960
NON-CRIMINAL                    91915
ASSAULT                         76815
DRUG/NARCOTIC                   53919
VEHICLE THEFT                   53706
VANDALISM                       44581
WARRANTS                        42145
BURGLARY                        36600
SUSPICIOUS OCC                  31394
MISSING PERSON                  25669
ROBBERY                         22988
FRAUD                           16637
FORGERY/COUNTERFEITING          10592
SECONDARY CODES                  9979
WEAPON LAWS                      8550
PROSTITUTION                     7446
TRESPASS                         7318
STOLEN PROPERTY                  4537
SEX OFFENSES FORCIBLE            4380
DISORDERLY CONDUCT               4313
DRUNKENNESS                      4277
RECOVERED VEHICLE                3132
KIDNAPPING                       2340
DRIVING UNDER THE INFLUENCE      2268
LIQUOR LAWS                      1899
RUNAWAY     

In [13]:
le = LabelEncoder()
y_train = le.fit_transform(y_train)
print(le.classes_)

['ARSON' 'ASSAULT' 'BAD CHECKS' 'BRIBERY' 'BURGLARY' 'DISORDERLY CONDUCT'
 'DRIVING UNDER THE INFLUENCE' 'DRUG/NARCOTIC' 'DRUNKENNESS'
 'EMBEZZLEMENT' 'EXTORTION' 'FAMILY OFFENSES' 'FORGERY/COUNTERFEITING'
 'FRAUD' 'GAMBLING' 'KIDNAPPING' 'LARCENY/THEFT' 'LIQUOR LAWS' 'LOITERING'
 'MISSING PERSON' 'NON-CRIMINAL' 'OTHER OFFENSES'
 'PORNOGRAPHY/OBSCENE MAT' 'PROSTITUTION' 'RECOVERED VEHICLE' 'ROBBERY'
 'RUNAWAY' 'SECONDARY CODES' 'SEX OFFENSES FORCIBLE'
 'SEX OFFENSES NON FORCIBLE' 'STOLEN PROPERTY' 'SUICIDE' 'SUSPICIOUS OCC'
 'TREA' 'TRESPASS' 'VANDALISM' 'VEHICLE THEFT' 'WARRANTS' 'WEAPON LAWS']


In [14]:
num_train = X_train.shape[0]
all_data = pd.concat((X_train, X_test), ignore_index=True)

In [15]:
date = pd.to_datetime(all_data['Dates'])
all_data['year'] = date.dt.year
all_data['month'] = date.dt.month
all_data['day'] = date.dt.day
all_data['hour'] = date.dt.hour
all_data['minute'] = date.dt.minute
# all_data['second'] = date.dt.second  # all zero
all_data["n_days"] = (date - date.min()).apply(lambda x: x.days)
all_data.drop("Dates", axis=1, inplace=True)

In [16]:
all_data["DayOfWeek"].value_counts()

Friday       268074
Wednesday    259228
Saturday     253507
Tuesday      251543
Thursday     251298
Monday       243529
Sunday       232809
Name: DayOfWeek, dtype: int64

In [17]:
all_data["PdDistrict"].value_counts()

SOUTHERN      313984
MISSION       240172
NORTHERN      212122
BAYVIEW       178689
CENTRAL       171397
TENDERLOIN    163389
INGLESIDE     158806
TARAVAL       132017
PARK           99360
RICHMOND       90052
Name: PdDistrict, dtype: int64

In [18]:
all_data['block'] = all_data["Address"].str.contains("block", case=False)
all_data.drop("Address", axis=1, inplace=True)

In [19]:
all_data["X+Y"] = all_data["X"] + all_data["Y"]
all_data["X-Y"] = all_data["X"] - all_data["Y"]
all_data["XY1"] = (all_data["X"] - all_data["X"].min()) ** 2 + (all_data["Y"] - all_data["Y"].min()) ** 2
all_data["XY2"] = (all_data["X"].max() - all_data["X"]) ** 2 + (all_data["Y"] - all_data["Y"].min()) ** 2
all_data["XY3"] = (all_data["X"] - all_data["X"].min()) ** 2 + (all_data["Y"].max() - all_data["Y"]) ** 2
all_data["XY4"] = (all_data["X"].max() - all_data["X"]) ** 2 + (all_data["Y"].max() - all_data["Y"]) ** 2

In [20]:
categorical_features = ["DayOfWeek", "PdDistrict", "block"]
ct = ColumnTransformer(transformers=[("categorical_features", OrdinalEncoder(), categorical_features)],
                       remainder="passthrough")
all_data = ct.fit_transform(all_data)

In [21]:
X_train = all_data[:num_train]
X_test = all_data[num_train:]

In [22]:
def cross_val_score_prod(clf, X, y):
    scores = []
    cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)
    for train_index, test_index in cv.split(X, y):
        est = clone(clf)
        est.fit(X[train_index], y[train_index])
        prob = est.predict_proba(X[test_index])
        scores.append(log_loss(y[test_index], prob))
    return scores

In [26]:
# prob = np.zeros((X_test.shape[0], len(le.classes_)))
n_models = 10
for random_seed in range(n_models):
    clf = catboost.CatBoostClassifier(n_estimators=5000, learning_rate=0.05,
                                      cat_features=np.arange(len(categorical_features)),
                                      random_seed=random_seed, task_type="GPU", verbose=100)
    clf.fit(X_train, y_train)
    clf.save_model("model/model" + str(random_seed))
    # prob += clf.predict_proba(X_test)
    print()
# prob /= n_models

0:	learn: 3.4702612	total: 296ms	remaining: 24m 38s
100:	learn: 2.3704923	total: 28.4s	remaining: 22m 56s
200:	learn: 2.3310044	total: 56.4s	remaining: 22m 26s
300:	learn: 2.3084329	total: 1m 24s	remaining: 22m
400:	learn: 2.2916022	total: 1m 52s	remaining: 21m 31s
500:	learn: 2.2785303	total: 2m 20s	remaining: 21m 4s
600:	learn: 2.2677665	total: 2m 48s	remaining: 20m 36s
700:	learn: 2.2583985	total: 3m 17s	remaining: 20m 8s
800:	learn: 2.2501830	total: 3m 45s	remaining: 19m 39s
900:	learn: 2.2428121	total: 4m 13s	remaining: 19m 11s
1000:	learn: 2.2359935	total: 4m 41s	remaining: 18m 43s
1100:	learn: 2.2297602	total: 5m 9s	remaining: 18m 16s
1200:	learn: 2.2237117	total: 5m 37s	remaining: 17m 48s
1300:	learn: 2.2184388	total: 6m 5s	remaining: 17m 19s
1400:	learn: 2.2129250	total: 6m 33s	remaining: 16m 51s
1500:	learn: 2.2078204	total: 7m 2s	remaining: 16m 23s
1600:	learn: 2.2031068	total: 7m 30s	remaining: 15m 56s
1700:	learn: 2.1984050	total: 7m 58s	remaining: 15m 28s
1800:	learn: 2.1

4600:	learn: 2.0899040	total: 21m 38s	remaining: 1m 52s
4700:	learn: 2.0868222	total: 22m 7s	remaining: 1m 24s
4800:	learn: 2.0836319	total: 22m 35s	remaining: 56.2s
4900:	learn: 2.0804415	total: 23m 3s	remaining: 28s
4999:	learn: 2.0773178	total: 23m 32s	remaining: 0us

0:	learn: 3.4703021	total: 293ms	remaining: 24m 26s
100:	learn: 2.3727960	total: 28.7s	remaining: 23m 14s
200:	learn: 2.3316604	total: 56.8s	remaining: 22m 37s
300:	learn: 2.3092466	total: 1m 25s	remaining: 22m 6s
400:	learn: 2.2922668	total: 1m 53s	remaining: 21m 39s
500:	learn: 2.2792469	total: 2m 21s	remaining: 21m 9s
600:	learn: 2.2688858	total: 2m 49s	remaining: 20m 40s
700:	learn: 2.2594510	total: 3m 17s	remaining: 20m 12s
800:	learn: 2.2512247	total: 3m 46s	remaining: 19m 44s
900:	learn: 2.2431541	total: 4m 14s	remaining: 19m 16s
1000:	learn: 2.2363188	total: 4m 42s	remaining: 18m 48s
1100:	learn: 2.2300613	total: 5m 10s	remaining: 18m 20s
1200:	learn: 2.2243651	total: 5m 38s	remaining: 17m 51s
1300:	learn: 2.21

4100:	learn: 2.1069643	total: 19m 19s	remaining: 4m 14s
4200:	learn: 2.1034971	total: 19m 47s	remaining: 3m 45s
4300:	learn: 2.1002037	total: 20m 16s	remaining: 3m 17s
4400:	learn: 2.0969376	total: 20m 44s	remaining: 2m 49s
4500:	learn: 2.0936654	total: 21m 12s	remaining: 2m 21s
4600:	learn: 2.0904418	total: 21m 41s	remaining: 1m 52s
4700:	learn: 2.0871794	total: 22m 9s	remaining: 1m 24s
4800:	learn: 2.0840543	total: 22m 38s	remaining: 56.3s
4900:	learn: 2.0810599	total: 23m 6s	remaining: 28s
4999:	learn: 2.0779996	total: 23m 34s	remaining: 0us

0:	learn: 3.4705453	total: 295ms	remaining: 24m 33s
100:	learn: 2.3722115	total: 28.6s	remaining: 23m 9s
200:	learn: 2.3324494	total: 57s	remaining: 22m 41s
300:	learn: 2.3087838	total: 1m 25s	remaining: 22m 10s
400:	learn: 2.2920226	total: 1m 53s	remaining: 21m 41s
500:	learn: 2.2791649	total: 2m 21s	remaining: 21m 11s
600:	learn: 2.2684661	total: 2m 49s	remaining: 20m 43s
700:	learn: 2.2587365	total: 3m 18s	remaining: 20m 15s
800:	learn: 2.25

3700:	learn: 2.1187071	total: 17m 24s	remaining: 6m 6s
3800:	learn: 2.1151462	total: 17m 52s	remaining: 5m 38s
3900:	learn: 2.1116074	total: 18m 21s	remaining: 5m 10s
4000:	learn: 2.1082124	total: 18m 49s	remaining: 4m 42s
4100:	learn: 2.1046903	total: 19m 17s	remaining: 4m 13s
4200:	learn: 2.1015165	total: 19m 46s	remaining: 3m 45s
4300:	learn: 2.0982159	total: 20m 14s	remaining: 3m 17s
4400:	learn: 2.0949382	total: 20m 43s	remaining: 2m 49s
4500:	learn: 2.0916200	total: 21m 11s	remaining: 2m 20s
4600:	learn: 2.0884806	total: 21m 39s	remaining: 1m 52s
4700:	learn: 2.0853114	total: 22m 8s	remaining: 1m 24s
4800:	learn: 2.0821617	total: 22m 36s	remaining: 56.2s
4900:	learn: 2.0789979	total: 23m 4s	remaining: 28s
4999:	learn: 2.0759183	total: 23m 32s	remaining: 0us

0:	learn: 3.4697371	total: 298ms	remaining: 24m 47s
100:	learn: 2.3722536	total: 28.9s	remaining: 23m 20s
200:	learn: 2.3321226	total: 57.1s	remaining: 22m 42s
300:	learn: 2.3082836	total: 1m 25s	remaining: 22m 9s
400:	learn:

In [27]:
prob = np.zeros((X_test.shape[0], len(le.classes_)))
n_models = 10
for random_seed in range(n_models):
    clf = catboost.CatBoostClassifier()
    clf.load_model("model/model" + str(random_seed))
    prob += clf.predict_proba(X_test)
prob /= n_models

In [28]:
%%time
submission = pd.DataFrame(np.c_[test_ID, prob], columns=["Id"] + list(le.classes_))
submission["Id"] = submission["Id"].astype(int)
submission.to_csv("submission/v2.gz", compression="gzip", index=False)

CPU times: user 2min 45s, sys: 0 ns, total: 2min 45s
Wall time: 2min 45s
