- Dates - timestamp of the crime incident
- Category - category of the crime incident (only in train.csv). This is the target variable you are going to predict.
- Descript - detailed description of the crime incident (only in train.csv)
- DayOfWeek - the day of the week
- PdDistrict - name of the Police Department District
- Resolution - how the crime incident was resolved (only in train.csv)
- Address - the approximate street address of the crime incident 
- X - Longitude
- Y - Latitude

In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import numpy as np
import pandas as pd
pd.set_option("display.max_columns", None)
import matplotlib as plt
%matplotlib inline
from sklearn.preprocessing import LabelEncoder, LabelBinarizer, OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.base import clone
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.metrics import log_loss
import lightgbm as lgb
import catboost

In [2]:
X_train = pd.read_csv("data/train.csv")
X_test = pd.read_csv("data/test.csv")

In [3]:
X_train[:5]

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541


In [4]:
print(X_train.duplicated().sum())
X_train.drop_duplicates(inplace=True)
assert X_train.duplicated().sum() == 0

2323


In [5]:
X_test[:5]

Unnamed: 0,Id,Dates,DayOfWeek,PdDistrict,Address,X,Y
0,0,2015-05-10 23:59:00,Sunday,BAYVIEW,2000 Block of THOMAS AV,-122.399588,37.735051
1,1,2015-05-10 23:51:00,Sunday,BAYVIEW,3RD ST / REVERE AV,-122.391523,37.732432
2,2,2015-05-10 23:50:00,Sunday,NORTHERN,2000 Block of GOUGH ST,-122.426002,37.792212
3,3,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412
4,4,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412


In [6]:
y_train = X_train['Category']
X_train_description = X_train['Descript']
X_train_resolution = X_train['Resolution']
X_train.drop(["Category", "Descript", "Resolution"], axis=1, inplace=True)

In [7]:
test_ID = X_test["Id"]
X_test.drop("Id", axis=1, inplace=True)

In [8]:
X_train[:5]

Unnamed: 0,Dates,DayOfWeek,PdDistrict,Address,X,Y
0,2015-05-13 23:53:00,Wednesday,NORTHERN,OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,Wednesday,NORTHERN,OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,Wednesday,NORTHERN,VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,Wednesday,NORTHERN,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,Wednesday,PARK,100 Block of BRODERICK ST,-122.438738,37.771541


In [9]:
X_test[:5]

Unnamed: 0,Dates,DayOfWeek,PdDistrict,Address,X,Y
0,2015-05-10 23:59:00,Sunday,BAYVIEW,2000 Block of THOMAS AV,-122.399588,37.735051
1,2015-05-10 23:51:00,Sunday,BAYVIEW,3RD ST / REVERE AV,-122.391523,37.732432
2,2015-05-10 23:50:00,Sunday,NORTHERN,2000 Block of GOUGH ST,-122.426002,37.792212
3,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412
4,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412


In [10]:
X_train.shape

(875726, 6)

In [11]:
X_test.shape

(884262, 6)

In [12]:
y_train.value_counts()

LARCENY/THEFT                  174320
OTHER OFFENSES                 125960
NON-CRIMINAL                    91915
ASSAULT                         76815
DRUG/NARCOTIC                   53919
VEHICLE THEFT                   53706
VANDALISM                       44581
WARRANTS                        42145
BURGLARY                        36600
SUSPICIOUS OCC                  31394
MISSING PERSON                  25669
ROBBERY                         22988
FRAUD                           16637
FORGERY/COUNTERFEITING          10592
SECONDARY CODES                  9979
WEAPON LAWS                      8550
PROSTITUTION                     7446
TRESPASS                         7318
STOLEN PROPERTY                  4537
SEX OFFENSES FORCIBLE            4380
DISORDERLY CONDUCT               4313
DRUNKENNESS                      4277
RECOVERED VEHICLE                3132
KIDNAPPING                       2340
DRIVING UNDER THE INFLUENCE      2268
LIQUOR LAWS                      1899
RUNAWAY     

In [13]:
le = LabelEncoder()
y_train = le.fit_transform(y_train)
print(le.classes_)

['ARSON' 'ASSAULT' 'BAD CHECKS' 'BRIBERY' 'BURGLARY' 'DISORDERLY CONDUCT'
 'DRIVING UNDER THE INFLUENCE' 'DRUG/NARCOTIC' 'DRUNKENNESS'
 'EMBEZZLEMENT' 'EXTORTION' 'FAMILY OFFENSES' 'FORGERY/COUNTERFEITING'
 'FRAUD' 'GAMBLING' 'KIDNAPPING' 'LARCENY/THEFT' 'LIQUOR LAWS' 'LOITERING'
 'MISSING PERSON' 'NON-CRIMINAL' 'OTHER OFFENSES'
 'PORNOGRAPHY/OBSCENE MAT' 'PROSTITUTION' 'RECOVERED VEHICLE' 'ROBBERY'
 'RUNAWAY' 'SECONDARY CODES' 'SEX OFFENSES FORCIBLE'
 'SEX OFFENSES NON FORCIBLE' 'STOLEN PROPERTY' 'SUICIDE' 'SUSPICIOUS OCC'
 'TREA' 'TRESPASS' 'VANDALISM' 'VEHICLE THEFT' 'WARRANTS' 'WEAPON LAWS']


In [14]:
num_train = X_train.shape[0]
all_data = pd.concat((X_train, X_test), ignore_index=True)

In [15]:
date = pd.to_datetime(all_data['Dates'])
all_data['year'] = date.dt.year
all_data['month'] = date.dt.month
all_data['day'] = date.dt.day
all_data['hour'] = date.dt.hour
all_data['minute'] = date.dt.minute
# all_data['second'] = date.dt.second  # all zero
all_data["n_days"] = (date - date.min()).apply(lambda x: x.days)
all_data.drop("Dates", axis=1, inplace=True)

In [16]:
all_data["DayOfWeek"].value_counts()

Friday       268074
Wednesday    259228
Saturday     253507
Tuesday      251543
Thursday     251298
Monday       243529
Sunday       232809
Name: DayOfWeek, dtype: int64

In [17]:
all_data["PdDistrict"].value_counts()

SOUTHERN      313984
MISSION       240172
NORTHERN      212122
BAYVIEW       178689
CENTRAL       171397
TENDERLOIN    163389
INGLESIDE     158806
TARAVAL       132017
PARK           99360
RICHMOND       90052
Name: PdDistrict, dtype: int64

In [18]:
all_data['block'] = all_data["Address"].str.contains("block", case=False)
all_data.drop("Address", axis=1, inplace=True)

In [19]:
all_data["X+Y"] = all_data["X"] + all_data["Y"]
all_data["X-Y"] = all_data["X"] - all_data["Y"]
all_data["XY1"] = (all_data["X"] - all_data["X"].min()) ** 2 + (all_data["Y"] - all_data["Y"].min()) ** 2
all_data["XY2"] = (all_data["X"].max() - all_data["X"]) ** 2 + (all_data["Y"] - all_data["Y"].min()) ** 2
all_data["XY3"] = (all_data["X"] - all_data["X"].min()) ** 2 + (all_data["Y"].max() - all_data["Y"]) ** 2
all_data["XY4"] = (all_data["X"].max() - all_data["X"]) ** 2 + (all_data["Y"].max() - all_data["Y"]) ** 2

In [20]:
categorical_features = ["DayOfWeek", "PdDistrict", "year", "month", "day", "hour", "minute", "block"]
ct = ColumnTransformer(transformers=[("categorical_features", OrdinalEncoder(), categorical_features)],
                       remainder="passthrough")
all_data = ct.fit_transform(all_data)

In [21]:
X_train = all_data[:num_train]
X_test = all_data[num_train:]

In [22]:
def cross_val_score_prod(clf, X, y):
    scores = []
    cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)
    for train_index, test_index in cv.split(X, y):
        est = clone(clf)
        est.fit(X[train_index], y[train_index],
                cat_features=np.arange(len(categorical_features)))
        prob = est.predict_proba(X[test_index])
        scores.append(log_loss(y[test_index], prob))
    return scores

In [23]:
prob = np.zeros((X_test.shape[0], len(le.classes_)))
n_models = 10
for random_seed in range(n_models):
    clf = catboost.CatBoostClassifier(n_estimators=5000, learning_rate=0.05,
                                      random_seed=random_seed, task_type="GPU", verbose=100)
    clf.fit(X_train, y_train)
    prob += clf.predict_proba(X_test)
    print()
prob /= n_models

0:	learn: 3.4710514	total: 183ms	remaining: 15m 16s
100:	learn: 2.3735663	total: 10s	remaining: 8m 5s
200:	learn: 2.3325449	total: 19.3s	remaining: 7m 40s
300:	learn: 2.3098052	total: 28.5s	remaining: 7m 25s
400:	learn: 2.2922515	total: 37.8s	remaining: 7m 13s
500:	learn: 2.2788889	total: 47.1s	remaining: 7m 3s
600:	learn: 2.2678371	total: 56.6s	remaining: 6m 54s
700:	learn: 2.2581452	total: 1m 6s	remaining: 6m 45s
800:	learn: 2.2494024	total: 1m 15s	remaining: 6m 35s
900:	learn: 2.2417399	total: 1m 24s	remaining: 6m 26s
1000:	learn: 2.2342195	total: 1m 34s	remaining: 6m 17s
1100:	learn: 2.2273391	total: 1m 43s	remaining: 6m 8s
1200:	learn: 2.2211622	total: 1m 53s	remaining: 5m 58s
1300:	learn: 2.2150962	total: 2m 3s	remaining: 5m 49s
1400:	learn: 2.2092671	total: 2m 12s	remaining: 5m 40s
1500:	learn: 2.2035013	total: 2m 22s	remaining: 5m 31s
1600:	learn: 2.1982745	total: 2m 31s	remaining: 5m 21s
1700:	learn: 2.1932484	total: 2m 41s	remaining: 5m 12s
1800:	learn: 2.1882901	total: 2m 50

4999:	learn: 2.0665348	total: 7m 58s	remaining: 0us

0:	learn: 3.4720007	total: 102ms	remaining: 8m 29s
100:	learn: 2.3716645	total: 9.57s	remaining: 7m 44s
200:	learn: 2.3321795	total: 18.9s	remaining: 7m 31s
300:	learn: 2.3083483	total: 28.3s	remaining: 7m 22s
400:	learn: 2.2913112	total: 37.9s	remaining: 7m 14s
500:	learn: 2.2782868	total: 47.4s	remaining: 7m 5s
600:	learn: 2.2675968	total: 57s	remaining: 6m 56s
700:	learn: 2.2578628	total: 1m 6s	remaining: 6m 47s
800:	learn: 2.2492562	total: 1m 16s	remaining: 6m 38s
900:	learn: 2.2411512	total: 1m 25s	remaining: 6m 29s
1000:	learn: 2.2341828	total: 1m 35s	remaining: 6m 20s
1100:	learn: 2.2274400	total: 1m 44s	remaining: 6m 11s
1200:	learn: 2.2210547	total: 1m 54s	remaining: 6m 1s
1300:	learn: 2.2149632	total: 2m 3s	remaining: 5m 52s
1400:	learn: 2.2092622	total: 2m 13s	remaining: 5m 42s
1500:	learn: 2.2037245	total: 2m 23s	remaining: 5m 33s
1600:	learn: 2.1983333	total: 2m 32s	remaining: 5m 23s
1700:	learn: 2.1928344	total: 2m 42s	

4900:	learn: 2.0700656	total: 7m 45s	remaining: 9.4s
4999:	learn: 2.0669763	total: 7m 54s	remaining: 0us

0:	learn: 3.4718611	total: 103ms	remaining: 8m 34s
100:	learn: 2.3711726	total: 9.58s	remaining: 7m 44s
200:	learn: 2.3310951	total: 18.9s	remaining: 7m 31s
300:	learn: 2.3085504	total: 28.2s	remaining: 7m 20s
400:	learn: 2.2918991	total: 37.7s	remaining: 7m 12s
500:	learn: 2.2787657	total: 47.2s	remaining: 7m 3s
600:	learn: 2.2675222	total: 56.7s	remaining: 6m 55s
700:	learn: 2.2581314	total: 1m 6s	remaining: 6m 45s
800:	learn: 2.2494292	total: 1m 15s	remaining: 6m 36s
900:	learn: 2.2412895	total: 1m 25s	remaining: 6m 27s
1000:	learn: 2.2339308	total: 1m 34s	remaining: 6m 18s
1100:	learn: 2.2274737	total: 1m 44s	remaining: 6m 8s
1200:	learn: 2.2210786	total: 1m 53s	remaining: 5m 59s
1300:	learn: 2.2151264	total: 2m 3s	remaining: 5m 49s
1400:	learn: 2.2093407	total: 2m 12s	remaining: 5m 40s
1500:	learn: 2.2036843	total: 2m 22s	remaining: 5m 31s
1600:	learn: 2.1983025	total: 2m 31s	

4800:	learn: 2.0727104	total: 7m 35s	remaining: 18.9s
4900:	learn: 2.0697027	total: 7m 44s	remaining: 9.39s
4999:	learn: 2.0666484	total: 7m 54s	remaining: 0us

0:	learn: 3.4706532	total: 106ms	remaining: 8m 49s
100:	learn: 2.3729055	total: 9.55s	remaining: 7m 43s
200:	learn: 2.3316457	total: 18.9s	remaining: 7m 30s
300:	learn: 2.3079598	total: 28.2s	remaining: 7m 20s
400:	learn: 2.2915979	total: 37.7s	remaining: 7m 12s
500:	learn: 2.2784778	total: 47.2s	remaining: 7m 3s
600:	learn: 2.2674354	total: 56.7s	remaining: 6m 54s
700:	learn: 2.2578686	total: 1m 6s	remaining: 6m 45s
800:	learn: 2.2494111	total: 1m 15s	remaining: 6m 36s
900:	learn: 2.2415303	total: 1m 25s	remaining: 6m 27s
1000:	learn: 2.2340644	total: 1m 34s	remaining: 6m 17s
1100:	learn: 2.2275853	total: 1m 44s	remaining: 6m 8s
1200:	learn: 2.2212304	total: 1m 53s	remaining: 5m 59s
1300:	learn: 2.2152548	total: 2m 3s	remaining: 5m 49s
1400:	learn: 2.2094779	total: 2m 12s	remaining: 5m 40s
1500:	learn: 2.2039344	total: 2m 21s	

In [24]:
%%time
submission = pd.DataFrame(np.c_[test_ID, prob], columns=["Id"] + list(le.classes_))
submission["Id"] = submission["Id"].astype(int)
submission.to_csv("submission/v2.gz", compression="gzip", index=False)

CPU times: user 2min 29s, sys: 19.9 ms, total: 2min 29s
Wall time: 2min 29s
