- Dates - timestamp of the crime incident
- Category - category of the crime incident (only in train.csv). This is the target variable you are going to predict.
- Descript - detailed description of the crime incident (only in train.csv)
- DayOfWeek - the day of the week
- PdDistrict - name of the Police Department District
- Resolution - how the crime incident was resolved (only in train.csv)
- Address - the approximate street address of the crime incident 
- X - Longitude
- Y - Latitude

In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import numpy as np
import pandas as pd
pd.set_option("display.max_columns", None)
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import StratifiedKFold
from sklearn.decomposition import PCA
from sklearn.mixture import GaussianMixture
import catboost
import gensim

In [2]:
X_train = pd.read_csv("data/train.csv")
X_test = pd.read_csv("data/test.csv")

In [3]:
X_train[:5]

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541


In [4]:
print(X_train.duplicated().sum())
X_train.drop_duplicates(inplace=True)
assert X_train.duplicated().sum() == 0

2323


In [5]:
X_test[:5]

Unnamed: 0,Id,Dates,DayOfWeek,PdDistrict,Address,X,Y
0,0,2015-05-10 23:59:00,Sunday,BAYVIEW,2000 Block of THOMAS AV,-122.399588,37.735051
1,1,2015-05-10 23:51:00,Sunday,BAYVIEW,3RD ST / REVERE AV,-122.391523,37.732432
2,2,2015-05-10 23:50:00,Sunday,NORTHERN,2000 Block of GOUGH ST,-122.426002,37.792212
3,3,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412
4,4,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412


In [6]:
y_train = X_train['Category']
X_train_description = X_train['Descript']
X_train_resolution = X_train['Resolution']
X_train.drop(["Category", "Descript", "Resolution"], axis=1, inplace=True)

In [7]:
test_ID = X_test["Id"]
X_test.drop("Id", axis=1, inplace=True)

In [8]:
X_train[:5]

Unnamed: 0,Dates,DayOfWeek,PdDistrict,Address,X,Y
0,2015-05-13 23:53:00,Wednesday,NORTHERN,OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,Wednesday,NORTHERN,OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,Wednesday,NORTHERN,VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,Wednesday,NORTHERN,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,Wednesday,PARK,100 Block of BRODERICK ST,-122.438738,37.771541


In [9]:
X_test[:5]

Unnamed: 0,Dates,DayOfWeek,PdDistrict,Address,X,Y
0,2015-05-10 23:59:00,Sunday,BAYVIEW,2000 Block of THOMAS AV,-122.399588,37.735051
1,2015-05-10 23:51:00,Sunday,BAYVIEW,3RD ST / REVERE AV,-122.391523,37.732432
2,2015-05-10 23:50:00,Sunday,NORTHERN,2000 Block of GOUGH ST,-122.426002,37.792212
3,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412
4,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412


In [10]:
X_train.shape

(875726, 6)

In [11]:
X_test.shape

(884262, 6)

In [12]:
y_train.value_counts()

LARCENY/THEFT                  174320
OTHER OFFENSES                 125960
NON-CRIMINAL                    91915
ASSAULT                         76815
DRUG/NARCOTIC                   53919
VEHICLE THEFT                   53706
VANDALISM                       44581
WARRANTS                        42145
BURGLARY                        36600
SUSPICIOUS OCC                  31394
MISSING PERSON                  25669
ROBBERY                         22988
FRAUD                           16637
FORGERY/COUNTERFEITING          10592
SECONDARY CODES                  9979
WEAPON LAWS                      8550
PROSTITUTION                     7446
TRESPASS                         7318
STOLEN PROPERTY                  4537
SEX OFFENSES FORCIBLE            4380
DISORDERLY CONDUCT               4313
DRUNKENNESS                      4277
RECOVERED VEHICLE                3132
KIDNAPPING                       2340
DRIVING UNDER THE INFLUENCE      2268
LIQUOR LAWS                      1899
RUNAWAY     

In [13]:
le = LabelEncoder()
y_train = le.fit_transform(y_train)
print(le.classes_)

['ARSON' 'ASSAULT' 'BAD CHECKS' 'BRIBERY' 'BURGLARY' 'DISORDERLY CONDUCT'
 'DRIVING UNDER THE INFLUENCE' 'DRUG/NARCOTIC' 'DRUNKENNESS'
 'EMBEZZLEMENT' 'EXTORTION' 'FAMILY OFFENSES' 'FORGERY/COUNTERFEITING'
 'FRAUD' 'GAMBLING' 'KIDNAPPING' 'LARCENY/THEFT' 'LIQUOR LAWS' 'LOITERING'
 'MISSING PERSON' 'NON-CRIMINAL' 'OTHER OFFENSES'
 'PORNOGRAPHY/OBSCENE MAT' 'PROSTITUTION' 'RECOVERED VEHICLE' 'ROBBERY'
 'RUNAWAY' 'SECONDARY CODES' 'SEX OFFENSES FORCIBLE'
 'SEX OFFENSES NON FORCIBLE' 'STOLEN PROPERTY' 'SUICIDE' 'SUSPICIOUS OCC'
 'TREA' 'TRESPASS' 'VANDALISM' 'VEHICLE THEFT' 'WARRANTS' 'WEAPON LAWS']


In [14]:
num_train = X_train.shape[0]
all_data = pd.concat((X_train, X_test), ignore_index=True)

In [15]:
date = pd.to_datetime(all_data['Dates'])
all_data['year'] = date.dt.year
all_data['month'] = date.dt.month
all_data['day'] = date.dt.day
all_data['hour'] = date.dt.hour
all_data['minute'] = date.dt.minute
all_data['special_time'] = all_data['minute'].isin([0, 30]).astype(int)
# all_data['second'] = date.dt.second  # all zero
all_data["n_days"] = (date - date.min()).apply(lambda x: x.days)
all_data.drop("Dates", axis=1, inplace=True)

In [16]:
all_data["DayOfWeek"].value_counts()

Friday       268074
Wednesday    259228
Saturday     253507
Tuesday      251543
Thursday     251298
Monday       243529
Sunday       232809
Name: DayOfWeek, dtype: int64

In [17]:
all_data["PdDistrict"].value_counts()

SOUTHERN      313984
MISSION       240172
NORTHERN      212122
BAYVIEW       178689
CENTRAL       171397
TENDERLOIN    163389
INGLESIDE     158806
TARAVAL       132017
PARK           99360
RICHMOND       90052
Name: PdDistrict, dtype: int64

In [18]:
sentences = []
for s in all_data["Address"]:
    sentences.append(s.split(" "))
address_model = gensim.models.Word2Vec(sentences, min_count=1)
encoded_address = np.zeros((all_data.shape[0], 100))
for i in range(len(sentences)):
    for j in range(len(sentences[i])):
        encoded_address[i] += address_model.wv[sentences[i][j]]
    encoded_address[j] /= len(sentences[i])

In [19]:
all_data['block'] = all_data["Address"].str.contains("block", case=False)
all_data.drop("Address", axis=1, inplace=True)

In [20]:
print(all_data["X"].min(), all_data["X"].max())
print(all_data["Y"].min(), all_data["Y"].max())

-122.51364206429 -120.5
37.7078790224135 90.0


In [21]:
X_median = all_data[all_data["X"] < -120.5]["X"].median()
Y_median = all_data[all_data["Y"] < 90]["Y"].median()
all_data.loc[all_data["X"] >= -120.5, "X"] = X_median
all_data.loc[all_data["Y"] >= 90, "Y"] = Y_median

In [22]:
print(all_data["X"].min(), all_data["X"].max())
print(all_data["Y"].min(), all_data["Y"].max())

-122.51364206429 -122.364750704393
37.7078790224135 37.82062083807021


In [23]:
all_data["X+Y"] = all_data["X"] + all_data["Y"]
all_data["X-Y"] = all_data["X"] - all_data["Y"]
all_data["XY30_1"] = all_data["X"] * np.cos(np.pi / 6) + all_data["Y"] * np.sin(np.pi / 6)
all_data["XY30_2"] = all_data["Y"] * np.cos(np.pi / 6) - all_data["X"] * np.sin(np.pi / 6)
all_data["XY60_1"] = all_data["X"] * np.cos(np.pi / 3) + all_data["Y"] * np.sin(np.pi / 3)
all_data["XY60_2"] = all_data["Y"] * np.cos(np.pi / 3) - all_data["X"] * np.sin(np.pi / 3)
all_data["XY1"] = (all_data["X"] - all_data["X"].min()) ** 2 + (all_data["Y"] - all_data["Y"].min()) ** 2
all_data["XY2"] = (all_data["X"].max() - all_data["X"]) ** 2 + (all_data["Y"] - all_data["Y"].min()) ** 2
all_data["XY3"] = (all_data["X"] - all_data["X"].min()) ** 2 + (all_data["Y"].max() - all_data["Y"]) ** 2
all_data["XY4"] = (all_data["X"].max() - all_data["X"]) ** 2 + (all_data["Y"].max() - all_data["Y"]) ** 2
all_data["XY5"] = (all_data["X"] - X_median) ** 2 + (all_data["Y"] - Y_median) ** 2
pca = PCA(n_components=2).fit(all_data[["X", "Y"]])
XYt = pca.transform(all_data[["X", "Y"]])
all_data["XYpca1"] = XYt[:, 0]
all_data["XYpca2"] = XYt[:, 1]
# n_components selected by aic/bic
clf = GaussianMixture(n_components=150, covariance_type="diag",
                      random_state=0).fit(all_data[["X", "Y"]])
all_data["XYcluster"] = clf.predict(all_data[["X", "Y"]])

In [24]:
categorical_features = ["DayOfWeek", "PdDistrict", "block", "special_time", "XYcluster"]
ct = ColumnTransformer(transformers=[("categorical_features", OrdinalEncoder(), categorical_features)],
                       remainder="passthrough")
all_data = ct.fit_transform(all_data)

In [25]:
all_data = np.hstack((all_data, encoded_address))

In [26]:
X_train = all_data[:num_train]
X_test = all_data[num_train:]

In [27]:
prob = np.zeros((X_test.shape[0], len(le.classes_)))
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
for train_index, test_index in skf.split(X_train, y_train):
    clf = catboost.CatBoostClassifier(n_estimators=10000, learning_rate=0.05,
                                      cat_features=np.arange(len(categorical_features)),
                                      early_stopping_rounds=100, random_seed=0, task_type="GPU",
                                      devices="0", verbose=50)
    clf.fit(X_train[train_index], y_train[train_index],
            eval_set=(X_train[test_index], y_train[test_index]))
    prob += clf.predict_proba(X_test)
prob /= 5

0:	learn: 3.4561447	test: 3.4553012	best: 3.4553012 (0)	total: 326ms	remaining: 54m 21s
50:	learn: 2.4108387	test: 2.4136224	best: 2.4136224 (50)	total: 15.6s	remaining: 50m 40s
100:	learn: 2.3457966	test: 2.3523762	best: 2.3523762 (100)	total: 29.9s	remaining: 48m 46s
150:	learn: 2.3187689	test: 2.3289203	best: 2.3289203 (150)	total: 44.1s	remaining: 47m 56s
200:	learn: 2.2995096	test: 2.3132356	best: 2.3132356 (200)	total: 58.6s	remaining: 47m 35s
250:	learn: 2.2849689	test: 2.3019091	best: 2.3019091 (250)	total: 1m 12s	remaining: 47m 12s
300:	learn: 2.2725286	test: 2.2927919	best: 2.2927919 (300)	total: 1m 27s	remaining: 46m 59s
350:	learn: 2.2614397	test: 2.2846853	best: 2.2846853 (350)	total: 1m 41s	remaining: 46m 40s
400:	learn: 2.2517570	test: 2.2781398	best: 2.2781398 (400)	total: 1m 56s	remaining: 46m 26s
450:	learn: 2.2427275	test: 2.2721585	best: 2.2721585 (450)	total: 2m 10s	remaining: 46m 12s
500:	learn: 2.2351359	test: 2.2675095	best: 2.2675095 (500)	total: 2m 25s	remaini

4350:	learn: 1.9979051	test: 2.2098679	best: 2.2098679 (4350)	total: 20m 57s	remaining: 27m 12s
4400:	learn: 1.9957627	test: 2.2097636	best: 2.2097572 (4398)	total: 21m 11s	remaining: 26m 57s
4450:	learn: 1.9936428	test: 2.2096653	best: 2.2096644 (4447)	total: 21m 26s	remaining: 26m 43s
4500:	learn: 1.9915911	test: 2.2095664	best: 2.2095659 (4497)	total: 21m 40s	remaining: 26m 28s
4550:	learn: 1.9894444	test: 2.2094910	best: 2.2094910 (4550)	total: 21m 54s	remaining: 26m 14s
4600:	learn: 1.9872094	test: 2.2094059	best: 2.2094059 (4600)	total: 22m 9s	remaining: 26m
4650:	learn: 1.9851252	test: 2.2093532	best: 2.2093532 (4650)	total: 22m 24s	remaining: 25m 45s
4700:	learn: 1.9829433	test: 2.2092640	best: 2.2092614 (4687)	total: 22m 38s	remaining: 25m 31s
4750:	learn: 1.9807645	test: 2.2092399	best: 2.2092333 (4744)	total: 22m 53s	remaining: 25m 17s
4800:	learn: 1.9785222	test: 2.2091561	best: 2.2091547 (4799)	total: 23m 7s	remaining: 25m 2s
4850:	learn: 1.9763675	test: 2.2090994	best: 2.

3000:	learn: 2.0580949	test: 2.2172665	best: 2.2172665 (3000)	total: 14m 29s	remaining: 33m 48s
3050:	learn: 2.0556759	test: 2.2170233	best: 2.2170233 (3050)	total: 14m 44s	remaining: 33m 34s
3100:	learn: 2.0531927	test: 2.2167840	best: 2.2167840 (3100)	total: 14m 58s	remaining: 33m 19s
3150:	learn: 2.0505925	test: 2.2164904	best: 2.2164904 (3150)	total: 15m 13s	remaining: 33m 5s
3200:	learn: 2.0481361	test: 2.2161731	best: 2.2161731 (3200)	total: 15m 27s	remaining: 32m 50s
3250:	learn: 2.0456332	test: 2.2159694	best: 2.2159667 (3247)	total: 15m 42s	remaining: 32m 36s
3300:	learn: 2.0431416	test: 2.2157717	best: 2.2157671 (3297)	total: 15m 56s	remaining: 32m 21s
3350:	learn: 2.0406951	test: 2.2155351	best: 2.2155341 (3349)	total: 16m 11s	remaining: 32m 6s
3400:	learn: 2.0382838	test: 2.2153471	best: 2.2153426 (3398)	total: 16m 25s	remaining: 31m 52s
3450:	learn: 2.0359609	test: 2.2151321	best: 2.2151321 (3450)	total: 16m 39s	remaining: 31m 37s
3500:	learn: 2.0336837	test: 2.2149043	bes

1750:	learn: 2.1280807	test: 2.2247793	best: 2.2247793 (1750)	total: 8m 26s	remaining: 39m 48s
1800:	learn: 2.1251156	test: 2.2241411	best: 2.2241411 (1800)	total: 8m 41s	remaining: 39m 33s
1850:	learn: 2.1220810	test: 2.2234948	best: 2.2234948 (1850)	total: 8m 55s	remaining: 39m 18s
1900:	learn: 2.1192334	test: 2.2229756	best: 2.2229756 (1900)	total: 9m 10s	remaining: 39m 3s
1950:	learn: 2.1162760	test: 2.2223272	best: 2.2223272 (1950)	total: 9m 24s	remaining: 38m 48s
2000:	learn: 2.1134931	test: 2.2218487	best: 2.2218487 (2000)	total: 9m 38s	remaining: 38m 33s
2050:	learn: 2.1104430	test: 2.2212545	best: 2.2212545 (2050)	total: 9m 53s	remaining: 38m 19s
2100:	learn: 2.1075914	test: 2.2207162	best: 2.2207162 (2100)	total: 10m 7s	remaining: 38m 4s
2150:	learn: 2.1047738	test: 2.2201376	best: 2.2201376 (2150)	total: 10m 22s	remaining: 37m 50s
2200:	learn: 2.1018756	test: 2.2196580	best: 2.2196580 (2200)	total: 10m 36s	remaining: 37m 36s
2250:	learn: 2.0991600	test: 2.2192011	best: 2.219

150:	learn: 2.3188511	test: 2.3257162	best: 2.3257162 (150)	total: 44.5s	remaining: 48m 21s
200:	learn: 2.2998364	test: 2.3100088	best: 2.3100088 (200)	total: 58.9s	remaining: 47m 51s
250:	learn: 2.2858224	test: 2.2992176	best: 2.2992176 (250)	total: 1m 13s	remaining: 47m 29s
300:	learn: 2.2730833	test: 2.2896676	best: 2.2896676 (300)	total: 1m 27s	remaining: 47m 12s
350:	learn: 2.2622565	test: 2.2820481	best: 2.2820481 (350)	total: 1m 42s	remaining: 46m 58s
400:	learn: 2.2527948	test: 2.2755277	best: 2.2755277 (400)	total: 1m 56s	remaining: 46m 40s
450:	learn: 2.2442702	test: 2.2700042	best: 2.2700042 (450)	total: 2m 11s	remaining: 46m 24s
500:	learn: 2.2359597	test: 2.2647849	best: 2.2647849 (500)	total: 2m 26s	remaining: 46m 13s
550:	learn: 2.2288822	test: 2.2606057	best: 2.2606057 (550)	total: 2m 40s	remaining: 45m 57s
600:	learn: 2.2224813	test: 2.2569824	best: 2.2569824 (600)	total: 2m 55s	remaining: 45m 40s
650:	learn: 2.2162840	test: 2.2534924	best: 2.2534924 (650)	total: 3m 9s

4500:	learn: 1.9914854	test: 2.2054071	best: 2.2054071 (4500)	total: 21m 43s	remaining: 26m 33s
4550:	learn: 1.9893821	test: 2.2053511	best: 2.2053511 (4550)	total: 21m 58s	remaining: 26m 18s
4600:	learn: 1.9872471	test: 2.2052483	best: 2.2052385 (4593)	total: 22m 13s	remaining: 26m 4s
4650:	learn: 1.9850229	test: 2.2051168	best: 2.2051168 (4650)	total: 22m 27s	remaining: 25m 49s
4700:	learn: 1.9828250	test: 2.2050358	best: 2.2050358 (4700)	total: 22m 42s	remaining: 25m 35s
4750:	learn: 1.9807781	test: 2.2049489	best: 2.2049484 (4749)	total: 22m 56s	remaining: 25m 20s
4800:	learn: 1.9786610	test: 2.2048547	best: 2.2048520 (4799)	total: 23m 10s	remaining: 25m 6s
4850:	learn: 1.9764587	test: 2.2047862	best: 2.2047862 (4850)	total: 23m 25s	remaining: 24m 51s
4900:	learn: 1.9743929	test: 2.2047712	best: 2.2047673 (4890)	total: 23m 39s	remaining: 24m 37s
4950:	learn: 1.9721882	test: 2.2047594	best: 2.2047346 (4932)	total: 23m 54s	remaining: 24m 22s
5000:	learn: 1.9700773	test: 2.2046950	bes

3200:	learn: 2.0484165	test: 2.2087330	best: 2.2087330 (3200)	total: 15m 9s	remaining: 32m 11s
3250:	learn: 2.0460501	test: 2.2084516	best: 2.2084516 (3250)	total: 15m 23s	remaining: 31m 56s
3300:	learn: 2.0435599	test: 2.2082075	best: 2.2082075 (3300)	total: 15m 37s	remaining: 31m 42s
3350:	learn: 2.0412206	test: 2.2079839	best: 2.2079834 (3348)	total: 15m 51s	remaining: 31m 27s
3400:	learn: 2.0388310	test: 2.2077703	best: 2.2077703 (3400)	total: 16m 5s	remaining: 31m 13s
3450:	learn: 2.0366402	test: 2.2074884	best: 2.2074884 (3450)	total: 16m 19s	remaining: 30m 59s
3500:	learn: 2.0344298	test: 2.2073426	best: 2.2073426 (3500)	total: 16m 33s	remaining: 30m 44s
3550:	learn: 2.0320652	test: 2.2071500	best: 2.2071500 (3550)	total: 16m 47s	remaining: 30m 30s
3600:	learn: 2.0296870	test: 2.2069677	best: 2.2069677 (3600)	total: 17m 2s	remaining: 30m 16s
3650:	learn: 2.0274739	test: 2.2068128	best: 2.2068128 (3650)	total: 17m 16s	remaining: 30m 3s
3700:	learn: 2.0252201	test: 2.2066427	best:

In [28]:
%%time
submission = pd.DataFrame(np.c_[test_ID, prob], columns=["Id"] + list(le.classes_))
submission["Id"] = submission["Id"].astype(int)
submission.to_csv("submission/v1-1.gz", compression="gzip", index=False)

CPU times: user 2min 28s, sys: 272 ms, total: 2min 28s
Wall time: 2min 29s


In [27]:
prob = np.zeros((X_test.shape[0], len(le.classes_)))
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
for train_index, test_index in skf.split(X_train, y_train):
    clf = catboost.CatBoostClassifier(n_estimators=10000, learning_rate=0.05,
                                      cat_features=np.arange(len(categorical_features)),
                                      early_stopping_rounds=100, random_seed=1, task_type="GPU",
                                      devices="0", verbose=50)
    clf.fit(X_train[train_index], y_train[train_index],
            eval_set=(X_train[test_index], y_train[test_index]))
    prob += clf.predict_proba(X_test)
prob /= 5

0:	learn: 3.4550124	test: 3.4544966	best: 3.4544966 (0)	total: 373ms	remaining: 1h 2m 5s
50:	learn: 2.4098463	test: 2.4116950	best: 2.4116950 (50)	total: 14.9s	remaining: 48m 28s
100:	learn: 2.3450458	test: 2.3508708	best: 2.3508708 (100)	total: 28.7s	remaining: 46m 53s
150:	learn: 2.3180195	test: 2.3278865	best: 2.3278865 (150)	total: 42.5s	remaining: 46m 9s
200:	learn: 2.2985936	test: 2.3122144	best: 2.3122144 (200)	total: 56.4s	remaining: 45m 49s
250:	learn: 2.2844093	test: 2.3014572	best: 2.3014572 (250)	total: 1m 10s	remaining: 45m 28s
300:	learn: 2.2720167	test: 2.2923466	best: 2.2923466 (300)	total: 1m 24s	remaining: 45m 19s
350:	learn: 2.2615086	test: 2.2850230	best: 2.2850230 (350)	total: 1m 38s	remaining: 45m 7s
400:	learn: 2.2519090	test: 2.2785143	best: 2.2785143 (400)	total: 1m 52s	remaining: 44m 56s
450:	learn: 2.2431493	test: 2.2728379	best: 2.2728379 (450)	total: 2m 6s	remaining: 44m 45s
500:	learn: 2.2352198	test: 2.2681261	best: 2.2681261 (500)	total: 2m 21s	remaining

4350:	learn: 1.9967268	test: 2.2093502	best: 2.2093440 (4347)	total: 20m 29s	remaining: 26m 36s
4400:	learn: 1.9944593	test: 2.2092699	best: 2.2092699 (4400)	total: 20m 44s	remaining: 26m 22s
4450:	learn: 1.9922420	test: 2.2091875	best: 2.2091861 (4449)	total: 20m 58s	remaining: 26m 8s
4500:	learn: 1.9899667	test: 2.2090892	best: 2.2090783 (4495)	total: 21m 12s	remaining: 25m 54s
4550:	learn: 1.9877699	test: 2.2090134	best: 2.2090098 (4549)	total: 21m 26s	remaining: 25m 40s
4600:	learn: 1.9856730	test: 2.2089176	best: 2.2089156 (4597)	total: 21m 40s	remaining: 25m 26s
4650:	learn: 1.9835258	test: 2.2088473	best: 2.2088407 (4628)	total: 21m 54s	remaining: 25m 12s
4700:	learn: 1.9813356	test: 2.2087729	best: 2.2087729 (4700)	total: 22m 9s	remaining: 24m 58s
4750:	learn: 1.9793660	test: 2.2087226	best: 2.2087226 (4750)	total: 22m 23s	remaining: 24m 43s
4800:	learn: 1.9772229	test: 2.2086094	best: 2.2086094 (4800)	total: 22m 37s	remaining: 24m 29s
4850:	learn: 1.9751770	test: 2.2085454	bes

3000:	learn: 2.0574940	test: 2.2164740	best: 2.2164740 (3000)	total: 14m 11s	remaining: 33m 6s
3050:	learn: 2.0550085	test: 2.2162484	best: 2.2162484 (3050)	total: 14m 25s	remaining: 32m 51s
3100:	learn: 2.0526380	test: 2.2159913	best: 2.2159913 (3100)	total: 14m 39s	remaining: 32m 37s
3150:	learn: 2.0502449	test: 2.2157482	best: 2.2157482 (3150)	total: 14m 53s	remaining: 32m 22s
3200:	learn: 2.0478212	test: 2.2155824	best: 2.2155776 (3199)	total: 15m 7s	remaining: 32m 8s
3250:	learn: 2.0453318	test: 2.2153639	best: 2.2153631 (3249)	total: 15m 22s	remaining: 31m 54s
3300:	learn: 2.0428710	test: 2.2151257	best: 2.2151257 (3300)	total: 15m 36s	remaining: 31m 40s
3350:	learn: 2.0403514	test: 2.2149394	best: 2.2149394 (3350)	total: 15m 50s	remaining: 31m 25s
3400:	learn: 2.0379612	test: 2.2147319	best: 2.2147319 (3400)	total: 16m 4s	remaining: 31m 11s
3450:	learn: 2.0355472	test: 2.2145447	best: 2.2145258 (3443)	total: 16m 18s	remaining: 30m 57s
3500:	learn: 2.0331100	test: 2.2143228	best:

2000:	learn: 2.1119544	test: 2.2168293	best: 2.2168293 (2000)	total: 9m 28s	remaining: 37m 52s
2050:	learn: 2.1091423	test: 2.2163074	best: 2.2163065 (2049)	total: 9m 42s	remaining: 37m 37s
2100:	learn: 2.1060522	test: 2.2157270	best: 2.2157270 (2100)	total: 9m 56s	remaining: 37m 23s
2150:	learn: 2.1032771	test: 2.2151414	best: 2.2151414 (2150)	total: 10m 10s	remaining: 37m 8s
2200:	learn: 2.1004133	test: 2.2146896	best: 2.2146896 (2200)	total: 10m 24s	remaining: 36m 54s
2250:	learn: 2.0975667	test: 2.2142323	best: 2.2142323 (2250)	total: 10m 39s	remaining: 36m 39s
2300:	learn: 2.0947210	test: 2.2136697	best: 2.2136697 (2300)	total: 10m 53s	remaining: 36m 25s
2350:	learn: 2.0919153	test: 2.2132781	best: 2.2132781 (2350)	total: 11m 7s	remaining: 36m 11s
2400:	learn: 2.0892689	test: 2.2128670	best: 2.2128670 (2400)	total: 11m 21s	remaining: 35m 57s
2450:	learn: 2.0865701	test: 2.2124581	best: 2.2124581 (2450)	total: 11m 35s	remaining: 35m 43s
2500:	learn: 2.0840634	test: 2.2121050	best: 

450:	learn: 2.2437542	test: 2.2713856	best: 2.2713856 (450)	total: 2m 8s	remaining: 45m 12s
500:	learn: 2.2359052	test: 2.2667149	best: 2.2667149 (500)	total: 2m 22s	remaining: 44m 58s
550:	learn: 2.2287423	test: 2.2624698	best: 2.2624698 (550)	total: 2m 36s	remaining: 44m 45s
600:	learn: 2.2223334	test: 2.2587909	best: 2.2587909 (600)	total: 2m 50s	remaining: 44m 26s
650:	learn: 2.2162847	test: 2.2555840	best: 2.2555840 (650)	total: 3m 4s	remaining: 44m 12s
700:	learn: 2.2099984	test: 2.2524784	best: 2.2524784 (700)	total: 3m 19s	remaining: 44m 1s
750:	learn: 2.2041391	test: 2.2495564	best: 2.2495564 (750)	total: 3m 33s	remaining: 43m 48s
800:	learn: 2.1987941	test: 2.2470768	best: 2.2470768 (800)	total: 3m 47s	remaining: 43m 34s
850:	learn: 2.1938254	test: 2.2448971	best: 2.2448971 (850)	total: 4m 1s	remaining: 43m 19s
900:	learn: 2.1891265	test: 2.2428663	best: 2.2428663 (900)	total: 4m 15s	remaining: 43m 4s
950:	learn: 2.1847464	test: 2.2411202	best: 2.2411202 (950)	total: 4m 30s	r

4800:	learn: 1.9759065	test: 2.2067991	best: 2.2067991 (4800)	total: 22m 38s	remaining: 24m 31s
4850:	learn: 1.9738778	test: 2.2067366	best: 2.2067320 (4844)	total: 22m 52s	remaining: 24m 17s
4900:	learn: 1.9717469	test: 2.2066338	best: 2.2066338 (4900)	total: 23m 6s	remaining: 24m 2s
4950:	learn: 1.9696742	test: 2.2066346	best: 2.2066078 (4916)	total: 23m 21s	remaining: 23m 48s
5000:	learn: 1.9673599	test: 2.2065523	best: 2.2065494 (4999)	total: 23m 35s	remaining: 23m 34s
5050:	learn: 1.9652770	test: 2.2065136	best: 2.2065038 (5042)	total: 23m 49s	remaining: 23m 20s
5100:	learn: 1.9630619	test: 2.2064113	best: 2.2064113 (5100)	total: 24m 3s	remaining: 23m 6s
5150:	learn: 1.9608568	test: 2.2063082	best: 2.2063082 (5150)	total: 24m 18s	remaining: 22m 52s
5200:	learn: 1.9588894	test: 2.2062798	best: 2.2062727 (5162)	total: 24m 32s	remaining: 22m 38s
5250:	learn: 1.9568325	test: 2.2062190	best: 2.2062081 (5244)	total: 24m 46s	remaining: 22m 24s
5300:	learn: 1.9548006	test: 2.2061612	best:

2850:	learn: 2.0648919	test: 2.2137774	best: 2.2137774 (2850)	total: 13m 27s	remaining: 33m 45s
2900:	learn: 2.0623962	test: 2.2135005	best: 2.2135005 (2900)	total: 13m 41s	remaining: 33m 31s
2950:	learn: 2.0598194	test: 2.2132714	best: 2.2132714 (2950)	total: 13m 56s	remaining: 33m 16s
3000:	learn: 2.0574241	test: 2.2130587	best: 2.2130587 (3000)	total: 14m 10s	remaining: 33m 2s
3050:	learn: 2.0550547	test: 2.2128139	best: 2.2128139 (3050)	total: 14m 24s	remaining: 32m 48s
3100:	learn: 2.0524694	test: 2.2125996	best: 2.2125996 (3100)	total: 14m 38s	remaining: 32m 34s
3150:	learn: 2.0498970	test: 2.2123485	best: 2.2123485 (3150)	total: 14m 52s	remaining: 32m 20s
3200:	learn: 2.0471993	test: 2.2120785	best: 2.2120771 (3199)	total: 15m 6s	remaining: 32m 6s
3250:	learn: 2.0447474	test: 2.2118685	best: 2.2118685 (3250)	total: 15m 21s	remaining: 31m 52s
3300:	learn: 2.0423422	test: 2.2116421	best: 2.2116421 (3300)	total: 15m 35s	remaining: 31m 37s
3350:	learn: 2.0401965	test: 2.2114924	best

In [28]:
%%time
submission = pd.DataFrame(np.c_[test_ID, prob], columns=["Id"] + list(le.classes_))
submission["Id"] = submission["Id"].astype(int)
submission.to_csv("submission/v1-2.gz", compression="gzip", index=False)

CPU times: user 2min 27s, sys: 5.33 ms, total: 2min 27s
Wall time: 2min 27s


In [None]:
df1 = pd.read_csv("submission/v1-1.gz", compression="gzip")
df2 = pd.read_csv("submission/v1-2.gz", compression="gzip")
for col in df1.columns[1:]:
    df1[col] = (df1[col] + df2[col]) / 2
df1.to_csv("submission/v1.gz", compression="gzip", index=False)