Objective

The objective of this contest is to predict the probability of an offer being accepted by a certain driver.
Data

Data consists of the following features:

* offer_gk – unique offer identifier (INT)
* weekday_key – day of week number (Sunday = 0, Monday = 1, etc.) (INT)
* hour_key – hour of day representing an hour part of datetime (value from 0 to 23) (INT)
* driver_gk – unique driver identifier (INT)
* order_gk – unique order identifier (INT). Order may have multiple offers
* driver_latitude – latitude of driver at the time of getting an offer (FLOAT) 
* driver_longitude – longitude of driver at the time of receiving an offer (FLOAT)
* origin_order_latitude – latitude of the order start location at the time of receiving an offer (FLOAT)
* origin_order_longitude – longitude of the order start location at the moment of receiving an offer (FLOAT)
* distance_km – estimated distance from origin to destination in kilometres (FLOAT). Value -1 means that the destination is not set
* duration_min – estimated duration from origin to destination in minutes (FLOAT). Value -1 means that the destination is not set
* offer_class_group – class of the order, e.g. Economy, Business, XL (VARCHAR)
* ride_type_desc – private or business order attribute (VARCHAR)
* driver_response – driver choice of whether to accept the offer or not (VARCHAR) 
* The variable to be predicted is “driver_response”. 

Files:
* CAX_TestData_McK.csv
* CAX_TrainingData_McK.csv
* McK_SubmissionFormat.csv

In [1]:
!head -n2 CAX_TrainingData_McK.csv

offer_gk,weekday_key,hour_key,driver_gk,order_gk,driver_latitude,driver_longitude,origin_order_latitude,origin_order_longitude,distance_km,duration_min,offer_class_group,ride_type_desc,driver_response
1105373,5,20,6080,174182,55.818842,37.334562,55.814567,37.35501,-1,-1,Economy,private,0


In [2]:
!head -n2 McK_SubmissionFormat.csv

offer_gk,driver_response
152446,


In [7]:
#imports
import pandas as pd
import numpy as np
import matplotlib
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import pickle
from geopy.distance import vincenty
from sklearn.preprocessing import StandardScaler

In [4]:
types = {'offer_gk': 'uint32',
         'weekday_key' :'uint8',
         'hour_key' : 'uint8',
         'driver_gk' : 'uint32',
         'order_gk' : 'uint32',
         'driver_latitude' :'float32',
         'driver_longitude' :'float32',
         'origin_order_latitude' : 'float32',
         'origin_order_longitude' :'float32',
         'distance_km' : 'float32',
         'duration_min' :'float32',
         'offer_class_group' :'object',
         'ride_type_desc' :'object',
         'driver_response' :'object'}

In [5]:
train = pd.read_csv("CAX_TrainingData_McK.csv", dtype=types)

In [6]:
test = pd.read_csv("CAX_TestData_McK.csv", dtype=types)

In [7]:
train['is_test'] = False
test['is_test'] = True

In [8]:
all_data = pd.concat([train, test])

В трейне и тесте есть набор расстояний/времени с 0 значениями - приведем их к -1 (точка назначения не указана)

In [9]:
all_data.loc[all_data.duration_min == 0, 'duration_min'] = -1
all_data.loc[all_data.distance_km == 0, 'distance_km'] = -1

In [10]:
all_data=all_data.reset_index(drop=True)

Калсетризуруем координаты

In [11]:
all_coords = np.vstack([all_data[['driver_latitude','driver_longitude']], all_data[['origin_order_latitude', 'origin_order_longitude']]])
all_coords = all_coords.astype(np.float32)
all_coords=list(zip(all_coords[:,0], all_coords[:,1]))

In [11]:
# cl = KMeans(n_clusters=42)
# clusters = cl.fit_predict(all_coords)
# pickle.dump(clusters, open('clusters.pkl', 'wb'))

In [12]:
clusters = pickle.load(open('clusters.pkl', 'rb'))

In [16]:
half_len=int(clusters.shape[0]/2)

In [26]:
driver_clusters = clusters[:half_len]
order_clusters = clusters[half_len:]

In [25]:
all_data['driver_cluster'] = driver_clusters

In [27]:
all_data['order_cluster'] = order_clusters

In [28]:
all_data.head()

Unnamed: 0,offer_gk,weekday_key,hour_key,driver_gk,order_gk,driver_latitude,driver_longitude,origin_order_latitude,origin_order_longitude,distance_km,duration_min,offer_class_group,ride_type_desc,driver_response,is_test,km_per_min,distance_log,driver_cluster,order_cluster
0,1105373,5,20,6080,174182,55.818844,37.33456,55.814568,37.355011,-1.0,-1.0,Economy,private,0,False,1.0,0.0,12,12
1,759733,5,14,6080,358774,55.805344,37.515022,55.819328,37.466396,18.802,25.216999,Standard,private,1,False,0.745608,2.933963,28,19
2,416977,6,14,6080,866260,55.813976,37.347687,55.814827,37.354073,6.747,9.8,Economy,private,0,False,0.688469,1.909098,12,12
3,889660,2,6,6080,163522,55.745922,37.421749,55.743469,37.431129,-1.0,-1.0,Economy,private,1,False,1.0,0.0,7,7
4,1120055,4,16,6080,506710,55.803577,37.521603,55.812557,37.527409,12.383,19.25,Economy,private,1,False,0.643273,2.516325,28,28


* offer_gk - не учитываем в модели
* weekday_key - гармоническое кодирование OHE кодирование
* hour_key - гармоничское кодирование OHE кодирование
* hour + weekday - OHE кодирование, hash_trick
* time_of_day - ohe
* driver_gk - hash trick 
* order_gk не используем
* driver_latitude - Scaling
* driver_longitude - Scaling
* Кластер водителя - Hash trick, OHE
* origin_order_latitude - Scaling
* origin_order_longitude - Scaling
* Кластер заказа - - Hash trick, OHE
* Расстояние по прямой - Scaling
* distance_km - вещественная - логарифмируем
* duration_min - веществаенная - логарифмируем
* km_per_min - distance_km/duration_min
* offer_class_group - OHE
* ride_type_desc - OHE
* driver_response - Это метка

In [8]:
columns_to_exclude=['order_gk',  
                    'weekday_key', 'hour_key', 'driver_gk', 'ride_type_desc', 'driver_cluster', 'order_cluster',
                    'time_of_day', 'distance_km', 'duration_min', 'offer_class_group']

In [5]:
def make_harmonic_features(value, period=24):
    value *= 2 * np.pi / period
    return np.cos(value), np.sin(value)

In [6]:
HASH_SPACE = 2**20
def hash_trick(feature, value):
    return hash(feature+'='+str(value)) % HASH_SPACE

In [7]:
def OHE(data, prefix):
    return pd.get_dummies(data, prefix=prefix)

In [8]:
def distance(x):
    return vincenty((x[0], x[1]), (x[2], x[3])).kilometers

In [9]:
def time_of_day(x):
    if x > 2 and x <=7:
        return 0
    elif x > 7 and x <= 13:
        return 1
    elif x > 13 and x <= 17:
        return 2
    else:
        return 3

In [53]:
all_data['cos_hr'] = all_data.hour_key.apply(lambda x: make_harmonic_features(x, period=24)[0])
all_data['sin_hr'] = all_data.hour_key.apply(lambda x: make_harmonic_features(x, period=24)[1])

In [55]:
all_data['cos_day'] = all_data.weekday_key.apply(lambda x: make_harmonic_features(x, period=7)[0])
all_data['sin_day'] = all_data.weekday_key.apply(lambda x: make_harmonic_features(x, period=7)[1])

In [62]:
hour_ohe = OHE(all_data.hour_key, 'hrs')
day_ohe = OHE(all_data.weekday_key, 'wds')

In [65]:
all_data = pd.concat([all_data, hour_ohe], axis=1)
all_data = pd.concat([all_data, day_ohe], axis=1)

In [9]:
all_data = pickle.load(open('all_data,pkl', 'rb'))

In [10]:
all_data['driver_hash'] = all_data.driver_gk.apply(lambda x: hash_trick('deiver_gk', x))

In [13]:
pickle.dump(all_data, open('all_data,pkl', 'wb'))

In [14]:
all_data['driver_hash'] = StandardScaler().fit_transform(all_data['driver_hash'])



In [22]:
all_data['day_hour'] = all_data[['weekday_key', 'hour_key']].apply(lambda x: x[0]*100+x[1], axis=1)

In [23]:
all_data['day_hour'] = all_data['day_hour'].apply(lambda x: hash_trick('day_hour', x))

In [24]:
all_data['day_hour'] = StandardScaler().fit_transform(all_data['day_hour'])



In [33]:
all_data['time_of_day'] = all_data['hour_key'].apply(time_of_day)

In [34]:
tod_ohe=OHE(all_data['time_of_day'], 'tod')

In [35]:
all_data = pd.concat([all_data, tod_ohe], axis=1)

In [38]:
all_data['driver_latitude'] = StandardScaler().fit_transform(all_data['driver_latitude'])



In [41]:
all_data['driver_longitude'] = StandardScaler().fit_transform(all_data['driver_longitude'])



In [47]:
all_data['origin_order_latitude'] = StandardScaler().fit_transform(all_data['origin_order_latitude'].values)



In [49]:
all_data['origin_order_longitude'] = StandardScaler().fit_transform(all_data['origin_order_longitude'])



In [54]:
all_data['km_per_min'] = 0
all_data.loc[all_data.distance_km>0, 'km_per_min'] = all_data[all_data.distance_km>0].distance_km/all_data[all_data.distance_km>0].duration_min

In [55]:
all_data['distance_log'] = 0
all_data.loc[all_data.distance_km>0, 'distance_log'] = np.log(all_data[all_data.distance_km > 0].distance_km)

In [57]:
all_data['duration_log'] = 0
all_data.loc[all_data.duration_min>0, 'duration_log'] = np.log(all_data[all_data.duration_min > 0].duration_min)

In [62]:
offer_ohe = OHE(all_data.offer_class_group, 'off')
ride_ohe = OHE(all_data.ride_type_desc, 'rd')

In [63]:
all_data = pd.concat([all_data, offer_ohe], axis = 1)
all_data = pd.concat([all_data, ride_ohe], axis = 1)

In [66]:
drcl_ohe = OHE(all_data.driver_cluster, 'dr_cl')
ofcl_ohe = OHE(all_data.order_cluster, 'of_cl')

In [67]:
all_data = pd.concat([all_data, drcl_ohe], axis = 1)
all_data = pd.concat([all_data, ofcl_ohe], axis = 1)

In [3]:
all_data = pickle.load(open('all_data,pkl', 'rb'))

In [12]:
all_data['drcl_hash'] = all_data.driver_cluster.apply(lambda x: hash_trick('drcl_hash', x))

In [13]:
all_data['orcl_hash'] = all_data.order_cluster.apply(lambda x: hash_trick('orcl_hash', x))

In [14]:
all_data['drcl_hash'] = StandardScaler().fit_transform(all_data['drcl_hash'])



In [15]:
all_data['orcl_hash'] = StandardScaler().fit_transform(all_data['orcl_hash'])



In [23]:
all_data['dist_to_order'] = all_data[['driver_latitude','driver_longitude',
                                      'origin_order_latitude','origin_order_longitude']].apply(distance, axis=1)

In [25]:
all_data['dist_to_order'] = StandardScaler().fit_transform(all_data['dist_to_order'])



In [27]:
pickle.dump(all_data, open('all_data.pkl', 'wb'))

In [5]:
all_data=pickle.load(open('all_data.pkl', 'rb'))

In [9]:
model_data = all_data.drop(columns_to_exclude, axis=1)

In [37]:
for i in model_data.columns:
    print(i)

offer_gk
driver_latitude
driver_longitude
origin_order_latitude
origin_order_longitude
driver_response
is_test
km_per_min
distance_log
cos_hr
sin_hr
cos_day
sin_day
hrs_0
hrs_1
hrs_2
hrs_3
hrs_4
hrs_5
hrs_6
hrs_7
hrs_8
hrs_9
hrs_10
hrs_11
hrs_12
hrs_13
hrs_14
hrs_15
hrs_16
hrs_17
hrs_18
hrs_19
hrs_20
hrs_21
hrs_22
hrs_23
wds_0
wds_1
wds_2
wds_3
wds_4
wds_5
wds_6
driver_hash
day_hour
tod_0
tod_1
tod_2
tod_3
duration_log
off_Delivery
off_Economy
off_Kids
off_Premium
off_Standard
off_Test
off_VIP
off_VIP+
off_XL
rd_SMB
rd_affiliate
rd_business
rd_private
dr_cl_0
dr_cl_1
dr_cl_2
dr_cl_3
dr_cl_4
dr_cl_5
dr_cl_6
dr_cl_7
dr_cl_8
dr_cl_9
dr_cl_10
dr_cl_11
dr_cl_12
dr_cl_13
dr_cl_14
dr_cl_15
dr_cl_16
dr_cl_17
dr_cl_18
dr_cl_19
dr_cl_20
dr_cl_21
dr_cl_22
dr_cl_23
dr_cl_24
dr_cl_25
dr_cl_26
dr_cl_27
dr_cl_28
dr_cl_29
dr_cl_30
dr_cl_31
dr_cl_32
dr_cl_33
dr_cl_34
dr_cl_35
dr_cl_36
dr_cl_37
dr_cl_38
dr_cl_39
dr_cl_40
dr_cl_41
of_cl_0
of_cl_1
of_cl_2
of_cl_3
of_cl_4
of_cl_5
of_cl_6
of_cl_7
of_cl_8
of

In [38]:
model_data.head()

Unnamed: 0,offer_gk,driver_latitude,driver_longitude,origin_order_latitude,origin_order_longitude,driver_response,is_test,km_per_min,distance_log,cos_hr,...,of_cl_35,of_cl_36,of_cl_37,of_cl_38,of_cl_39,of_cl_40,of_cl_41,drcl_hash,orcl_hash,dist_to_order
0,1105373,0.077573,-0.052165,0.092665,-0.378348,0,False,0.0,0.0,0.5,...,0,0,0,0,0,0,0,-1.490162,1.29136,0.029488
1,759733,0.073391,0.029971,0.098564,-0.179782,1,False,0.745608,2.933963,-0.8660254,...,0,0,0,0,0,0,0,-1.151166,-1.015491,-0.034
2,416977,0.076065,-0.046191,0.092986,-0.380021,0,False,0.688469,1.909098,-0.8660254,...,0,0,0,0,0,0,0,-1.490162,1.29136,0.033742
3,889660,0.054983,-0.012482,0.004566,-0.242652,1,False,0.0,0.0,6.123234000000001e-17,...,0,0,0,0,0,0,0,-1.668547,0.494102,-0.020609
4,1120055,0.072844,0.032966,0.090174,-0.071017,1,False,0.643273,2.516325,-0.5,...,0,0,0,0,0,0,0,-1.151166,0.924953,-0.092286


In [34]:
model_data.describe()

Unnamed: 0,driver_latitude,driver_longitude,origin_order_latitude,origin_order_longitude,km_per_min,distance_log,cos_hr,sin_hr,cos_day,sin_day,...,of_cl_35,of_cl_36,of_cl_37,of_cl_38,of_cl_39,of_cl_40,of_cl_41,drcl_hash,orcl_hash,dist_to_order
count,1130370.0,1130370.0,1130370.0,1130370.0,1130370.0,1130370.0,1130370.0,1130370.0,1130370.0,1130370.0,...,1130370.0,1130370.0,1130370.0,1130370.0,1130370.0,1130370.0,1130370.0,1130370.0,1130370.0,1130370.0
mean,-3.065783e-06,-1.609235e-06,-6.01524e-06,1.721488e-05,0.4055505,1.683078,-0.01435815,-0.2618066,-0.01738679,-0.06340841,...,0.01632917,0.01403169,0.07442253,0.02130895,0.01067969,0.01057618,0.02706017,1.562048e-14,8.621087e-14,5.573289e-16
std,0.9982629,0.998432,0.9986981,0.9985257,0.3309943,1.403453,0.7116064,0.6518192,0.696389,0.7146469,...,0.1267381,0.1176215,0.2624574,0.1444123,0.1027893,0.1022953,0.1622589,1.0,1.0,1.0
min,-17.52356,-17.49989,-70.30705,-68.75334,-0.004,-6.907755,-1.0,-1.0,-0.9009689,-0.9749279,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.668547,-1.486172,-0.1502116
25%,0.04256257,0.01543512,-0.04264559,-0.1494255,0.0,0.0,-0.7071068,-0.8660254,-0.9009689,-0.7818315,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.8056124,-0.9046575,-0.1127604
50%,0.05810551,0.06016053,0.02085956,0.0246719,0.423786,1.83737,-1.83697e-16,-0.5,-0.2225209,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03881184,-0.0496248,-0.07493595
75%,0.07224222,0.08620231,0.0770712,0.1281453,0.6468972,2.851342,0.7071068,0.258819,0.6234898,0.7818315,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5931276,0.924953,-0.001994323
max,0.2474403,0.7001401,0.8404282,2.53297,1.629633,9.120169,1.0,1.0,1.0,0.9749279,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.863345,1.69106,45.53295


In [10]:
X_train = model_data[model_data.is_test == False].drop(['is_test', 'driver_response'], axis = 1)
X_test  = model_data[model_data.is_test == True].drop(['is_test', 'driver_response'], axis = 1)

In [11]:
y_train = model_data[model_data.is_test == False][['offer_gk','driver_response']]

In [15]:
y_train['driver_response'] = y_train.driver_response.astype('int8')

In [12]:
print(X_train.shape, X_test.shape, y_train.shape)

(892557, 149) (237813, 149) (892557, 2)


In [13]:
pickle.dump(X_train, open('x_train.pkl', 'wb'))
pickle.dump(X_test, open('x_test.pkl', 'wb'))

In [17]:
pickle.dump(y_train, open('y_train.pkl', 'wb'))