## 加载数据

比赛地址：http://algo.tpai.qq.com/home/information/index.html

下面是所有文件的一个说明
![](http://static.zybuluo.com/zhuanxu/evhuwbyqmrvfsvuypsbfrry6/image_1cak7ik2ituh32giqnb65br09.png)

In [1]:
import zipfile
import numpy as np
import pandas as pd

In [2]:
data_root = "./pre"
# load data
dfTrain = pd.read_csv("%s/train.csv"%data_root)
dfTest = pd.read_csv("%s/test.csv"%data_root)
dfAd = pd.read_csv("%s/ad.csv"%data_root)

训练数据文件(train.csv)，每行代表一个训练样本，各字段之间由逗号分隔，顺序依次为：“label，clickTime，conversionTime，creativeID，userID，positionID，connectionType，telecomsOperator”。

当label=0时，conversionTime字段为空字符串。

注：若字段取值为0或空字符串均代表未知。(站点集合ID(sitesetID)为0并不表示未知，而是一个特定的站点集合。)


In [3]:
dfTrain.head()

Unnamed: 0,label,clickTime,conversionTime,creativeID,userID,positionID,connectionType,telecomsOperator
0,0,170000,,3089,2798058,293,1,1
1,0,170000,,1259,463234,6161,1,2
2,0,170000,,4465,1857485,7434,4,1
3,0,170000,,1004,2038823,977,1,1
4,0,170000,,1887,2015141,3688,1,1


In [4]:
dfTrain[dfTrain['label']==1].head()

Unnamed: 0,label,clickTime,conversionTime,creativeID,userID,positionID,connectionType,telecomsOperator
147,1,170001,181031.0,2137,703736,2579,1,1
149,1,170001,170009.0,3981,2030308,2579,2,1
190,1,170001,170010.0,3584,936876,3322,2,3
194,1,170001,181027.0,2137,2619571,2579,1,2
250,1,170001,181031.0,2137,1411484,2579,1,2


下面我们来看下 clickTime都有哪些

In [6]:
dfTrain['clickTime'].min(),dfTrain['clickTime'].max()

(170000, 302359)

clickTime，conversionTime，installTime，格式均为DDHHMM，其中DD代表第几天，HH代表小时，MM代表分钟。

因为不再提供线上的评估方式了，我们现在统计下每天的数据量，然后我们选择30号的数据作为val数据

In [20]:
dfTrain['day'] = dfTrain['clickTime'].apply(lambda t : int(t/10000))

In [26]:
a = dfTrain.groupby('day').label.agg(['mean','size'])

In [27]:
a

Unnamed: 0_level_0,mean,size
day,Unnamed: 1_level_1,Unnamed: 2_level_1
17,0.02534,294553
18,0.025633,159991
19,0.031548,104158
20,0.024489,206462
21,0.023075,308596
22,0.022616,325921
23,0.026374,288433
24,0.02588,285242
25,0.027643,266833
26,0.025975,297825


In [30]:
train = dfTrain[dfTrain['day']<30].reset_index()
val   = dfTrain[dfTrain['day'] == 30].reset_index()

In [31]:
dfAd.head()

Unnamed: 0,creativeID,adID,camgaignID,advertiserID,appID,appPlatform
0,4079,2318,147,80,14,2
1,4565,3593,632,3,465,1
2,3170,1593,205,54,389,1
3,6566,2390,205,54,389,1
4,5187,411,564,3,465,1


## Ad基线版本
提供两个基线版本，统计和lr，先是纯统计的版本，我们直接统计每个appId的转换率，然后用他作为预测值

In [32]:
# process data
trainAll = pd.merge(train, dfAd, on="creativeID")
valAll = pd.merge(val, dfAd, on="creativeID")
y_train = trainAll["label"].values

In [33]:
key = "appID"
dfCvr = trainAll.groupby(key).apply(lambda df: np.mean(df["label"])).reset_index()

In [35]:
dfCvr.head()

Unnamed: 0,appID,0
0,14,0.002603
1,25,0.006042
2,68,0.00047
3,75,0.0
4,83,0.106286


In [36]:
dfCvr.columns = ['appID','prior_cvr'] # 先验分布

In [37]:
dfVal = pd.merge(valAll, dfCvr, how="left", on=key)

In [39]:
dfVal[['label','prior_cvr']].head()

Unnamed: 0,label,prior_cvr
0,0,0.020345
1,0,0.020345
2,0,0.020345
3,0,0.020345
4,0,0.020345


In [44]:
# 因为这两个appId从未出现在 train中，所以会有na值，我们的做法就是直接将其设置为平均点击率
dfVal[dfVal['prior_cvr'].isna()]['appID'].unique()

array([391, 442])

In [46]:
dfVal["prior_cvr"].fillna(np.mean(trainAll["label"]), inplace=True)

计算 dfVal 的logloss值

In [47]:
from sklearn.metrics import log_loss

In [48]:
# 我们得到了一个基线是 0.09
log_loss(dfVal['label'].values, dfVal['prior_cvr'].values)

0.09055954426098009

## lr 基线版本

In [49]:
trainAll.head()

Unnamed: 0,index,label,clickTime,conversionTime,creativeID,userID,positionID,connectionType,telecomsOperator,day,adID,camgaignID,advertiserID,appID,appPlatform
0,0,0,170000,,3089,2798058,293,1,1,17,1321,83,10,434,1
1,162,0,170001,,3089,195578,3659,0,2,17,1321,83,10,434,1
2,2237,0,170014,,3089,1462213,3659,0,3,17,1321,83,10,434,1
3,4809,0,170030,,3089,1985880,5581,1,1,17,1321,83,10,434,1
4,7128,0,170047,,3089,2152167,5581,1,1,17,1321,83,10,434,1


我们第二个基线模型采用 Lr，所以对于 类别型特征我们都需要进行encode

In [50]:
from scipy import sparse
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression

In [51]:
# feature engineering/encoding
enc = OneHotEncoder()
# 此处我们不对user_id处理，因为基数太大了。
feats = ["creativeID", "adID", "camgaignID", "advertiserID", "appID", "appPlatform"]
for i,feat in enumerate(feats):
    x_train = enc.fit_transform(trainAll[feat].values.reshape(-1, 1))
    # 注意，此处如果 val 中的值从未在 train 中出现过，则编码后都是0
    X_val = enc.transform(valAll[feat].values.reshape(-1, 1))
    if i == 0:
        X_train, X_val = x_train, x_val
    else:
        X_train, X_val = sparse.hstack((X_train, x_train)), sparse.hstack((X_val, x_val))

In [52]:
# model training
lr = LogisticRegression()
lr.fit(X_train, y_train)
proba_val = lr.predict_proba(X_val)[:,1]

In [56]:
# 我们得到了一个基线是 0.087
log_loss(dfVal['label'].values, proba_val)

0.08700116469412644

In [58]:
import sklearn.externals.joblib as jl

In [59]:
jl.dump(trainAll,"%s/trainAll.pkl"%data_root)
jl.dump(valAll,"%s/valAll.pkl"%data_root)

['./pre/valAll.pkl']