# 天池o2o


## 初始准备



** AUC,ROC:**

ROC和AUC定义

ROC全称是“受试者工作特征”（Receiver Operating Characteristic）。ROC曲线的面积就是AUC（Area Under the Curve）。AUC用于衡量“二分类问题”机器学习算法性能（泛化能力）。
计算ROC需要知道的关键概念

首先，解释几个二分类问题中常用的概念：True Positive, False Positive, True Negative, False Negative。它们是根据真实类别与预测类别的组合来区分的。

假设有一批test样本，这些样本只有两种类别：正例和反例。机器学习算法预测类别如下图（左半部分预测类别为正例，右半部分预测类别为反例），而样本中真实的正例类别在上半部分，下半部分为真实的反例。

    预测值为正例，记为P（Positive）
    预测值为反例，记为N（Negative）
    预测值与真实值相同，记为T（True）
    预测值与真实值相反，记为F（False）
    
![%E5%9B%BE%E7%89%87.png](attachment:%E5%9B%BE%E7%89%87.png)
    


    TP：预测类别是P（正例），真实类别也是P
    FP：预测类别是P，真实类别是N（反例）
    TN：预测类别是N，真实类别也是N
    FN：预测类别是N，真实类别是P

样本中的真实正例类别总数即TP+FN。TPR即True Positive Rate，TPR = TP/(TP+FN)。   
同理，样本中的真实反例类别总数为FP+TN。FPR即False Positive Rate，FPR=FP/(TN+FP)。



In [1]:
# import libraries necessary for this project
import os, sys, pickle #os 配置文件 sys 运行时环境 pickle 用于python特有的类型和python的数据类型间进行转换
 
import numpy as np
import pandas as pd
 
from datetime import date #datetime.date：表示日期的类
 
from sklearn.model_selection import KFold, train_test_split, StratifiedKFold, cross_val_score, GridSearchCV #模型选择 ， 有关函数
from sklearn.pipeline import Pipeline#实现了用于构建复合估计器的实用程序，作为变换和估计器链。
from sklearn.linear_model import SGDClassifier, LogisticRegression#sgd训练的线性分类器和逻辑回归分类
from sklearn.preprocessing import StandardScaler#通过删除均值和缩放到单位方差来标准化特征
from sklearn.metrics import log_loss, roc_auc_score, auc, roc_curve#metric指标
from sklearn.preprocessing import MinMaxScaler #预处理和正则化
 
# display for this notebook
%matplotlib inline#%matplotlib inline 可以在Ipython编译器里直接使用（专用），功能是可以内嵌绘图
%config InlineBackend.figure_format = 'retina'#matplotlib在Retina屏幕中显示模糊问题

UsageError: unrecognized arguments: inline 可以在Ipython编译器里直接使用（专用），功能是可以内嵌绘图


### 数据读取简单分析

In [2]:
dfoff = pd.read_csv('data/ccf_offline_stage1_train.csv',keep_default_na=False)
dfon = pd.read_csv('data/ccf_online_stage1_train.csv',keep_default_na=False)
dftest = pd.read_csv('data/ccf_offline_stage1_test_revised.csv',keep_default_na=False)

# dfoff = pd.read_csv('data/ccf_offline_stage1_train.csv',keep_default_na=False) 
#如果指定na_values参数，并且keep_default_na=False，那么默认的NaN将被覆盖，否则添加
 
dfoff.head(5)

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date
0,1439408,2632,,,0,,20160217.0
1,1439408,4663,11002.0,150:20,1,20160528.0,
2,1439408,2632,8591.0,20:1,0,20160217.0,
3,1439408,2632,1078.0,20:1,0,20160319.0,
4,1439408,2632,8591.0,20:1,0,20160613.0,


用户线下消费和优惠卷领取的字段表对应
![%E5%9B%BE%E7%89%87.png](attachment:%E5%9B%BE%E7%89%87.png)

用户线上点击消费和优惠卷字段
![%E5%9B%BE%E7%89%87.png](attachment:%E5%9B%BE%E7%89%87.png)

In [3]:
dfon.head(10)

Unnamed: 0,User_id,Merchant_id,Action,Coupon_id,Discount_rate,Date_received,Date
0,13740231,18907,2,100017492.0,500:50,20160513.0,
1,13740231,34805,1,,,,20160321.0
2,14336199,18907,0,,,,20160618.0
3,14336199,18907,0,,,,20160618.0
4,14336199,18907,0,,,,20160618.0
5,14336199,18907,0,,,,20160618.0
6,14336199,18907,0,,,,20160618.0
7,14336199,18907,0,,,,20160618.0
8,14336199,18907,0,,,,20160618.0
9,14336199,18907,0,,,,20160618.0


测试集于用户相关信息
![%E5%9B%BE%E7%89%87.png](attachment:%E5%9B%BE%E7%89%87.png)

In [4]:
dftest.head(5)

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received
0,4129537,450,9983,30:5,1.0,20160712
1,6949378,1300,3429,30:5,,20160706
2,2166529,7113,6928,200:20,5.0,20160727
3,2166529,7113,1808,100:10,5.0,20160727
4,6172162,7605,6500,30:1,2.0,20160708


In [5]:
#线下数据及分析统计
#
print('有优惠卷，购买商品：%d' % dfoff[(dfoff['Date_received'] != 'null') & (dfoff['Date'] != 'null')].shape[0])
print('有优惠卷，未购商品：%d' % dfoff[(dfoff['Date_received'] != 'null') & (dfoff['Date'] == 'null')].shape[0])
print('无优惠卷，购买商品：%d' % dfoff[(dfoff['Date_received'] == 'null') & (dfoff['Date'] != 'null')].shape[0])
print('无优惠卷，未购商品：%d' % dfoff[(dfoff['Date_received'] == 'null') & (dfoff['Date'] == 'null')].shape[0])

有优惠卷，购买商品：75382
有优惠卷，未购商品：977900
无优惠卷，购买商品：701602
无优惠卷，未购商品：0


有优惠卷未购买商品的人远大于购买商品的人，需要尽可能使优惠卷尽可能的被使用

In [6]:
dfoff['Date_received'][0] == np.nan

False

### 特征提取

In [7]:
dfoff.head(1000)

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date
0,1439408,2632,,,0,,20160217
1,1439408,4663,11002,150:20,1,20160528,
2,1439408,2632,8591,20:1,0,20160217,
3,1439408,2632,1078,20:1,0,20160319,
4,1439408,2632,8591,20:1,0,20160613,
5,1439408,2632,,,0,,20160516
6,1439408,2632,8591,20:1,0,20160516,20160613
7,1832624,3381,7610,200:20,0,20160429,
8,2029232,3381,11951,200:20,1,20160129,
9,2029232,450,1532,30:5,0,20160530,


#### 折扣

打折率分为 3 种情况：

- 'null' 表示没有打折

- [0,1] 表示折扣率

- x:y 表示满x减y

**将这个参数值的项统一为一种表现方式**  进行归一化的化简统一  
**处理方式：**

- 打折类型：getDiscountType()

- 折扣率：convertRate()

- 满多少：getDiscountMan()

- 减多少：getDiscountJian()

In [8]:
#对折扣的参数计算
#数据中有null  n：m 0-1的小数 三种形式，对其进行判定类型
def getDiscountType(row):
    #输入数据中的特征项
    #输出类型0，1，null
    if row == 'null':
        return 'null' 
    elif ':' in row:
        return 1
    else:
        return 0
#折扣转化
def convertRate(row):
    #输出转化后的折扣率
    """Convert discount to rate"""
    if row == 'null':
        return 1.0 #全额
    elif ':' in row:
        rows = row.split(':')#以：进行拆分
        return 1.0 - float(rows[1])/float(rows[0])
    else:
        return float(row) #折扣
    
#满减转化
def getDiscountMan(row):
    if ':' in row:
        rows = row.split(':')
        return int(rows[0])
    else:
        return 0
def getDiscountJian(row):
    if ':' in row:
        rows = row.split(':')
        return int(rows[1])
    else:
        return 0

In [9]:
def processData(df):
    
    # 将df中的对应rate转化为四项基本特征
    df['discount_type'] = df['Discount_rate'].apply(getDiscountType)
    df['discount_rate'] = df['Discount_rate'].apply(convertRate)
    df['discount_man'] = df['Discount_rate'].apply(getDiscountMan)
    df['discount_jian'] = df['Discount_rate'].apply(getDiscountJian)
    #apply可以把dataframe的一列或几列遍历计算
    
    print(df['discount_rate'].unique()) #统计存在的可能出现的值的情况放入列表并打印出来
    
    return df

In [10]:
#对训练集和测试集进行转化
dfoff = processData(dfoff)
dftest = processData(dftest)

[1.         0.86666667 0.95       0.9        0.83333333 0.8
 0.5        0.85       0.75       0.66666667 0.93333333 0.7
 0.6        0.96666667 0.98       0.99       0.975      0.33333333
 0.2        0.4       ]
[0.83333333 0.9        0.96666667 0.8        0.95       0.75
 0.98       0.5        0.86666667 0.6        0.66666667 0.7
 0.85       0.33333333 0.94       0.93333333 0.975      0.99      ]


In [11]:
dfoff.head(5)

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date,discount_type,discount_rate,discount_man,discount_jian
0,1439408,2632,,,0,,20160217.0,,1.0,0,0
1,1439408,4663,11002.0,150:20,1,20160528.0,,1.0,0.866667,150,20
2,1439408,2632,8591.0,20:1,0,20160217.0,,1.0,0.95,20,1
3,1439408,2632,1078.0,20:1,0,20160319.0,,1.0,0.95,20,1
4,1439408,2632,8591.0,20:1,0,20160613.0,,1.0,0.95,20,1


In [12]:
dftest.head(5)

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,discount_type,discount_rate,discount_man,discount_jian
0,4129537,450,9983,30:5,1.0,20160712,1,0.833333,30,5
1,6949378,1300,3429,30:5,,20160706,1,0.833333,30,5
2,2166529,7113,6928,200:20,5.0,20160727,1,0.9,200,20
3,2166529,7113,1808,100:10,5.0,20160727,1,0.9,100,10
4,6172162,7605,6500,30:1,2.0,20160708,1,0.966667,30,1


#### 距离distance

In [13]:
print('Distance 类型：',dfoff['Distance'].unique())#可能存在的距离的类型

Distance 类型： ['0' '1' 'null' '2' '10' '4' '7' '9' '3' '5' '6' '8']


In [14]:
#转化为int
dfoff['distance'] = dfoff['Distance'].replace('null', -1).astype(int)#将null转化为-1

In [15]:
print('Distance 类型：',dfoff['Distance'].unique())#可能存在的距离的类型

Distance 类型： ['0' '1' 'null' '2' '10' '4' '7' '9' '3' '5' '6' '8']


In [16]:
#对test进行转化
dftest['distance'] = dftest['Distance'].replace('null', -1).astype(int)

In [17]:
print(dftest['distance'].unique())

[ 1 -1  5  2  0 10  3  6  7  4  9  8]


In [18]:
dfoff.head(5)

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date,discount_type,discount_rate,discount_man,discount_jian,distance
0,1439408,2632,,,0,,20160217.0,,1.0,0,0,0
1,1439408,4663,11002.0,150:20,1,20160528.0,,1.0,0.866667,150,20,1
2,1439408,2632,8591.0,20:1,0,20160217.0,,1.0,0.95,20,1,0
3,1439408,2632,1078.0,20:1,0,20160319.0,,1.0,0.95,20,1,0
4,1439408,2632,8591.0,20:1,0,20160613.0,,1.0,0.95,20,1,0


In [19]:
dftest.head(5)

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,discount_type,discount_rate,discount_man,discount_jian,distance
0,4129537,450,9983,30:5,1.0,20160712,1,0.833333,30,5,1
1,6949378,1300,3429,30:5,,20160706,1,0.833333,30,5,-1
2,2166529,7113,6928,200:20,5.0,20160727,1,0.9,200,20,5
3,2166529,7113,1808,100:10,5.0,20160727,1,0.9,100,10,5
4,6172162,7605,6500,30:1,2.0,20160708,1,0.966667,30,1,2


#### 领劵日期 Date_received

In [20]:
date_received = dfoff['Date_received'].unique()

In [21]:
date_received

array(['null', '20160528', '20160217', '20160319', '20160613', '20160516',
       '20160429', '20160129', '20160530', '20160519', '20160606',
       '20160207', '20160421', '20160130', '20160412', '20160518',
       '20160327', '20160127', '20160215', '20160524', '20160523',
       '20160515', '20160521', '20160114', '20160321', '20160426',
       '20160409', '20160326', '20160322', '20160131', '20160125',
       '20160602', '20160128', '20160605', '20160607', '20160324',
       '20160601', '20160126', '20160124', '20160123', '20160201',
       '20160522', '20160203', '20160417', '20160415', '20160202',
       '20160206', '20160218', '20160611', '20160329', '20160510',
       '20160302', '20160526', '20160318', '20160205', '20160411',
       '20160520', '20160527', '20160317', '20160213', '20160505',
       '20160402', '20160211', '20160405', '20160408', '20160323',
       '20160204', '20160112', '20160430', '20160525', '20160609',
       '20160403', '20160325', '20160413', '20160210',

**关于领劵日期的特征：**

- weekday : {null, 1, 2, 3, 4, 5, 6, 7} #对应星期

- weekday_type : {1, 0}（周六和周日为1，其他为0）#休息日

- Weekday_1 : {1, 0, 0, 0, 0, 0, 0}

- Weekday_2 : {0, 1, 0, 0, 0, 0, 0}

- Weekday_3 : {0, 0, 1, 0, 0, 0, 0}

- Weekday_4 : {0, 0, 0, 1, 0, 0, 0}

- Weekday_5 : {0, 0, 0, 0, 1, 0, 0}

- Weekday_6 : {0, 0, 0, 0, 0, 1, 0}

- Weekday_7 : {0, 0, 0, 0, 0, 0, 1}



In [22]:
def getWeekday(row):
    if row == 'null':
        return row #若无日期返回null
    else:
        return date(int(row[0:4]), int(row[4:6]), int(row[6:8])).weekday() + 1#利用python 的data模块进行转化，输入年月日返回对应星期

In [23]:
dfoff['weekday'] = dfoff['Date_received'].astype(str).apply(getWeekday) #str转化便于拆分字符#加入wekday特征
dftest['weekday'] = dftest['Date_received'].astype(str).apply(getWeekday)

In [24]:
dfoff.head(5)

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date,discount_type,discount_rate,discount_man,discount_jian,distance,weekday
0,1439408,2632,,,0,,20160217.0,,1.0,0,0,0,
1,1439408,4663,11002.0,150:20,1,20160528.0,,1.0,0.866667,150,20,1,6.0
2,1439408,2632,8591.0,20:1,0,20160217.0,,1.0,0.95,20,1,0,3.0
3,1439408,2632,1078.0,20:1,0,20160319.0,,1.0,0.95,20,1,0,6.0
4,1439408,2632,8591.0,20:1,0,20160613.0,,1.0,0.95,20,1,0,1.0


In [25]:
# weekday_type :  设代那个类型周六和周日为1，其他为0
dfoff['weekday_type'] = dfoff['weekday'].apply(lambda x: 1 if x in [6,7] else 0)
dftest['weekday_type'] = dftest['weekday'].apply(lambda x: 1 if x in [6,7] else 0)
#lambda表达式，通常是在需要一个函数，但是又不想费神去命名一个函数的场合下使用，也就是指匿名函数。
#函数lambda x: 1 if x in [6,7] else 0是代表输入x 输出1 if x in [6,7] else 0

In [26]:
#设定周一周日的onehot
# change weekday to one-hot encoding 
weekdaycols = ['weekday_' + str(i) for i in range(1,8)]#生成对应的列的名称
#print(weekdaycols)

tmpdf = pd.get_dummies(dfoff['weekday'].replace('null', np.nan)) #将星期的数字进行离散化编码处理
#使用get_dummies进行one-hot编码
tmpdf.columns = weekdaycols#设定tmpdf的列标签为weekday_1-7
dfoff[weekdaycols] = tmpdf#设定weekdaycols加入到dfoff当中

tmpdf = pd.get_dummies(dftest['weekday'].replace('null', np.nan))
tmpdf.columns = weekdaycols
dftest[weekdaycols] = tmpdf

In [27]:
dfoff

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date,discount_type,discount_rate,discount_man,...,distance,weekday,weekday_type,weekday_1,weekday_2,weekday_3,weekday_4,weekday_5,weekday_6,weekday_7
0,1439408,2632,,,0,,20160217,,1.000000,0,...,0,,0,0,0,0,0,0,0,0
1,1439408,4663,11002,150:20,1,20160528,,1,0.866667,150,...,1,6,1,0,0,0,0,0,1,0
2,1439408,2632,8591,20:1,0,20160217,,1,0.950000,20,...,0,3,0,0,0,1,0,0,0,0
3,1439408,2632,1078,20:1,0,20160319,,1,0.950000,20,...,0,6,1,0,0,0,0,0,1,0
4,1439408,2632,8591,20:1,0,20160613,,1,0.950000,20,...,0,1,0,1,0,0,0,0,0,0
5,1439408,2632,,,0,,20160516,,1.000000,0,...,0,,0,0,0,0,0,0,0,0
6,1439408,2632,8591,20:1,0,20160516,20160613,1,0.950000,20,...,0,1,0,1,0,0,0,0,0,0
7,1832624,3381,7610,200:20,0,20160429,,1,0.900000,200,...,0,5,0,0,0,0,0,1,0,0
8,2029232,3381,11951,200:20,1,20160129,,1,0.900000,200,...,1,5,0,0,0,0,0,1,0,0
9,2029232,450,1532,30:5,0,20160530,,1,0.833333,30,...,0,1,0,1,0,0,0,0,0,0


**特征**（经过转化后的特征集合）
- discount_rate

- discount_type

- discount_man

- discount_jian

- distance

- weekday

- weekday_type

- weekday_1

- weekday_2

- weekday_3

- weekday_4

- weekday_5

- weekday_6

- weekday_7

### 标签标注
三种情况：

- Date_received == 'null'：表示没有领到优惠券，无需考虑，y = -1

- (Date_received != 'null') & (Date != 'null') & (Date - Date_received <= 15)：表示领取优惠券且在15天内使用，即正样本，y = 1

- (Date_received != 'null') & ((Date == 'null') | (Date - Date_received > 15))：表示领取优惠券未在在15天内使用，即负样本，y = 0

In [28]:
dfoff.head(100)

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date,discount_type,discount_rate,discount_man,...,distance,weekday,weekday_type,weekday_1,weekday_2,weekday_3,weekday_4,weekday_5,weekday_6,weekday_7
0,1439408,2632,,,0,,20160217,,1.000000,0,...,0,,0,0,0,0,0,0,0,0
1,1439408,4663,11002,150:20,1,20160528,,1,0.866667,150,...,1,6,1,0,0,0,0,0,1,0
2,1439408,2632,8591,20:1,0,20160217,,1,0.950000,20,...,0,3,0,0,0,1,0,0,0,0
3,1439408,2632,1078,20:1,0,20160319,,1,0.950000,20,...,0,6,1,0,0,0,0,0,1,0
4,1439408,2632,8591,20:1,0,20160613,,1,0.950000,20,...,0,1,0,1,0,0,0,0,0,0
5,1439408,2632,,,0,,20160516,,1.000000,0,...,0,,0,0,0,0,0,0,0,0
6,1439408,2632,8591,20:1,0,20160516,20160613,1,0.950000,20,...,0,1,0,1,0,0,0,0,0,0
7,1832624,3381,7610,200:20,0,20160429,,1,0.900000,200,...,0,5,0,0,0,0,0,1,0,0
8,2029232,3381,11951,200:20,1,20160129,,1,0.900000,200,...,1,5,0,0,0,0,0,1,0,0
9,2029232,450,1532,30:5,0,20160530,,1,0.833333,30,...,0,1,0,1,0,0,0,0,0,0


In [29]:
def label(row):
    #输入为df形式数据
    if row['Date_received'] == 'null':
        #如果日期不存在，则优惠卷,此部分不做考虑
        return -1
    if row['Date'] != 'null':#有优惠卷且进行了使用
        td = pd.to_datetime(row['Date'], format='%Y%m%d') - pd.to_datetime(row['Date_received'], format='%Y%m%d')
        ##以年月日对时间进行计算
        #时间的计算相减
        if td <= pd.Timedelta(15, 'D'):
            return 1
    return 0

In [30]:
#设定数据的标签
dfoff['label'] = dfoff.apply(label, axis=1) #axis =1 ,把一行数据输入

In [31]:
dfoff.head(100)

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date,discount_type,discount_rate,discount_man,...,weekday,weekday_type,weekday_1,weekday_2,weekday_3,weekday_4,weekday_5,weekday_6,weekday_7,label
0,1439408,2632,,,0,,20160217,,1.000000,0,...,,0,0,0,0,0,0,0,0,-1
1,1439408,4663,11002,150:20,1,20160528,,1,0.866667,150,...,6,1,0,0,0,0,0,1,0,0
2,1439408,2632,8591,20:1,0,20160217,,1,0.950000,20,...,3,0,0,0,1,0,0,0,0,0
3,1439408,2632,1078,20:1,0,20160319,,1,0.950000,20,...,6,1,0,0,0,0,0,1,0,0
4,1439408,2632,8591,20:1,0,20160613,,1,0.950000,20,...,1,0,1,0,0,0,0,0,0,0
5,1439408,2632,,,0,,20160516,,1.000000,0,...,,0,0,0,0,0,0,0,0,-1
6,1439408,2632,8591,20:1,0,20160516,20160613,1,0.950000,20,...,1,0,1,0,0,0,0,0,0,0
7,1832624,3381,7610,200:20,0,20160429,,1,0.900000,200,...,5,0,0,0,0,0,1,0,0,0
8,2029232,3381,11951,200:20,1,20160129,,1,0.900000,200,...,5,0,0,0,0,0,1,0,0,0
9,2029232,450,1532,30:5,0,20160530,,1,0.833333,30,...,1,0,1,0,0,0,0,0,0,0


## 模型建立于训练

### 建立线性模型 SGDClassifier

- 使用上面提取的14个特征。

- 训练集：20160101-20160515；验证集：20160516-20160615。

- 用线性模型 SGDClassifier

In [32]:
dfoff

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date,discount_type,discount_rate,discount_man,...,weekday,weekday_type,weekday_1,weekday_2,weekday_3,weekday_4,weekday_5,weekday_6,weekday_7,label
0,1439408,2632,,,0,,20160217,,1.000000,0,...,,0,0,0,0,0,0,0,0,-1
1,1439408,4663,11002,150:20,1,20160528,,1,0.866667,150,...,6,1,0,0,0,0,0,1,0,0
2,1439408,2632,8591,20:1,0,20160217,,1,0.950000,20,...,3,0,0,0,1,0,0,0,0,0
3,1439408,2632,1078,20:1,0,20160319,,1,0.950000,20,...,6,1,0,0,0,0,0,1,0,0
4,1439408,2632,8591,20:1,0,20160613,,1,0.950000,20,...,1,0,1,0,0,0,0,0,0,0
5,1439408,2632,,,0,,20160516,,1.000000,0,...,,0,0,0,0,0,0,0,0,-1
6,1439408,2632,8591,20:1,0,20160516,20160613,1,0.950000,20,...,1,0,1,0,0,0,0,0,0,0
7,1832624,3381,7610,200:20,0,20160429,,1,0.900000,200,...,5,0,0,0,0,0,1,0,0,0
8,2029232,3381,11951,200:20,1,20160129,,1,0.900000,200,...,5,0,0,0,0,0,1,0,0,0
9,2029232,450,1532,30:5,0,20160530,,1,0.833333,30,...,1,0,1,0,0,0,0,0,0,0


In [33]:
dftest

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,discount_type,discount_rate,discount_man,discount_jian,distance,weekday,weekday_type,weekday_1,weekday_2,weekday_3,weekday_4,weekday_5,weekday_6,weekday_7
0,4129537,450,9983,30:5,1,20160712,1,0.833333,30,5,1,2,0,0,1,0,0,0,0,0
1,6949378,1300,3429,30:5,,20160706,1,0.833333,30,5,-1,3,0,0,0,1,0,0,0,0
2,2166529,7113,6928,200:20,5,20160727,1,0.900000,200,20,5,3,0,0,0,1,0,0,0,0
3,2166529,7113,1808,100:10,5,20160727,1,0.900000,100,10,5,3,0,0,0,1,0,0,0,0
4,6172162,7605,6500,30:1,2,20160708,1,0.966667,30,1,2,5,0,0,0,0,0,1,0,0
5,4005121,450,9983,30:5,0,20160706,1,0.833333,30,5,0,3,0,0,0,1,0,0,0,0
6,4347394,450,9983,30:5,0,20160716,1,0.833333,30,5,0,6,1,0,0,0,0,0,1,0
7,3094273,760,13602,30:5,1,20160727,1,0.833333,30,5,1,3,0,0,0,1,0,0,0,0
8,5139970,450,9983,30:5,10,20160729,1,0.833333,30,5,10,5,0,0,0,0,0,1,0,0
9,3237121,760,13602,30:5,1,20160703,1,0.833333,30,5,1,7,1,0,0,0,0,0,0,1


### 划分训练集/验证集

In [34]:
df = dfoff[dfoff['label'] != -1].copy() #排除训练集标签-1无参考价值

In [35]:
df

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date,discount_type,discount_rate,discount_man,...,weekday,weekday_type,weekday_1,weekday_2,weekday_3,weekday_4,weekday_5,weekday_6,weekday_7,label
1,1439408,4663,11002,150:20,1,20160528,,1,0.866667,150,...,6,1,0,0,0,0,0,1,0,0
2,1439408,2632,8591,20:1,0,20160217,,1,0.950000,20,...,3,0,0,0,1,0,0,0,0,0
3,1439408,2632,1078,20:1,0,20160319,,1,0.950000,20,...,6,1,0,0,0,0,0,1,0,0
4,1439408,2632,8591,20:1,0,20160613,,1,0.950000,20,...,1,0,1,0,0,0,0,0,0,0
6,1439408,2632,8591,20:1,0,20160516,20160613,1,0.950000,20,...,1,0,1,0,0,0,0,0,0,0
7,1832624,3381,7610,200:20,0,20160429,,1,0.900000,200,...,5,0,0,0,0,0,1,0,0,0
8,2029232,3381,11951,200:20,1,20160129,,1,0.900000,200,...,5,0,0,0,0,0,1,0,0,0
9,2029232,450,1532,30:5,0,20160530,,1,0.833333,30,...,1,0,1,0,0,0,0,0,0,0
10,2029232,6459,12737,20:1,0,20160519,,1,0.950000,20,...,4,0,0,0,0,1,0,0,0,0
13,2747744,6901,1097,50:10,,20160606,,1,0.800000,50,...,1,0,1,0,0,0,0,0,0,0


In [36]:
train = df[(df['Date_received'] < '20160516')].copy()#以20160516之前的样本作为训练时所用的样本
valid = df[(df['Date_received'] >= '20160516') & (df['Date_received'] <= '20160615')].copy()#查封出一部分作为交叉验证的样本

**深浅拷贝：**  
对象赋值实际上是对象的引用。当创建一个对象，然后把它赋给另一个变量的时候，python并没有拷贝这个对象，而只是拷贝了这个对象的引用，所以对象发生改变，原始列表改变，被赋值的b也会做相同的改变b=alist
[1, 2, 3, ['a', 'b']] 任意一个变跟随改变

copy浅拷贝，没有拷贝子对象，所以原始数据改变，子对象会改变c=copy.copy(alist)
[1, 2, 3, ['a', 'b']] 浅拷贝，对应其中['a','b']改变就会发生改变

深拷贝，包含对象里面的自对象的拷贝，所以原始对象的改变不会造成深拷贝里任何子元素的改变[1, 2, 3, ['a', 'b']]始终没有改变

In [37]:
train

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date,discount_type,discount_rate,discount_man,...,weekday,weekday_type,weekday_1,weekday_2,weekday_3,weekday_4,weekday_5,weekday_6,weekday_7,label
2,1439408,2632,8591,20:1,0,20160217,,1,0.950000,20,...,3,0,0,0,1,0,0,0,0,0
3,1439408,2632,1078,20:1,0,20160319,,1,0.950000,20,...,6,1,0,0,0,0,0,1,0,0
7,1832624,3381,7610,200:20,0,20160429,,1,0.900000,200,...,5,0,0,0,0,0,1,0,0,0
8,2029232,3381,11951,200:20,1,20160129,,1,0.900000,200,...,5,0,0,0,0,0,1,0,0,0
16,2223968,3381,9776,10:5,2,20160129,,1,0.500000,10,...,5,0,0,0,0,0,1,0,0,0
17,73611,2099,12034,100:10,,20160207,,1,0.900000,100,...,7,1,0,0,0,0,0,0,1,0
18,163606,1569,5054,200:30,10,20160421,,1,0.850000,200,...,4,0,0,0,0,1,0,0,0,0
19,3273056,4833,7802,200:20,10,20160130,,1,0.900000,200,...,6,1,0,0,0,0,0,1,0,0
20,94107,3381,7610,200:20,2,20160412,,1,0.900000,200,...,2,0,0,1,0,0,0,0,0,0
23,253750,8390,7531,20:5,0,20160327,,1,0.750000,20,...,7,1,0,0,0,0,0,0,1,0


In [38]:
valid

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date,discount_type,discount_rate,discount_man,...,weekday,weekday_type,weekday_1,weekday_2,weekday_3,weekday_4,weekday_5,weekday_6,weekday_7,label
1,1439408,4663,11002,150:20,1,20160528,,1,0.866667,150,...,6,1,0,0,0,0,0,1,0,0
4,1439408,2632,8591,20:1,0,20160613,,1,0.950000,20,...,1,0,1,0,0,0,0,0,0,0
6,1439408,2632,8591,20:1,0,20160516,20160613,1,0.950000,20,...,1,0,1,0,0,0,0,0,0,0
9,2029232,450,1532,30:5,0,20160530,,1,0.833333,30,...,1,0,1,0,0,0,0,0,0,0
10,2029232,6459,12737,20:1,0,20160519,,1,0.950000,20,...,4,0,0,0,0,1,0,0,0,0
13,2747744,6901,1097,50:10,,20160606,,1,0.800000,50,...,1,0,1,0,0,0,0,0,0,0
15,196342,1579,10698,20:1,1,20160606,,1,0.950000,20,...,1,0,1,0,0,0,0,0,0,0
22,253750,6901,2366,30:5,0,20160518,,1,0.833333,30,...,3,0,0,0,1,0,0,0,0,0
24,343660,4663,11002,150:20,,20160528,,1,0.866667,150,...,6,1,0,0,0,0,0,1,0,0
31,1113008,3621,2705,20:5,0,20160524,,1,0.750000,20,...,2,0,0,1,0,0,0,0,0,0


In [39]:
print('Train Set: \n', train['label'].value_counts())
print('Valid Set: \n', valid['label'].value_counts())
#统计训练集样本和验证样本的数量

Train Set: 
 0    759172
1     41524
Name: label, dtype: int64
Valid Set: 
 0    229715
1     22871
Name: label, dtype: int64


In [40]:
# 特征数量
original_feature = ['discount_rate','discount_type','discount_man', 'discount_jian','distance', 'weekday', 'weekday_type'] + weekdaycols
print('共有特征：',len(original_feature),'个')
print(original_feature)

共有特征： 14 个
['discount_rate', 'discount_type', 'discount_man', 'discount_jian', 'distance', 'weekday', 'weekday_type', 'weekday_1', 'weekday_2', 'weekday_3', 'weekday_4', 'weekday_5', 'weekday_6', 'weekday_7']


### 建立模型

In [41]:
#训练数据集
data = train
data 

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date,discount_type,discount_rate,discount_man,...,weekday,weekday_type,weekday_1,weekday_2,weekday_3,weekday_4,weekday_5,weekday_6,weekday_7,label
2,1439408,2632,8591,20:1,0,20160217,,1,0.950000,20,...,3,0,0,0,1,0,0,0,0,0
3,1439408,2632,1078,20:1,0,20160319,,1,0.950000,20,...,6,1,0,0,0,0,0,1,0,0
7,1832624,3381,7610,200:20,0,20160429,,1,0.900000,200,...,5,0,0,0,0,0,1,0,0,0
8,2029232,3381,11951,200:20,1,20160129,,1,0.900000,200,...,5,0,0,0,0,0,1,0,0,0
16,2223968,3381,9776,10:5,2,20160129,,1,0.500000,10,...,5,0,0,0,0,0,1,0,0,0
17,73611,2099,12034,100:10,,20160207,,1,0.900000,100,...,7,1,0,0,0,0,0,0,1,0
18,163606,1569,5054,200:30,10,20160421,,1,0.850000,200,...,4,0,0,0,0,1,0,0,0,0
19,3273056,4833,7802,200:20,10,20160130,,1,0.900000,200,...,6,1,0,0,0,0,0,1,0,0
20,94107,3381,7610,200:20,2,20160412,,1,0.900000,200,...,2,0,0,1,0,0,0,0,0,0
23,253750,8390,7531,20:5,0,20160327,,1,0.750000,20,...,7,1,0,0,0,0,0,0,1,0


In [42]:
#特征标签
predictors=original_feature
original_feature 

['discount_rate',
 'discount_type',
 'discount_man',
 'discount_jian',
 'distance',
 'weekday',
 'weekday_type',
 'weekday_1',
 'weekday_2',
 'weekday_3',
 'weekday_4',
 'weekday_5',
 'weekday_6',
 'weekday_7']

**lambda作为一个表达式，定义了一个匿名函数**

In [43]:
#样例
testLambda = lambda x:x+1(1)
#形成一个函数类型，输入输出包含

In [44]:
#分类函数， 
# 具有SGD训练的线性分类器（SVM，逻辑回归，ao）
#该估计器利用随机梯度下降（SGD）学习实现正则化线性模型：一次估计每个样本的损失梯度，并且沿着减小强度计划（即学习速率）的方式更新模型。
#SGD允许使用minibatch（在线/核心外）学习
#SGDClassifier来源于skearn的库，设定默认classifier的默认参数
classifier = lambda: SGDClassifier(
    #skearn自带sgd分类模型
    loss='log',  # loss function: 逻辑回归
    penalty='elasticnet', # L1 & L2 弹性惩罚方法 设定为结合L1L2
    fit_intercept=True,  # 是否存在截距，默认存在
    max_iter=100, #最高迭代次数
    shuffle=True,  # 每次epoch进行随机打乱
    n_jobs=1, # 可使用的最大处理数量
    class_weight=None) # 默认每个类别权重

In [45]:
# 管道机制使得参数集在新数据集（比如测试集）上的重复使用，管道机制实现了对全部步骤的流式化封装和管理。
#from sklearn.pipeline，加速模型运算
# Pipeline可以将许多算法模型串联起来，比如将特征提取、归一化、分类组织在一起形成一个典型的机器学习问题工作流。主要带来两点好处：

#pipeline可以用于把多个estimators级联成一个estimator，这么 做的原因是考虑了数据处理过程中一系列前后相继的固定流程，
#比如feature selection->normalization->classification

#     直接调用fit和predict方法来对pipeline中的所有算法模型进行训练和预测。
#     可以结合grid search对参数进行选择。

#链接的（名称，变换）元组（实现拟合/变换）的列表，按照它们被链接的顺序，最后一个对象是估计器。
model = Pipeline(steps=[
    ('ss', StandardScaler()), # 特征标准化处理from sklearn.preprocessing
    ('en', classifier())  # 优化器sgd
])
#流程数据归一化+分类训练

In [46]:
parameters = {
    'en__alpha': [ 0.001, 0.01, 0.1],
    'en__l1_ratio': [ 0.001, 0.01, 0.1]
}#参数设定，学习率，L1正则化参数

In [47]:
# StratifiedKFold用法类似Kfold，但是他是分层采样，确保训练集，测试集中各类别样本的比例与原始数据集中相同。
# from sklearn.model_selection 
#分层K折交叉验证
folder = StratifiedKFold(n_splits=3, shuffle=True)
#分成三份并保证各类样本比例近似

In [48]:
# 搜索，搜索最好的超参数 from sklearn.model_selection
grid_search = GridSearchCV(
    model,  #为了实现scikit-learn估计器接口。估算器需要提供score函数，或者scoring必须通过。

    parameters, #具有参数名称（字符串）作为键的字典和作为值尝试的参数设置列表，或这些字典的列表，在这种情况下，探索列表中每个字典所跨越的网格。这样可以搜索任何参数设置序列。
    cv=folder, #交叉验证生成器或可迭代的，经过交叉验证分类后的对象
    n_jobs=-1,  # -1 means using all processors这用于指定应该为并行化例程使用多少并发进程/线程。Scikit-learn默认使用一个处理器进行处理，尽管它也使用NumPy，NumPy可以配置为使用线程数字处理器库（如MKL;请参阅常见问题解答）。
    verbose=1) #Controls the verbosity: the higher, the more messages.

#使用网格搜索进行训练
grid_search = grid_search.fit(data[predictors], 
                              data['label'])

Fitting 3 folds for each of 9 candidates, totalling 27 fits


[Parallel(n_jobs=-1)]: Done  27 out of  27 | elapsed:  2.5min finished


In [75]:
model = grid_search

### 测试

In [52]:
#在验证集里进行预测
y_valid_pred =model.predict_proba(valid[predictors])

In [54]:
#将预测值加入到pd中
valid1 = valid.copy()
valid1['pred_prob'] = y_valid_pred[:, 1]
valid1.head(5)

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date,discount_type,discount_rate,discount_man,...,weekday_type,weekday_1,weekday_2,weekday_3,weekday_4,weekday_5,weekday_6,weekday_7,label,pred_prob
1,1439408,4663,11002,150:20,1,20160528,,1,0.866667,150,...,1,0,0,0,0,0,1,0,0,0.019522
4,1439408,2632,8591,20:1,0,20160613,,1,0.95,20,...,0,1,0,0,0,0,0,0,0,0.101212
6,1439408,2632,8591,20:1,0,20160516,20160613.0,1,0.95,20,...,0,1,0,0,0,0,0,0,0,0.101212
9,2029232,450,1532,30:5,0,20160530,,1,0.833333,30,...,0,1,0,0,0,0,0,0,0,0.09691
10,2029232,6459,12737,20:1,0,20160519,,1,0.95,20,...,0,0,0,0,1,0,0,0,0,0.13286


**利用AUC检验效果**

In [57]:
vg = valid1.groupby(['Coupon_id'])#根据Coupon_id拆分pandas对象,进行分组

In [62]:
aucs = []
for i in vg:
    #对于每一个分组出来的id，对应id和分离出来的对应项
    tmpdf = i[1] #对应pd结构数据
    if len(tmpdf['label'].unique()) != 2:#unique()统计种类
        #label 只有一类，就直接跳过，因为 AUC 无法计算
        continue
    fpr, tpr, thresholds = roc_curve(tmpdf['label'], tmpdf['pred_prob'], pos_label=1)#利用sklearn计算fpr、tpr和thresholds算子
    #输入标签、预测值、pos_label意思是将Label considered as positive and others are considered negative.
    aucs.append(auc(fpr, tpr))#计算出auc面积值并保存在list中
    #from sklearn.metrics import auc， roc_curve
print(np.average(aucs))#计算aues的平均值作为评估预测

0.5323444017937871


In [73]:
 tpr

array([0.07142857, 0.07142857, 0.21428571, 0.28571429, 0.35714286,
       0.42857143, 0.42857143, 0.5       , 0.5       , 0.57142857,
       0.64285714, 0.64285714, 0.71428571, 0.85714286, 0.85714286,
       0.85714286, 0.92857143, 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        ])

### 输入测试集生成预测结果

In [76]:
y_test_pred = model.predict_proba(dftest[predictors])
dftest1 = dftest[['User_id','Coupon_id','Date_received']].copy()
dftest1['Probability'] = y_test_pred[:,1]

In [78]:
dftest1 #生成上传所选哟的预测结果

Unnamed: 0,User_id,Coupon_id,Date_received,Probability
0,4129537,9983,20160712,0.105270
1,6949378,3429,20160706,0.153750
2,2166529,6928,20160727,0.005533
3,2166529,1808,20160727,0.018750
4,6172162,6500,20160708,0.063657
5,4005121,9983,20160706,0.126577
6,4347394,9983,20160716,0.115114
7,3094273,13602,20160727,0.103618
8,5139970,9983,20160729,0.011736
9,3237121,13602,20160703,0.073244


In [79]:
dftest1.to_csv('submit1.csv', index=False, header=False)
#表示不输入列名和索引
dftest1.head(5)

Unnamed: 0,User_id,Coupon_id,Date_received,Probability
0,4129537,9983,20160712,0.10527
1,6949378,3429,20160706,0.15375
2,2166529,6928,20160727,0.005533
3,2166529,1808,20160727,0.01875
4,6172162,6500,20160708,0.063657


### 保存导入模型

In [80]:
if not os.path.isfile('1_model.pkl'):
    with open('1_model.pkl', 'wb') as f:
        pickle.dump(model, f)
else:
    with open('1_model.pkl', 'rb') as f:
        model = pickle.load(f)



In [81]:
 model

GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=None, shuffle=True),
       error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('ss', StandardScaler(copy=True, with_mean=True, with_std=True)), ('en', SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='log', max_iter=100, n_iter=None,
       n_jobs=1, penalty='elasticnet', power_t=0.5, random_state=None,
       shuffle=True, tol=None, verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'en__alpha': [0.001, 0.01, 0.1], 'en__l1_ratio': [0.001, 0.01, 0.1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)