## 任务2 - 特征工程（2天）

任务2：对数据特征进行衍生和进行特征挑选。时间：2天

  * 特征衍生

  * 特征挑选：分别用IV值和随机森林等进行特征选择

  * ……以及你能想到特征工程处理

### 特征衍生
  在实际业务中，通常我们只拥有几个到几十个不等的基础变量，而多数变量没有实际含义，不适合直接建模，如用户地址（多种属性值的分类变量）、用户日消费金额（弱数值变量）。而此类变量在做一定的变换或者组合后，往往具有较强的信息价值，对数据敏感性和机器学习实战经验能起到一定的帮助作用。所以我们需要对基础特征做一些衍生类的工作，也就是业内常说的如何生成万维数据。

  特征衍生也叫特征构建，是指从原始数据中构建新的特征，也属于特征选择的一种手段。特征构建工作并不完全依赖于技术，它要求我们具备相关领域丰富的知识或者实践经验，基于业务，花时间去观察和分析原始数据，思考问题的潜在形式和数据结构，从原始数据中找出一些具有物理意义的特征。

  找到可以拓展的基础特征后，便可用如下几种方式衍生特征：

特征扩展
合成特征
特征组合
特征交叉

In [109]:
import pickle
import pandas as pd
from sklearn.model_selection import train_test_split

# 载入数据
with open('final.pkl', 'rb') as f:
    final_data = pickle.load(f)
# data = pd.read_csv("./data.csv",encoding='gbk')
# y=data.status
# 划分训练集测试集
X, y = final_data[final_data.columns.drop("status")], final_data['status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2018)


In [110]:
# 性能评估
from sklearn.metrics import accuracy_score, roc_auc_score

def model_metrics(clf, X_train, X_test, y_train, y_test):
    # 预测
    y_train_pred = clf.predict(X_train)
    y_test_pred = clf.predict(X_test)
    
    y_train_proba = clf.predict_proba(X_train)[:,1]
    y_test_proba = clf.predict_proba(X_test)[:,1]
    
    # 准确率
    print "'[准确率]': end = ''"
    print "训练集：', '%.4f', end = ' '" % accuracy_score(y_train, y_train_pred)
    print "'测试集：', '%.4f')" % accuracy_score(y_test, y_test_pred)
    
    # auc取值：用roc_auc_score或auc
    print "('[auc值]', end = ' ')"
    print "('训练集：', '%.4f', end = ' ')"%roc_auc_score(y_train, y_train_proba)
    print "'测试集：', '%.4f')"%roc_auc_score(y_test, y_test_proba)

## IV值进行特征选择[参考链接](https://blog.csdn.net/iModel/article/details/79420437)

在机器学习的二分类问题中，IV值（Information Value）主要用来对输入变量进行编码和预测能力评估。特征变量IV值的大小即表示该变量预测能力的强弱。IV 值的取值范围是 \[[0, 正无穷) ，如果当前分组中只包含响应客户或者未响应客户时，IV = 正无穷。量化指标含义如下：< 0.02useless for prediction、0.02 to 0.1Weak predictor、0.1 to 0.3Medium predictor、0.3 to 0.5Strong predictor 、>0.5 Suspicious or too good to be true。

- WOE的全称是“weight of evidence”，即证据权重。直观上讲，WOE是对原始变量的一种编码形式，要对一个变量进行WOE编码，首先需要把这个变量进行分组处理，即分箱或者离散化，常用离散化的方法有等宽分组，等高分组，或者利用决策树来分组。

- IV衡量的是某一个变量的信息量，从公式来看的话，相当于是自变量WOE值的一个加权求和，其值的大小决定了自变量对于目标变量的影响程度，对于分组 i ，其对应的IV值参考下图，其中n是分组个数，注意，在变量的任何分组中，不应该出现响应数为0或非响应数位0的情况，当变量的一个分组的响应数位0时，对应的woe就为负无穷，此时IV值为正无穷。

- WOE和IV值的区别WOE 和 IV 都能表达某个分组对目标变量的预测能力。但实际中，我们通常选择 IV 而不是 WOE 的和来衡量变量预测的能力，这是为什么呢？首先，因为我们在衡量一个变量的预测能力时，我们所使用的指标值不应该是负数。从这意义上来说，IV 比 WOE 多乘以前面那个因子，就保证了它不会是负数；然后，乘以(Pyi−Pni)这个因子，体现出了变量当前分组中个体的数量占整体的比例，从而很好考虑了这个分组中样本占整体的比例，比例越低，这个分组对变量整体预测能力的贡献越低。相反，如果直接用 WOE 的绝对值加和，会因为该分组出现次数偏少的影响而得到一个很高的指标。


In [111]:
import math
import numpy as np
from scipy import stats
from sklearn.utils.multiclass import type_of_target

def woe(X, y, event=1):  
    res_woe = []
    iv_dict = {}
    for feature in X.columns:
        x = X[feature].values
        # 1) 连续特征离散化
        if type_of_target(x) == 'continuous':
            x = discrete(x)
        # 2) 计算该特征的woe和iv
        # woe_dict, iv = woe_single_x(x, y, feature, event)
        woe_dict, iv = woe_single_x(x, y, feature, event)
        iv_dict[feature] = iv
        res_woe.append(woe_dict) 
        
    return iv_dict
        
def discrete(x):
    # 使用5等分离散化特征
    res = np.zeros(x.shape)
    for i in range(5):
        point1 = stats.scoreatpercentile(x, i * 20)
        point2 = stats.scoreatpercentile(x, (i + 1) * 20)
        x1 = x[np.where((x >= point1) & (x <= point2))]
        mask = np.in1d(x, x1)
        res[mask] = i + 1    # 将[i, i+1]块内的值标记成i+1
    return res

def woe_single_x(x, y, feature,event = 1):
    # event代表预测正例的标签
    event_total = sum(y == event)
    non_event_total = y.shape[-1] - event_total
    
    iv = 0
    woe_dict = {}
    for x1 in set(x):    # 遍历各个块
        y1 = y.reindex(np.where(x == x1)[0])
        event_count = sum(y1 == event)
        non_event_count = y1.shape[-1] - event_count
        rate_event = event_count / event_total    
        rate_non_event = non_event_count / non_event_total
        
        if rate_event == 0:
            rate_event = 0.0001
            # woei = -20
        if rate_non_event == 0:
            rate_non_event = 0.0001
            # woei = 20
        woei = math.log(rate_event / rate_non_event)
        woe_dict[x1] = woei
        iv += (rate_event - rate_non_event) * woei
    return woe_dict, iv

In [112]:
import warnings
warnings.filterwarnings("ignore")

iv_dict = woe(X_train, y_train)


In [113]:
iv = sorted(iv_dict.items(), key = lambda x:x[1],reverse = True)
iv

[(u'reg_preference_for_trad_\u4e8c\u7ebf\u57ce\u5e02', 9.209419337938984),
 (u'reg_preference_for_trad_\u5176\u4ed6\u57ce\u5e02', 9.209419337938984),
 (u'is_high_user', 9.209419337938984),
 (u'railway_consume_count_last_12_month', 9.209419337938984),
 (u'reg_preference_for_trad_\u5883\u5916', 9.209419337938984),
 (u'jewelry_consume_count_last_6_month', 9.209419337938984),
 (u'loans_long_time', 0.0),
 (u'low_volume_percent', 0.0),
 (u'avg_price_last_12_month', 0.0),
 (u'avg_price_top_last_12_valid_month', 0.0),
 (u'latest_six_month_loan', 0.0),
 (u'consume_mini_time_last_1_month', 0.0),
 (u'latest_one_month_fail', 0.0),
 (u'consfin_product_count', 0.0),
 (u'trans_fail_top_count_enum_last_1_month', 0.0),
 (u'trans_day_last_12_month', 0.0),
 (u'middle_volume_percent', 0.0),
 (u'max_cumulative_consume_later_1_month', 0.0),
 (u'regional_mobility', 0.0),
 (u'consfin_credit_limit', 0.0),
 (u'trans_days_interval_filter', 0.0),
 (u'query_org_count', 0.0),
 (u'loans_score', 0.0),
 (u'loans_overd

## 随机森林挑选特征

In [104]:
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# 观察默认参数的性能
rf0 = RandomForestClassifier(oob_score=True, random_state=2018)
rf0.fit(X_train, y_train)
print u'袋外分数：', rf0.oob_score_
model_metrics(rf0, X_train, X_test, y_train, y_test)
rf0

袋外分数： 0.7424105801021942
'[准确率]': end = ''
训练集：', '0.9850', end = ' '
'测试集：', '0.7744')
('[auc值]', end = ' ')
('训练集：', '0.9992', end = ' ')
'测试集：', '0.7193')


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=True, random_state=2018, verbose=0, warm_start=False)

In [105]:
# 网格法调参
param_test = {'n_estimators':range(20,200,20)}
gsearch = GridSearchCV(estimator = RandomForestClassifier(n_estimators=120, max_depth=9, min_samples_split=50, 
                                                          min_samples_leaf=20, max_features = 9,random_state=2018), 
                       param_grid = param_test, scoring='roc_auc', cv=5)

gsearch.fit(X_train, y_train)
# gsearch.grid_scores_, 
gsearch.best_params_, gsearch.best_score_

({'n_estimators': 100}, 0.7939985894823949)

In [106]:
rf = RandomForestClassifier(n_estimators=100, max_depth=9, min_samples_split=50,
                            min_samples_leaf=20, max_features = 9,oob_score=True, random_state=2018)
rf.fit(X_train, y_train)
print u'袋外分数：', rf.oob_score_
model_metrics(rf, X_train, X_test, y_train, y_test)

袋外分数： 0.7902013826269912
'[准确率]': end = ''
训练集：', '0.8182', end = ' '
'测试集：', '0.7821')
('[auc值]', end = ' ')
('训练集：', '0.8999', end = ' ')
'测试集：', '0.7728')


In [107]:
rf_dict = {}
feature_s = rf.feature_importances_
feature_n = X.columns
for i in range(0,len(feature_s)):
    rf_dict[feature_n[i]] = feature_s[i]
rf_dict

{u'abs': 0.006719667941365493,
 u'apply_credibility': 0.005117155285768405,
 u'apply_score': 0.06308268368322414,
 u'avg_consume_less_12_valid_month': 0.0015488547285952615,
 u'avg_price_last_12_month': 0.010072827889432335,
 u'avg_price_top_last_12_valid_month': 0.004634251845524828,
 u'consfin_avg_limit': 0.012167948971900702,
 u'consfin_credibility': 0.005578808638673587,
 u'consfin_credit_limit': 0.00894937059798237,
 u'consfin_max_limit': 0.0076301228627914455,
 u'consfin_org_count_behavior': 0.002491680778681817,
 u'consfin_org_count_current': 0.0041933816829697965,
 u'consfin_product_count': 0.004515194457619264,
 u'consume_mini_time_last_1_month': 0.006984134834602973,
 u'consume_top_time_last_1_month': 0.009705384040834536,
 u'consume_top_time_last_6_month': 0.006534534454032899,
 u'cross_consume_count_last_1_month': 0.00031588336700211044,
 u'first_transaction_day': 0.009583052272597613,
 u'historical_trans_amount': 0.010902055704639096,
 u'historical_trans_day': 0.0088745098

In [108]:
rf_sort = sorted(rf_dict.items(), key = lambda x:x[1],reverse = True)
rf_sort

[(u'trans_fail_top_count_enum_last_1_month', 0.12296704348417509),
 (u'history_fail_fee', 0.1131460948020786),
 (u'loans_score', 0.07299907548161173),
 (u'apply_score', 0.06308268368322414),
 (u'latest_one_month_fail', 0.0500495827779603),
 (u'loans_overdue_count', 0.04349707521896435),
 (u'trans_fail_top_count_enum_last_6_month', 0.030280854961790782),
 (u'trans_day_last_12_month', 0.02237300997807261),
 (u'trans_fail_top_count_enum_last_12_month', 0.021850806333816367),
 (u'max_cumulative_consume_later_1_month', 0.01970361477796049),
 (u'latest_one_month_suc', 0.018943873520427777),
 (u'latest_query_day', 0.017620250591160826),
 (u'rank_trad_1_month', 0.016248946710905557),
 (u'trans_top_time_last_1_month', 0.014624721554774239),
 (u'consfin_avg_limit', 0.012167948971900702),
 (u'historical_trans_amount', 0.010902055704639096),
 (u'trans_amount_3_month', 0.010883166394533912),
 (u'history_suc_fee', 0.01083947927223657),
 (u'trans_activity_day', 0.0100941566527476),
 (u'avg_price_last