# Step4: Prediction

## 4.1 在测试集中进行预测
我们通过同样的过程对测试集中用户进行分组, 并运用训练好的Voting模型对模型进行预测

In [1]:
# 导入相关包
from lib.feature_engineer import *
import pandas as pd
from joblib import load 
from sklearn.metrics import accuracy_score

In [2]:
# 导入相关数据
test_set = pd.read_csv('./temp/test_set.csv')

In [3]:
# 把test数据label化
df_test = feature_engineer(test_set, 5)

Unnamed: 0,6,1,9,8,0,4,5,2,10,3,7
客户人数,1014,300,275,223,223,218,145,124,10,4,3


In [44]:
X_test = df_test.iloc[:,1:7]
Y_test = df_test.iloc[:, 7].values

In [45]:
# 导入模型
clf = load("./temp/votingC_model.joblib")

In [46]:
# 用模型进行预测
Y_pred = clf.predict(X_test)

In [48]:
# 打印预测集中结果
print("正确率: {:.2f} %".format(100 * accuracy_score(Y_test, Y_pred)))

正确率: 83.77 %


在完全没有接触过的训练集中，我们得到了83.77%预测正确率，这意味着我们可以有80%的把握根据用户第一笔订单来预测对正确的所属组，接下来我们对这个指标进行实际分析。

## 4.2 对企业经济效益进行评估
如在Step2中，我们对每个组的特征进行了汇总。可以看到平均订单数量最小为1.16，组别是3，其次最小为1.9，组别是5，这两个组别是重点提高对象。再者就是小于2.5的，组别是1、4、8。根据我们现在预测集中预测的结果。我们发现了2539人中发现了182人复购可能性很小，

In [64]:
target_group_no = [3]
print("复购可能性极小的人数：{}".format(sum([i in target_group_no for i in Y_pred])))
target_group_no = [5]
print("复购可能性很小的人数：{}".format(sum([i in target_group_no for i in Y_pred])))
target_group_no = [1,4,8]
print("可能复购但可以观察的人数：{}".format(sum([i in target_group_no for i in Y_pred])))

复购可能性极小的人数：4
复购可能性很小的人数：178
可能复购但可以观察的人数：199


### 4.2.1 销售额提升
如果我们采取针对性的促销手段，例如针对这些客户专门发邮件提供再次购买优惠券等。并分别以10%、30%、50%、70%的比例提升用户复购率。将会产生如下收益：

In [100]:
import numpy as np

395422.22000000003

In [130]:
group_mean = [224.38,263.55,233.74,249.21,248.46,328.93,1138.73,4810.17,463.07,362.168,2564.43]
group_id = [4,9,1,8,0,6,5,3,2,7,10]
group_improve = [3,5,1,4,8]
rate = [0.1, 0.3, 0.5, 0.7]

In [85]:
group_info = pd.DataFrame(group_mean, group_id).rename(columns={0:"mean"})

In [97]:
group_info.loc[:, "num"] = ([sum([k == i for k in Y_pred]) for i in group_info.index])

In [127]:
group_arr = group_info.values
improve_arr = group_info.loc[group_improve].values

In [129]:
total = np.round(sum(group_arr[:,0]*group_arr[:,1]), 2)
improve_total = np.round(sum(improve_arr[:,0]*improve_arr[:,1]), 2)

In [136]:
for i in range(len(rate)):
    print("针对复购指数低的用户，提高了{}%的转化率，销售额提高了{}英镑，增长率为{}%"
          .format(np.round(rate[i]*100,0), np.round(improve_total*rate[i],2), np.round(improve_total*rate[i]/total*100,2)))

针对复购指数低的用户，提高了10.0%的转化率，销售额提高了39542.22英镑，增长率为4.35%
针对复购指数低的用户，提高了30.0%的转化率，销售额提高了118626.67英镑，增长率为13.04%
针对复购指数低的用户，提高了50.0%的转化率，销售额提高了197711.11英镑，增长率为21.74%
针对复购指数低的用户，提高了70.0%的转化率，销售额提高了276795.55英镑，增长率为30.43%


### 4.2.2 促销成本节约
如果按销售的30%为优惠，对比营销上的成本节省

In [143]:
group_price = [483.25,743.11,535.3,594.42,671.47,1023.07,2204.41,5403.55,7027.12,33514.7,41911.2]
group_id = [4,9,1,8,0,6,5,3,2,7,10]
group_improve = [3,5,1,4,8]
group_info = pd.DataFrame(group_price, group_id).rename(columns={0:"price"})
group_info.loc[:, "num"] = ([sum([k == i for k in Y_pred]) for i in group_info.index])
group_arr = group_info.values
improve_arr = group_info.loc[group_improve].values

In [144]:
total = np.round(sum(group_arr[:,0]*group_arr[:,1]), 2)
improve_total = np.round(sum(improve_arr[:,0]*improve_arr[:,1]), 2)

In [145]:
before_promotion = total * 0.3
after_promotion = improve_total * 0.3

In [150]:
print("没有针对性促销前，促销的总成本为{}英镑".format(np.round(before_promotion,2)))
print("针对性促销后，促销的总成本为{}英镑，减少了{}%".format(np.round(after_promotion,2)
                                        , np.round((before_promotion - after_promotion)/before_promotion *100,2)))

没有针对性促销前，促销的总成本为792984.96英镑
针对性促销后，促销的总成本为243627.1英镑，减少了69.28%
