## 星巴克促销活动 A/B 测试
<br>

<img src="https://opj.ca/wp-content/uploads/2018/02/New-Starbucks-Logo-1200x969.jpg" width="200" height="200">
<br>
<br>
 
#### 背景信息

此练习将提供一个数据集，星巴克原先使用该数据集作为面试题。这道练习的数据包含 120,000 个数据点，按照 2:1 的比例划分为训练文件和测试文件。数据模拟的实验测试了一项广告宣传活动，看看该宣传活动能否吸引更多客户购买定价为 10 美元的特定产品。由于公司分发每份宣传资料的成本为 0.15 美元，所以宣传资料最好仅面向最相关的人群。每个数据点都有一列表示是否向某个人发送了产品宣传资料，另一列表示此人最终是否购买了该产品。每个人还有另外 7 个相关特征，表示为 V1-V7。

#### 优化策略

你的任务是通过训练数据了解 V1-V7 存在什么规律表明应该向用户分发宣传资料。具体而言，你的目标是最大化以下指标：

* **增量响应率 (IRR)** 

IRR 表示与没有收到宣传资料相比，因为推广活动而购买产品的客户增加了多少。从数学角度来说，IRR 等于推广小组的购买者人数与购买者小组客户总数的比例 (_treatment_) 减去非推广小组的购买者人数与非推广小组的客户总数的比例 (_control_)。

$$ IRR = \frac{purch_{treat}}{cust_{treat}} - \frac{purch_{ctrl}}{cust_{ctrl}} $$


* **净增量收入 (NIR)**

NIR 表示分发宣传资料后获得（丢失）了多少收入。从数学角度来讲，NIR 等于收到宣传资料的购买者总人数的 10 倍减去分发的宣传资料份数的 0.15 倍，再减去没有收到宣传资料的购买者人数的 10 倍。

$$ NIR = (10\cdot purch_{treat} - 0.15 \cdot cust_{treat}) - 10 \cdot purch_{ctrl}$$

要详细了解星巴克提供给应聘者的数据集，请参阅[此处的说明](https://drive.google.com/open?id=18klca9Sef1Rs6q8DW4l7o349r8B70qXM)。

下面是训练数据。研究数据和不同的优化策略。

#### 如何测试你的策略？

如果你想到了优化策略，请完成要传递给 `test_results` 函数的 `promotion_strategy` 函数。  
根据以往的数据，我们知道有四种可能的结果：

实际推广客户与预测推广客户表格：  

<table>
<tr><th></th><th colspan = '2'>实际</th></tr>
<tr><th>预测</th><th>是</th><th>否</th></tr>
<tr><th>是</th><td>I</td><td>II</td></tr>
<tr><th>否</th><td>III</td><td>IV</td></tr>
</table>

我们仅针对预测应该包含推广活动的个人比较了指标，即第一象限和第二象限。由于收到宣传资料的第一组客户（在训练集中）是随机收到的，因此第一象限和第二象限的参与者人数应该大致相同。  

比较第一象限与第二象限可以知道宣传策略未来效果如何。 

首先阅读以下数据。看看每个变量或变量组合与推广活动对购买率有何影响。你想到谁应该接收宣传资料的策略后，请使用在最后的 `test_results` 函数中使用的测试数据集测试你的策略。

In [4]:
# load in packages
from itertools import combinations

from test_results import test_results, score
import numpy as np
import pandas as pd
import scipy as sp
from imblearn.over_sampling import SMOTE
import sklearn as sk
import xgboost as xgb
# from xgboost import plot_importance

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

XGBoostError: XGBoost Library (libxgboost.dylib) could not be loaded.
Likely causes:
  * OpenMP runtime is not installed (vcomp140.dll or libgomp-1.dll for Windows, libomp.dylib for Mac OSX, libgomp.so for Linux and other UNIX-like OSes). Mac OSX users: Run `brew install libomp` to install OpenMP runtime.
  * You are running 32-bit Python on a 64-bit OS
Error message(s): ['dlopen(/Users/tom/opt/anaconda3/lib/python3.7/site-packages/xgboost/lib/libxgboost.dylib, 6): Library not loaded: /usr/local/opt/libomp/lib/libomp.dylib\n  Referenced from: /Users/tom/opt/anaconda3/lib/python3.7/site-packages/xgboost/lib/libxgboost.dylib\n  Reason: image not found']


### 1. 数据研究

In [2]:
# load in the data
train_data = pd.read_csv('./data/training.csv')
train_data.head()

Unnamed: 0,ID,Promotion,purchase,V1,V2,V3,V4,V5,V6,V7
0,1,No,0,2,30.443518,-1.165083,1,1,3,2
1,3,No,0,3,32.15935,-0.645617,2,3,2,2
2,4,No,0,2,30.431659,0.133583,1,1,4,2
3,5,No,0,0,26.588914,-0.212728,2,1,4,2
4,8,Yes,0,3,28.044332,-0.385883,1,1,2,2


In [None]:
# check for data info
train_data.info()

In [None]:
# check dataset
train_data.describe()

In [None]:
# check for null value
train_data.isnull().sum()

In [None]:
# load in the data
test_data = pd.read_csv('./data/Test.csv')
test_data.head()

In [None]:
test_data.info()

In [None]:
test_data.describe()

In [None]:
# check for null value
test_data.isnull().mean()

首先数据是干净的，并且 train_data test_data 以 2 ：1 的数据点比例提供。可以对training 进行可视化研究，为防止数据泄露只在测试阶段使用 test_data。

In [None]:
# check A/B groups 1:1 
sns.countplot(data=train_data, x='Promotion');

Customers are approximately equally divided into the treatment and control groups.

In [None]:
# check puchase rate in train_data
train_data['purchase'].mean()

In [None]:
purch_palette = [sns.color_palette()[3], sns.color_palette()[2]]
sns.countplot(data=train_data, x='purchase', palette=purch_palette);

In [None]:
features = train_data.columns[3:].tolist()
features_df = train_data[features]
features_df.describe()

In [None]:
features_df.hist(figsize=(10,10));

'V2' 'V3' 数据具有连续性，所以为数值数据（'quantitative data'）， 'V1' 'V4' 'V5' 'V6' 'V7'呈现为分类数据特征 （'categorical data'）

### 2. 构建模型

In [None]:
dummied_df = pd.get_dummies(train_data, columns = ['V1', 'V4', 'V5', 'V6', 'V7'])
dummied_df.head()

In [None]:
dummied_df.info()

建立模型的目的是发掘数据中隐藏的信息，找到有可能收到促销活动而进行有效购买的消费者，以提高促销的有效性，增加收入和减少促销成本。 在数据中建立 Y lable，当消费者收到促销优惠并进行有效购买时该值设置为1，否则为0.

In [None]:
dummied_df['y'] = np.where((dummied_df['purchase'] == 1)&(dummied_df['Promotion'] == 'Yes'), 1, 0)
dummied_df.head()

In [None]:
dummied_df[dummied_df['y'] == 1]

In [None]:
train_set, valid_set = sk.model_selection.train_test_split(dummied_df, test_size=0.2,random_state=42)

In [None]:
features = dummied_df.columns[3:-1]
X_train = train_set[features]
Y_train = train_set['y']

X_valid = valid_set[features]
Y_valid = valid_set['y']

In [None]:
sm = SMOTE(random_state=42, ratio = 1.0)
X_train_upsamp, Y_train_upsamp = sm.fit_sample(X_train, Y_train)
X_train_upsamp = pd.DataFrame(X_train_upsamp, columns=features)
Y_train_upsamp = pd.Series(Y_train_upsamp)

In [None]:
# Train an xgboost model
eval_set = [(X_train_upsamp, Y_train_upsamp), (X_valid, Y_valid)]
model = xgb.XGBClassifier(learning_rate = 0.1,max_depth = 7,min_child_weight = 5,objective = 'binary:logistic',\
                          seed = 42,gamma = 0.1,silent = True)
model.fit(X_train_upsamp, Y_train_upsamp, eval_set=eval_set, eval_metric="auc", verbose=True,\
          early_stopping_rounds=30)

In [None]:
fig, ax = plt.subplots(figsize=(10, 18));
xgb.plot_importance(model, ax=ax);

In [None]:
valid_pred = model.predict(X_valid, ntree_limit=model.best_ntree_limit)
cm = sk.metrics.confusion_matrix
cm(Y_valid, valid_pred)

In [None]:
valid_pred = model.predict(X_valid, ntree_limit=model.best_ntree_limit)
cm = sk.metrics.confusion_matrix(Y_valid, valid_pred)

fig, ax= plt.subplots(figsize=(10,10))
sb.heatmap(cm, annot=True, fmt='g', ax = ax); #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels');
ax.set_ylabel('True labels'); 
ax.set_title('Confusion Matrix'); 
ax.xaxis.set_ticklabels(['No Purchase', 'Made Purchase']); 
ax.yaxis.set_ticklabels(['No Purchase', 'Made Purchase']);

In [None]:
def promotion_strategy(df):
    '''
    INPUT 
    df - a dataframe with *only* the columns V1 - V7 (same as train_data)

    OUTPUT
    promotion_df - np.array with the values
                   'Yes' or 'No' related to whether or not an 
                   individual should recieve a promotion 
                   should be the length of df.shape[0]
                
    Ex:
    INPUT: df
    
    V1	V2	  V3	V4	V5	V6	V7
    2	30	-1.1	1	1	3	2
    3	32	-0.6	2	3	2	2
    2	30	0.13	1	1	4	2
    
    OUTPUT: promotion
    
    array(['Yes', 'Yes', 'No'])
    indicating the first two users would recieve the promotion and 
    the last should not.
    '''
    preds = model.predict(df, ntree_limit=model.best_ntree_limit)

    promotion = []
    for pred in preds:
        if pred == 1:
            promotion.append('Yes')
        else:
            promotion.append('No')
    promotion = np.array(promotion)
    return promotion    

In [None]:
# test irr and nlr on our validation set
valid_results(promotion_strategy, valid_set)

In [None]:
# This will test your results, and provide you back some information 
# on how well your promotion_strategy will work in practice

test_results(promotion_strategy)