# Marketing Analytics EDA & XGBOOST Regression
**数据处理**

- 空值、异常值
- 添加了年龄、总购买量以及各品类购买占比字段

**数据可视化**

- 通过可视化结果对地区、学历、婚姻状况的分布以及与购买力之间的关系进行粗略分析
- 通过可视化结果对商品种类与购买方式、活动响应之间的关系进行粗略分析

**XGBOOST回归预测**
- Pipeline编码、建模
- RandomSearh调参
- 交叉验证
- R2-Score = 99.44%

## 字段含义
- **ID：** Customer's unique identifier
- **Year_Birth：** Customer's birth year
- **Education：** Customer's education level
- **Marital_Status：** Customer's marital status
- **Income：** Customer's yearly household income
- **Kidhome：** Number of children in customer's household
- **Teenhome：** Number of teenagers in customer's household
- **Dt_Customer：** Date of customer's enrollment with the company
- **Recency：** Number of days since customer's last purchase
- **MntWines：** Amount spent on wine in the last 2 years
- **MntFruits：** Amount spent on fruits in the last 2 years
- **MntMeatProducts：** Amount spent on meat in the last 2 years
- **MntFishProducts：** Amount spent on fish in the last 2 years
- **MntSweetProducts：** Amount spent on sweets in the last 2 years
- **MntGoldProds：** Amount spent on gold in the last 2 years
- **NumDealsPurchases：** Number of purchases made with a discount
- **NumWebPurchases：** Number of purchases made through the company's web site
- **NumCatalogPurchases：** Number of purchases made using a catalogue
- **NumStorePurchases：** Number of purchases made directly in stores
- **NumWebVisitsMonth：** Number of visits to company's web site in the last month
- **AcceptedCmp3：** 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
- **AcceptedCmp4：** 1 if customer accepted the offer in the 4th campaign, 0 otherwise
- **AcceptedCmp5：** 1 if customer accepted the offer in the 5th campaign, 0 otherwise
- **AcceptedCmp1：** 1 if customer accepted the offer in the 1st campaign, 0 otherwise
- **AcceptedCmp2：** 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
- **Response：** 1 if customer accepted the offer in the last campaign, 0 otherwise
- **Complain：** 1 if customer complained in the last 2 years, 0 otherwise
- **Country：** Customer's location

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
sns.set_palette('pastel')
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## 数据清洗

In [None]:
pd.set_option('display.max_columns',None)
pd.set_option('display.width',None)
df = pd.read_csv('/kaggle/input/marketing-data/marketing_data.csv')
df.head()

In [None]:
df.info()

发现Income字段中存在24个空值，且字段名不规范。

In [None]:
df.columns = df.columns.str.replace(' ','')
df.columns

In [None]:
df.loc[df.Income.isna()]

通过上表可以发现Income字段中的24个空值是随机分布的，这些数据大约占总体样本的1%，可以直接删除。

In [None]:
df=df.dropna()
df.duplicated().sum()

In [None]:
df['Income'] = df['Income'].str.replace('[$,]','').astype('float')
df['Dt_Customer'] = pd.to_datetime(df['Dt_Customer'],infer_datetime_format=True,errors='raise')

为便于分析，添加'Age'、'MntTotal'、'Year_Enroll'、'Dependents'字段

In [None]:
df['MntTotal'] = df.loc[:,'MntWines':'MntGoldProds'].sum(axis=1)
df['Age'] = 2021 - df['Year_Birth']
df['Year_Enroll'] = 2021 - df['Dt_Customer'].dt.year
df['Dependents'] = df['Kidhome'] + df['Teenhome']

In [None]:
df.head(10)

为各品类销量、各途径付款数、用户年龄及收入绘制分布图、箱线图并输出描述统计结果。

In [None]:
df_to_plot=df.loc[:,'MntWines':'NumWebVisitsMonth'].join(df.loc[:,'MntTotal':'Dependents']).join(df['Income'])
for col in df_to_plot.columns:
    fig,(ax0,ax1) = plt.subplots(1,2)
    fig.set_size_inches(16,4)
    sns.histplot(df_to_plot[col],ax=ax0,kde=True)
    sns.boxplot(df_to_plot[col],ax=ax1)
    ax0.axvline(x=np.mean(df_to_plot[col]),label='Avg',linestyle='--')
    ax0.legend()
    plt.show()
    print(df_to_plot[col].describe())

- 各类商品购买量的数据中存在较多离群值
- 在'Age'字段中存在两个超过120的离群值，这显然是不可能的，故当作异常删除。

In [None]:
df = df[df.Age <=100]
df.shape

创建三张新表

In [None]:
purchase = df.loc[:,'NumDealsPurchases':'NumWebVisitsMonth']
products = df.loc[:,'MntWines':'MntGoldProds']
campaign = df.loc[:,'AcceptedCmp3':'AcceptedCmp2']

由于数据不是连续分布的，所以关系图采用spearman相关系数更合适

In [None]:
plt.figure(figsize=(16,12))
sns.heatmap(df.corr(method='spearman'),annot=True,mask=(df.corr()**2<0.25))

从以上关系图可以粗略地看出：
- 用户收入与购买量、目录购买次数之间的相关系数在0.79 - 0.85之间，具有较强正相关性；而用户收入与网页浏览次数则存在一定的负相关性。
- 家庭儿童数量与总购买量相关系数为-0.62，与线下购买量相关系数为-0.6，说明家中有儿童的用户的线下购买量和总购买量可能相对较少。
- 分析单一商品种类与购买方式、活动响应之间的关系应该排除其他变量的影响，在之后进行。

## 数据可视化

In [None]:
sns.catplot(data=df,x='Education',col='Country',col_wrap=2,kind='count',hue='Marital_Status',legend=True,height=5,aspect=1.4)

可以看出：
- 用户主要来自于西班牙,加拿大和沙特阿拉伯次之；美国、澳大利亚、德国、印度有少量用户；黑山只有3位用户
- 用户的学历以本科、博士、硕士居多（按数量排名）
- 已婚、单身、恋爱人士占用户的主体，同时还存在少量离婚和丧偶人士

In [None]:
sns.catplot(data=df,x='Education',y='MntTotal',estimator=sum
            ,col='Country',col_wrap=2,kind='point',hue='Marital_Status',ci=None,legend=True,height=5,aspect=1.4)

总购买量：
- 本科用户的总购买量最多，博士用户购买量略多于硕士用户。
- 已婚用户的总购买量最多，恋爱用户购买量略多于单身用户。

In [None]:
sns.catplot(data=df,x='Education',y='MntTotal'
            ,col='Country',col_wrap=2,kind='point',hue='Marital_Status',ci=None,legend=True,height=5,aspect=1.4)

人均购买量：
- 丧偶人士在各国家的人均购买量都比较高，尤其是西班牙（主要市场）。
- 在西班牙市场中，硕士学历的用户人均购买量略高于其他学历用户；已婚人士中博士的购买力最强。

In [None]:
fig,(ax0,ax1) = plt.subplots(1,2)
fig.set_size_inches(16,12)
ax0.pie(np.sum(campaign),autopct='%1.1f%%',labels=['Campaign 3','Campaign 4','Campaign 5','Campaign 1','Campaign 2'])
ax0.set_title('Campaign accepted rate')
ax1.pie(np.sum(products),autopct='%1.1f%%',labels=['Wines','Fruits','Meat','Fish','Sweet','Gold'])
ax1.set_title('Product amount rate')

- 活动2的参与率极低
- 出售的商品主要以酒类(50.3%)、肉类(27.5%)为主

In [None]:
fig,ax = plt.subplots(2,4)
fig.set_size_inches(16,12)
ax[0,0].pie(np.sum(df.query("Country=='SP'").loc[:,'AcceptedCmp3':'AcceptedCmp2'])
            ,autopct='%1.1f%%',labels=['Campaign 3','Campaign 4','Campaign 5','Campaign 1','Campaign 2'])
ax[0,0].set_title('SP')
ax[0,1].pie(np.sum(df.query("Country=='CA'").loc[:,'AcceptedCmp3':'AcceptedCmp2'])
            ,autopct='%1.1f%%',labels=['Campaign 3','Campaign 4','Campaign 5','Campaign 1','Campaign 2'])
ax[0,1].set_title('CA')
ax[0,2].pie(np.sum(df.query("Country=='US'").loc[:,'AcceptedCmp3':'AcceptedCmp2'])
            ,autopct='%1.1f%%',labels=['Campaign 3','Campaign 4','Campaign 5','Campaign 1','Campaign 2'])
ax[0,2].set_title('US')
ax[0,3].pie(np.sum(df.query("Country=='AUS'").loc[:,'AcceptedCmp3':'AcceptedCmp2'])
            ,autopct='%1.1f%%',labels=['Campaign 3','Campaign 4','Campaign 5','Campaign 1','Campaign 2'])
ax[0,3].set_title('AUS')
ax[1,0].pie(np.sum(df.query("Country=='GER'").loc[:,'AcceptedCmp3':'AcceptedCmp2'])
            ,autopct='%1.1f%%',labels=['Campaign 3','Campaign 4','Campaign 5','Campaign 1','Campaign 2'])
ax[1,0].set_title('GER')
ax[1,1].pie(np.sum(df.query("Country=='IND'").loc[:,'AcceptedCmp3':'AcceptedCmp2'])
            ,autopct='%1.1f%%',labels=['Campaign 3','Campaign 4','Campaign 5','Campaign 1','Campaign 2'])
ax[1,1].set_title('IND')
ax[1,2].pie(np.sum(df.query("Country=='SA'").loc[:,'AcceptedCmp3':'AcceptedCmp2'])
            ,autopct='%1.1f%%',labels=['Campaign 3','Campaign 4','Campaign 5','Campaign 1','Campaign 2'])
ax[1,2].set_title('SA')
ax[1,3].pie(np.sum(df.query("Country=='ME'").loc[:,'AcceptedCmp3':'AcceptedCmp2'])
            ,autopct='%1.1f%%',labels=['Campaign 3','Campaign 4','Campaign 5','Campaign 1','Campaign 2'])
ax[1,3].set_title('ME')

- 活动二的响应程度非常低，在美国和澳大利亚甚至无人参与。

In [None]:
fig,ax = plt.subplots(2,4)
fig.set_size_inches(16,12)
ax[0,0].pie(np.sum(df.query("Country=='SP'").loc[:,'MntWines':'MntGoldProds'])
            ,autopct='%1.1f%%',labels=['Wines','Fruits','Meat','Fish','Sweet','Gold'])
ax[0,0].set_title('SP')
ax[0,1].pie(np.sum(df.query("Country=='CA'").loc[:,'MntWines':'MntGoldProds'])
            ,autopct='%1.1f%%',labels=['Wines','Fruits','Meat','Fish','Sweet','Gold'])
ax[0,1].set_title('CA')
ax[0,2].pie(np.sum(df.query("Country=='US'").loc[:,'MntWines':'MntGoldProds'])
            ,autopct='%1.1f%%',labels=['Wines','Fruits','Meat','Fish','Sweet','Gold'])
ax[0,2].set_title('US')
ax[0,3].pie(np.sum(df.query("Country=='AUS'").loc[:,'MntWines':'MntGoldProds'])
            ,autopct='%1.1f%%',labels=['Wines','Fruits','Meat','Fish','Sweet','Gold'])
ax[0,3].set_title('AUS')
ax[1,0].pie(np.sum(df.query("Country=='GER'").loc[:,'MntWines':'MntGoldProds'])
            ,autopct='%1.1f%%',labels=['Wines','Fruits','Meat','Fish','Sweet','Gold'])
ax[1,0].set_title('GER')
ax[1,1].pie(np.sum(df.query("Country=='IND'").loc[:,'MntWines':'MntGoldProds'])
            ,autopct='%1.1f%%',labels=['Wines','Fruits','Meat','Fish','Sweet','Gold'])
ax[1,1].set_title('IND')
ax[1,2].pie(np.sum(df.query("Country=='SA'").loc[:,'MntWines':'MntGoldProds'])
            ,autopct='%1.1f%%',labels=['Wines','Fruits','Meat','Fish','Sweet','Gold'])
ax[1,2].set_title('SA')
ax[1,3].pie(np.sum(df.query("Country=='ME'").loc[:,'MntWines':'MntGoldProds'])
            ,autopct='%1.1f%%',labels=['Wines','Fruits','Meat','Fish','Sweet','Gold'])
ax[1,3].set_title('ME')

结合各国用户活动参与率以及商品种类购买率可以看出：
- 印度和美国肉类销量占比较其他国家稍高，分别为30.5%和29.9%
- 西班牙的酒类销量占比最高，为51%；澳大利亚和德国黄金销量占比较其他国家更高，分别为8.3%和7.9%

由于表中存在多个商品种类，在探究单一品类与购买途径的关系时，为了一定程度降低其他变量的影响，对各商品种类购买量做归一化处理，变为各品类占总体购买率。
- ***这张表是用户表,包含了用户两年内购买的所有商品，准确有效的结果应当对订单表数据进行分析后得出***

In [None]:
df['WinesRate']=df['MntWines']/df['MntTotal']
df['FruitsRate']=df['MntFruits']/df['MntTotal']
df['MeatRate']=df['MntMeatProducts']/df['MntTotal']
df['FishRate']=df['MntFishProducts']/df['MntTotal']
df['SweetRate']=df['MntSweetProducts']/df['MntTotal']
df['GoldRate']=df['MntGoldProds']/df['MntTotal']
df.head()

In [None]:
sns.pairplot(data=df,x_vars=['WinesRate','FruitsRate','MeatRate','FishRate','SweetRate','GoldRate']
            ,y_vars=['NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases'],kind='reg'
            ,plot_kws=dict(x_jitter=0.1,y_jitter=0.1,scatter_kws=dict(s=10,alpha=0.2),line_kws=dict(color='orange')))

- 酒类商品购买占比越高的用户通过门店、网站、折扣购买的次数越多。
- 肉类商品购买占比高的用户比较倾向于选择线下门店和目录购买。

In [None]:
sns.pairplot(data=df,x_vars=['WinesRate','FruitsRate','MeatRate','FishRate','SweetRate','GoldRate']
            ,y_vars=['AcceptedCmp1','AcceptedCmp2','AcceptedCmp3','AcceptedCmp4','AcceptedCmp5','Complain'],kind='reg'
            ,plot_kws=dict(x_jitter=0.1,y_jitter=0.1,logistic=True,ci=None
                           ,scatter_kws=dict(s=10,alpha=0.3)
                           ,line_kws=dict(color='orange')))

- 酒类购买量占比越高的用户更倾向于参与活动4
- 肉类购买量占比越高的用户更倾向于参与活动1和活动5
- 黄金购买量占比越高的用户更倾向于参与活动3

## XGBOOST回归

In [None]:
# preprocessing

df['Education'] = df['Education'].astype('category')
df['Marital_Status'] = df['Marital_Status'].astype('category')
df['AcceptedCmp1'] = df['AcceptedCmp1'].astype('category')
df['AcceptedCmp2'] = df['AcceptedCmp2'].astype('category')
df['AcceptedCmp3'] = df['AcceptedCmp3'].astype('category')
df['AcceptedCmp4'] = df['AcceptedCmp4'].astype('category')
df['AcceptedCmp5'] = df['AcceptedCmp5'].astype('category')
df['Response'] = df['Response'].astype('category')
df['Complain'] = df['Complain'].astype('category')
df['Country'] = df['Country'].astype('category')
df.info()

In [None]:
# feature & target split

X = df.loc[:,'Education':'Income'].join(df.loc[:,'Recency':'Country']).join(df.loc[:,'Age':'Dependents'])
y = df['MntTotal']
X.head()

In [None]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline
import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split

# convert X to dict for DictVectorizer
X_dict = X.to_dict('records')
X_train, X_test, y_train, y_test = train_test_split(X_dict, y, test_size=0.2, random_state=2)

# encode categorical variables
steps = Pipeline([
                    ('encoder', DictVectorizer(sparse=False)),
                    ('xgbreg', xgb.XGBRegressor(seed=2))
                ])

#  XGBOOST regressor hyperparameters turning by Random Search
params = {
            'xgbreg__n_estimators' : np.arange(10,200,5),
            'xgbreg__learning_rate' : np.arange(0.1,1,0.02),
            'xgbreg__max_depth' : np.arange(3,20,1),
            'xgbreg__colsample_bytree' : np.arange(0.2,1,0.05)
         }

randomized_mse = RandomizedSearchCV(steps,cv=10,param_distributions=params,n_iter=2
                                    ,scoring='neg_mean_squared_error',verbose=1,random_state=2)

# fit model and print best hyperparameters
randomized_mse.fit(X_train, y_train)
print("Best estimator found: ", randomized_mse.best_estimator_)
print("Lowest RMSE found: ", np.sqrt(np.abs(randomized_mse.best_score_)))

In [None]:
print('MntTotal Mean: ',y.mean())
print('MntTotal Median: ',y.median())

误差大约在8% - 15%之间，结果比较一般，但这可能是由于异常值导致的，再查看一下R2-Score

In [None]:
# fine model

regpipeline = Pipeline([
                            ('encoder', DictVectorizer(sparse=False)),
                            ('xgbreg', xgb.XGBRegressor(base_score=0.5, booster='gbtree',
                                                      colsample_bylevel=1, colsample_bynode=1,
                                                      colsample_bytree=0.7999999999999998, gamma=0,
                                                      gpu_id=-1, importance_type='gain',
                                                      interaction_constraints='',
                                                      learning_rate=0.16000000000000003,
                                                      max_delta_step=0, max_depth=19,
                                                      min_child_weight=1, missing=np.nan,
                                                      monotone_constraints='()', n_estimators=80,
                                                      n_jobs=8, num_parallel_tree=1, random_state=2,
                                                      reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
                                                      seed=2, subsample=1, tree_method='exact',
                                                      validate_parameters=1, verbosity=None))
                        ])

regpipeline.fit(X_train, y_train)
R2_score = regpipeline.score(X_test, y_test)
print('XGBOOST Model R2-Score: {:.2%}'.format(R2_score))

R2-Score达到了99.44%，看一下回归曲线的拟合效果

In [None]:
# plot Test - Predict comparison curve

y_pred = regpipeline.predict(X_test)
plt.plot(range(len(y_test)),sorted(y_test),c='black',label='Test')
plt.plot(range(len(y_pred)),sorted(y_pred),c='red',label='Predict')
plt.legend()