# Member Repurchase Forecast
## Exploratory Data Analysis (EDA)

Based on the historical data of some members, predict who in another group of members will buy back products.
User data in the retail industry, including personal information, transaction records, etc., can help operations or marketing strategies

## Initial setting

In [None]:
# 载入需要使用的包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import os

pd.set_option('display.max_columns', None)
# raw 对应原始数据， processed 对于处理后的数据
raw_data_path = '../input/member-repurchase-forecast/'
processed_data_path = '../input/member-repurchase-forecast/'

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

## 1. 资料
### 资料描述  
Repurchase_train_baseline.csv - 原始训练资料集，包含会员的ID以及相对应的特征。其中包含预测目标(target)，对应题目中的是否回购。  
Repurchase_test_baseline.csv - 原始测试资料集。  
transactions.csv - 会员交易记录，包含会员ID、商家ID、交易金額、品项等信息。  

## 1.1 读取资料

In [None]:
df_train = pd.read_csv(os.path.join(raw_data_path, 'train.csv'), parse_dates=['first_active_month'])
df_test = pd.read_csv(os.path.join(raw_data_path, 'test.csv'), parse_dates=['first_active_month'])
transactions = pd.read_csv(os.path.join(raw_data_path, 'transactions.csv'))
transactions['purchase_date'] = pd.to_datetime(transactions['purchase_date'])

## 1.2 资料概览

### 训练资料集

In [None]:
# 行、列數量与前5笔样本
print('Train', df_train.shape)
df_train.head(5)

In [None]:
# 基本信息
df_train.info()

In [None]:
# 描述性統計
df_train.describe()

### 测试资料集

In [None]:
#print('Test', df_test.shape)
#df_test.head()

# 基本資訊
#df_test.info()

# 描述性統計
#df_test.describe()

### 会员交易记录

In [None]:
#print('Transactions', transactions.shape)
#transactions.head()

# 基本資訊
#transactions.info()

# 描述性統計
#transactions.describe()

In [None]:
transactions.columns

## 2. 资料视觉化 & 探索性分析

## 2.1 变量分布图 - 柱状图

In [None]:
# 柱狀圖
col = 'target'
df_train.groupby([col])[col].count().plot(kind='bar', rot=0, figsize=[10, 6])

In [None]:
# 或是使用下面这种方式
df_train[col].value_counts().plot(kind = 'bar', rot = 0, figsize = [10,6])

## 2.2 数值型变量分布 - 直方图

In [None]:
# 直方圖
col = 'feature_4'
plt.figure(figsize=[10, 6])
sns.distplot(df_train[col])

col = 'feature_5'
plt.figure(figsize=[10, 6])
sns.distplot(df_train[col])

## 2.3 类别型变量分布 - 柱状图

In [None]:
# 柱狀圖
col = 'feature_1'
df_train.groupby([col])[col].count().plot(kind='bar', rot=0, figsize=[10, 6])
plt.show()

col = 'feature_2'
df_train.groupby([col])[col].count().plot(kind='bar', rot=0, figsize=[10, 6])
plt.show()

col = 'feature_3'
df_train.groupby([col])[col].count().plot(kind='bar', rot=0, figsize=[10, 6])
plt.show()

## 2.4 时间型变量分布 - 柱状图
大部分的会员在2017年注册，並有上升趋势

In [None]:
# 柱狀圖
col = 'first_active_month'
df_train.groupby([col])[col].count().plot(kind='bar', figsize=[20, 6], title='First active month count in train set')
plt.show()

col = 'first_active_month'
df_test.groupby([col])[col].count().plot(kind='bar', figsize=[20, 6], title='First active month count in test set')
plt.show()

## 多变量

## 2.5 目标变量在类别型变量中各类别下的分布狀況 - 柱状图

In [None]:
# 柱状图 - target vs. feature_1, feature_2, feature_3
col = 'feature_1'
df_train.groupby([col])['target'].sum().plot(kind='bar', rot=0, figsize=[10, 6], title = 'target vs %s' %col)
plt.show()

col = 'feature_2'
df_train.groupby([col])['target'].mean().plot(kind='bar', rot=0, figsize=[10, 6])
plt.show()

col = 'feature_3'
df_train.groupby([col])['target'].sum().plot(kind='bar', rot=0, figsize=[10, 6])
plt.show()

## 2.6 数值型变量在在目标变量中不同类别下的分布状态 - 分组箱形图

In [None]:
# 分组箱型图 - target vs. feature_4, feature_5
col = 'feature_4'
plt.figure(figsize=[10, 6])
sns.boxplot(x='target', y=col, data=df_train)

col = 'feature_5'
plt.figure(figsize=[10, 6])
sns.boxplot(x='target', y=col, data=df_train)

## 2.7 各变量之间的的线性相关性 - 相关性热力图

In [None]:
# 相关性热力图
df_corr = df_train.loc[:, ['target', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5']]
plt.figure(figsize=[10, 10])
sns.heatmap(df_corr.corr(), cmap='Blues', square=True)