# Note :<br>

- Major update : (18-Jan-2021) <br>
    1) Added content in section2 "significant factor related to the number of store purchases." <br>
    2) Add highlight color in dataframe for better visibility  <br>
    3) Added KNN imputation explanation
- I'll keep updating some little details that I miss. Once I think this notebook is perfectly done, I will delete this note. <br>
- However, if I made some mistakes or you want to leave some advice, please do not hesitate to comment in the comment section below. <br>
- Any comments are welcomed :)
___

**Section 01: Exploratory Data Analysis**

- Are there any **null values or outliers**? How will you wrangle/handle them? <br>
-> There're nullvalues and outliers in 'Income' column. They will be imputed by sklearn's KNNImputer

- Are there any variables that warrant **transformations**? <br>
-> We can transform year columns into age columns.

- Are there any useful variables that you can engineer with the given data? <br>
-> Total_product_purchases, Conversion rate of campaign acception and many others will be shown inside the notebook.

- Do you notice any **patterns or anomalies** in the data? Can you plot them? <br>
-> There're anomaly in customers' age and extramly high value in income columns. There also are highly right-skewed distribution in product purchases, income, channel purchase columns



**Section 02: Statistical Analysis**

- What factors are significantly related to the **number of store purchases?** <br>
-> MntWines

- Does **US** fare significantly better than the Rest of the World in terms of **total purchases?** <br>
-> *No, we can't conclude like that*

- Your supervisor insists that people who buy gold are more conservative. 
Therefore, people who spent an **above average amount on gold** in the last 2 years would have **more in store purchases**. Justify or refute this statement using an appropriate statistical test <br>
-> *Yes*
- Fish has Omega 3 fatty acids which are good for the brain. Accordingly, do **"Married PhD candidates"** have a significant relation with **amount spent on fish**? What **other factors** are significantly related to amount spent on fish? <br>
-> *Not at all*

- Is there a significant relationship between **geographical regional** and success of a **campaign**? <br>
-> *Yes, many campaigns are most successful in Spain*


**Section 03: Data Visualization**

This notebook will not explicitly have this section. However, throughout this notebook, I did quite a lot of visualization to answer the questions of all the sections.

- Which marketing **campaign** is most successful? <br>
-> *Campaign4 have the greatest number of acceptance while canpaign2 have the least.*
- What does the average customer look like for this company? <br>
-> *Middle-age, Spainish, Have degree(s), Have family, Mostly purchase wine*
- Which products are performing best? <br>
-> *wine*
- Which channels are underperforming? <br>
-> *Deal(Purchase by discount deal)*


___

# Prepare the data

We're gonna using these libraries.

*Make sure the version of seaborn is 0.11.0

In [None]:
!pip install seaborn --upgrade

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
import warnings
import re
import os
sns.set()
pd.set_option("display.max_columns", None)
pd.set_option("display.width", None)
warnings.filterwarnings("ignore")

sns.__version__

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Let's import our data.

In [None]:
data = pd.read_csv('/kaggle/input/marketing-data/marketing_data.csv', index_col='ID', parse_dates=['Dt_Customer'])

data = data.rename(columns={
    'Dt_Customer':'Enrollment date',
    'Recency':'Days since last purchase',
    ' Income ':'Income'})

data['ID'] = data.index

data.head()

In [None]:
print('Number of columns :',data.shape[1])
print('Number of records :',data.shape[0])

# Data cleaning

Change "Income" column format from "\\$84,835.00" (String) to 84835.00 (Float) <br>
By replacing comma(",") and "$" with empty string("").

In [None]:
# Another way
data['Income'].str.replace(r'[$,]','')

In [None]:
def extract(x):
    if x is np.nan: return np.nan
    return float(re.sub(r'[$,]', "", str(x)))

data['Income'] = data['Income'].apply(extract)

Create table for each section of data

In [None]:
# Store customer's information
Customers = data.loc[:,:'Days since last purchase'].join(data[['Country']])  

# Store product's information
Products = data.loc[:,'MntWines':'MntGoldProds']     

# Store Purchases' information
Purchases = data.loc[:,'NumDealsPurchases':'NumWebVisitsMonth']    

# Store campaign's information
Campaigns = data.loc[:,'AcceptedCmp3':'AcceptedCmp2']     
Misc = data.loc[:,['Response','Complain']]

Create table for each type of data 

In [None]:
category = data.select_dtypes(include='object')
numeric = data.select_dtypes(exclude='object')

---

# Section 01: Exploratory Data Analysis
- Are there any null values or outliers? How will you wrangle/handle them?
- Are there any variables that warrant transformations?
- Are there any useful variables that you can engineer with the given data?
- Do you notice any patterns or anomalies in the data? Can you plot them?

I'll do my best to answer above questions. This first section is separated into 3 parts: <br>
    **1) Null values** : In this part, I'll try to find all the missing values and impute them. <br>
    **2) Outliers** : In this part, I'll plot all the distribution of numerical data in order to detect some outliers and try to make sense of them.<br> 
    **3) Analysis** in each feature : In this part, I'll try to seek some insights into the data by creating perceptive pivot tables, graphs, and other visualizations. 




## 1) Null values

In [None]:
pd.DataFrame(data.isnull().sum(), columns=['#Null values']).T

There're 24 detectable null values in "Income" column. However, we still need to check for other missing values since sometimes missing values are denoted as, for example, "Unknown" for categorical data or -1 for numerical data. <br>


In [None]:
for f in category.columns:
    print(category[f].value_counts())
    print('***********************************')

Inspecting each value counts in categorical columns -> there're no more missing value.

In [None]:
df = numeric.describe()

def custom_style(row):
    
    color = 'white'
    if row.name == 'min' or row.name == 'max':
        color = 'darkkhaki'

    return ['background-color: %s' % color]*len(row.values)

df.style.apply(custom_style, axis=1)

Inspecting each numerical columns -> There's no weirdly low or high value.


### We can conclude that there are only have missing values in "Income" column

In [None]:
'''
from sklearn.impute import KNNImputer

numeric_before_impute = numeric.drop(['Year_Birth','Enrollment date','ID'], axis=1).copy()

imputer = KNNImputer(missing_values=np.nan)
numeric_imputed = imputer.fit_transform(numeric_before_impute)

numeric_imputed = pd.DataFrame(numeric_imputed, 
                       index=numeric_before_impute.index, 
                       columns=numeric_before_impute.columns).join(numeric[['Year_Birth','Enrollment date','ID']])

'''

### 2) Outliers <br>
Create report() function to describe and visualize the numerical data.

In [None]:
def report(feature):
    fig, ax = plt.subplots(1,2)
    fig.set_size_inches(16,4)
    fig.suptitle(feature, fontsize=16)
    sns.histplot(data=numeric, x=feature, kde=True, ax=ax[0])
    sns.boxplot(data=numeric, x=feature, ax=ax[1])
    plt.show()

    print(numeric[feature].describe())

I'll create new column named "Age" and "Enroll_at_age" derived from "Year_Birth" and "Enrollment date" respectively.

In [None]:
from datetime import date

Age = date.today().year-numeric['Year_Birth']
numeric.insert(1, 'Age', Age,)

Enroll_at_age = numeric['Enrollment date'].dt.year - numeric['Year_Birth']
numeric.insert(6, 'Enroll_at_age', Enroll_at_age)

Then, print the distribution report for numerical columns.

In [None]:
for col in numeric.columns:
    if col in ['Year_Birth','Enrollment date']: continue
    if col == 'AcceptedCmp3' : break
    report(col)

After we get some sense of distribution of each numerical column, next, we'll analyze them. <br>
___


**-> Age** <br>
There're 3 people with ages 128, 122, and 121 whose Enrollment_at_age are 113, 114, and 121 respectively which is quite impossible.

In [None]:
numeric[numeric['Age']>80]

According to [Wikipedia](https://en.wikipedia.org/wiki/List_of_the_verified_oldest_people), there're no person alives at that age in 20 century. <br>
We can conclude that there were some mistakes in these records. So, I'll mark their "Year_Birth", "Age", "Enroll_at_age" Null.

In [None]:
temp = numeric[numeric['Age']>120].index
numeric.loc[temp, ["Year_Birth", "Age", "Enroll_at_age"]] = np.nan

**-> Income**

In [None]:
numeric[numeric['Income']>160000]

There is one person with ID 9432 that has very high income which is not impossible. Moreover, when consider age and enrollment date, it seems OK. So, I decide to do not thing with him.

**Amount of product, number of purchases features** are exponentially distributed which not unusual. <br>

### 3) Analysis <br>
After we have looked into missing values and outliers, we then analyze all of the features to get some insight into them.

Considering **Enrollment date**, we can see the total number of enrollment in each year in the following bar graph.

In [None]:
plt.bar(height = numeric['Enrollment date'].dt.year.value_counts()[[2012,2013,2014]], x=['2012','2013','2014'])
plt.title('Number of enrollment in each year')
plt.show()

However in 2012, the data is gathered from 2012/08 and ,in 2014, data is gathered until 2014/07. Thus, that makes sense that 2013 have the greatest number of enrollment. The graph below shows the average number of enrollment over time.

In [None]:
df = pd.pivot_table(numeric, values='ID', index='Enrollment date', aggfunc='count')
df['count'] = df['ID'].rolling(10).mean()
df['Year'] = df.index.year.astype('category')

fig, ax = plt.subplots()
fig.set_size_inches(20,6)

sns.lineplot(data=df, x='Enrollment date', y='count', ax=ax, hue='Year')
ax.set_ylabel('Average enrollments')
ax.xaxis.set_major_locator(mdates.MonthLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y - %m'))
plt.xticks(rotation=40)
plt.show()

## - Income <br>
**Mean and Median of Income** in **each country**

In [None]:
# This time, we need to exclude the outlier in "Income" column before caculating any statistics.
numeric_analysis = numeric[numeric['Income']!=666666]

df = pd.pivot_table(numeric_analysis.join(category[['Country']]), 
                     values='Income', 
                     index='Country', 
                     aggfunc={'Income':['mean','median']})
df.plot(kind='bar')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.ylabel('Income')
plt.show()

We'll go deeper by looking into relationship between **age** and **income** in **each country**. <br>
We noticed that "ME" country have only 3 observations. We won't drop them. Instead, we'll keep that caution in mind while we're doing analysis.

In [None]:
temp = numeric_analysis[['Income','Age']].join(category[['Country']])
sns.lmplot(data=temp, y='Income', x='Age', col='Country', col_wrap=4, line_kws={'color': 'darkorange'}, scatter_kws={'color':'teal'})
plt.ylim(0,200000)
plt.show()

Next, **Income** in each **education level**.

In [None]:
df = pd.pivot_table(numeric_analysis.join(category[['Education']]), 
               values='Income', 
               index='Education', 
               aggfunc={'Income':['count','mean','median']})

df[['mean','median']].plot(kind='bar')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.ylabel('Income')
plt.xlabel('')
plt.title('Income in each education level')
plt.show()

df

In [None]:
del numeric_analysis

## - Purchase

I'll create new column **"Total products amount"** : define a total number of products purchased by each customer.

In [None]:
numeric['Total products amount'] = np.sum(Products, axis=1)

We'll looking into **proportion** of the number of **purchases in each channel** ('Deal', 'Web', 'Catalog', 'Store') to see the performance of each channel. <br>
We see that 39% of all purchase is in store, 27.5% in web, 15.6% in deal, and 17.9% by catalog. We can conclude that ,from the data, more than half of the customers purchased in store and website. 

In [None]:
total_purchase_each = np.sum(Purchases.iloc[:,:-1], axis=0)

percent_purchase_each = total_purchase_each/np.sum(total_purchase_each)*100

def plot_pie_chart(labels, sizes: pd.Series, title):
    fig, ax = plt.subplots()
    fig.suptitle(title, fontsize=16)
    fig.set_size_inches(5,5)
    ax.pie(sizes, labels=labels, autopct='%1.1f%%',
            shadow=True, startangle=90)
    ax.axis('equal')  
    plt.show()
    print(sizes.sort_values(ascending=False))
    

plot_pie_chart(labels=['Deal','Web','Catalog','Store'], sizes=percent_purchase_each, title='Purchases in each channel')

Next, looking into total number of **purchases** in each **country**. <br>
We see that most of the customers are from Spain

In [None]:
if 'Total purchase' not in Purchases.columns:
    Purchases['Total purchase'] = np.sum(Purchases, axis=1)

Purchase_category = Purchases.join(category)

Purchase_country_summary = pd.pivot_table(Purchase_category, 
                                          values='Total purchase', 
                                          index='Country', 
                                          aggfunc={'Total purchase':['sum']})

Purchase_country_summary.plot(kind='bar')
plt.title('Total purchases in each country')
plt.show()

## - Product

 We wanted to know which product was the most popular. So, we'll look into **Overall proportion of products purchased**. <br>
 We see that 50.2% of all products purchased by all customers is wine and the second place(27.6%) is meat.

In [None]:
sum_each_product = np.sum(Products, axis=0)

plot_pie_chart(sizes=sum_each_product/np.sum(sum_each_product)*100, 
               labels=['Wine','Fruit','Meat','Fish','Sweet','Gold'],
              title='Overall %Products')

Moreover, we want to know the average purchase behavior of each customer. We'll looking into **Average proportion of each product purchased by one ID** <br>
From 100% of all product each person puurchased*, we see that 45.8% will be wine, 24.95% meat product, 12% gold ptoduct, 7% fish product, 5% sweet, 4.9% fruit product.

In [None]:
if 'Total' not in Products.columns:
    Products['Total'] = np.sum(Products, axis=1)

Each_ID_Products = Products.apply(lambda x:x/x[-1]*100, axis=1)

Avg_Each_ID_Products = np.mean(Each_ID_Products, axis=0)

In [None]:
plot_pie_chart(sizes = Avg_Each_ID_Products[:-1], 
               labels=['Wine','Fruit','Meat','Fish','Sweet','Gold'],
               title='Average Product for each ID')

## - Campaign <br>
We wanted to know the performance of each campaign we conducted. Below, the bar graph and pie-chart show the number of acceptance and success rates in each campaign. <br>
We see that camapign2 might have some problems because it's very less accepted while other campaigns are accepted at a similar rate.

In [None]:
each_campaign = np.sum(Campaigns,axis=0)

CR_each_canpaign = each_campaign/len(Campaigns)*100

In [None]:
fig, ax = plt.subplots(1,2)
fig.set_size_inches(15,4)

cam_color = ['steelblue','peru','olivedrab','teal','sienna']

ax[0].barh(y=['Campaign 3','Campaign 4','Campaign 5','Campaign 1','Campaign 2'], 
           width=each_campaign.values, color=cam_color)
ax[0].set_xlabel('# success')

ax[1].pie(x=CR_each_canpaign, labels=['Campaign 3','Campaign 4','Campaign 5','Campaign 1','Campaign 2'],
         autopct='%1.1f%%', shadow=True, startangle=90, colors=cam_color)
ax[1].axis('equal')
ax[1].set_xlabel('Overall each campaign\'s success rate')

plt.show()

We want to know if 'Age' have some noticeable effect on campaign acceptance. <br>
Although there's no obvious trend, we still can see that campaign3 is likely to be accepted in younger customers than campaign4 while other campaigns are uniformly accepted in different ages.

In [None]:
fig, ax = plt.subplots(1,len(Campaigns.columns), sharey=True)
i=0
fig.set_size_inches(20,6)

for campaign in Campaigns.columns:
    sns.histplot(data=numeric[numeric[campaign]==1], x = 'Age', ax=ax[i])
    ax[i].set_ylim(0,40)
    ax[i].set_title(campaign)    
    i+=1
    
plt.show()

Next, we'll see if there's that effect in different countries. The table below shows the **overall conversion rate** in **each country**. <br>

The overall conversion rate is **around 30%**. <br>
We see that '*ME'* has the best conversion rate but there is only 3 observation in this country. So, *it's not significant*. <br>
Other than 'ME', campaigns in **'SP' and 'CA'** get the **best conversion rate(32.4%)**. <br>
The worst rate is in 'AUS' which is 21.8%.

In [None]:
Campaigns_category = Campaigns.join(category)
Campaigns_category['Total accept'] = np.sum(Campaigns, axis=1)

summary_country = pd.pivot_table(Campaigns_category, 
                                   values='Total accept', 
                                   index='Country', 
                                   aggfunc={'Total accept':['sum','count']})

summary_country['CR'] = summary_country['sum']/summary_country['count']
summary_country.rename(columns={'count':'#customers', 'sum':'Total accept'}).style.background_gradient(sns.light_palette('khaki', as_cmap=True), 
                                                                                                       subset=pd.IndexSlice[:, ['CR']])

Next, we'll see the **average performance** of **each campaign** in **each country**. <br>
Keep in mind that *\"ME\" has only 3 observations*.

In [None]:
temp = Campaigns_category.groupby(by='Country').agg(['mean'])
temp.drop(['ME'], axis=0, inplace=True)  # since "ME" have only 3 observations, I decided to drop it in this table.
temp.style.background_gradient(sns.light_palette('green', as_cmap=True))

In each column, the darkest green shade one is the country that performed the best in each campaign. <br>
We see that:
- In campaign 3, "GER", "IND", "SP", "US" perform quite well.
- In campaign 4, "SP", "IND", "GER", "CA" perform quite well.
- In campaign 5, "AUS", "CA", "SP" perform quite well.
- In campaign 1, "SP", "CA" perform quite well.
- Campaign 2 was doing not very well in any country.
- For the overall average performance, "CA", "GER", "SP" are the best.

# Section 02: Statistical Analysis

- What **factors** are significantly related to the **number of store purchases?** <br>
I'll figure out this question by trying some feature selection methods including L1-Regularization, ANOVA F-test, Recursive feature elimination.

**First**, We'll see the *linear* effect of numerical variables on "number of store purchases" by using statistical models such as regression and F-test. <br>
But before that we need to immpute the missing values. <br>

I'll impute these null values by KNN imputation. KNN imputation is an approach to fill the missing data by using a model to predict the missing values. A range of different models can be used, although a simple k-nearest neighbor (KNN) model has proven to be effective in experiments. The use of a KNN model to predict or fill missing values is referred to as “Nearest Neighbor Imputation” or “KNN imputation.” <br>

[this link](https://machinelearningmastery.com/knn-imputation-for-missing-values-in-machine-learning/) provides you a great stuff about KNN imputation. Make sure you check it out!

In [None]:
# Building analysis dataset
numeric_analsis = numeric[numeric['Income']!=666666]

# Let's clear redundant features
X_numeric = numeric_analsis.drop(['NumStorePurchases','Enrollment date','Year_Birth','ID','Total products amount'], axis=1)

# Focus on number of store purchases
y = numeric_analsis[['NumStorePurchases']]

In [None]:
from sklearn.linear_model import Lasso, Ridge
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.feature_selection import f_regression, f_classif, chi2, RFE, VarianceThreshold, SelectPercentile
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.impute import KNNImputer

# 1) Preoprocess
# Impute null values
Imputer = KNNImputer(missing_values=np.nan)

# Standardize numeric columns
Scaler = StandardScaler()

numeric_pipe = Pipeline([("Impute", Imputer),
                         ("Scale", Scaler)])

X_numeric_preprocessed = numeric_pipe.fit_transform(X_numeric)

# 2) Building the models
lasso = Lasso(alpha=0.01).fit(X_numeric_preprocessed, y)
ridge = Ridge().fit(X_numeric_preprocessed, y)
F, p = f_regression(X_numeric_preprocessed, y)  # Statistically check how each predictor & target are linearly correlated

The table below shows a summary of the models we have built. LASSO's coefficient and F-value can tell the importance of each variable. <br>
I also included the variance of each variable to see the spread of value in each variable.

In [None]:
index = [
    ['Features','Variance','Lasso','Ridge','F-test','F-test'],
    ['','','Coef','Coef','F-value','p-value']
]

StorePurchases_effect_num = pd.DataFrame(list(zip(X_numeric.columns,
                                          X_numeric.var()/np.nanmean(X_numeric, axis=0),
                                          lasso.coef_, 
                                          ridge.coef_[0], 
                                          F, p))
                                          ,columns=index).set_index('Features')

StorePurchases_effect_num.sort_values(by=('F-test','p-value')).style.background_gradient(sns.light_palette('khaki', as_cmap=True), 
                                                                                                       subset=pd.IndexSlice[:, [('Lasso','Coef'),
                                                                                                                                ('Ridge','Coef'),
                                                                                                                                ('F-test','F-value')]])

For simplicity, we plot the **coefficient of LASSO** and **F-value of F-test** in the bar graphs below.

In [None]:
plt.figure(figsize=(7,7))
StorePurchases_effect_num['Lasso']['Coef'].sort_values().plot(kind='barh')
plt.xlabel('Lasso coef')
plt.title('LASSO analysis')
plt.show()

Considering Lasso coefficient, we see that 
- Features having positive effect to the number of store purchases in decreasing order is *'MntWines', 'NumDealsPurchases', 'NumWebPurchases', 'MntFruits', 'MntFishProducts'*.
- Features having negative effect to the number of store purchases in decreasing order is *NumWebVisitsMonth, Kidhome, Response*.

In [None]:
plt.figure(figsize=(7,7))
StorePurchases_effect_num['F-test']['F-value'].sort_values().plot(kind='barh')
plt.xlabel('F-value')
plt.title('F-test')
plt.show()

Apart from statistical methods(LASSO and F-test) that measure te linear effect, we'll also perform Recursive Feature Elimination which is a feature selection method based on ML.

In [None]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()

rfe = RFE(estimator = rf, n_features_to_select = 1.0, verbose=1).fit(X_numeric_preprocessed, y)

In [None]:
pd.Series(dict(zip(X_numeric.columns, rfe.ranking_)), name='Rank').to_frame().sort_values(by='Rank')

We can conclude from many methods we performed before that **'MntWines' is the most significant factor related to the number of store purchases.**

Once we see the effect of our numerical variables on the number of store purchases, we also have to see the effect of categorical variables.<br>

In [None]:
category_NumStore = numeric[['NumStorePurchases']].join(category)
category_NumStore.head()

In [None]:
sns.catplot(kind='box', data=category_NumStore.query("Country!='ME'"), col='Country', x='Education', y='NumStorePurchases',  col_wrap=4)
plt.show()

Obviously, we can see that **'Basic' education level** seems to have **the lowest number of store purchases in all countries**. 

On the other hand, there is no obvious trend in marital status in each country as shown in the plot below. However, from the plot below, we might conclude that **widows in the US** are **less likely to purchase in-store** while those in AUS, IND, GER are tended to purchase in-store.

In [None]:
sns.catplot(kind='bar', data=category_NumStore.query("(Country!='ME') and (Marital_Status not in ['YOLO','Alone','Absurd'])"), 
            col='Country', x='Marital_Status', y='NumStorePurchases',  col_wrap=4)
plt.show()

Lastly, we have to look up to the average number of store purchases in each country and in each education level to see if there's some bias in the number of samples.

In [None]:
fig, ax = plt.subplots(1,3, sharey=True)
fig.set_size_inches(15,5)

for i,col in list(enumerate(['Education', 'Marital_Status', 'Country'])):
    sns.countplot(data=data, x='Education',  ax=ax[i])

for ax_i in ax:
    ax_i.xaxis.set_label_coords(0.5, 1.05)
    plt.setp( ax_i.xaxis.get_majorticklabels(), rotation=40 )

plt.show()

We can't confidentially tell that **'Basic' education level** have **the lowest number of store purchases in all countries** because we don't have enough data to say so. But these visualization give us a roughly say. If we have more data, we'll have more confident to that.

- Does **US** fare significantly better than the Rest of the World in terms of **total purchases?** <br>

I'll do **X-test** to test whether mean of total purchases in US is significantly higher than other country by stating null-hypothesis and alternative hypothesis under the significance level of 0.05 as following.<br>
In each country : $C$ <br>
$H_0 : \mu_{us} \leq \mu_{c}$ <br>
$H_a : \mu_{us} > \mu_{c}$
$, \alpha = 0.05$

Let's see the distribution of purchases in each country and then <br>
a table that summary important statistics of purchases in each country.

In [None]:
temp = Purchases.join(data[['Country','ID']])

fig, ax = plt.subplots(2,4)
fig.set_size_inches(20,7)
i=0

import itertools 
axes = list(itertools.chain(ax[0],ax[1]))

for c in temp.Country.unique():
    sns.histplot(data=temp[temp['Country'] == c], x='Total purchase', label=c, ax=axes[i], bins=15)
    axes[i].legend()
    i+=1
plt.show()

In [None]:
summary_purchase_country = np.sum(Purchases, axis=1).to_frame('Total purchases').join(category[['Country']]).groupby(by='Country').agg(['count','sum','mean','median','std'])

summary_purchase_country

In [None]:
from scipy.stats import norm

US_mean = summary_purchase_country.loc['US', ('Total purchases','mean')]
US_std = summary_purchase_country.loc['US', ('Total purchases','std')]
US_n = summary_purchase_country.loc['US', ('Total purchases','count')]

for country in summary_purchase_country.index:
    if country == 'US':break
    other_mean = summary_purchase_country.loc[country, ('Total purchases','mean')]
    x = (US_mean-other_mean)/(US_std/US_n**0.5)
    print('mean "US" > mean "'+country.upper()+'" p-value = '+str(norm.sf(x)))
    

If we ignore "ME", because of its too few observation, we still can not conclude that US has significantly better than the Rest of the World in term of total purchase since p-value of $\mu_{us}>\mu_{ca}$ is 0.0554 which is not less than 0.05.

- people who spent an above average amount on gold in the last 2 years would have more in store purchases.

From the histogram and boxplot below, we can conclude that people who spent an **above-average amount on gold** have more **in-store purchases**.

In [None]:
Gold_avg = np.mean(Products['MntGoldProds'])

mask = Products['MntGoldProds']>=Gold_avg

Above_gold = Products[mask].index
Below_gold = Products[~mask].index

fig,ax = plt.subplots(1,2)
fig.set_size_inches(15,5)
sns.histplot(ax=ax[0], data=Purchases[mask], x='NumStorePurchases', kde=True, label='Purchase Gold Above Avg.', color='indigo',element='step')
sns.histplot(ax=ax[0], data=Purchases[~mask], x='NumStorePurchases', kde=True, label='Below', color='darkorange', element='step')

temp = Purchases.join(mask)
temp['MntGoldProds'] = temp['MntGoldProds'].replace({True:'Gold above avg.', False:'Gold below avg.'})
sns.boxplot(ax=ax[1], data=temp, y='NumStorePurchases', x='MntGoldProds', palette={'Gold above avg.':'indigo','Gold below avg.':'darkorange'})

ax[0].legend()
plt.show()

- do **"Married PhD candidates"** have a significant relation with amount spent on **fish**? What other factors are significantly related to amount spent on fish?

In [None]:
PhD_Married = (category['Education']=='PhD') & (category['Marital_Status']=='Married')

numeric.loc[PhD_Married,['MntFishProducts']].describe().join(numeric.loc[~PhD_Married,['MntFishProducts']].describe(),
                                                            lsuffix='_PhD&Marrid',
                                                            rsuffix='_Not')

From the summary table above, we clearly see that "Married PhD candidates" doesn't have a significant relation with amount spent on fish. <br>

So, what is the key factor related to amount spent on fish? Let's first see the correlation between other numerical features and 'MntFishProducts'.

In [None]:
Customers.drop(['Year_Birth','Enrollment date'], axis=1)\
.join([Purchases, Products, Campaigns])\
.corr()[['MntFishProducts']]\
.style.background_gradient(sns.light_palette('green', as_cmap=True))

From the correlation table above, the darker the color shade is the greater relation on 'MntFishProducts' each variable has.

Moreover, we need to see if there are some categorical variables that can be related to 'MntFishProducts'.

In [None]:
fig, ax = plt.subplots(1,3, sharey=True)
fig.set_size_inches(20,7)
sns.boxplot(data=data[data['Country']!='ME'], 
            x='Country', y='MntFishProducts',
            ax=ax[0])

sns.boxplot(data=data.query('Marital_Status not in  ["YOLO","Alone","Absurd"]'), 
            x='Marital_Status', y='MntFishProducts',
            ax=ax[1])

sns.boxplot(data=data, x='Education', y='MntFishProducts',ax=ax[2])

fig.tight_layout()
fig.suptitle('Relationship between each categorical variable\nand amount spent on Fish', fontsize=18, y=1.07)
plt.show()

From the **"Relationship between each categorical variable and amount spent on Fish"** graphs we see that:
- In 'Country' variable, we see no difference in each country on 'MntFishProducts'.
- In 'Marital_Status' variable, Widow tends to have more 'MntFishProducts'.
- In 'Education' variable, 'Graduation' and '2n Cycle' tends to have more 'MntFishProducts'.

Once we know that 'Marital_Status' and 'Education' have some relation to 'MntFishProducts', we'll go deeper to find more specific insight.

First, see 'Marital_Status' in each country.

In [None]:
sns.catplot(kind='box', data = data.query('(Marital_Status not in ["YOLO","Absurd","Alone"]) and (Country != "ME")'), 
            x ='Marital_Status', 
            y ='MntFishProducts', 
            col='Country', col_wrap=4)
plt.suptitle('In each country, \nAmount spent on Fish in each marital status', fontsize=18, y=1.07)

plt.show()

From the **"In each country, Amount spent on Fish in each marital status"** graphs we see that:
- **Widows** from **"GER" and "IND"** tend to spend on fish significantly more than others. While the **Divorced** from those country have a opposite trend.

And then, see 'Education' in each country.

In [None]:
sns.catplot(kind='box', data = data.query('(Marital_Status not in ["YOLO","Absurd","Alone"]) and (Country != "ME")'), 
            x ='Education', 
            y ='MntFishProducts', 
            col='Country', col_wrap=4)
plt.suptitle('In each country, \nAmount spent on Fish in each education level', fontsize=18, y=1.07)

plt.show()

From the **"In each country, Amount spent on Fish in each education level"** graphs we see that:

- Only 'Graduation' and '2n Cycle' in 'SP', 'US', and 'SA' spend on fish more than others. But not for other countries.