# Introduction:

This notebook is a response to the dataset task 'Business Analysis with EDA & Statistics'. The task details is as follows:

> You're a marketing analyst and you've been told by the Chief Marketing Officer that recent marketing campaigns have not been as effective as they were expected to be. You need to analyze the data set to understand this problem and propose data-driven solutions.

This notebook will contain the following sections:
* Section 01: Exploratory Data Analysis

* Section 02: Statistical Analysis

* Section 03: Data Visualization

* Section 04: CMO Recommendations


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib import pyplot as plt
from datetime import datetime, timedelta, date
import statsmodels.formula.api as smf 
import scipy
from scipy import stats
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
mkt = pd.read_csv("/kaggle/input/marketing-data/marketing_data.csv")

# Section 01: Exploratory Data Analysis

In this section, we will answer the following questions:
1. Are there any null values or outliers? How will you wrangle/handle them?
2. Are there any variables that warrant transformations?
3. Are there any useful variables that you can engineer with the given data?
4. Do you notice any patterns or anomalies in the data? Can you plot them?

**1. Are there any null values or outliers? How will you wrangle/handle them?**

We start by checking for missing data in the dataset

In [None]:
mkt.isna().sum()

There are 24 customers that are missing 'Income' data, and none of the other columns contain any missing data.

We will plot all the variables for visualization before performing any data wrangling on missing data & outliers, to ensure a hollistic approach.

Before that, we will have to perform some transformations on the variables "Income" and "Dt_Customer". The main reason being that the syntax of the data contained are not suitable for statistical analysis (currency formatting & date formatting). The following transformations will be done:
1. Income: We rename the column to remove leading & trailing spaces. We will also reformat the data values to remove dollar sign and commas
2. Dt_Customer: We convert the values to represent 'days since joining' by subtracting the date joined from today's date.

In [None]:
# Income
mkt = mkt.rename(columns={' Income ':'Income'})
mkt['Income'] = mkt['Income'].replace({'\$': '', ',': ''}, regex=True)
mkt['Income'] = pd.to_numeric(mkt['Income'])

# Dt_Customer
today = datetime.now()
mkt['Dt_Customer']= pd.to_datetime(mkt['Dt_Customer'])
mkt['Dt_Customer'] = (today - mkt['Dt_Customer']).dt.days

Then, we will plot each variable to visualize the distribution of the data and identify any outliers or imbalanced classes.

We will plot boxplots for quantitative variables and barplots (of value counts) for qualitative, categorical and binary (yes/no) variables

In [None]:
f, axs = plt.subplots(7,4,figsize=(15,30))
mkt['Year_Birth'].plot(kind='box', ax=axs[0,0])
axs[0,0].title.set_text('Year_Birth')
mkt['Education'].value_counts().plot(kind='bar',ax=axs[0,1])
axs[0,1].title.set_text('Education')
mkt['Marital_Status'].value_counts().plot(kind='bar',ax=axs[0,2])
axs[0,2].title.set_text('Marital_Status')
mkt['Income'].plot(kind='box', ax=axs[0,3])
axs[0,3].title.set_text('Income')
mkt['Kidhome'].value_counts().plot(kind='bar',ax=axs[1,0])
axs[1,0].title.set_text('Kidhome')
mkt['Teenhome'].value_counts().plot(kind='bar',ax=axs[1,1])
axs[1,1].title.set_text('Teenhome')
mkt['Dt_Customer'].plot(kind='box', ax=axs[1,2])
axs[1,2].title.set_text('Dt_Customer')
mkt['Recency'].plot(kind='box', ax=axs[1,3])
axs[1,3].title.set_text('Recency')
mkt['MntWines'].plot(kind='box', ax=axs[2,0])
axs[2,0].title.set_text('MntWines')
mkt['MntFruits'].plot(kind='box', ax=axs[2,1])
axs[2,1].title.set_text('MntFruits')
mkt['MntMeatProducts'].plot(kind='box', ax=axs[2,2])
axs[2,2].title.set_text('MntMeatProducts')
mkt['MntFishProducts'].plot(kind='box', ax=axs[2,3])
axs[2,3].title.set_text('MntFishProducts')
mkt['MntSweetProducts'].plot(kind='box', ax=axs[3,0])
axs[3,0].title.set_text('MntSweetProducts')
mkt['MntGoldProds'].plot(kind='box', ax=axs[3,1])
axs[3,1].title.set_text('MntGoldProds')
mkt['NumDealsPurchases'].plot(kind='box', ax=axs[3,2])
axs[3,2].title.set_text('NumDealsPurchases')
mkt['NumWebPurchases'].plot(kind='box', ax=axs[3,3])
axs[3,3].title.set_text('NumWebPurchases')
mkt['NumCatalogPurchases'].plot(kind='box', ax=axs[4,0])
axs[4,0].title.set_text('NumCatalogPurchases')
mkt['NumStorePurchases'].plot(kind='box', ax=axs[4,1])
axs[4,1].title.set_text('NumStorePurchases')
mkt['NumWebVisitsMonth'].plot(kind='box', ax=axs[4,2])
axs[4,2].title.set_text('NumWebVisitsMonth')
mkt['AcceptedCmp3'].value_counts().plot(kind='bar',ax=axs[4,3])
axs[4,3].title.set_text('AcceptedCmp3')
mkt['AcceptedCmp4'].value_counts().plot(kind='bar',ax=axs[5,0])
axs[5,0].title.set_text('AcceptedCmp4')
mkt['AcceptedCmp5'].value_counts().plot(kind='bar',ax=axs[5,1])
axs[5,1].title.set_text('AcceptedCmp5')
mkt['AcceptedCmp1'].value_counts().plot(kind='bar',ax=axs[5,2])
axs[5,2].title.set_text('AcceptedCmp1')
mkt['AcceptedCmp2'].value_counts().plot(kind='bar',ax=axs[5,3])
axs[5,3].title.set_text('AcceptedCmp2')
mkt['Response'].value_counts().plot(kind='bar',ax=axs[6,0])
axs[6,0].title.set_text('Response')
mkt['Complain'].value_counts().plot(kind='bar',ax=axs[6,1])
axs[6,1].title.set_text('Complain')
mkt['Country'].value_counts().plot(kind='bar',ax=axs[6,2])
axs[6,2].title.set_text('Country')

f.delaxes(axs[6][3])
f.tight_layout()
plt.show()

From the variable plots, we identify the following regarding the variables:
* Year_Birth: Some significant outliers are present on the lower spectrum. These datapoints are likely false as these people would be well over 100 years old
* Marital_Status: There are 3 categories 'Alone', 'YOLO' and 'Absurd' which are not valid marital statuses, with very low number of customers in each
* Income: There are some significant outliers on the upper bound, with one outlier > 600,000
* Kidhome & Teenhome: The number of customers with 2 kid / teenager at home is very low, causing the classes to be very imbalanced. It may make more sense to convert the variables into binary variables indicating presence or absence of kid / teen
* Amount Spent on Categories: The outliers for these categories are all acceptable & valid values of purchases, and shall be kept as is
* Number of Touchpoint Visits: The outliers for these categories are all acceptable & valid values, and shall be kept as is
* Campaign Responses: It can be observed that campaigns 1, 3, 4 and 5 performed similarly, whereas campaign 2 performed poorly. The latest campaign (under variable 'Response') had the best performance
* Complain: Very few customers have filed any complaints
* Country: Most of the customers are from Spain, and there is a very small group of customers from Mexico

With this information, we can handle the missing data discovered above.

Since there is also a far outlier for 'Income' of > 600,000, we will impute the missing data as well as this outlier with the median value of 'Income', which is not affected by outliers in the dataset

In [None]:
mkt.loc[mkt.Income.isna(),'Income'] = mkt.Income.median()
mkt.loc[mkt.Income > 600000,'Income'] = mkt.Income.median()

Next, we will handle the outliers discussed in the previous section by either rectifying, removing or retaining them. We will handle the following outliers:

* Year_Birth: Significant outliers on the lower spectrum to be imputed with mean (calculated excluding these values)
* Marital_Status: Customer data in the 3 categories 'Alone', 'YOLO' and 'Absurd' to be removed
* Kidhome & Teenhome: New features to be created, converting these variables into binary variables


In [None]:
# Marital_Status
mkt = mkt[~mkt.Marital_Status.isin(['Alone','YOLO','Absurd'])]

# Year_Birth
mkt.loc[mkt.Year_Birth < 1920,'Year_Birth'] = round(mkt[mkt.Year_Birth > 1920].Year_Birth.mean())

# Kidhome
mkt['HasKid'] = np.where(mkt['Kidhome'] > 0, 1, 0)

# Teenhome
mkt['HasTeen'] = np.where(mkt['Teenhome'] > 0, 1, 0)

We will plot the variables against to visualize the changes that have been made. We will also plot the 2 new variables 'HasKid' and 'HasTeen'

In [None]:
f, axs = plt.subplots(8,4,figsize=(15,30))
mkt['Year_Birth'].plot(kind='box', ax=axs[0,0])
axs[0,0].title.set_text('Year_Birth')
mkt['Education'].value_counts().plot(kind='bar',ax=axs[0,1])
axs[0,1].title.set_text('Education')
mkt['Marital_Status'].value_counts().plot(kind='bar',ax=axs[0,2])
axs[0,2].title.set_text('Marital_Status')
mkt['Income'].plot(kind='box', ax=axs[0,3])
axs[0,3].title.set_text('Income')
mkt['Kidhome'].value_counts().plot(kind='bar',ax=axs[1,0])
axs[1,0].title.set_text('Kidhome')
mkt['Teenhome'].value_counts().plot(kind='bar',ax=axs[1,1])
axs[1,1].title.set_text('Teenhome')
mkt['Dt_Customer'].plot(kind='box', ax=axs[1,2])
axs[1,2].title.set_text('Dt_Customer')
mkt['Recency'].plot(kind='box', ax=axs[1,3])
axs[1,3].title.set_text('Recency')
mkt['MntWines'].plot(kind='box', ax=axs[2,0])
axs[2,0].title.set_text('MntWines')
mkt['MntFruits'].plot(kind='box', ax=axs[2,1])
axs[2,1].title.set_text('MntFruits')
mkt['MntMeatProducts'].plot(kind='box', ax=axs[2,2])
axs[2,2].title.set_text('MntMeatProducts')
mkt['MntFishProducts'].plot(kind='box', ax=axs[2,3])
axs[2,3].title.set_text('MntFishProducts')
mkt['MntSweetProducts'].plot(kind='box', ax=axs[3,0])
axs[3,0].title.set_text('MntSweetProducts')
mkt['MntGoldProds'].plot(kind='box', ax=axs[3,1])
axs[3,1].title.set_text('MntGoldProds')
mkt['NumDealsPurchases'].plot(kind='box', ax=axs[3,2])
axs[3,2].title.set_text('NumDealsPurchases')
mkt['NumWebPurchases'].plot(kind='box', ax=axs[3,3])
axs[3,3].title.set_text('NumWebPurchases')
mkt['NumCatalogPurchases'].plot(kind='box', ax=axs[4,0])
axs[4,0].title.set_text('NumCatalogPurchases')
mkt['NumStorePurchases'].plot(kind='box', ax=axs[4,1])
axs[4,1].title.set_text('NumStorePurchases')
mkt['NumWebVisitsMonth'].plot(kind='box', ax=axs[4,2])
axs[4,2].title.set_text('NumWebVisitsMonth')
mkt['AcceptedCmp3'].value_counts().plot(kind='bar',ax=axs[4,3])
axs[4,3].title.set_text('AcceptedCmp3')
mkt['AcceptedCmp4'].value_counts().plot(kind='bar',ax=axs[5,0])
axs[5,0].title.set_text('AcceptedCmp4')
mkt['AcceptedCmp5'].value_counts().plot(kind='bar',ax=axs[5,1])
axs[5,1].title.set_text('AcceptedCmp5')
mkt['AcceptedCmp1'].value_counts().plot(kind='bar',ax=axs[5,2])
axs[5,2].title.set_text('AcceptedCmp1')
mkt['AcceptedCmp2'].value_counts().plot(kind='bar',ax=axs[5,3])
axs[5,3].title.set_text('AcceptedCmp2')
mkt['Response'].value_counts().plot(kind='bar',ax=axs[6,0])
axs[6,0].title.set_text('Response')
mkt['Complain'].value_counts().plot(kind='bar',ax=axs[6,1])
axs[6,1].title.set_text('Complain')
mkt['Country'].value_counts().plot(kind='bar',ax=axs[6,2])
axs[6,2].title.set_text('Country')
mkt['HasKid'].value_counts().plot(kind='bar',ax=axs[6,3])
axs[6,3].title.set_text('HasKid')
mkt['HasTeen'].value_counts().plot(kind='bar',ax=axs[7,0])
axs[7,0].title.set_text('HasTeen')

f.delaxes(axs[7][1])
f.delaxes(axs[7][2])
f.delaxes(axs[7][3])
f.tight_layout()
plt.show()

From this, it can be seen that we have successfully removed extreme outliers, and some sparse classes mentioned above

**2. Are there any variables that warrant transformations?**

Transformations have already been done above for columns 'Dt_Customer' and 'Income' to ensure they are useable for statistical analysis. The following transformations will be done:

* Income: Column renamed to remove leading & trailing spaces. We also reformatted the data values to remove dollar sign and commas
* Dt_Customer: Values converted to represent 'days since joining' by subtracting the date joined from today's date.

**3. Are there any useful variables that you can engineer with the given data?**

In the previous section, we have already engineered new features 'HasKid' and 'HasTeen'

Additionally , we will perform the following feature engineering steps to prepare the data for statistical analysis:
1. Creating dummy variables for each qualitative variables so that they can be used in performing statistical modelling.
2. Perform min-max scaling on all quantitative variables to standardize them

The standardized variables will be stored in a separate variable, so that the initial variable values are still easily accessible

In [None]:
mkt_dummy = pd.get_dummies(mkt, columns=['Education', 'Marital_Status', 'Kidhome', 'Teenhome', 
                'Country'], drop_first=True, prefix=['Education', 'Marital_Status', 
                'Kidhome', 'Teenhome', 'Country'], prefix_sep='_')

We then perform scaling of variables, and storing the values in a new dataframe 'mkt_scale'

In [None]:
def norm_func(i):
    x = (i-i.min())	/ (i.max()-i.min())
    return (x)

mkt_scale = norm_func(mkt_dummy.iloc[:,:]) 

**4. Do you notice any patterns or anomalies in the data? Can you plot them?**

There are some anomalies in the data that have been pointed out in while handling outliers, including:
* Year_Birth: Some significant outliers are present on the lower spectrum. These datapoints are likely false as these people would be well over 100 years old
* Marital_Status: There are 3 categories 'Alone', 'YOLO' and 'Absurd' which are not valid marital statuses, with very low number of customers in each
* Country: There are only 3 entries for customers from Mexico, and this may skew results related to this variable

The plots have already been included in the outlier analysis section, and will not be presented again here


# Section 02: Statistical Analysis

In this section, we perform statistical analysis to answer the following questions. For each question, we will provide a brief explanation of the findings, suitable for non-technical users:

1. What factors are significantly related to the number of store purchases?
2. Does US fare significantly better than the Rest of the World in terms of total purchases?
3. Your supervisor insists that people who buy gold are more conservative. Therefore, people who spent an above average amount on gold in the last 2 years would have more in store purchases. Justify or refute this statement using an appropriate statistical test
4. Fish has Omega 3 fatty acids which are good for the brain. Accordingly, do "Married PhD candidates" have a significant relation with amount spent on fish? What other factors are significantly related to amount spent on fish? (Hint: use your knowledge of interaction variables/effects)
5. Is there a significant relationship between geographical regional and success of a campaign?

**1. What factors are significantly related to the number of store purchases?**

For this, we will build a multiple linear regression model to predict the number of store purchases. Based on the coefficients of the model, we will be able to plot & determine the most significant factors (positive and negative) related to the number of store purchases.

In [None]:
ml1 = smf.ols('NumStorePurchases ~ Year_Birth+Income+Dt_Customer+Recency+MntWines+MntFruits+MntMeatProducts+\
              MntFishProducts+MntSweetProducts+MntGoldProds+NumDealsPurchases+NumWebPurchases+NumCatalogPurchases\
              +NumWebVisitsMonth+AcceptedCmp3+AcceptedCmp4+AcceptedCmp5+AcceptedCmp1+AcceptedCmp2+Response+\
              Complain+Education_Basic+Education_Graduation+Education_Master+Education_PhD+\
              Marital_Status_Married+Marital_Status_Single+Marital_Status_Together+Marital_Status_Widow+\
              Kidhome_1+Kidhome_2+Teenhome_1+Teenhome_2+Country_CA+Country_GER+Country_IND+Country_ME+Country_SA\
              +Country_SP+Country_US', data=mkt_scale).fit()

ml1.summary()

We will plot the parameter coefficients to visualize the effect of each parameter on the variable 'NumStorePurchases'

In [None]:
params = ml1.params.sort_values(ascending=False)
plt.figure(figsize=(15, 4)) 
plt.bar(params.index, params)
plt.xticks(rotation=90)
plt.title('Variable Scores')
plt.xlabel('Variable')
plt.ylabel('Coefficient')
plt.show()

We will arbitrarily use -0.1 and 0.1 as the cutoff for coefficients, in order to narrow down the important factors

In [None]:
param_narrow = params[params>0.1].append(params[params<-0.1])
param_narrow=param_narrow[~(param_narrow.index=='Intercept')]
plt.figure(figsize=(8, 4)) 
plt.bar(param_narrow.index, param_narrow)
plt.xticks(rotation=90)
plt.title('Variable Scores')
plt.xlabel('Variable')
plt.ylabel('Coefficient')
plt.show()

From the chart, we conclude that the 7 factors above are most significantly related to number of store purchases. Among the 7 factors, 
* Positive Effect: 'MntWines', 'NumWebPurchases', 'NumDealsPurchases', 'Income' and 'MntFruits' have a positive effect on the number of store purchases in decreasing order. This means that as the values of these variables increase, the number of store purchases also increases, with 'MntWines' having the largest effect on number of store purchases
* Negative Effect: On the other hand, 'NumCatalogPurchases' and 'NumWebVisitsMonth' have a negative effect on the number of store purchases in decreasing order. This means that as the values of these variables increase, the number of store purchases decreases, with 'NumWebVisitsMonth' having the largest negative effect on number of store purchases

Overall, it can be inferred that the selection of wines and fruits, as well as the attractive deals attract customers to make store purchases. Besides that, customers with higher income also tend to make more store purchases.

On the other hand, customers who tend to make catalog purchases are less likely to make purchases in the store.

However, there are 2 contrasting factors, 'NumWebPurchases' and 'NumWebVisitsMonth' which are both website related but have opposing effects on the number of store purchases. This is a factor that should be further studied by the CMO and marketing team, and it is worth noting that 'NumWebVisitsMonth' only captures the past month's data, whereas 'NumWebPurchases' and 'NumStorePurchases' both seemingly capture the lifetime data of the customer. 

Thus, it can be hypothesized that the store has been promoting their website on advertisements / social media lately, driving customers to switch from regular store purchases to website purchase, whereas customers who historically make many purchases on the website are loyal customers of the store, and tend to make in store purchases as well.

**2. Does US fare significantly better than the Rest of the World in terms of total purchases?**

For this, we will tabulate the total purchases of each country. Then, we will visualize this information for US as compared to the rest of the world.

If necessary, we will perform hypothesis testing to determine if it is statistically significant that US fares better than the rest of the world in terms of total purchases

In [None]:
total_purchase = mkt.loc[:,['NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases','Country']]
total_purchase = total_purchase.groupby('Country').sum().sum(axis=1)

total_purchase = total_purchase.sort_values(ascending=False)
plt.figure(figsize=(8, 4)) 
plt.bar(total_purchase.index, total_purchase)
plt.bar('US',total_purchase['US'])
plt.xticks(rotation=90)
plt.title('Total Purchases Across Countries')
plt.xlabel('Country')
plt.ylabel('Total Purchases')
plt.show()

From this, we can clearly see that the US is second to last in terms of total purchases, only ahead of Mexico. Thus. US definitely does not fare better than the rest of the world in terms of total purchases.

Let's take a look if the US fares better in terms of average purchases per customer

In [None]:
number_cust = mkt.loc[:,['Country']]
number_cust = number_cust.value_counts()

ave_purchase = (total_purchase / number_cust.to_numpy()).sort_values(ascending=False)

plt.figure(figsize=(8, 4)) 
plt.bar(ave_purchase.index, ave_purchase)
plt.bar('US',ave_purchase['US'])
plt.xticks(rotation=90)
plt.title('Average Purchases Across Countries')
plt.xlabel('Country')
plt.ylabel('Average Purchases')
plt.show()


From this, we can also see that the US does not fare better than the rest of the world, being significantly behind Mexico in terms of average purchases per customer, and very close to all the other countries

Let's take a look if the US fares better in terms of total purchases amounts. For this, we will total the values in the 6 columns 'MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts and MntGoldProds

In [None]:
total_value = mkt.loc[:,['MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts','MntGoldProds','Country']]
total_value = total_value.groupby('Country').sum().sum(axis=1)
total_value = total_value.sort_values(ascending=False)

plt.figure(figsize=(8, 4)) 
plt.bar(total_value.index, total_value)
plt.bar('US',total_value['US'])
plt.xticks(rotation=90)
plt.title('Total Purchase Value Across Countries')
plt.xlabel('Country')
plt.ylabel('Total Purchase Value')
plt.show()

Similar to total purchases, US is also second to last in terms of total purchase value, only ahead of Mexico. Thus. US definitely does not fare better than the rest of the world in terms of total purchase value.

Let's take a look if the US fares better in terms of average purchase value per customer

In [None]:
ave_value = (total_value / number_cust.to_numpy()).sort_values(ascending=False)

plt.figure(figsize=(8, 4)) 
plt.bar(ave_value.index, ave_value)
plt.bar('US',ave_value['US'])
plt.xticks(rotation=90)
plt.title('Average Purchase Value Across Countries')
plt.xlabel('Country')
plt.ylabel('Average Purchase Value')
plt.show()

Again, we see that the US does not fare better than the rest of the world, being significantly behind Mexico in terms of average purchase value per customer, and very close to all the other countries

Thus, we can conclude with certainty that US does not fare better than the rest of the world in terms of total purchases.

**3. Your supervisor insists that people who buy gold are more conservative. Therefore, people who spent an above average amount on gold in the last 2 years would have more in store purchases. Justify or refute this statement using an appropriate statistical test**

For this, we will first tabulate the data (number of in store purchases) for 2 populations, those with above average spend on gold, and those with average or lower spending on gold.

Then, we will perform a T test for population means to determine if there is statistical evidence to support the claim that people who spend above average amount on gold also has more in store purchases



We first set up the hypothesis test:

Ho: Store purchases of people who spend more on gold <= Store purchases of people who spend less on gold 

Ha: Store purchases of people who spend more on gold > Store purchases of people who spend less on gold

Next, we check if the 2 samples have equal variances

In [None]:
gold_purchase = mkt.loc[:,['MntGoldProds','NumStorePurchases']]
gold_purchase['AboveAvg'] = np.where(gold_purchase['MntGoldProds'] > gold_purchase['MntGoldProds'].mean(),'Above','Below')

# Preparing data for gold buyers & non gold buyers
gold_buyers = gold_purchase[gold_purchase.AboveAvg=='Above'].loc[:,'NumStorePurchases']
non_buyers = gold_purchase[gold_purchase.AboveAvg=='Below'].loc[:,'NumStorePurchases']

print(scipy.stats.levene(gold_buyers,non_buyers))


Based on the results, P-value is very small, thus reject the null hypothesis that the variances are equal. 

Thus, we will perform 2 sample T-Test for Unequal Variances

In [None]:
from scipy.stats import ttest_ind  
    
def t_test(x,y,alternative='equal'):
        _, double_p = ttest_ind(x,y,equal_var = False)
        if alternative == 'equal':
            pval = double_p
        elif alternative == 'greater':
            if np.mean(x) > np.mean(y):
                pval = double_p/2.
            else:
                pval = 1.0 - double_p/2.
        elif alternative == 'less':
            if np.mean(x) < np.mean(y):
                pval = double_p/2.
            else:
                pval = 1.0 - double_p/2.
        return pval

print("At 0.05 significance level, \n")    
    
# Two tailed T-Test
p = t_test(gold_buyers,non_buyers,alternative='equal')
if p > 0.05:
    print("For 2-tailed test:\nWith P-Val of", p, "we fail to reject the null hypothesis. Population means are equal\n")
else:
    print("For 2-tailed test:\nWith P-Val of", p, "null hypothesis is rejected. Population means are not equal\n")

# One tailed T-Test
p2 = t_test(gold_buyers,non_buyers,alternative='greater')
if p2 > 0.05:
    print("For 1-tailed test:\nWith P-Val of", p2, "we fail to reject the null hypothesis. \nStore purchases of people who spend more on gold < Store purchases of people who spend less")
else:
    print("For 1-tailed test:\nWith P-Val of", p2, "null hypothesis is rejected. \nStore purchases of people who spend more on gold > Store purchases of people who spend less")


Based on the hypothesis test, we can conclude that store purchases of people who spend more on gold is greater than store purchases of people who spend less on gold. Thus, the supervisor's claim is justified.

**4. Fish has Omega 3 fatty acids which are good for the brain. Accordingly, do "Married PhD candidates" have a significant relation with amount spent on fish? What other factors are significantly related to amount spent on fish? (Hint: use your knowledge of interaction variables/effects)**

For this, we will build a multiple linear regression model to predict the amount spent on fish products. We will be using the standardized variables, so each variable is scaled from 0 to 1. Based on the coefficients of the model, we will be able to plot & determine the most significant factors (positive and negative) related to the amount spent on fish products, and identify if married PhD candidates is a strong factor.

In [None]:
ml2 = smf.ols('MntFishProducts ~ Year_Birth+Income+Dt_Customer+Recency+MntWines+MntFruits+MntMeatProducts+\
              MntSweetProducts+MntGoldProds+NumDealsPurchases+NumWebPurchases+NumCatalogPurchases+\
              NumWebVisitsMonth+NumStorePurchases+AcceptedCmp3+AcceptedCmp4+AcceptedCmp5+AcceptedCmp1+\
              AcceptedCmp2+Response+Complain+Education_Basic+Education_Graduation+Education_Master+Education_PhD\
              +Marital_Status_Married+Marital_Status_Single+Marital_Status_Together+Marital_Status_Widow+\
              HasKid+HasTeen+Country_CA+Country_GER+Country_IND+Country_ME+Country_SA\
              +Country_SP+Country_US+Marital_Status_Married*Education_PhD', data=mkt_scale).fit()

ml2.summary()

We will plot the parameter coefficients to visualize the effect of each parameter on the variable 'MntFishProducts'

In [None]:
params = ml2.params.sort_values(ascending=False)
plt.figure(figsize=(15, 4)) 
plt.bar(params.index, params)
plt.xticks(rotation=90)
plt.title('Variable Scores')
plt.xlabel('Variable')
plt.ylabel('Coefficient')
plt.show()

We will arbitrarily use -0.1 and 0.1 as the cutoff for coefficients, in order to narrow down the important factors. We will also plot the coefficient of the interaction term 'Marital_Status_Married:Education_PhD' for comparison

In [None]:
param_narrow = params[params>0.1].append(params[params<-0.1])
param_narrow=param_narrow[~(param_narrow.index=='Intercept')]
param_narrow.append(params[params.index=='Marital_Status_Married:Education_PhD'])
plt.figure(figsize=(8, 4)) 
plt.bar(param_narrow.index, param_narrow)
plt.bar('Marital_Status_Married:Education_PhD',params[params.index=='Marital_Status_Married:Education_PhD'])
plt.xticks(rotation=90)
plt.title('Variable Scores')
plt.xlabel('Variable')
plt.ylabel('Coefficient')
plt.show()

Based on the model, the factors that are significantly related to the amount spent on fish are: 'MntSweetProducts', 'MntFruits', 'MntMeatProducts', 'NumCatalogPurchases', 'MntGoldProds' and 'Country_ME'.

This means that customers who tend to spend on other products will also spend on fish products. Interestingly, Customers in Mexico also tend to buy more fish

In comparison, the interaction term of Married PhD customers have a very low coefficient. Thus, it does not have a significant relationship with the amount spent on fish.

**5. Is there a significant relationship between geographical regional and success of a campaign?**

For this, we will fit a random forest classifier to the data, to predict response to the various campaigns (6 different models will be fit). The algorithm will rate each feature importance with a coefficient, based on the reduction effect on entropy. 

We will plot the feature importance scores, and then the feature importance scores for the country variables will be studied to determine if there is a significant relationship to campaign success.

Note that logistic regression was tried, but the model did not converge for many of the campaigns, thus we are taking this approach instead.

In [None]:
# Target variable "Response", which is also the latest campaign ran

X = mkt_scale.drop(['ID','HasKid','HasTeen','Response'],axis=1)
y = mkt_scale.Response
# define the model
model = RandomForestClassifier(n_estimators=100, criterion="entropy")
# fit the model
model.fit(X, y)
# get importance
importance = model.feature_importances_
# summarize feature importance
plt.figure(figsize=(15, 4)) 
plt.bar(X.columns[:-7], importance[:-7])
plt.bar(X.columns[-7:], importance[-7:])
plt.xticks(rotation=90)
plt.title('Feature Importance for Latest Campaign "Response"')
plt.xlabel('Feature')
plt.ylabel('Coefficient')
plt.show()

In [None]:
# Target variable "AcceptedCmp5", which is the fifth campaign ran

X = mkt_scale.drop(['ID','HasKid','HasTeen','AcceptedCmp5'],axis=1)
y = mkt_scale.AcceptedCmp5
# define the model
model = RandomForestClassifier(n_estimators=100, criterion="entropy")
# fit the model
model.fit(X, y)
# get importance
importance = model.feature_importances_
# summarize feature importance
plt.figure(figsize=(15, 4)) 
plt.bar(X.columns[:-7], importance[:-7])
plt.bar(X.columns[-7:], importance[-7:])
plt.xticks(rotation=90)
plt.title('Feature Importance for Campaign 5')
plt.xlabel('Feature')
plt.ylabel('Coefficient')
plt.show()

In [None]:
# Target variable "AcceptedCmp4", which is the fourth campaign ran

X = mkt_scale.drop(['ID','HasKid','HasTeen','AcceptedCmp4'],axis=1)
y = mkt_scale.AcceptedCmp4
# define the model
model = RandomForestClassifier(n_estimators=100, criterion="entropy")
# fit the model
model.fit(X, y)
# get importance
importance = model.feature_importances_
# summarize feature importance
plt.figure(figsize=(15, 4)) 
plt.bar(X.columns[:-7], importance[:-7])
plt.bar(X.columns[-7:], importance[-7:])
plt.xticks(rotation=90)
plt.title('Feature Importance for Campaign 4')
plt.xlabel('Feature')
plt.ylabel('Coefficient')
plt.show()

In [None]:
# Target variable "AcceptedCmp3", which is the third campaign ran

X = mkt_scale.drop(['ID','HasKid','HasTeen','AcceptedCmp3'],axis=1)
y = mkt_scale.AcceptedCmp3
# define the model
model = RandomForestClassifier(n_estimators=100, criterion="entropy")
# fit the model
model.fit(X, y)
# get importance
importance = model.feature_importances_
# summarize feature importance
plt.figure(figsize=(15, 4)) 
plt.bar(X.columns[:-7], importance[:-7])
plt.bar(X.columns[-7:], importance[-7:])
plt.xticks(rotation=90)
plt.title('Feature Importance for Campaign 3')
plt.xlabel('Feature')
plt.ylabel('Coefficient')
plt.show()

In [None]:
# Target variable "AcceptedCmp2", which is the second campaign ran

X = mkt_scale.drop(['ID','HasKid','HasTeen','AcceptedCmp2'],axis=1)
y = mkt_scale.AcceptedCmp2
# define the model
model = RandomForestClassifier(n_estimators=100, criterion="entropy")
# fit the model
model.fit(X, y)
# get importance
importance = model.feature_importances_
# summarize feature importance
plt.figure(figsize=(15, 4)) 
plt.bar(X.columns[:-7], importance[:-7])
plt.bar(X.columns[-7:], importance[-7:])
plt.xticks(rotation=90)
plt.title('Feature Importance for Campaign 2')
plt.xlabel('Feature')
plt.ylabel('Coefficient')
plt.show()

In [None]:
# Target variable "AcceptedCmp1", which is the first campaign ran

X = mkt_scale.drop(['ID','HasKid','HasTeen','AcceptedCmp1'],axis=1)
y = mkt_scale.AcceptedCmp1
# define the model
model = RandomForestClassifier(n_estimators=100, criterion="entropy")
# fit the model
model.fit(X, y)
# get importance
importance = model.feature_importances_
# summarize feature importance
plt.figure(figsize=(15, 4)) 
plt.bar(X.columns[:-7], importance[:-7])
plt.bar(X.columns[-7:], importance[-7:])
plt.xticks(rotation=90)
plt.title('Feature Importance for Campaign 1')
plt.xlabel('Feature')
plt.ylabel('Coefficient')
plt.show()

Based on the feature importance plots for each of the campaigns, it can be observed that the "Country" variables (in orange) tend to have very small coefficients, meaning that they are less important in predicting campaign success as compared to other features

We can further verify this by plotting the campaign acceptance rate across different countries.

In [None]:
campaign = mkt.loc[:,['Response','AcceptedCmp1','AcceptedCmp2','AcceptedCmp3','AcceptedCmp4','AcceptedCmp5','Country']]
campaign = campaign.groupby('Country').mean()

f, axs = plt.subplots(3,2,figsize=(15,8))
campaign['Response'].plot(kind='bar', ax=axs[0,0])
axs[0,0].title.set_text('Acceptance % of Latest Campaign')
campaign['AcceptedCmp5'].plot(kind='bar', ax=axs[0,1])
axs[0,1].title.set_text('Acceptance % of Campaign 5')
campaign['AcceptedCmp4'].plot(kind='bar', ax=axs[1,0])
axs[1,0].title.set_text('Acceptance % of Campaign 4')
campaign['AcceptedCmp3'].plot(kind='bar', ax=axs[1,1])
axs[1,1].title.set_text('Acceptance % of Campaign 3')
campaign['AcceptedCmp2'].plot(kind='bar', ax=axs[2,0])
axs[2,0].title.set_text('Acceptance % of Campaign 2')
campaign['AcceptedCmp1'].plot(kind='bar', ax=axs[2,1])
axs[2,1].title.set_text('Acceptance % of Campaign 1')

f.tight_layout()
plt.show()

From the plot above, we can see that the acceptance rate (%) of each campaign across the various countries tend to be quite low and rather uniform. Thus, it makes sense and further supports our conclusion that "Country" is not a significant feature to predict campaign success.

Note that the dataset only contains 3 customer datapoints for Mexico, thus the acceptance rate appears to be high (i.e. If 1 customer accepts the campaign, success rate would already be at 33%)

# Section 03: Data Visualization
In this section, we will present data visualizations to answer the following questions:

1. Which marketing campaign is most successful?
2. What does the average customer look like for this company?
3. Which products are performing best?
4. Which channels are underperforming?

**1. Which marketing campaign is most successful?**

For this, we will plot the takeup rate for all campaigns

In [None]:
campaign_takeup = mkt.loc[:,['Response','AcceptedCmp1','AcceptedCmp2','AcceptedCmp3','AcceptedCmp4','AcceptedCmp5']]

campaign_takeup = campaign_takeup.melt()
campaign_takeup = pd.crosstab(campaign_takeup["variable"], campaign_takeup["value"]).sort_values(0)

cols = list(campaign_takeup.columns)
a, b = cols.index(0), cols.index(1)
cols[b], cols[a] = cols[a], cols[b]
campaign_takeup = campaign_takeup[cols]

campaign_takeup.columns = "Yes","No"
campaign_takeup.plot.bar(stacked=True)
plt.title('Acceptance of Marketing Campaigns')
plt.xlabel('Campaign')
plt.ylabel('Acceptance')
plt.legend(title='Response',loc='upper right')
plt.show()

The graph above displays the acceptance rates of the various campaigns, in descending order

Based on the graph, we can conclude that the most recent campaign is the most successful one.

**2. What does the average customer look like for this company?**

We will average across each qualitative variables, and take the modal category of all categorical variables to obtain the average customer for this company

In [None]:
average_cust = mkt.drop('ID',axis=1).mean()
average_cust = pd.DataFrame(average_cust)
average_cust.loc['Dt_Customer',:] = str(datetime.today() - timedelta(days = average_cust.loc['Dt_Customer',0]))
modal = mkt.drop('ID',axis=1).mode().transpose().loc[['Country','Education','Marital_Status']]
average_cust = average_cust.append(modal)
print(average_cust)

On average, a customer looks like this, divided into several categories:

Demographic:
* Born between 1968 and 1969
* Income: $51,959
* 0.44 kids and 0.51 teens at home, for an average of 1 dependent at home
* Married
* Graduated (likely high school)
* From Spain

Loyalty:
* Became a customer on 10-07-2013
* Last made a purchase 49 days ago

Expenditure in last 2 years:
* Wine: 304.03
* Fruits: 26.30
* Meat: 167.11
* Fish: 37.45
* Sweet: 27.11
* Gold: 43.90

Channels:
* Deals Purchases: 2.32
* Web Purchases: 4.08
* Catalog Purchases: 2.66
* Store Purchases: 5.79
* Number of Web Visits in past month: 5.32

Interactions:
* Complains: 0.01
* Accepted latest campaign: 0.15
* Accepted Campaign 1: 0.064
* Accepted Campaign 2: 0.013
* Accepted Campaign 3: 0.073
* Accepted Campaign 4: 0.075
* Accepted Campaign 5: 0.073

**3. Which products are performing best?**

We will plot a bar plot for each of the products, based on the amount sold in the last 2 years

In [None]:
products = mkt.loc[:,['MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts','MntGoldProds']]
products = products.sum().sort_values(ascending=False)

plt.bar(products.index,products)
plt.xticks(rotation=90)
plt.title('Product Performance in last 2 years')
plt.xlabel('Product Category')
plt.ylabel('Number Sold')
plt.show()

Based on the chart, we can see that wine performed the best, with the highest number of items sold, followed by meat products. On the other hand, Gold, Fish, Sweet and Fruits products are les popular, with similar number of items sold

**4. Which channels are underperforming?**

We will plot a bar plot for each of the products, based on the interactions in the last 2 years. Number of web visits will not be included as data is only available for the past month 

In [None]:
channels = mkt.loc[:,['NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases']]
channels = channels.sum().sort_values(ascending=False)

plt.bar(channels.index,channels)
plt.xticks(rotation=90)
plt.title('Channel Performance in last 2 years')
plt.xlabel('Channel')
plt.ylabel('Number of Products Sold')
plt.show()

Based on the chart, we can see that most customers preferred purchasing in physical stores, as it has the most number of items sold. This is followed by online website, catalog, and deals. Deals is the most underperforming channel

# Section 04: CMO Recommendations

Based on the findings in the sections above, we provide the following data driven recommendations:

1. Regional Market Recommendations:
    * Mexico presents as an interested market with willingness to pay, as they have the highest average purchase quantities & purchase values, despite the company having low market penetration in the country.

2. Channel Recommendations:
    * CMO should relook at website strategy. As shown in the exploratory data analytics, there is large interest in website as shown by high number of visits in the last month. However, this is not being converted into sales as the average number of website purchases in the past 2 years is lower than the average visits in a single month.

3. Product Recommendations:
    * There is a large appetite for wine & meat products from the company, thus the company can try to diversify and bring in more premium products in these categories to further drive sales. Additionally, there are some customers with very high income values ( > 100k ) who may be interested in these products.

4. Campaign Recommendations:
    * Future campaigns should try to emulate and repeat features in latest campaign ran by the company as it performed very well, with highest takeup rate of 14% compared to the average campaign takeup rate of 7%. 
    * Since there is no strong relationship between geographical regional and success of a campaign, future campaigns can be piloted at a smaller scale on 1/2 countries to find out what works and what doesn't, before rolling out to other countries. This will help to save cost on failed campaigns as well as increase agility in campaign rollouts
    
5. Customer Loyalty Recommendations:
    * CMO should consider some loyalty program or rewards for members to increase stickiness & purchase frequency. This is because engagement within members is rather low, as many have been members for over a year, but median recency (days since last purchase) is approximately 50 days. Considering the store has a large variety of fresh products (Fruits, Fish & Meat), they would prefer weekly / biweekly visits of customers
