<img src = "https://ally-marketing.com/wp-content/uploads/2016/07/marketing.jpg">

## Introduction
__Task Detail__


You're a marketing analyst and you've been told by the Chief Marketing Officer that recent marketing campaigns have not been as effective as they were expected to be. You need to analyze the data set to understand this problem and propose data-driven solutions.

***
__Section 01: Exploratory Data Analysis__
***

 -	Are there any null values or outliers? How will you wrangle/handle them?
 -	Are there any variables that warrant transformations?
 -	Are there any useful variables that you can engineer with the given data?
 -	Do you notice any patterns or anomalies in the data? Can you plot them?

***
__Section 02: Statistical Analysis__
***

 -	What factors are significantly related to the number of store purchases?
 -	Does US fare significantly better than the Rest of the World in terms of total purchases?
 -	Your supervisor insists that people who buy gold are more conservative. Therefore, people who spent an above average amount on gold in the last 2 years would have more in store purchases. Justify or refute this statement using an appropriate statistical test
 -	Fish has Omega 3 fatty acids which are good for the brain. Accordingly, do "Married PhD candidates" have a significant relation with amount spent on fish? What other factors are significantly related to amount spent on fish? (Hint: use your knowledge of interaction variables/effects)
 -	Is there a significant relationship between geographical regional and success of a campaign?
 
 
 ***
 __Section 03: Data Visualization__
 ***
 - Which marketing campaign is most successful?
 - What does the average customer look like for this company?
 - Which products are performing best?
 - Which channels are underperforming?


***
__Section 04: CMO Recommendations__
***


Bring together everything from Sections 01 to 03 and provide data-driven recommendations/suggestions to your CMO.



## Data Loading and Cleaning

 - Will load the relevant libraries to help with our task
 - Import the data
 - Clean up the data

In [None]:
# the relevant libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

pd.set_option('display.max_columns', None)
%matplotlib inline
sns.set_style('white')
sns.despine()

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
# load the dataset and print info
df = pd.read_csv('../input/marketing-data/marketing_data.csv')
print(df.info())
df.head(10)

Clean up the Column name Income and transform Income from an object to a float

In [None]:
#Remove any spaces found in the column name
df.columns = df.columns.str.replace(' ', '')

# remove the dollar sign and the comma
df['Income'] = df['Income'].str.replace('$', '')
df['Income'] = df['Income'].str.replace(',', '').astype('float')

## Section 01: Exploratory Data Analysis

### **Are there any null values or outliers? How will you wrangle/handle them?**

In [None]:
#Check to see if there are any null values
df.isnull().sum()

In [None]:
# Fill the Null values with median values
df["Income"] = df["Income"].fillna(df["Income"].median())

In [None]:
# Creating a copy of the Dataframe and dropping the values that won't have outliers
df_check = df.copy()
df_check = df.drop(columns=['ID', 'AcceptedCmp3', 'AcceptedCmp4', 
                              'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2', 'Response', 
                              'Complain']).select_dtypes(include=np.number)

In [None]:
fig, ax = plt.subplots(4, 4, figsize = (15, 16))
ax = ax.flatten()

for i, c in enumerate(df_check):
    sns.boxplot(x = df_check[c], ax = ax[i])
plt.suptitle('Outlier Analysis using BoxPlots', fontsize = 25)
fig.tight_layout()

From the Boxplots we see Year_Birth and Income have a Few Outliers. We will change Year_Birth but ignore Income because there exists high earners.

In [None]:
plt.figure(figsize=(8,4))
sns.histplot(df['Year_Birth'])
plt.title('Year Birth Distribution', size=16)
plt.ylabel('Count');

In [None]:
#Change anybody born to the Median Year_Birth
df['Year_Birth'] = df['Year_Birth'].apply(lambda x: df['Year_Birth'].median() if x <= 1900 else x)

#Change from float to integer
df['Year_Birth'] = df['Year_Birth'].astype(int)

### **Are there any variables that warrant transformations?**

In [None]:
#We will change the Year_Birth to age by using the datetime library.
year = datetime.datetime.today().year
df['Age'] = year - df['Year_Birth']

In [None]:
sns.histplot(data = df['Age'], kde = True)
plt.title('Distribution Age of Customer')

Instead of having customers categorized by the full date they signed up with the company, it can be simplified to just year

In [None]:
df['Year_Customer'] = pd.DatetimeIndex(df["Dt_Customer"]).year

### **Are there any useful variables that you can engineer with the given data?**

In [None]:
# Kidhome and Teenhome can be combined to create Dependents
df["Dependents"] = df["Kidhome"] + df["Teenhome"]

# total amount spent
mnt_cols = [col for col in df.columns if 'Mnt' in col]
df['TotalMnt'] = df[mnt_cols].sum(axis=1)

# Total Purchases
purchases_cols = [col for col in df.columns if 'Purchases' in col]
df['TotalPurchases'] = df[purchases_cols].sum(axis=1)

# Total Campaigns Accepted
campaigns_cols = [col for col in df.columns if 'Cmp' in col] + ['Response'] 
df['TotalCampaignsAcc'] = df[campaigns_cols].sum(axis=1)

In [None]:
# We will drop the columns that we transformed
df = df.drop(["Year_Birth"], axis = 1)

df = df.drop(["Kidhome", "Teenhome"], axis = 1)

df = df.drop(["Dt_Customer"], axis = 1)

### **Do you notice any patterns or anomalies in the data? Can you plot them?**

In [None]:
# Create categorical groups for Age and Income to plot anomolies. 

df_group = df.copy()
age_bins = [20, 30, 40, 50, 60, 120]
age_labels = ["20s", "30s", "40s", "50s", "60+"]
df_group["AgeGroup"] = pd.cut(df_group.Age, 
                              age_bins, 
                              labels = age_labels, 
                              include_lowest = True)

#Creating Income Group 
# https://www.thebalance.com/definition-of-middle-class-income-4126870
income_bins = [0, 19999, 39999, 59999, 89999, 2000000]
income_labels = ["Very Low", 
                 "Low", 
                 "Middle", 
                 "Middle High",
                 "High"]
df_group["IncomeGroup"] = pd.cut(df_group.Income, 
                                 income_bins, 
                                 labels = income_labels, 
                                 include_lowest = True)

In [None]:
# Drop Income and Age from this DataFrame
df_group = df_group.drop(["Income", "Age"], axis = 1)

In [None]:
f,ax = plt.subplots(2, 2, figsize = (14, 10))
sns.barplot(x = 'IncomeGroup', y = 'TotalMnt', data = df_group,ax = ax[0][0] );
ax[0][0].set_title('Income Group to Total Spent', fontweight ="bold") 
ax[0][0].set_xlabel('')
sns.barplot(x = 'Dependents', y = 'TotalMnt', data = df_group, ax = ax[0][1]);
ax[0][1].set_title('Dependents to Total Spent', fontweight ="bold") 
ax[0][1].set_xlabel('')
sns.barplot(x = 'Education', y = 'TotalMnt', data = df_group, ax = ax[1][0]);
ax[1][0].set_title('Edcuation Level to Total Spent', fontweight ="bold") 
ax[1][0].set_xlabel('')
sns.barplot(x = 'Education', y = 'Income', data = df, ax = ax[1][1]);
ax[1][1].set_title('Income to Education', fontweight ="bold") 
ax[1][1].set_xlabel('');

The data illustrates that high income customers purchase the most in all stores. In addition, those without dependents spend the most from the customers who have 1 or more dpendents. From the bottom 2 bar charts to say that higher income invididuals spend the most and typically those with PhD make the most money.

## Section 02: Statistical Analysis

### **What factors are significantly related to the number of store purchases?**

In [None]:
corr = df.drop(columns='ID').select_dtypes(include = np.number).corr(method = 'kendall')

fig, ax = plt.subplots( 1, 1, figsize = (16, 16), dpi = 200)
sns.heatmap( corr, ax = ax, annot = True, fmt = '.2f', square = True)
plt.show()

Based on the correlation matrix, the factors that are significantly related to the number of store purchases are:

 - TotalPurchases
 - TotalMnt
 - NumCatalogPurchases
 - MntWines
 - MntMeanProducts

**Does US fare significantly better than the Rest of the World in terms of total purchases?**

In [None]:
plt.figure(figsize = (4, 4))
df.groupby('Country')['TotalPurchases'].sum().sort_values(ascending = False).plot(kind = 'bar')
plt.title('Total Number of Purchases by Country', size = 16)
plt.ylabel('Number of Purchases')
plt.xlabel(' ');

The answer is no. The US is second to last with Spain coming in first in terms of total purchases.

### **Your supervisor insists that people who buy gold are more conservative. Therefore, people who spent an above average amount on gold in the last 2 years would have more in store purchases. Justify or refute this statement using an appropriate statistical test**

In [None]:
from scipy.stats import pearsonr

gold = df['MntGoldProds']
stpurch = df['NumStorePurchases']

stat, p = pearsonr(gold, stpurch)
print(stat, p)
if p > 0.05:
    print('Disagree with the Supervisor')
else:
    print('Agree with the Supervisor')

### **Fish has Omega 3 fatty acids which are good for the brain. Accordingly, do "Married PhD candidates" have a significant relation with amount spent on fish? What other factors are significantly related to amount spent on fish? (Hint: use your knowledge of interaction variables/effects)**

In [None]:
# Create a new dataframe to see who is Married with a PhD.
df2 = df.copy()
df2['Married_PhD'] = df2['Marital_Status'] + df2['Education']
df2['Married_PhD'] = df2['Married_PhD'].apply(lambda x: 1 if x == 'MarriedPhD' else 0)

plt.figure(figsize=(3,4))
sns.boxplot(x = 'Married_PhD', y = 'MntFishProducts', data = df2)

In [None]:
# Calculate the T-test for the means of two independent samples of scores
from scipy.stats import ttest_ind

pval = ttest_ind(df2[df2['Married_PhD'] == 1]['MntFishProducts'], 
                 df2[df2['Married_PhD'] == 0]['MntFishProducts']).pvalue
print(pval)

In [None]:
df2 = df2.drop(columns = ['ID', 'Education', 'Marital_Status', 'AcceptedCmp1', 'AcceptedCmp2', 
                          'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Response', 
                          'Complain', 'Year_Customer'], axis = 1)

In [None]:
corr2 = df2.corr(method = 'kendall')
fig, ax = plt.subplots( 1, 1, figsize = (16, 16), dpi = 200)
sns.heatmap(corr2, ax = ax, annot = True, fmt = '.2f', square = True)
plt.show()

From the heatmap, we can see that the factors that influence the purchase of fish are:
 - MntWines
 - MntMeatProducts
 - MntSweetProducts
 - TotalMnt
 - NumCatalPurchases

### **Is there a significant relationship between geographical regional and success of a campaign?**

In [None]:
df_campaign = df[['Country', 'AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 
                  'AcceptedCmp4', 'AcceptedCmp5', 'Response']]

In [None]:
f,ax = plt.subplots(2, 3, figsize = (15,8), sharey = True)
sns.barplot(x = 'Country', y = 'AcceptedCmp1', data = df_campaign,ax = ax[0][0] )
ax[0][0].set_title('Campaign 1 Acceptance by Country', fontweight ="bold")
ax[0][0].set_ylabel('')
ax[0][0].set_xlabel('');
sns.barplot(x = 'Country', y = 'AcceptedCmp2', data = df_campaign, ax = ax[0][1])
ax[0][1].set_title('Campaign 2 Acceptance by Country', fontweight ="bold") 
ax[0][1].set_ylabel('')
ax[0][1].set_xlabel('');
sns.barplot(x = 'Country', y = 'AcceptedCmp3', data = df_campaign, ax = ax[0][2]);
ax[0][2].set_title('Campaign 3 Acceptance by Country', fontweight ="bold") 
ax[0][2].set_ylabel('')
ax[0][2].set_xlabel('');
sns.barplot(x = 'Country', y = 'AcceptedCmp4', data = df_campaign, ax = ax[1][0])
ax[1][0].set_title('Campaign 4 Acceptance by Country', fontweight ="bold") 
ax[1][0].set_ylabel('')
ax[1][0].set_xlabel('');
sns.barplot(x = 'Country', y = 'AcceptedCmp5', data = df_campaign, ax = ax[1][1])
ax[1][1].set_title('Campaign 5 Acceptance by Country', fontweight ="bold") 
ax[1][1].set_ylabel('')
ax[1][1].set_xlabel('');
sns.barplot(x = 'Country', y = 'Response', data = df_campaign, ax = ax[1][2])
ax[1][2].set_title('Recent Campaign Acceptance by Country', fontweight ="bold") 
ax[1][2].set_ylabel('');
ax[1][2].set_xlabel('');

## Section 03: Data Visualization

### **Which marketing campaign is most successful?**

In [None]:
df_success = pd.DataFrame(df[['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4',
                              'AcceptedCmp5', 'Response']].mean()*100, 
                          columns = ['Percent']).reset_index()

# plot
plt.figure(figsize = (8,4))
sns.barplot(x='index', y='Percent', data = df_success)
plt.xlabel(' ')
plt.ylabel(' ')
plt.title('Success Rate (%) of Marketing Campaign', size = 14);

From the chart, the most recent Marketing Campaign had the most success while Campaign2 had the least.

### **What does the average customer look like for this company?**

In [None]:
f, ax = plt.subplots(1, 3, figsize = (15,4), sharey = True)
sns.countplot(x = df['Country'], ax = ax[0])
ax[0].set_title('Customer Count by Country', fontweight ="bold")
ax[0].set_xlabel('')
ax[0].set_ylabel('');
sns.countplot(x = df['Marital_Status'], ax = ax[1])
ax[1].set_title('Customer Count by Marital Status', fontweight ="bold") 
ax[1].tick_params(labelrotation=45)
ax[1].set_xlabel('')
ax[1].set_ylabel('');
sns.countplot(x = df['Education'], ax = ax[2])
ax[2].set_title('Customer Count by Education', fontweight ="bold") 
ax[2].set_xlabel('')
ax[2].set_ylabel('');

In [None]:
cust_info = df[['Age', 'Income', 'Dependents', 'Recency']]
round(cust_info.mean(), 1)

Average Customer:
 - Age: 52
 - Income: $52,238
 - Dependents: 1
 - Country: Spain
 - Marital Status: Married
 - Education: Graduation
 - Recency: 49

### **Which products are performing best?**

In [None]:
product = pd.DataFrame(round(df[['MntFruits', 'MntSweetProducts', 'MntFishProducts', 'MntGoldProds',
                                'MntMeatProducts', 'MntWines']].mean(), 1), 
                       columns = ['Average']).sort_values(by = 'Average').reset_index()

plt.figure(figsize = (8,4))
plt.xticks(rotation = 30)
sns.barplot(x = 'index', y = 'Average', data = product)
plt.xlabel('Purchased Amount', weight = 'bold', size = 13)
plt.ylabel('Average', weight = 'bold', size = 13)

The best performing product is Wine followed by Meat Products.

### **Which channels are underperforming?**

In [None]:
channels = pd.DataFrame(round(df[['NumDealsPurchases', 'NumCatalogPurchases', 'NumWebPurchases', 
                                  'NumStorePurchases','TotalCampaignsAcc']].mean(), 1), 
                        columns = ['Average']).sort_values(by = 'Average').reset_index()
plt.figure(figsize = (8,4))
plt.xticks(rotation = 30)
sns.barplot(x = 'index', y = 'Average', data = channels)
plt.xlabel('Channels', weight = 'bold', size = 13)
plt.ylabel('Average', weight = 'bold', size = 13)

Deals and Marketing Campaigns are the channels that are struggling.

## Section 04: CMO Recommendations

### In addition to recommendations, included will be dashboard in Tableau for the CMO.

In [None]:
# organize the Columns
columns = ['ID', 'AgeGroup', 'Education', 'Marital_Status', 'IncomeGroup',
           'Dependents', 'Country','Year_Customer', 
           'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts',
           'MntSweetProducts','MntGoldProds', 'TotalMnt',
           'NumDealsPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 
           'TotalPurchases',  'NumWebPurchases', 'NumWebVisitsMonth',
           'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
           'AcceptedCmp2', 'TotalCampaignsAcc', 'Response', 'Complain', 'Recency']

# Save the new columns to df_group
df_group = df_group[columns]

In [None]:
# Save dataframe to csv. Bottom line has been marked out so we don't save the csv file on kaggle.
# df_group.to_csv(r'marketing_data_altered.csv', index=False)

After exporting the csv file, the data was put into Tableau Dashboard.

In [None]:
%%HTML
<div class='tableauPlaceholder' id='viz1614570325977' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ma&#47;MarketData_16145702817340&#47;Dashboard1&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='MarketData_16145702817340&#47;Dashboard1' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ma&#47;MarketData_16145702817340&#47;Dashboard1&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1614570325977');                    var vizElement = divElement.getElementsByTagName('object')[0];                    if ( divElement.offsetWidth > 800 ) { vizElement.style.minWidth='420px';vizElement.style.maxWidth='1500px';vizElement.style.width='100%';vizElement.style.minHeight='587px';vizElement.style.maxHeight='887px';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';} else if ( divElement.offsetWidth > 500 ) { vizElement.style.minWidth='420px';vizElement.style.maxWidth='1500px';vizElement.style.width='100%';vizElement.style.minHeight='587px';vizElement.style.maxHeight='887px';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';} else { vizElement.style.width='100%';vizElement.style.height='1127px';}                     var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

### The average customer shopping is:
 - Age: 52
 - Income: $52,238
 - Dependents: 1
 - Country: Spain
 - Marital Status: Married
 - Education: Graduation

Tailor your marketing to target this audience and build on top of the success of the recent marketing campaign. 