# Business Analysis with Exploratory Data Analysis & Statistics

You're a marketing analyst and you've been told by the Chief Marketing Officer that recent marketing campaigns have not been as effective as they were expected to be. You need to analyze the data set to understand this problem and propose data-driven solutions.

This project will include and answer the following questions:

#### Section 01: Exploratory Data Analysis
* Are there any null values or outliers? How will you wrangle/handle them?
* Are there any variables that warrant transformations?
* Are there any useful variables that you can engineer with the given data?
* Do you notice any patterns or anomalies in the data? Can you plot them?

#### Section 02: Data Visualization

* Which marketing campaign is most successful?
* What does the average customer look like for this company?
* Which products are performing best?
* Which channels are underperforming?


#### Section 03: Statistical Analysis
* What factors are significantly related to the number of store purchases?
* Does US fare significantly better than the Rest of the World in terms of total purchases?
* Fish has Omega 3 fatty acids which are good for the brain. Accordingly, do "Married PhD candidates" have a significant relation with amount spent on fish? 

#### Section 04: Conclusions and Recommendations






### Exploratory Data Analysis

In [None]:
#Import the libraries that we will be using
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import statsmodels.api as sm 
from sklearn import linear_model
import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('../input/marketing-data/marketing_data.csv')

In [None]:
df.head()

Data Cleaning:


In [None]:
#Remove spaces from the header 
df.columns = df.columns.str.strip()
df.columns.tolist()

In [None]:
#Check for null values
df.isnull().sum()
#We can see that Income has 24 null values so we drop them 
df = df[df['Income'].notna()]
df.isnull().sum()

In [None]:
#Get a statistical description of our data
df.describe()

In [None]:
#Convert Income into numerical
#First remove the '$' symbol
df['Income'] = df['Income'].str.replace('$', '')
#Then remove the ','
df['Income'] = df['Income'].str.replace(',', '').astype('float')
#df.head()

In [None]:
df['Total_Children'] = df.Kidhome + df.Teenhome

In [None]:
df['Age'] = 2014 - df.Year_Birth  #The data was collected in 2014 
bins = [18, 26, 40, 56, 70]
labels = ['18-26', '26-40', '40-56', '56-70']
df['Age_group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False, include_lowest = True)

In [None]:
#Create a new column of the total of Products and the total of Purchases
df['Total_Products'] = df.MntWines + df.MntFruits + df.MntMeatProducts + df.MntFishProducts + df.MntGoldProds + df.MntSweetProducts
df['Total_Purchases'] = (df['NumDealsPurchases'] + df['NumWebPurchases'] + df['NumStorePurchases'] +
                            df['NumCatalogPurchases'])

df.head()

In [None]:
#Create a new column with the total campaign acceptance
df['Total_CampAccepted'] = df.AcceptedCmp1 + df.AcceptedCmp2 + df.AcceptedCmp3 + df.AcceptedCmp4 + df.AcceptedCmp5


Plot the Income distribution and check for outliers: 

In [None]:
plt.figure(figsize=(8,5))
sns.set_theme()
sns.boxplot(df.Income)
plt.ylabel('Count')

We can notice some outliers in the Income column so we're removing the values where Income is more than $200'000  

In [None]:
df = df[df.Income < 200000]


Plot the Age disribution and check for outliers:

In [None]:
sns.set_theme()
sns.boxplot(df.Age, palette = 'husl' )
plt.ylabel('Count')

There are also some outliers of age more than 100 so we are removing them too. 

In [None]:
df = df[df.Age < 100]

Plot the total Purchase distribution:

In [None]:
plt.figure(figsize=(8,5))
sns.set_theme()
sns.distplot(df.Total_Purchases, kde = False, hist = True)
plt.ylabel('Count')

### Data Visualization

Visualize the relation of Income with Total Products by highlighting the amount of Gold Products:

In [None]:
plt.figure(figsize = (11,5))
sns.scatterplot(data=df, x='Income', y='Total_Products', hue = 'MntGoldProds')


Visualize the relation of Income with Total Products by highlighting the amount of Fish Products:

In [None]:
plt.figure(figsize = (11,5))
sns.scatterplot(data=df, x='Income', y='Total_Products', hue = 'MntFishProducts')



In [None]:
#ads=df[['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Response', 'Country']]
#ads.apply(pd.value_counts).plot(kind='bar', 
#                                     title='Accepted Campaigns', plot='Country')

In [None]:
plt.figure(figsize=(7,6))
sns.set_theme(style="whitegrid")
ax = sns.countplot(y= 'Marital_Status', data = df, orient = 'h', palette= 'husl')
plt.title('Marital Status')

In [None]:
plt.figure(figsize=(4,3))
sns.countplot(x='Age_group', data = df, palette= 'husl')


In [None]:
#plt.figure(figsize=(7,6))
sns.countplot(x= 'Total_Children', data = df, palette= 'husl')
plt.title('Number of Children')
plt.xlabel('Number of Children')

In [None]:
sns.countplot(x= 'Education', data = df, palette= 'husl')
plt.title('Education')

In [None]:
df.Recency.mean()

What does the average customer look like for this company?


* Graduated 
* 1 child 
* 40-56 years old 
* Married 
* Household income of $50'000(avg)
* Made a purchase in the last 49 days

##### Which products are performing best?

In [None]:
f, ax = plt.subplots(2,3, figsize = (15,8))

x1 = sns.boxplot(df.MntWines, ax=ax[0, 0])
x2 = sns.boxplot(df.MntFruits, ax=ax[0, 1])
x3 = sns.boxplot(df.MntMeatProducts, ax=ax[0,2])
x4 = sns.boxplot(df.MntFishProducts, ax=ax[1,0])
x5 = sns.boxplot(df.MntGoldProds, ax=ax[1,1])
x6 = sns.boxplot(df.MntSweetProducts, ax=ax[1,2])

x1.set_title('Wines Sold Distribution')
x2.set_title('Fruits Sold Distribution')
x3.set_title('Meat Products Sold Distribution')
x4.set_title('Fish Products Sold Distribution')
x5.set_title('Gold Products Sold Distribution')
x6.set_title('Sweet Products Sold Distribution')
plt.tight_layout()


In [None]:
plt.figure(figsize = (8, 6))
product = df[['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntGoldProds', 'MntSweetProducts']].agg([sum])

sns.barplot(x = product.T.index, y = product.T['sum'], palette= 'husl')
plt.gca().set_xticklabels(['Wines', 'Fruits', 'Meat', 'Fish', 'Gold', 'Sweet'])
plt.xlabel('Products')
plt.ylabel('Purchases')
plt.show()

In [None]:
print('wine ' + str(df.MntWines.mean()))
print('fruit ' + str(df.MntFruits.mean()))
print('meat ' + str(df.MntMeatProducts.mean()))
print('Fish ' + str(df.MntFishProducts.mean()))
print('Gold ' + str(df.MntGoldProds.mean()))
print(df.MntSweetProducts.mean())

The average customer spent: 
* 305 dollars on wine, 
* 26 on fruits, 
* 167 on meat, 
* 37 on fish, 
* 43 on gold and 
* 27 on sweets

##### Which channels are performing better?

Stores have the highest number of purchases.

In [None]:
plt.figure(figsize = (8, 6))
purchase_source =  pd.DataFrame(df[['NumWebPurchases', 'NumStorePurchases', 'NumCatalogPurchases', 'NumDealsPurchases']].agg([sum]))

sns.barplot(x = purchase_source.T.index, y = purchase_source.T['sum'], palette = 'husl')
plt.gca().set_xticklabels(['Web', 'Store', 'Catalog', 'Deals'])
plt.xlabel('Purchase Source')
plt.ylabel('Purchases')
plt.show()

#### Which country has more purchases?

Mexico has the highest number of total purchases. 

In [None]:
graph = sns.barplot(data = df, x='Country', y='Total_Purchases', palette= 'husl')


#### Which country has more clients?

As can be seen from the plot, Spain has more customers than other countries. 

In [None]:
sns.countplot(data = df, x='Country')

### Statistical Analysis

##### Which marketing campaign is most successful?


In [None]:
f, ax = plt.subplots(2,3, figsize = (15,8))

x1 = sns.barplot(data = df, x='Country', y='AcceptedCmp1', palette= 'husl', ax=ax[0, 0])
x2 = sns.barplot(data = df, x='Country', y='AcceptedCmp2', palette= 'husl', ax=ax[0, 1])
x3 = sns.barplot(data = df, x='Country', y='AcceptedCmp3', palette= 'husl', ax=ax[0,2])
x4 = sns.barplot(data = df, x='Country', y='AcceptedCmp4', palette= 'husl', ax=ax[1,0])
x5 = sns.barplot(data = df, x='Country', y='AcceptedCmp5', palette= 'husl', ax=ax[1,1])
x6 = sns.barplot(data = df, x='Country', y='Response', palette= 'husl', ax=ax[1,2])

x1.set_title('Cmp1')
x2.set_title('Cmp2')
x3.set_title('Cmp3')
x4.set_title('Cmp4')
x5.set_title('Cmp5')
x6.set_title('Response')
plt.tight_layout()

In [None]:
campaign = pd.DataFrame(df[['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Response']].agg(['mean']))

sns.barplot(x = campaign.T.index, y = campaign.T['mean'], palette = 'husl')
plt.xticks(rotation = 45)


Let's calculate each of the values for a better comparison:

In [None]:
 
print(df.AcceptedCmp1.value_counts())
print(df.AcceptedCmp2.value_counts())
print(df.AcceptedCmp3.value_counts())
print(df.AcceptedCmp4.value_counts())
print(df.AcceptedCmp5.value_counts())
print(df.Response.value_counts())
print('Cmp1 has '+ str(142/(2079+142)*100) + '% acceptance')
print('Cmp2 has '+ str(30/(2182+30)*100) + '% acceptance')
print('Cmp3 has '+ str(163/(2049+163)*100) + '% acceptance')
print('Cmp4 has '+ str(164/(2048+164)*100) + '% acceptance')
print('Cmp5 has '+ str(161/(2051+161)*100) + '% acceptance')
print('Response has '+ str(333/(1879+333)*100) + '% acceptance')

In the next step we are going to build a correlation matrix to measure the statistical relationship between variables:

In [None]:
corr=df.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
f, ax = plt.subplots(figsize=(13, 10))
cmap = sns.diverging_palette(250, 20, as_cmap=True)

sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot = True, annot_kws={'size':7})


##### Strongest relationships that we can notice from the correlation matrix
How does number of children affects purchases: 
* Nr of deal purchases has a **positive** relation of **0.44** with the _total_ number of children in home.
* Nr of Store Purchases has a **negative** relation of **0.5** with the number of kids in home.
* Sales of meat products have a **negative** relation of **0.5** with the _total_ number of children in home.
* Nr of Catalog Purchases have a **negative** relation of **0.5** with the number of kids in home. 
* Total Products have a **negative** relation of **0.5** with the _total_ number of children in home.

The correlations of Income with the buying channels? 
* Income - Deal Purchases **-0.11**
* Income - Web Purchases **0.46**
* Income - Catalog Purchases **0.7**
* Income - Store Purchases **0.63**





In the next step let's check the amount spent on fish by "Married PhD candidates" compared to other groups.  


In [None]:
df5 = df.groupby(['Education', 'Marital_Status']).MntFishProducts.sum().sort_values(ascending=False).reset_index()
print(df5)

"Married-PhD candidates" are not the group with the hightest amount spent on fish, "Graduation - Married" are. 


Nest We are going to perform a linear regression to identify variables that significantly affect the number of store purchases.
We are choosing variables that we think we’ll be good predictors for the dependent variable. This can be done by checking the correlations between variables in the correlation matrix plotted above. The fetures that have the highest correlation with the number of store prchases are: Kidhome, Total_Products, MntWines, NumCatalogPurchases, NumWebPurchases, Total_Purchases. 

In [None]:
#X- our independent variable
#Y - our dependent variables
dfr = df
target = pd.DataFrame(df['NumStorePurchases'])
X = dfr[['Kidhome', 'Total_Products', 'MntWines', 'NumCatalogPurchases', 'NumWebPurchases', 'Total_Purchases']]
Y = dfr['NumStorePurchases']

In [None]:
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
model.summary()

* We can see that the model has a high R-squared value — 0.961, meaning that this model explains 96.1% of the variance in our dependent variable, NumStorePurchases.
* When Number of Kids in home increases by 1, number of store purchases decreases by 0.678. It makes sense because it may be difficult to shop in stores with children.
* As number of catalogue purchases increases by 1, number of store purchases will decrease by 0.7515.
* As number of web purchases increases by 1, number of store purchases will decrease by 0.7232.
* As number of total purchases increase by 1, number of store purchases will increase by 0.6713. This is expected  because total purchases are the sum of store purchases and other channels.
* Total_Products and MntWines are not significant.

### Coclusions and Recommendations

Country with the highest campaign acceptance is Mexico with the most recent marketing campaign, Response. The management might want to follow the same approach with the other campaigns too. 

The marketing campaign with the lowest acceptance is Campaign 2 so it is better to be avoided in the future. 

It is recommended that the representation of parents in the target population is reconsidered because the number of children in home has a relatively strong correlation with nr of store purchases, sales of meat products, nr of catalogue purchases and total products. 

The products that are performing best are wines followed by meat so the next campaigns are recommended to focus on the other products that are selling less. 

The channels that are performing best are stores followed by web but catalogue and deal purchases are underperforming. 

Income has a correlation of 0.7 with catalogue purchases so the marketing team should consider this when advertising.




##### Sources and Acknowledgement:
This data set was provided to students for their final project in order to test their statistical analysis skills as part of a MSc. in Business Analytics. 
Thank you Dr. Omar Romero-Hernandez for providing this data set for your students.
You can find it <a href="https://www.kaggle.com/jackdaoud/marketing-data" target="_blank">here.</a>