# Exploratory Data Analysis of Marketing Data
## Goal:- 
You're a marketing analyst and you've been told by the Chief Marketing Officer that recent marketing campaigns have not been as effective as they were expected to be. You need to analyze the data set to understand this problem and propose data-driven solutions...

In [None]:
# Importing all important libraries which will be in use.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas_visual_analysis import VisualAnalysis
from PIL import Image

In [None]:
# Importing data using pandas library
data = pd.read_csv('marketing_data.csv')

In [None]:
# getting top  5 rows of the dataset
data.head()

In [None]:
data.columns.values

In [None]:
# Total rows and columns in our dataset
data.shape

In [None]:
# Descriptive statistics
data.describe()

In [None]:
# Data type of the columns
data.info()

# Section 01: Exploratory Data Analysis

In [None]:
# Checking null values
data.isnull().sum()

### Are there any null values?
There are null values in 'Income' column


In [None]:
# Converting Income column from string to float
data[' Income '] = data[' Income '].str.replace(',', '').str.replace('$', '').astype(float)

In [None]:
#Checking the income column after the modification
data[' Income ']

In [None]:
# Making boxplot to find out the outliers in Income columns 
plt.figure(figsize = (10,15))
sns.boxplot(y = data[' Income '])
plt.ylim(0, 350000)

### Are there any outliers?
There are basically outliers in the income column, because in the 'Income' column the difference between mean and the median value is large and also in the above box plot outliers are seen clearly.   

In [None]:
figure = data[' Income '].hist(bins=25)

In [None]:
IQR = data[' Income '].quantile(0.75) - data[' Income '].quantile(0.25)

In [None]:
upper_bound = data[' Income '].quantile(0.25) - (IQR*1.5)
lower_bound =  data[' Income '].quantile(0.75) + (IQR*1.5)

In [None]:
print(upper_bound)
print(lower_bound)

In [None]:
data[' Income '].describe()

### Handling null values and Outliers 

In [None]:
# Dropping the columns which will be of no use
data1 = data.drop(['ID', 'Dt_Customer'], axis=1)

Filling the null values of the 'Income' column with the help of median value, because there are many outliers involved as we have seen the boxplot. 

In [None]:
# Substituting null values
data1[' Income '] = data1[' Income '].fillna(data1[' Income '].median())

In [None]:
# Again checking the total null values
data1.isnull().sum()

In [None]:
data1.head()

In [None]:
### Replacing values greater than 118350.5 with its  median values of the Income column.
data1.loc[data1[' Income ']>118350.5, ' Income '] = data1[' Income '].median()

### Noticing  patterns or anomalies in the data using sweetviz library

In [None]:
import sweetviz as sv

In [None]:
report = sv.analyze(data1)

In [None]:
report.show_html('visualization.html')

In [None]:
VisualAnalysis(data1)

In [None]:
# Heatmap of the correlations between the columns
plt.figure(figsize=(20,10))
sns.heatmap(data1.corr(),annot=True)
plt.show()

In [None]:
data.columns.values

# Section 02: Statistical Analysis


### 1. What factors are significantly related to the number of store purchases? 

Factors responsible for number of store purchases
- Income
- Kidhome
- MntWines
- MntFruits 
- MntMeatProducts
- MntFishProducts
- MntSweetProducts
- MntGoldProds

## 2. Does US fare significantly better than the Rest of the World in terms of total purchases? 

U.S fare is not better than the Rest of the World in terms of total purchases.

## 3. Your supervisor insists that people who buy gold are more conservative. Therefore, people who spent an above average amount on gold in the last 2 years would have more in store purchases. Justify or refute this statement using an appropriate statistical test.

If we seen our correlation matrix, the pearson's correlations between 'MntGoldProds' and the nnumber of products purchased are less than 0.5. From this statistics we can say that people who buys gold does not results in increasing the production.
So, the statement made by Supervisor is not Significant.

## 4. Fish has Omega 3 fatty acids which are good for the brain. Accordingly, do "Married PhD candidates" have a significant relation with amount spent on fish? What other factors are significantly related to amount spent on fish? (Hint: use your knowledge of interaction variables/effects) ?

Factors responsible for to the amount spend on fishes :-
- Income
- MntFruits
- MntMeatProducts
- MntSweetProducts
- NumCatalogPurchases

## 5. Is there a significant relationship between geographical regional and success of a campaign?

# Section 03: Data Visualization

## 1. Which marketing campaign is most successful? 

In [None]:
# calculate success rate (percent accepted)
cam_success = pd.DataFrame(data[['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Response']].mean()*100, 
                           columns=['Percent']).reset_index()

In [None]:
# plot
sns.barplot(x='Percent', y='index', data=cam_success.sort_values('Percent'), palette='Blues')
plt.xlabel('Accepted (%)')
plt.ylabel('Campaign')
plt.title('Marketing campaign success rate', size=16);

Last Marketing campaign is most successful. 

## 2. What does the average customer look like for this company? 

In [None]:
data.describe()

Average customer for the company:-
- Year_Birth = 1970
- Income = 51381
- Kihome= 0
- Teenhome = 0
- Recency = 49

## 3. Which products are performing best?

In [None]:
data.columns.values

In [None]:
spending = pd.DataFrame(round(data[['MntWines',
       'MntFruits', 'MntMeatProducts', 'MntFishProducts',
       'MntSweetProducts', 'MntGoldProds']].mean(), 1), columns=['Average']).sort_values(by='Average').reset_index()

# plot
ax = sns.barplot(x='Average', y='index', data=spending, palette='Blues')
plt.ylabel('Amount spent on...')

## add text labels for each bar's value
for p,q in zip(ax.patches, spending['Average']):
    ax.text(x=q+40,
            y=p.get_y()+0.5,
            s=q,
            ha="center") ;

The average customer spent...
- $25-50 on Fruits, Sweets, Fish, or Gold products
- Over $160 on Meat products
- Over $300 on Wines
- Over $600 total
- Products performing best:
- Wines
- Followed by meats

## Which channels are underperforming?

In [None]:
channels = pd.DataFrame(round(data[['NumDealsPurchases',
       'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases',
       'NumWebVisitsMonth']].mean(), 1), columns=['Average']).sort_values(by='Average').reset_index()

# plot
ax = sns.barplot(x='Average', y='index', data=channels, palette='Blues')
plt.ylabel('Number of...')

## add text labels for each bar's value
for p,q in zip(ax.patches, channels['Average']):
    ax.text(x=q+0.8,
            y=p.get_y()+0.5,
            s=q, 
            ha="center") ;

Channels: The average customer...
- Accepted less than 1 advertising campaign
- Made 2 deals purchases, 2 catalog purchases, 4 web purchases, and 5 store purchases
- Averaged 14 total purchases
- Visited the website 5 times
- Underperforming channels:
- Advertising campaigns
- Followed by deals, and catalog
