### Import libraries and read the data

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
df = pd.read_csv("../input/marketing-data/marketing_data.csv")

In [None]:
df.head()

## **Column Details:**
1. **ID:** Customer's Unique Identifier
2. **Year_Birth:** Customer's Birth Year
3. **Education:** Customer's education level
4. **Marital_Status:** Customer's marital status
5. **Income:** Customer's yearly household income
6. **Kidhome:** Number of children in customer's household
7. **Teenhome:** Number of teenagers in customer's household
8. **Dt_Customer:** Date of customer's enrollment with the company
9. **Recency:** Number of days since customer's last purchase
10. **MntWines:** Amount spent on wine in the last 2 years
11. **MntFruits:** Amount spent on fruits in the last 2 years
12. **MntMeatProducts:** Amount spent on meat in the last 2 years
13. **MntFishProducts:** Amount spent on fish in the last 2 years
14. **MntSweetProducts:** Amount spent on sweets in the last 2 years
15. **MntGoldProds:** Amount spent on gold in the last 2 years
16. **NumDealsPurchases:** Number of purchases made with a discount
17. **NumWebPurchases:** Number of purchases made through the company's web site
18. **NumCatalogPurchases:** Number of purchases made using a catalogue
19. **NumStorePurchases:** Number of purchases made directly in stores
20. **NumWebVisitsMonth:** Number of visits to company's web site in the last month
21. **AcceptedCmp1:** 1 if customer accepted the offer in the 1st campaign, 0 otherwise (Target variable)
22. **AcceptedCmp2:** 1 if customer accepted the offer in the 2nd campaign, 0 otherwise (Target variable)
23. **AcceptedCmp3:** 1 if customer accepted the offer in the 3rd campaign, 0 otherwise (Target variable)
24. **AcceptedCmp4:** 1 if customer accepted the offer in the 4th campaign, 0 otherwise (Target variable)
25. **AcceptedCmp5:** 1 if customer accepted the offer in the 5th campaign, 0 otherwise (Target variable)
26. **Response:** 1 if customer accepted the offer in the last campaign, 0 otherwise (Target variable)
27. **Complain:** 1 if customer complained in the last 2 years, 0 otherwise
28. **Country:** Customer's location


## **Data Wrangling**

In [None]:
df.shape

We have 2240 rows in the dataset. 

In [None]:
df.head()

In [None]:
df.info()  #pandas

1. We see that column_name "Income" has a space before it's name that will create problems in further analysis, so we'll rename it.

2. There looks a problem with 2 column's datatypes. 
We need to change the datatype of "**Income**" column into **int64** so that it can be used for further calculations, and change the datatype of "**Dt_Customer**" into datetime.

In [None]:
df.rename(columns={' Income ':'Income'},inplace=True)

In [None]:
df["Dt_Customer"] = pd.to_datetime(df["Dt_Customer"], format='%m/%d/%y')  
df["Income"] = df["Income"].str.replace("$","").str.replace(",","") 
df["Income"] = df["Income"].astype(float)

In [None]:
df.head()

In [None]:
df.nunique() 

We have data of 2240 unique Customers. No customer ID is repeated in the data. 

### reomve IDs

In [None]:
df.drop(['ID'],axis=1,inplace =True)

### Check duplicates if any

In [None]:
duplicate = df[df.duplicated(subset=None,keep='first')] 
  
print("Duplicate Rows :") 
  
# Print the resultant Dataframe 
duplicate.shape 
#duplicate

In [None]:
df= df.drop_duplicates() 

In [None]:
df.shape

no duplicates present in the data

### Handle missing values

In [None]:
df.isnull().sum()

We only have 24 missing values in the "**Income**" column.
So, we first check the skewness of the column.
If the data is symetrical, we use mean to impute the missing values, else we will use median.

To check the skewness, let us plot the boxplot and histogram. 

In [None]:
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.boxplot(data = df["Income"])
plt.subplot(1,2,2)
sns.histplot(df["Income"])

We can see that the distribution is rightly skewed. It has a lot of outliers towards the right and so, mean will not be a good imputation method as mean is sensitive to outliers. 

In [None]:
df["Income"].fillna(value=df["Income"].median(),inplace=True)

### Divide the dataframe into 3 sub dataframes: categorical string, categorical numerical, Numerical

In [None]:
# split the dg --> two sub df 
# 1. categorial -- > Numerical
# 2. categorical -- > string -- > Chi Square test
# 2. Numerical ---> box plot, histogram, scatter

In [None]:
df_cat = df.loc[:,df.dtypes==np.object]

In [None]:
df_cat.head()

In [None]:
df_cat.shape

In [None]:
cat_num = ['Kidhome', 'Teenhome', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
       'AcceptedCmp2', 'Response', 'Complain']

df_cat_num = df[cat_num]




In [None]:
df_cat_num['Kidhome'].value_counts()

In [None]:
df_cat['Marital_Status'].value_counts()

In [None]:
num = ['Year_Birth','Income','Dt_Customer', 'Recency', 'MntWines', 'MntFruits',
       'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
       'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth']
df_num = df[num]

In [None]:
df_num.head()

## Analysis Numerical Continuous variable

In [None]:
df_num.describe()

## Univariate Analysis

### Outlier removal

Let us now check if there are any outliers present in the dataset. 

In [None]:
df_num = df_num.drop(['Dt_Customer'],axis=1)

In [None]:
df_num.columns


In [None]:
df_num.shape

In [None]:
sns.boxplot(data = df_num['MntWines'])

#### No need to remove as amount as can be increased to any level

In [None]:
sns.boxplot(data = df_num['MntSweetProducts'])

#### No need to remove as amount as can be increased to any level

In [None]:
sns.boxplot(data = df_num['MntGoldProds'])

#### No need to remove as amount as can be increased to any level

In [None]:
sns.boxplot(data = df_num['MntSweetProducts'])

#### No need to remove as amount as can be increased to any level

In [None]:
sns.boxplot(data = df_num['MntFruits'])

#### No need to remove as amount as can be increased to any level

In [None]:
sns.boxplot(data = df_num['MntMeatProducts'])

#### No need to remove as amount as can be increased to any level

In [None]:
sns.boxplot(data = df_num['NumStorePurchases'])

In [None]:
sns.boxplot(data = df_num['NumWebPurchases'])

In [None]:
sns.boxplot(data = df_num['NumWebVisitsMonth'])

In [None]:
sns.boxplot(data = df_num['NumCatalogPurchases'])

In [None]:
sns.boxplot(data = df_num['NumDealsPurchases'])

In [None]:
sns.boxplot(data = df_num['Income'])

In [None]:
sns.boxplot(data = df_num['Year_Birth'])

#### We need to remove it as it's not possible for a customer to have birth year less than 1900

In [None]:
sns.boxplot(data = df_num['Recency'])

In [None]:
Q1 = df_num['Year_Birth'].quantile(0.25)
Q3 = df_num['Year_Birth'].quantile(0.75)
IQR = Q3 - Q1
print(Q1 )
print(Q3 )

In [None]:
df_num.shape

In [None]:
df_num = df_num[~((df_num['Year_Birth'] < (Q1 - 1.5 * IQR)) |(df_num['Year_Birth'] > (Q3 + 1.5 * IQR)))]

In [None]:
df_num.shape

In [None]:
df_num['Year_Birth'].shape

## Histogram

In [None]:
plt.figure(figsize=(10,5))
df_num.hist(figsize = (15,20))
plt.tight_layout()

In [None]:
sns.distplot(df["Income"])

**Insight:** We have a varied range of Customers in this store. People with income as high as 700k yearly income and some customers with less than $100k yearly income as well.

However, majority of customers are with low yearly income and there are only a few which have income more than $100k.
This means that the store caters to majorly low-income group customers and doesn't entertain rich/luxury customers. 

So, we will remove this outliers otherwise it will pose a problem in further analysis.
We use the log transformation technqiue for this.


In [None]:
df["Income"] = np.log(df["Income"])

In [None]:
sns.distplot(df["Income"])

## Bivariate analysis

### Correlation

In [None]:
plt.figure(figsize=(20,10))
sns.heatmap(df_num.corr(),annot=True)

1. **Income** has a high positive correlation with **"NumPurchases"** columns and **"Mnt"** columns. This represents the High Income cluster and shows people with high income spend more and purchase more frequently. **Income** has a high negative correlation with **"NumWebVisitsMonth"** suggesting that customers with high income do not for web visits too often.

2. **"Amount Spent on Wines"** has a high positive correlation with **"NumCatalogPurchases"** and **"NumStorePurchases"**, and similarly,  **"Amount Spent on Meat products"** has a very high positive correlation with **"NumCatalogPurchases"**, suggesting that People generally buy Wines and Meat products through Catalogs. 

4. **"NumWebVisitsMonth"** shows no correlation with **"NumWebPurchases"**.  Instead, it shows a mild correlation with **"NumDealsPurchases"** which suggests that deals are an effective way of stimulating purchases on the website. 


In [None]:
sns.pairplot(df_num)

### Analysis of each  categorical column

###  Education

In [None]:
sns.countplot(df["Education"])

Second cycle corresponds to graduate level or master's level studies. Third cycle corresponds to doctoral or PhD level studies. 
This type of education system is usally accepted in European countries.

**Insight:** So we have maximum customers who have completed their Graduation, and only a few of them have gone to study further after Graduation.

In [None]:
sns.barplot(x=df["Education"],y=df["Income"])

This is barplot wih x-axis as "Education" and y-axis "Income".

**Insight:** Customers with PhD degree have highest average income as compared to other customers.

###  Marital Status

In [None]:
sns.countplot(df["Marital_Status"])

Number of married customers are the most for this store. 

With the help of client, you can have a better understanding of the data and clean this type of data. Like you can merge the YOLO, Alone and Single categories together. 

But domain knowledge is important here. 

### Country

In [None]:
sns.countplot(df["Country"])

1. Spain has maximum customers.
2. Mean birth year for all countries is approximately the same.
3. Average Income of customers of all countries is approximately the same.


### Additional Features

###  Products


In [None]:
Products = [col for col in df.columns if 'Mnt' in col]
Products_total = []
for i in range(0,6):
  print("{} = ${}".format(Products[i],df[Products[i]].sum(axis=0)))
  Products_total.append(df[Products[i]].sum(axis=0))

In [None]:
sns.barplot(x=Products_total, y=Products)

This clearly shows Maximum Amount is spent on Wines, so it is most favorite product of all customers. The next favourite product is Meat Products.

### Purchases

In [None]:
Purchases = ['NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases']
Purchases_total = []
for i in range(0,4):
  print("{} = {}".format(Purchases[i],df[Purchases[i]].sum(axis=0)))
  Purchases_total.append(df[Purchases[i]].sum(axis=0))

In [None]:
Purchases

In [None]:
sns.barplot(x=Purchases_total, y=Purchases)

This shows that maximum purchases have been done through store visits and the next is through website. 



## Categorical Numerical

###  Complain

In [None]:
sns.countplot(df["Complain"])

Very few complaints have been made by the customers. Majority of people did not have any complaints. So the company can focus on people who had filed complaints and resolve those to have a no complaint record. 


 

### AcceptedCmp1

In [None]:
sns.countplot(df["AcceptedCmp1"])

In [None]:
sns.countplot(df["AcceptedCmp2"])

In [None]:
sns.countplot(df["AcceptedCmp3"])

In [None]:
sns.countplot(df["AcceptedCmp4"])

In [None]:
sns.countplot(df["AcceptedCmp5"])

In [None]:
sns.countplot(df["Response"])

### Kidhome + teenhome

In [None]:
df2 = df.copy()
df2["AmountSpent"] = df[Products].sum(axis=1)
df2["PurchasesMade"] = df[Purchases].sum(axis=1)

In [None]:

df2["Dependents"] = df2["Kidhome"] + df2["Teenhome"]

plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
sns.boxplot(y=df2["AmountSpent"],x=df2["Dependents"])
plt.subplot(1,2,2)
sns.boxplot(y=df2["PurchasesMade"],x=df2["Dependents"])
plt.tight_layout()

This shows that customers with more dependents spend less, as compared to customers with less dependents. 
Also, customers with more dependents make less number of purchases in the store.

### Feature Transformation

In [None]:
df2["Age"] = pd.datetime.today().year - df["Year_Birth"]

In [None]:
df2["Age_category"] = df2['Age'].apply(lambda x: 'Senior Citizen' if x >= 60 else 'Adult' if x > 25 else 'Youth')

In [None]:
sns.countplot(x=df2["Age_category"])

So we have most of the customers in the age group 25 to 60. 

In [None]:
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
sns.boxplot(y=df2["AmountSpent"],x=df2["Age_category"])
plt.subplot(1,2,2)
sns.boxplot(y=df2["PurchasesMade"],x=df2["Age_category"])
plt.tight_layout()

Surprisingly, Senior citizens are making more puchases and spending more as compared to Adults. 

## Statistical Analysis

### What factors are significantly related to the number of store purchases?

In [None]:
df_num.corrwith(df_num.NumStorePurchases).sort_values()

As we can see number of store purchases are highly correlated to amount of wines purchased. So we can conclude that may be wines are purchased more of the time from store

Insight: NumStorePurchases decrease with the increase in NumWebVisitsMonth. Also, NumStorePurchases increases with the mAmount spent on wines and NumCatalogPurchases.

### Does Spain fare differs than the Rest of the World in terms of total purchases?

In [None]:
df["AmountSpent"] = df[Products].sum(axis=1)
df["PurchasesMade"] = df[Purchases].sum(axis=1)

In [None]:
df.head()

In [None]:
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
df.groupby('Country')["PurchasesMade"].sum().sort_values().plot(kind='bar')
plt.title("PurchasesMade")
plt.subplot(1,2,2)
df.groupby('Country')["AmountSpent"].sum().sort_values().plot(kind='bar')
plt.title("AmountSpent")
plt.tight_layout()

Although it is too evident from the visualisation that Spain is the best in terms of Total Amount Spent and Total Purchases made in the store, but for a better analysis we will perform a statistical test. 

## Null Hypothesis test

Null Hypothesis: Spain fare is equal to the average fare of the rest of the world in terms of total purchases. (Spain_avg = Rest_avg)

Alternative Hypothesis: Spain fare differs than the average fare of the rest of the world in terms of total purchases.
(Spain_avg is not equal to Rest_avg)

In [None]:
avg_pur = df.groupby('Country')["PurchasesMade"].mean().sort_values()
avg_pur = pd.DataFrame(avg_pur).reset_index()
avg_pur

In [None]:
spain_avg = avg_pur[avg_pur['Country']=='SP']['PurchasesMade'].mean()
rest_avg = avg_pur[avg_pur['Country']!='SP']['PurchasesMade'].mean()
print("Spain's average no of purchases = {}".format(spain_avg) )
print("Rest of the world's average no of purchases = {}".format(rest_avg) )

In [None]:
from scipy.stats import ttest_ind
pval = ttest_ind(df[df['Country']=='SP']['PurchasesMade'], df[df['Country']!='SP']['PurchasesMade']).pvalue
print("t-test p-value: ", round(pval, 3))

p value > 0.05 --> accept null hypothesis

## Chi Square test

## Is there a significant relationship between geographical region and success of the 4th campaign?

We will use a chi-square test to test this as both are categorical variables.

Null Hypothesis: There is no significant relationship between geographical regional and success of the 4th campaign.
    
Alternative Hypothesis: There is a significant relationship between geographical regional and success of the 4th campaign

In [None]:
crosstab = pd.crosstab(df["Country"],df['AcceptedCmp4'])
crosstab

We can simply pass the crosstab variable through the chi2_contingency() method to conduct a Chi-square test of independence.The first value is the Chi-square value, followed by the p-value , then comes the degrees of freedom , and lastly it outputs the expected frequencies as an array. 

In [None]:
import scipy.stats as stats
chi_sq,p ,dof ,expected = stats.chi2_contingency(crosstab)

In [None]:
print("P-value for chi-sqaure test is = {}".format(p))

Since p-value is much greater than alpha=0.05, we fail to reject the null hypothesis. So we can conclude that there is no evidence of a significant relationship between geographical regional and success of the 4th campaign.

## Conclusion

 #### Store visit or web visit, which brings more customers? : Store Visit
 #### Which campaign performed best and which one worst? : cmp4 performed best and cmp2 least
 #### Which age group purchased the most?: Senior citizen
 #### Which country has maximum customers? Can we focus on that itself?: Spain, but only focusing on spain can't bring customers.
 #### Which product was sold the most in all campagins?: Wine
 #### Is there any contribution of a particular country to make the particular campaign successful?: No
 #### Who purchased the most: people with no dependents
