# Section 01: Exploratory Data Analysis
1. [ Are there any null values or outliers? How will you wrangle/handle them?](#1)
1.  Are there any variables that warrant transformations?
1. [ Are there any useful variables that you can engineer with the given data?](#2)
1. [ Do you notice any patterns or anomalies in the data? Can you plot them?](#3)


# Section 02: Statistical Analysis
Please run statistical tests in the form of regressions to answer these questions & propose data-driven action recommendations to your CMO. Make sure to interpret your results with non-statistical jargon so your CMO can understand your findings.

1. [What factors are significantly related to the number of store purchases?](#4)
2. [Does US fare significantly better than the Rest of the World in terms of total purchases?](#5)
3. [Your supervisor insists that people who buy gold are more conservative. Therefore, people who spent an above average amount on gold in the last 2 years would have more in store purchases. Justify or refute this statement using an appropriate statistical test](#6)
4. [Fish has Omega 3 fatty acids which are good for the brain. Accordingly, do "Married PhD candidates" have a significant relation with amount spent on fish? What other factors are significantly related to amount spent on fish? (Hint: use your knowledge of interaction variables/effects)](#7)
5. [Is there a significant relationship between geographical regional and success of a campaign?](#8)

In [None]:
# linear algebra
import numpy as np 

# data processing
import pandas as pd

# data visualization(for EDA)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('ggplot')
sns.set(color_codes=True)

#ignore warnings
import warnings
warnings.filterwarnings('ignore')
import datetime


# Importing sklearn methods
from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn import svm
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_score
from sklearn import model_selection
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

# import labelencoder
from sklearn.preprocessing import LabelEncoder

# the spearman's correlation between two variables
from scipy.stats import spearmanr



In [None]:
df=pd.read_csv('../input/marketing-data/marketing_data.csv')
df.shape

**Cleaning Data**
1. Income column change to numeric
1. Dt_Customer has string data type, we have to change it type to date


In [None]:
df.rename({' Income ':'Income'}, axis=1, inplace=True)
df['Income'] = df['Income'].str.replace('$','').str.replace(',','').astype(float)


<a id="1"></a> <br>

### **1. Are there any null values or outliers? How will you wrangle/handle them?**

In [None]:
df.head(3)

In [None]:
#null values
sns.heatmap(df.isnull(),yticklabels=False,cmap='YlOrRd');

As we can see that from the above plot, we have null values in Income column

In [None]:
#We can see that Income has 24 null values so we drop them 
df = df[df['Income'].notna()]
df.columns[df.isnull().any()].tolist()  

### Outliers & Anomalies
From the graphs , it is clear that multiple features contain outliers but income and births may indicate data entry error

In [None]:
import matplotlib.pyplot as plt
list(set(df.dtypes.tolist()))
df_num = df.drop(columns=['ID', 'AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Response', 'Complain']).select_dtypes(include = ['float64', 'int64'])

df_num.plot(subplots=True, layout=(4,4), kind='box', figsize=(16,18), patch_artist=True,color="Green" )
plt.subplots_adjust(wspace=0.5);

### Numeric Data Distribution

In [None]:
df_num.hist(figsize=(16, 20), bins=50, xlabelsize=8, ylabelsize=8,color="Red");


## Handling Outliers

In [None]:
from scipy import stats
import seaborn as sns
stats.probplot(df['Income'], plot=sns.mpl.pyplot);

From the above qqplot we see that one person has an income of over 600,000  which is an anomaly.Since it's a single record, I'll simply delete it.

In [None]:
stats.probplot(df['Year_Birth'],plot=sns.mpl.pyplot);

Finding :
The Year_Birth column contains three anomalies, so we'll simply drop these records

In [None]:
df.Marital_Status.value_counts()

We can see that the martial_status has outliers (alone, absurd, Yolo) as there are only seven records. Therefore, we will simply exclude these outliers from our data.

In [None]:
df = df[~df['Marital_Status'].isin(['Absurd', 'Alone', 'YOLO'])]

In [None]:
df = df[df['Year_Birth'] > 1910].reset_index(drop=True)
df = df[df['Income'] < 600000].reset_index(drop=True)

<a id="2"></a> <br>
### **Are there any useful variables that you can engineer with the given data?**

## Feature Engineering
With the help of given features we can drive some important variables like:

1. The total number of kid in the home can be calculated from the sum of “Kidhome” and “Teenhome”.
1. The total amount spent can be calculated from the sum of all features that containing the Mnt keyword.
1. The total number of purchases can be calculated from the sum of all features containing the keyword "Purchases".
1. The total number of campains accepted can be calculated from the sum of all features containing the keyword "Cmp".
1. From Dt_Customer we can find The year of becoming a customer
1. From Year_Birth we can derive Age

In [None]:
#Total kids
df['Totalkids'] = df['Kidhome'] + df['Teenhome']

#
df['YearCustomer'] = pd.DatetimeIndex(df['Dt_Customer']).year


# total amount spent
mnt_cols = [col for col in df.columns if 'Mnt' in col]
df['TotalMnt'] = df[mnt_cols].sum(axis=1)

# Total Purchases
purchases_cols = [col for col in df.columns if 'Purchases' in col]
df['TotalPurchases'] = df[purchases_cols].sum(axis=1)

# Total Campaigns Accepted
campaigns_cols = [col for col in df.columns if 'Cmp' in col] + ['Response'] 
df['TotalCampaignsAcc'] = df[campaigns_cols].sum(axis=1)

#age
year=datetime.datetime.today().year
df['Age']=year-df['Year_Birth']

#Age_groupe
bins= [18,39,59,90]
labels = ['Adult','Middle Age Adult','Senior Adult']
df['AgeGroup'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)
df['AgeGroup'] = df['AgeGroup'].astype('object')

<a id="3"></a> <br>
# Do you notice any patterns or anomalies in the data? Can you plot them?

Findings
1. Almost 50% of clients' education level is graduate, and few customers have an primary level of education.
1. The number of married clients is more than widow and divorce.
1. There is a remarkably high percentage of customers in Spain while the percentage of clients in the United States and Montenegro is very small.
1. There is a very high percentage of clients between 39 to 59 years old compared to other age groups

In [None]:

f,ax=plt.subplots(2,2,figsize=(20,15))


df['Education'].value_counts().plot.pie(autopct='%1.1f%%',ax=ax[0][0],shadow=True,legend=True)
ax[0][0].set_title('Level of Education',fontweight ="bold") 
ax[0][0].set_ylabel('')
df['Marital_Status'].value_counts().plot.pie(autopct='%1.1f%%',ax=ax[0][1],shadow=True,legend=True)
ax[0][1].set_title('Marital status',fontweight ="bold") 
ax[0][1].set_ylabel('')
df['Country'].value_counts().plot.pie(autopct='%1.1f%%',ax=ax[1][0],shadow=True,legend=True)
ax[1][0].set_title('Countries',fontweight ="bold") 
ax[1][0].set_ylabel('')
df['AgeGroup'].value_counts().plot.pie(autopct='%1.1f%%',ax=ax[1][1],shadow=True,legend=True)
ax[1][1].set_title('Age Group',fontweight ="bold") 
ax[1][1].set_ylabel('');



# Total spending
Findings
* People with PhDs used to spend more than other group of people
* Total spending of divorced, single, and married group members is roughly equal while widows spending is slightly higher than these individuals. 
* People who have no children spend more money than people who have children
* Montenegro spends significantly more than other countries.

In [None]:
f,ax=plt.subplots(2,2,figsize=(16,8))
sns.barplot(x='Education', y='TotalMnt', data=df,ax=ax[0][0]);
ax[0][0].set_title(' Education vs Total spending',fontweight ="bold") 
ax[0][0].set_xlabel('')
sns.barplot(x='Marital_Status', y='TotalMnt', data=df,ax=ax[0][1]);
ax[0][1].set_title('Marital status vs Total spending',fontweight ="bold") 
ax[0][1].set_xlabel('')
sns.barplot(x='Totalkids', y='TotalMnt', data=df,ax=ax[1][0]);
ax[1][0].set_title('Countries vs Total spending',fontweight ="bold") 
ax[1][0].set_xlabel('')
sns.barplot(x='Country', y='TotalMnt', data=df,ax=ax[1][1]);
ax[1][1].set_title('Age Group vs Total spending',fontweight ="bold") ;
ax[1][1].set_xlabel('');

### **The number of purchases through the Each channels**
Plot represent that most customer buy product from store.

In [None]:
channels = ['NumWebPurchases', 'NumCatalogPurchases',  'NumStorePurchases']
data = df[channels].sum()
plt.figure(figsize=(10,5))
plt.title('The number of purchases through the Each channels')
x=sns.barplot(x=channels,y=data.values,palette='Set2')
x.set_xticklabels(channels, size=12)
plt.tight_layout();


### The Total Amount of each product spent

In [None]:
col_products = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']

data = df[col_products].sum()
plt.figure(figsize=(15,5))
plt.title('The Total Amount of each product spent',fontweight ="bold")
x=sns.barplot(x=col_products,y=data.values,palette='Set2')
x.set_xticklabels(col_products, size=15)
plt.tight_layout()



In [None]:
plt.figure(figsize=(8,5))
plt.title('Income Age Group wise',fontweight ="bold")
x=sns.barplot(data=df,x='AgeGroup',y='Income',palette='Set2')
plt.tight_layout()


### purchases  vs age group

In [None]:


Purchases = ['NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases']
dataset = df.groupby('AgeGroup')[Purchases].mean()

score_label = np.arange(0, 10, 1)
Adult_mean  = list(dataset.T['Adult'])
Middleage_mean  = list(dataset.T['Middle Age Adult'])
SeniorAdult_mean  = list(dataset.T['Senior Adult'])

# set width of bar
barWidth = 0.35

fig, ax = plt.subplots(figsize=(19,8))

# Set position of bar on X axis
r1 = np.arange(0,len(Purchases)*2,2)
r2 = [x + barWidth for x in r1]
r3 = [x + barWidth for x in r2]


# Make the plot

Adult = ax.bar(r1, Adult_mean, width=barWidth, label='Adult')
Middleage = ax.bar(r2,Middleage_mean, width=barWidth, label='Middelage')
SeniorAdult= ax.bar(r3, SeniorAdult_mean,width=barWidth, label='Senior Adult')


# inserting x axis label
plt.xticks([r + barWidth for r in range(0,len(Purchases)*2,2)], dataset)
ax.set_xticklabels(Purchases)

# inserting y axis label
ax.set_yticks(score_label)
ax.set_yticklabels(score_label)

# inserting legend
ax.legend()

plt.title('purchases  vs age group')


plt.show()

### **products amount vs age group**

In [None]:
Products = ['MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts','MntGoldProds']
dataset = df.groupby('AgeGroup')[Products].mean()

score_label = np.arange(0, 500, 50)
Adult_mean  = list(dataset.T['Adult'])
Middleage_mean  = list(dataset.T['Middle Age Adult'])
SeniorAdult_mean  = list(dataset.T['Senior Adult'])
# set width of bar
barWidth = 0.35

fig, ax = plt.subplots(figsize=(19,8))

# Set position of bar on X axis
r1 = np.arange(0,len(Products)*2,2)
r2 = [x + barWidth for x in r1]
r3 = [x + barWidth for x in r2]


# Make the plot

Adult = ax.bar(r1, Adult_mean, width=barWidth, label='Adult')
Middleage = ax.bar(r2,Middleage_mean, width=barWidth, label='Middelage')
SeniorAdult= ax.bar(r3, SeniorAdult_mean,width=barWidth, label='Senior Adult')


# inserting x axis label
plt.xticks([r + barWidth for r in range(0,len(Products)*2,2)], dataset)
ax.set_xticklabels(Products)

# inserting y axis label
ax.set_yticks(score_label)
ax.set_yticklabels(score_label)

# inserting legend
ax.legend()

plt.title('products amount vs age group')
plt.show()

Total purchases vs Income

In [None]:
fig, ax = plt.subplots(figsize=(18,8))
sns.scatterplot(data=df,x='Income', y='TotalPurchases',ax=ax,hue='AgeGroup',style="AgeGroup",palette='dark')

### Income versus the quantity of products purchased


In [None]:
f,ax=plt.subplots(3,2,figsize=(18,17))

sns.scatterplot(data=df, x='Income', y='MntWines', hue='AgeGroup',markers=["o", "s", "D"],ax=ax[0][0])
sns.scatterplot(data=df, x='Income', y='MntWines', hue='AgeGroup',style="AgeGroup",ax=ax[0][0],palette="dark")
ax[0][0].set_title('Income vs Amount of wines purchase')
sns.scatterplot(data=df, x='Income', y='MntFruits', hue='AgeGroup',style="AgeGroup",ax=ax[0][1],palette="bright")
ax[0][1].set_title('Income vs Amount of Fruits purchase')
sns.scatterplot(data=df, x='Income', y='MntMeatProducts', hue='AgeGroup',style="AgeGroup",ax=ax[1][0],palette="bright")
ax[1][0].set_title('Income vs Amount of Meat purchase')
sns.scatterplot(data=df, x='Income', y='MntSweetProducts', hue='AgeGroup',style="AgeGroup",ax=ax[1][1],palette="bright")
ax[1][1].set_title('Income vs Amount of Sweet purchase')
sns.scatterplot(data=df, x='Income', y='MntGoldProds', hue='AgeGroup',style="AgeGroup",ax=ax[2][0], palette="bright")
ax[2][0].set_title('Income vs Amount of Gold purchase')
sns.scatterplot(data=df, x='Income', y='MntFishProducts', hue='AgeGroup',style="AgeGroup",ax=ax[2][1], palette="bright")
ax[2][1].set_title('Income vs Amount of Fish purchase')


<a id="4"></a> <br>
# Section 02: Statistical Analysis
### What factors are significantly related to the number of store purchases?
Let us plot heatplot diagrams to see the correlation of the numeric variables on the store purchases

In [None]:
df_num = df.drop(columns=['ID']).select_dtypes(include = ['float64', 'int64'])


plt.figure(figsize=(25,14))


mask = np.triu(np.ones_like(df_num.corr(), dtype=np.bool))
heatmap = sns.heatmap(df_num.corr(), mask=mask, vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':18}, pad=16);


### Correlation with NumStorePurchases
Let us check the correlation of numerical variables with NumStorePurchases.

In [None]:
corr_with_SalePrice = df_num.corr()
plot_data = corr_with_SalePrice["NumStorePurchases"].sort_values(ascending=True)
plt.figure(figsize=(12,6))
plot_data.plot.bar()
plt.title("Correlations with the  NumStorePurchases")
plt.show()

We can see the correlation of numerical columns/decorations on 'NumStorePurchases'. The columns that have clear correlation (high positive or high negative) are important for the prediction model, but few of those with small (about zero) correlation will not have much effect on the 'SalePrice', therefore, we can still drop few of them.

In [None]:

Data=df.drop(columns=['Response', 'Complain','Recency','Teenhome'])
#Droping uninformative features
Data=df.drop(columns=['ID','Dt_Customer'])

Let us now look at the 'NumStorePurchases' variation on different categories of categorical variables/columns.

In [None]:
few_cat_variables = ['Education','Marital_Status','Country' ,'AgeGroup']

for i in range(len(few_cat_variables)):
    sns.boxplot(x=few_cat_variables[i], y='NumStorePurchases', data=df)
    plt.show()

Now we will change the categorical variables to numerical ones by using LabelEncoder for the regression models

In [None]:
# Categorical boolean mask
categorical_feature_mask = Data.dtypes==object 
# filter categorical columns using mask and turn it into a list
categorical_cols =Data.columns[categorical_feature_mask].tolist()


# instantiate labelencoder object
le = LabelEncoder()
# apply le on categorical feature columns
Data[categorical_cols] =Data[categorical_cols].apply(lambda col: le.fit_transform(col))

In [None]:
plt.figure(figsize = (7, 5))
sns.distplot(df['NumStorePurchases'], color = 'k')
plt.title('NumStorePurchases distribution');

# Regression Models Analysis & Prediction 
Now, we take the cleaned data  and carry out prediction analysis with different regression methods from sklearn-library. We will compare the accuracy of different regression methods with  mean squared error.

In [None]:
# Separating 'NumStorePurchases' column

X = Data.drop(columns='NumStorePurchases')
y = Data['NumStorePurchases']

#Train, test split
x_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

# print Test and Validation data lenght
print("Train data points: ", len(x_train))
print("test data points: ", X_test.shape[0])

In [None]:
class Models(object):
    

    
    # Initialization 
    def __init__(self, x_train, X_test, y_train, y_test):
        # changing input as dataframe to list
        self.x_train = [x_train.iloc[i].tolist() for i in range(len(x_train))]
        self.X_test = [X_test.iloc[i].tolist() for i in range(len(X_test))]
        self.y_train = y_train.tolist()
        self.y_test = y_test.tolist()
    
    
    @staticmethod
    def print_info(mse):
        print("Mean Squared Error: ", mse)
        
        
    # Linear Regression 
    def linear_regression(self, x_train, X_test,  y_train, y_test):
        reg = linear_model.LinearRegression()
        reg.fit(self.x_train, self.y_train)
        y_pred_list = reg.predict(self.X_test)
        mse = mean_squared_error(self.y_test, y_pred_list)
        print("\nLinear Regression Model")
        self.print_info(mse)
        return  mse
        

    def random_forest(self, x_train, X_test,  y_train, y_test):
        rfr = RandomForestRegressor(n_estimators=8, max_depth=8, random_state=0, verbose=0)
     
        rfr.fit(self.x_train, self.y_train)
        y_pred_list = rfr.predict(self.X_test)
        mse = mean_squared_error(self.y_test, y_pred_list)
        print("\nRandom Forest Regressor")
        self.print_info(mse)
        return  mse
            
    # Lasso method 
    def lasso(self, x_train, X_test,  y_train, y_test):
        reg = linear_model.Lasso(alpha = 0.1)
        reg.fit(self.x_train, self.y_train)
        y_pred_list = reg.predict(self.X_test)
        mse = mean_squared_error(self.y_test, y_pred_list)

        print("\nLasso Regression Model")
        self.print_info(mse)
        return  mse
    
    # Gradient Boosing Regressor
    def GBR(self, x_train, X_test,  y_train, y_test):
        gbr = GradientBoostingRegressor(n_estimators=175, learning_rate=0.08, max_depth=3, random_state=0, loss='ls')
        gbr.fit(self.x_train, self.y_train)
        mse = mean_squared_error(self.y_test, gbr.predict(self.X_test))
        print('\nGradient Boosting Regressor')
        self.print_info(mse)
        return  mse


In [None]:
from types import FunctionType


methods = [x for x, y in Models.__dict__.items() if type(y) == FunctionType]
methods.remove('__init__')
# Now calling the all regression methods
mse_list = []
for model in methods:
    reg = Models(x_train, X_test, y_train, y_test)
    mse = getattr(reg, model)(x_train, X_test, y_train, y_test)

    mse_list.append(mse)


In [None]:
# # Plot Mean Squared Error

plt.plot(mse_list, c='b')
plt.title('Comparision of Algorithms')
plt.ylabel('Mean Squared Error')
x = np.array([0,1,2,3])
plt.scatter(x, mse_list, c='r', marker="s")
plt.xticks(x, methods)
plt.show()

## significant features

In [None]:
import eli5
from eli5.sklearn import PermutationImportance
reg = linear_model.LinearRegression().fit(x_train, y_train)
perm = PermutationImportance(reg, random_state=1).fit(X_test, y_test)
eli5.show_weights(perm, feature_names = X_test.columns.tolist(), top=5)

<a id="5"></a> <br>
### **Does US fare significantly better than the Rest of the World in terms of total purchases?**
* From the graph, we can clearly see that the US is on second last ranked in terms of total purchases.

In [None]:
plt.figure(figsize=(12,6))
explode = (0, 0.1, 0.2, 0.3, 0.4, 0, 0.5, 0.6)
df.groupby('Country')['TotalPurchases'].sum().sort_values(ascending=False).plot(kind='pie',autopct = '%1.1f%%',explode = explode)
plt.title('Total purchases in each country');

<a id="6"></a> <br>
### **Your supervisor insists that people who buy gold are more conservative. Therefore, people who spent an above average amount on gold in the last 2 years would have more in store purchases. Justify or refute this statement using an appropriate statistical test**

In [None]:
df=df.assign(
    gold_amount = lambda df: df['MntGoldProds'].map(lambda MntGoldProds:'Aboveavg' if MntGoldProds >=df['MntGoldProds'].mean() else 'Belowavg') 
)

From boxplot we can see that  people who spent an below average amount on gold have less in store purchases.So it means that the given statement is correct.+4


In [None]:
sns.boxplot(x='gold_amount',y='TotalPurchases',data=df);

<a id="7"></a> <br>
### **Fish has Omega 3 fatty acids which are good for the brain. Accordingly, do "Married PhD candidates" have a significant relation with amount spent on fish? What other factors are significantly related to amount spent on fish? (Hint: use your knowledge of interaction variables/effects)**
* From the graph we can see that Married PhD customers does not have a significant relationship with the amount spent on fish.

In [None]:
plt.figure(figsize=(12,6))
df.groupby(["Education",'Marital_Status'])['MntFishProducts'].sum().sort_values(ascending=False).plot(kind='bar')
plt.title('Amount of fish purchase by people');

Based on the graph, the factors that are significantly related to the amount spent on fish are:'TotalMNt', 'MntSweetProducts', 'MntFruits', 'MntMeatProducts', 'NumCatalogPurchases', 'MntGoldProds' and 'Country_ME'.

In [None]:
corr_with_SalePrice = Data.corr()
plot_data = corr_with_SalePrice["MntFishProducts"].sort_values(ascending=True)
plt.figure(figsize=(12,6))
plot_data.plot.bar()
plt.title("Correlations with the MntFishProducts")
plt.show()

<a id="8"></a> <br>
### **Is there a significant relationship between geographical regional and success of a campaign?**

From the analysis we can see that the success of a campaign has no relation to country.

In [None]:
a=['AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1','AcceptedCmp2']

for i in a:
    data1 = Data[i]
    data2 = df['Country']
    
   
    coef, p = spearmanr(data1, data2)
    print('Correlation coefficient: %.3f' % coef)
    # interpret the significance
    alpha = 0.05
    if p > alpha:
        print('{0} and {1} are uncorrelated (fail to reject H0) p={2}\n' .format(i,'Country',p))
       
    else:
        print('Samples are correlated (reject H0) p=%.3f' % p)