# Project Description.

<b>Problem Statement :</b>
A response model can provide a significant boost to the efficiency of a marketing campaign by increasing responses or reducing expenses. The objective is to predict who will respond to an offer for a product or service.

<b>Objective :</b>
* We are required to model the data relating to the various customer attributes and their response towards the marketting campaigns, and report the key-drivers of those responses.
* The predictive model will be used to predict the response of a new customer towards a marketting campaign, in order to do targetted marketting which would lead to a better response-ratio for the compaigns and hence will cut down unnecessary costs.  

# Importing libraries
Okay so prima facie, lets import some of the libraries we will be needing for this project.
(even if we are missing out on some library, we can always import it later on in the project.)

In [None]:
import sys

# library to handle vectorized data 
import numpy as np 
# library for data analsysis and manupulation
import pandas as pd 
# so that the output is not trunacated by pandas when we actually want to see it 
# pd.set_option('display.max_columns', 100)
# pd.set_option('display.max_rows', 1000)

# for visualisations
import seaborn as sns

%matplotlib inline 
import matplotlib as mpl
import matplotlib.pyplot as plt

# Loading the data.

In [None]:
data = pd.read_csv('../input/arketing-campaign/marketing_campaign.csv', sep = ';')
print('The dimension of our data is :',data.shape)

In [None]:
data.head()

In [None]:
# let's have a look at the features and the corresponding data types of those features\n",
data.dtypes

# Exploratory data analysis and data cleaning.

### Feature : ' ID '

In [None]:
data['ID'].value_counts().index.sort_values(ascending=True)

Okay so we can see that the feature ID represents the customer IDs of the various customers over a period of time, but the data does not reflect a consecutive collection of data from all the customers, which is why the length of our data is no inline with the range of the customer IDs in this feature.  
Due to the fact mentoined above, the feature is distorting the data a bit, so we will be dropping the feature from our data as it is not adding any value to our data and hence not required for our analysis.

In [None]:
data.drop('ID', axis=1, inplace=True)

### Feature : ' Year_Birth '

In [None]:
data['Year_Birth'].value_counts().index.sort_values(ascending=True)

Okay so from above we can observe the the feature 'Year_Birth' represents the different years of birth of the customers. The data collected contains some gaps in this feature; in terms of particular years in which customers were born. So our data does not contain all the years starting from 1893 to 1996.

### Feature : ' Education '

In [None]:
data['Education'].value_counts().index.sort_values(ascending=True)

We can see that there are  classes in this feature, but there are 2 among them which mean the same but are represented in two different ways; i.e. '2n Cycle' and 'Master'. So we will replace all occurances of he class'2n Cycle' with 'Master', for a better value representation.

In [None]:
data['Education'] = data["Education"].replace('2n Cycle', "Master")

### Feature : ' Marital_Status '

In [None]:
data['Marital_Status'].value_counts()

okay so this feature has 8 classes inour data, for the purpose of a better value representation in this feature, we will be doing the following transformations to some of the classes :
* 'Together' > replaced by 'live_in',
* 'Alone', 'YOLO' and 'Absurd' > replaced by 'single' .

In [None]:
data['Marital_Status'] = data['Marital_Status'].replace('Together', 'Live_in')
data['Marital_Status'] = data['Marital_Status'].replace(['YOLO', 'Alone', 'Absurd'], 'Single')

### Feature : ' Dt_Customer '

In [None]:
data['Dt_Customer'].value_counts().index.sort_values(ascending=True)

The feature Dt_Customer represents dates of customer’s enrolment with the company. The data in this feature is represented in type str. In order to improve the value representation of this feture, and to do meaningful feature engineering with this feature, we will be converting the values in this feaure to data type datetime.

In [None]:
data['Dt_Customer'] = pd.to_datetime(data['Dt_Customer'], format="%Y-%m-%d")

In [None]:
data['Dt_Customer'].value_counts().index.sort_values(ascending=True)

As we can see that the data type has been converted into datetime64[ns], without effecting the data itself.

# Descriptive analysis.

In [None]:
data.describe()

As we can see from above the numerical features are on very different scales with respect to each other, this tells us that we will need to scale the numerical features in the future before using for modelling the data.

In [None]:
data.corr()

In [None]:
# Visualization of the correlation between the features.
corr_matrix = data.corr()
sns.heatmap(corr_matrix)

We can see from above that a lot of values in the heat map representing the correlations between the features are null values (excluding the diagonal values); this is due to the presence of imbalanced data in some of the features in our data which basically leads to 0 variance, resulting in a null value for correlation with those features.  
The features with imbalanced features wll be handled in the later stage of the modelling.

# Handling missing-values.

In [None]:
# Let's check for missing values in our data
data.isnull().sum()

In [None]:
# Let's visualize the fragmentation of the data feature-wise due to the presence of the missing values.
import missingno as msno
msno.matrix(data)

We can see that the data is fairly oh high quality due to the fact that very less number of missing values are present in the data.  
Given that, we do see 24 missing values in the feature 'Income'. The best ways to impute the missing values in this feature, is strategies which utilise relative imputation strategies.  
For exampe we can use the mode of the feature for imputation purposes of the missing values in this feature if we assume that the data is a high degree representation of the bigger population, which means that any new data point will be in line with the distribution of the data we have.  
Otherwise, we can use the **sklearn.impute.KNNImputer** for imputation purposes of the missing values in this feature, which utilises the K-nearest-neighbors algorithm to figure out the value with highest probality, and uses it for imputation.  
Because of the fact we cannot be sure of how representative our data is of the larger population, we will be using KNNImputer in this case.

In [None]:
# let's check the range of the feature Income
data['Income'].max() - data['Income'].min()

In [None]:
from sklearn.impute import KNNImputer
data['Income'] = KNNImputer(n_neighbors=4).fit_transform(data['Income'].values.reshape(-1,1))

In [None]:
data['Income'].isnull().sum()

# Outliers detection and removal.

Removing the following features :
* categorical feature with only one class
* feature with only a singular value

In [None]:
# filtering the features with only one class or singular value
for col in data.columns :
    if len(data[col].value_counts()) == 1 :
        print(data[col].value_counts())

As we can see there are two features in our dataset have constant values, hence dropping them.

In [None]:
data.drop(['Z_CostContact', 'Z_Revenue'], axis=1, inplace=True)

In [None]:
data.columns

Let's explore the features with binary categorical classes to check for imbalanced data.

In [None]:
# filtering the features with binary classes to check for presence of imbalanced data 
for col in data.columns :
    if len(data[col].value_counts()) == 2 :
        print(data[col].value_counts())

As we can see from above, the feature 'AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Complain', and 'Response', are binary categorical features with imbalanced data.  
In order to prevent false outlier detection of the minority classes in the above mentioned features, we will be excluding them from the outlier detection and removal step.
And in order to prevent majority class prediction by the classification algorithms, we will be handling the above features to balance the frequency of the binary classes later in the process.

 Mean and standard deviation of the features, before removing outliers, for aiding the analysis of the presence and absence of outliers in our data.

In [None]:
outliers_effect = pd.DataFrame(data.describe().iloc[1:3, :]).rename(index={'mean':'initial_mean', 'std':'initial_SD'})
outliers_effect = outliers_effect.T
outliers_effect

Outlier detection using Local Outlier Factor.

In [None]:
# filtering the features to apply outlier detection and removal step. 
col_to_exclude = []
'''
Filtering the features with binary classes to exclude from outlier detection and removal step.
The reason we can use a simple filter like this is,
due to the fact that we know this filter screens out all the desired features,
as already demonstrated above.
'''
for col in data.columns :
    if len(data[col].value_counts()) == 2 :    
        col_to_exclude.append(col)
col_to_include = (data.select_dtypes(include=['float64', 'int64'])).drop(col_to_exclude, axis=1).columns
col_to_include

Let's figure out the optimum number of neghbors for the LocalOutlierFactor, for our dataset.

In [None]:
from sklearn.neighbors import LocalOutlierFactor
# we are taking a range of odd numbers, because of the use of 'VOTING' n the algorithm.
neighbors_LOF = np.arange(1, 40, 2)
num_outliers = []
for n in neighbors_LOF :
    outlier_detector = LocalOutlierFactor(n_neighbors = n)
    outliers = pd.Series(outlier_detector.fit_predict(data[col_to_include]))
    num_outliers.append(outliers.value_counts()[-1])
    
sns.barplot(x = neighbors_LOF, y = num_outliers)

As we can see from above that, the number of outliers detected by the LOF algorithm decreases as we increase the number of neighbor, which is inline with our expectations.  
From above we can clearly see that, the 21 nearest neighbor is the optimum numner of neighbors to detect outliers in this dataset, because number of nearest neighbors more than 21 is not bringing ay significant resuts.  

In [None]:
final_outlier_detector = LocalOutlierFactor(n_neighbors = 21)
outliers = pd.Series(final_outlier_detector.fit_predict(data[col_to_include]))
print('The number of outliers detected by LOF are > ',outliers.value_counts()[-1])

In [None]:
outliers_indices = [ x for x in range(len(outliers)) if outliers[x] == -1]
data.drop(data.index[outliers_indices], inplace = True)
print('The dimension of the data after removing outliers is', data.shape)

We can see the effect of removing the outliers from our data in terms of the mean and SD of the features before and after removal of the outliers.

In [None]:
outliers_effect["mean_after_outlier_removal"] = data.describe().iloc[1,:]
outliers_effect["SD_after_outlier_removal"] = data.describe().iloc[2,:]
outliers_effect

# Feature engineering.

Creating the feature 'Customer_age' from the feature 'Year_birth'

In [None]:
# calculating age as it is in the year 2020
data['Customer_age'] = 2020 - data['Year_Birth']
# dropping the 'year_Birth' feature from the data as it is now redundant
data.drop('Year_Birth', axis=1, inplace=True)
data['Customer_age'].head()

Further exploring the feature 'Customer_age'.

In [None]:
data['Customer_age'].value_counts().index.sort_values(ascending = True)

Okay so we can see that age of the customers in our dataset is ranging from 25 years to 80 years old. That gives us a range of 55 years of age gap between our youngest and oldest targeted customer.  
In order to improve the signal-noise ratio in our data, we will be further discretizing this feature into 11 bins, where each bin would represent a range of 5 years. 

In [None]:
data['Customer_age'] = pd.cut(data['Customer_age'], bins=11, labels=False, include_lowest=True)

The feature 'Kidhome' represents the number of small children in customer’s household.  
And the feature 'Teenhome' represents the number of teenagers in customer’s household.  
Creating the feature 'n_kids' from the above two features, and transformig the above two feature into percentage of the new feature 'n_kids'.

In [None]:
data['n_kids'] = data['Kidhome'] + data['Teenhome'] 
data['Kidhome'] = (data['Kidhome']/data['n_kids'])*100
data['Teenhome'] = (data['Teenhome']/data['n_kids'])*100

In [None]:
data[['n_kids', 'Kidhome', 'Teenhome']].head()

We can see that there are null-values in the transformed features 'Kidhome' and Teenhome', this is due to the fact that there were 0 in that rows of that feature, whch is why after the transformation, it is showing as NaN.  
Hence we will be filling the NaN values in the feature 'Kidhome' and 'Teenhome' with 0,  which would also be inline with the true data.  

In [None]:
data['Kidhome'].fillna(0, inplace=True)
data['Teenhome'].fillna(0, inplace=True)

In [None]:
# renaming the feature 'Kidhome' and 'Teenhome'
data = data.rename(columns= {'Kidhome':'percent_kids', 'Teenhome':'percent_teenagers'})

In [None]:
data['percent_kids'].value_counts()

In [None]:
data['percent_teenagers'].value_counts()

In [None]:
# rounding up the values in percent_kids and percent_teenagers to 0 decimals.
data['percent_kids'] = data['percent_kids'].apply(lambda x: round(x, 0))
data['percent_teenagers'] = data['percent_teenagers'].apply(lambda x: round(x, 0))

The feature 'Dt_custome' represents the date of customer’s enrolment with the company.  
Creating new feature 'Days_with_company' representing the number of days the customer has been associated with the company, calculated from the customer's registration date with the company.

In [None]:
import datetime
from datetime import datetime, date

for i in range(0, len(data)):
    data['Days_with_company'] = datetime.today().date()-data['Dt_Customer'].dt.date

In [None]:
print(data['Days_with_company'].dtype)
data['Days_with_company'].head()

We can see that in the value representation of the feature 'Days_with_company', the values have been represented in the data type timedelta64[ns], which is why we will be converting the values in the feature to integers, as doing so will not result in any kind of information loss, and in turn it will be a better value representation for the purposes of predictive modelling.

In [None]:
data['Days_with_company'] = data['Days_with_company'].apply(lambda x: int(x/np.timedelta64(1, 'D')))
data['Days_with_company'].dtype

In [None]:
# dopping the feature 'Dt_Customer' as it is not essential for our analysis
data.drop('Dt_Customer', axis=1, inplace=True)
data['Days_with_company'].head()

Further exploring the new feature 'Days_with_company'.

In [None]:
data['Days_with_company'].value_counts().index.sort_values(ascending=True)

We can see that the number of days the customers are associted with the company ranges from 2167 days to 2864 days, that is a range of 697 days.  
In order to improve the signal-noise ratio in the data we can further discreticize the feature based on the 4 quartiles.  
This would in turn also maintain balanced classes in the feature.  
Each bin representing the following : 
* bin 0 > legacy customer
* bin 1 > old customer
* bin 2 > new customer
* bin 3 > current customer

In [None]:
data['Days_with_company'] = pd.qcut(data['Days_with_company'], q=4, labels=False, precision=0)
data['Days_with_company'].value_counts()

The feature 'Recency' represents the number of days since the last purchase, i.e. in other terms it shows the last activity in terms of a purchase.  
Creating new feature 'last_purchase_day_type', which will represent the type of day of the last purchase, i.e. 'Weekday' or 'Weekend'.

In [None]:
from datetime import timedelta

days_list = []
for val in data['Recency'].values :
     days_list.append((datetime.today().date() - timedelta(days=int(val))).strftime('%A'))
data['last_purchase_day_type'] = days_list
data['last_purchase_day_type'].head()

In [None]:
data['last_purchase_day_type'] = data['last_purchase_day_type'].apply(lambda x: 'Weekend' if x in ['Saturday', 'Sunday'] else 'Weekday')

In [None]:
data['last_purchase_day_type'].head()

The features 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds'; represent the total amount spent on wine, fruits, meat, fish, sweet, gold products in the last 2 years respectively.  
Creating new feature 'Total_amnt_spent', representing the total amount spent by a customer in the last 2 years.  
And transforming the above mention original features to represent the percentage of money spent on those products with respect to the total amount spent, in the last 2 years.  

In [None]:
data['Tot_amnt_spent'] = (data['MntWines']+data['MntFruits']+data['MntMeatProducts']+data['MntFishProducts']+data['MntSweetProducts']+data['MntGoldProds'])

In [None]:
data['MntWines'] = round((data['MntWines']/data['Tot_amnt_spent'])*100, 2)
data['MntFruits'] = round((data['MntFruits']/data['Tot_amnt_spent'])*100, 2)
data['MntMeatProducts'] = round((data['MntMeatProducts']/data['Tot_amnt_spent'])*100, 2)
data['MntFishProducts'] = round((data['MntFishProducts']/data['Tot_amnt_spent'])*100, 2)
data['MntSweetProducts'] = round((data['MntSweetProducts']/data['Tot_amnt_spent'])*100, 2)
data['MntGoldProds'] = round((data['MntGoldProds']/data['Tot_amnt_spent'])*100, 2)

In [None]:
data['Income'].value_counts().index.sort_values(ascending=True)

As we can see from above that monthly income of the customers in our dataset range from 3502 (minimum) to 162397 (maximum).  
In order to improve the signal-noise ratio of our data we will be discretizing the feature into 5 bins according to the quantiles of the feature, where each bin represents the following :  
* bin 0 > represents low income
* bin 1 > represents below average income
* bin 2 > represents average income
* bin 3 > represents above average income
* bin 4 > represents high income

Using the quantiles to discretize the feature would also make sure that the discrete bins are of fairly same count, hence maintaining a balanced feruency of the discrete classes in the feature.

In [None]:
data['Income'] = pd.qcut(data['Income'], q=5, labels=False, precision=0)
data['Income'].value_counts()

# Data analysis utilising visualizations.

Visualization of Education and Income with Response of the customers.

In [None]:
sns.catplot(x='Response', hue='Income', col='Education', data=data, kind='count')

Observation : 
* customers who have a education level of 'Graduation' show the highest rejection levels to the last campaign, where the effect of 'Income' was insignificant towards the kind of response.
* customers who have a educational background of 'PhD' and 'Master', show approximately similar levels of rejection towards the last campaign. 
* Among the customers having a 'PhD', we can see the lowest levels of rejection response towards the last campaign is being showed by the customers having a low income, and the level of rejection responses gradually grows as the income levels increase peaking at the customers with above average income and then a slight decrease of rejection levels by customers with high income.
* The level of the customer's income is insignificant towards their levels of a rejection response, for the customers having a 'Master' in education.
* Fr the customers with a 'Basic' educational background we can see tht the rejection levels are reletively lower than for customers with other educational backgrounds; but within these class of customers, the ones having a low level of income show significantly higher level of a rejection response towards the last campaign.

Visualizatoin of Marital_Status and n_kids with Response.

In [None]:
sns.catplot(x='Response', hue='n_kids', col='Marital_Status', data=data, kind='count')

Observtions : 
* We can clearly see that a majority of the rejection responses towards the last campaign is from customers who are married, followed by customers who are in a live-in relationship and who are single.
* We can also see a clear pattern, that customers having 1 kid irrespective of their maritl status, show the relatively highest rejection responses across the population.

Visualization of Customer_age, Days_with_company and Response.

In [None]:
sns.catplot(x='Response', hue='Customer_age', col='Days_with_company', data=data, kind='count')

Observations :
* We can see a similar level of  relatively higher rejection responses from customers who are legecy and old customers.
* We can also see a similar level of relatively lower but independently higher, rejection levels from new and current customers.
* A common pttern among all kinds of customers is that the highest levels of rejection responses are shown by the cutomers in the age bin of 3 to 7.


Visualization of percent_kids, percent_teenager with Response.

In [None]:
sns.catplot(x='Response', hue='percent_kids', col='percent_teenagers', data=data, kind='count')

Observations : 
* We can see high levels of rejection responses from customers :
    * having 0 teenagers or kids
    * having only kids and o teenagers
    * having equally kids and teenagers
    * having only teenagers
* On the contrary, customers with 70% kids and 30% teenagers are showing lower rejection responses towards the last campaign.
* To summarize the above observations, customers with all kinds of combinations of kids and teenagers (except 30-70 ratio) in their family show similarly high rejection responses.

Visualization of NumDealsPurchases, NumWebPurchases with Response.

In [None]:
sns.catplot(x='Response', hue='NumDealsPurchases', col='NumWebPurchases', data=data, kind='count', legend_out = True, col_wrap=4)

We can observe from above that there is a high level of rejection responses from customers belonging to the type : 
* customers who did 1 to 4 purchases from the company website
    * and customers who did 1 to 4 purchases with discount deals.

Visualization of NumCatalogPurchases, NumStorePurchases with Response.

In [None]:
sns.catplot(x='Response', hue='NumCatalogPurchases', col='NumStorePurchases', data=data, kind='count', legend_out=True, col_wrap=5)

We can see from above that there is a high level of rejection responses coming from customers who did 2 to 4 purchases directly from the stores and among them the customers who did 1 to 3 purchases from the catalog, are showing relatively higher rejection responses.

Visualization of NumWebVisitsMonth, last_purchase_day_type with Response.

In [None]:
sns.catplot(x='Response', hue='NumWebVisitsMonth', col='last_purchase_day_type', data=data, kind='count')

We can see fromabove tat the customers who were visiting the website from 1 to 8 times a month on a weekday, are showing the relatively highest rejection responses; and the same apples for weekends but the magnitude of the rejection responses are relatively lower than on the weendays.

Visualization of AcceptedCmp1, Complain with Response.

In [None]:
sns.catplot(x='Response', hue='AcceptedCmp1', col='Complain', data=data, kind='count')

We can observe from above that, customers who do not complain are showing a higher level of rejection responses.

Visualization of AcceptedCmp2, Complain with Response.

In [None]:
sns.catplot(x='Response', hue='AcceptedCmp2', col='Complain', data=data, kind='count')

We can observe from above that, customers who do not complain are showing a higher level of rejection responses.

Visualization of AcceptedCmp3, Complain with Response.

In [None]:
sns.catplot(x='Response', hue='AcceptedCmp3', col='Complain', data=data, kind='count')

We can observe from above that, customers who do not complain are showing a higher level of rejection responses.

Visualization of AcceptedCmp4, Complain with Response.

In [None]:
sns.catplot(x='Response', hue='AcceptedCmp4', col='Complain', data=data, kind='count')

We can observe from above that, customers who do not complain are showing a higher level of rejection responses, but they ae also showing a relatively lower but globally higher levels of acceptance responses to the campaign.

Visualization of AcceptedCmp5, Complain with Response.

In [None]:
sns.catplot(x='Response', hue='AcceptedCmp5', col='Complain', data=data, kind='count')

We can observe from above that, customers who do not complain are showing a higher level of rejection responses, but they ae also showing a relatively lower but globally higher levels of acceptance responses to the campaign.

# Feature encoding.

In [None]:
# one hot encoding the nominal categorical feature 'last_purchase_day_type'
enc_feature_df = pd.get_dummies(data['last_purchase_day_type'],prefix='last_purchase_day_type', prefix_sep='_')
data = pd.concat([enc_feature_df,data], axis=1)
data.drop('last_purchase_day_type', axis=1, inplace=True)

# label encoding
# importing required libraies
from sklearn.preprocessing import LabelEncoder
data['Education'] = LabelEncoder().fit_transform(data['Education'].values.reshape(-1,1))
data['Marital_Status'] = LabelEncoder().fit_transform(data['Marital_Status'].values.reshape(-1,1))

# Making the feature and target space.

In [None]:
# making our x and y data
x_data = data.drop('Response', axis=1)
y_data = data['Response']

# Importing the required libraries, for the purposes of modelling.

In [None]:
import imblearn 
from imblearn.over_sampling import SMOTENC
from imblearn.pipeline import Pipeline

from sklearn.linear_model import (LogisticRegression, PassiveAggressiveClassifier, RidgeClassifier)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import (RandomizedSearchCV, train_test_split, cross_val_score)

from sklearn.metrics import accuracy_score

print("Finished importing the libraries.")

# Models objects and their parameter grid.

In [None]:
# models as per the sequence in the parameter grid 
model_objects = [LogisticRegression(),
                 LogisticRegression(),
								 LogisticRegression(),
								 PassiveAggressiveClassifier(),
								 RidgeClassifier(),
								 KNeighborsClassifier(),
								 SVC(),
								 DecisionTreeClassifier(),
								 RandomForestClassifier()]



# hyper-parameter dictionary for the tunningof the models
parameter_grid = {'LR_l1' : {'model__penalty' : ['l1'],
                              'model__C' : [0.001, 0.01, 0.1, 1, 10, 100],
                              'model__random_state' : [42],
                              'model__solver' : ['liblinear', 'saga'],
                              'model__max_iter' : [100000]
                          },
				
                  'LR_l2' : {'model__penalty' : ['l2'],
                              'model__C' : [0.001, 0.01, 0.1, 1, 10, 100],
                              'model__random_state' : [42],
                              'model__solver' : ['newton-cg', 'lbfgs', 'sag', 'saga'],
                              'model__max_iter' : [100000]
                          },

                  'LR_ElNet' : {'model__penalty' : ['elasticnet'],
                                'model__l1_ratio' : [0.3, 0.5, 0.7],
                                'model__C' : [0.001, 0.01, 0.1, 1, 10, 100],
                                'model__random_state' : [42],
                                'model__solver' : ['saga'],
                                'model__max_iter' : [100000]
                              },

                  'Pass_Agg_clif' : {'model__C' : [0.001, 0.01, 0.1, 1, 10, 100],
                                      'model__random_state' : [42],
                                      'model__loss' : ['hinge', 'squared_hinge'],
                                      'model__class_weight' : ['balanced', None]
                                  },
                  
                  'Ridge_clif' : {'model__alpha' : [500.0, 50.0, 5.0, 0.5, 0.05, 0.005],
                                  'model__fit_intercept' : ['True', 'False'],
                                  'model__normalize' : ['True', 'False'],
                                  'model__class_weight' : ['balanced', None],
                                  'model__solver' : ['svd', 'cholesky', 'lsqr', 'sparse_cg']
                              },
                  
                  'KN_classif' : {'model__n_neighbors' : [1,3,5,7,9],
                                  'model__p' : [1,2,5]                     
                              },
                  
                  'SVC' : {'model__C' : [0.001, 0.01, 0.1, 1, 10, 100, 1000],
                           'model__gamma' : ['scale', 'auto'],                     
                      },
                  
                  'DT_clif' : {'model__criterion': ['gini','entropy'],
                                'model__max_features': ['sqrt','log2',None],
                                'model__min_samples_leaf': [1,2,5,10],
                                'model__min_samples_split' : [2,5,10,15,100],
                                'model__max_depth': [5,8,15,25,30,None]
                          },
                  
                  'RF_clif' : {'model__n_estimators' : [120,300,500,800,1200],
                               'model__max_features': ['sqrt','log2',None],
                                'model__min_samples_leaf': [1,2,5,10],
                                'model__min_samples_split' : [2,5,10,15,100],
                                'model__max_depth': [5,8,15,25,30,None]                      
                          }
              }

# Splitting the data for the purpose of hyper-parameter optimisation and model selection.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.3, random_state = 42)
x_optimization, x_validation, y_optimization, y_validation = train_test_split(x_train, y_train, test_size = 0.3, random_state = 42)

print("Finished splitting the data.")

# Hyper-parameter optimization.

In [None]:

# initiating an empty list for storing the optimized models
hyper_parameter_optimized_models = []


'''
resampling our optimization datasets, in order to prevent overfitting of our models on the majority class of the target feature in our
for the purpose above stated we will be using SMOTENC, which requires us to give the column indices of the categrical features
'''
num_features = ['Recency', 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'Tot_amnt_spent']
catg_features = x_train.drop(num_features, axis=1).columns.tolist()
catg_idx_list = []
for feature in catg_features:
  catg_idx_list.append(x_data.columns.get_loc(feature))

# making the resampling and standardising objects
over_sampler = SMOTENC(categorical_features = catg_idx_list, random_state=42)
scaler = StandardScaler()

# initiating the random search
for grid, model in zip(parameter_grid.values(), model_objects) :
  # the only change that i have done is remove the comma "," from the end of the very next line i.e classif_model = ......
  classif_model = Pipeline([('resampler', over_sampler), ('scaler', scaler), ('model', model)])
  # the nex thing tht we can do is remove the over_sampler an scaler objects and define them in te pipeline itself
  optimizer = RandomizedSearchCV(estimator = classif_model,
								param_distributions = grid,
								random_state = 42,
								cv = 3,
								error_score = -1,
								verbose = 10,
								n_jobs = -1,
								)
  optimizer.fit(x_optimization, y_optimization.values.ravel())
	# appending the best estimator to a list
  hyper_parameter_optimized_models.append(optimizer.best_estimator_)

print('Hyper parameter tunning is finished.')

# Model Selection.

In [None]:
# initiating an empty list to stre the validation scores of the optimized models
optimized_model_validation_scores = []

for optimized_model in hyper_parameter_optimized_models :
  optimized_model_pipeline = Pipeline([('resampler', over_sampler), ('scaler', scaler), ('optimized_model', optimized_model)])
  model_validation_scores = cross_val_score(optimized_model_pipeline, x_validation, y_validation.values.ravel(), cv=3, n_jobs = -1)
  optimized_model_validation_scores.append(np.mean(model_validation_scores))

# making a dictionary to store the results of the hyper-parameter optimization and the model selection process.
results_dict = {'optimized_model':hyper_parameter_optimized_models,
                'validation_score':optimized_model_validation_scores
                }

optimized_model_results = pd.DataFrame(results_dict)
# # saving the results of the hyper-parameter optimization and model_selection in a csv file
# optimized_model_results.to_csv('/content/drive/My Drive/data_for_HPO&MS/Marketing_response/model_optimizaion_report.csv')
print('Model selection is finished')

# Best performing hyper-parameter optimised model.

In [None]:
# selecting the best model by its index for the final predictions
best_model_idx = optimized_model_results['validation_score'].idxmax(axis=0)
best_model = optimized_model_results.iloc[best_model_idx,0]

print('The best model to our finding is ', best_model)

# Defining the best model from the above findings, that will be futher used for the final prediction making.

In [None]:
# selecting the classifier algorithm from the pipeline of the best model found.
final_model = best_model[2]
final_model

# Final Prediction.

In [None]:
# we are utilizing the whole training dataset for training the fianl model before making predictions on the test set.
# resampling our training datasets, in order to prevent overfitting of our models on the majority class of the target feature in our training set
x_train_resampled, y_train_resampled = over_sampler.fit_resample(x_train, y_train)
# dropping the sythetic feature after resampling is done
y_train_resampled = pd.DataFrame(y_train_resampled)
x_train_resampled = pd.DataFrame(x_train_resampled, columns = x_train.columns)

# scaling our features in the training dataset
scaler = StandardScaler().fit(x_train_resampled)
x_train_scaled = scaler.transform(x_train_resampled)
x_test_scaled = scaler.transform(x_test)

# re-fitting out best found optimized model to the whole training set
final_model.fit(x_train_scaled, y_train_resampled.values.ravel())
out_of_sample_predictions = final_model.predict(x_test_scaled)

final_score = accuracy_score(y_test, out_of_sample_predictions)

print('The final average out-of-sample performance score of our best optimized model is', round(final_score, 3)*100, '%')