## Welcome to CS677 Project 

### Dataset : https://archive.ics.uci.edu/ml/datasets/bank+marketing 

<b> The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y). </b>

Input variables:
### bank client data:
1 - age (numeric)

2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student",
                                   "blue-collar","self-employed","retired","technician","services") 

3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)

4 - education (categorical: "unknown","secondary","primary","tertiary")

5 - default: has credit in default? (binary: "yes","no")

6 - balance: average yearly balance, in euros (numeric) 

7 - housing: has housing loan? (binary: "yes","no")

8 - loan: has personal loan? (binary: "yes","no")

### related with the last contact of the current campaign:

9 - contact: contact communication type (categorical: "unknown","telephone","cellular") 

10 - day: last contact day of the month (numeric)

11 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")

12 - duration: last contact duration, in seconds (numeric)


### other attributes:

13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

15 - previous: number of contacts performed before this campaign and for this client (numeric)

16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

Output variable (desired target):
17 - y - has the client subscribed a term deposit? (binary: "yes","no")

#### Dataset collected in marketing campaigns done by the bank. The dataset has both numerical and categorical columns. Goal is to build a model which will help the marketing team to identify potential consumers who will be more likely to invest in new financial products ex: term deposit offered by bank. </span>

#### Project Outcome Tasks:
1.	Doing the extract discovery analysis work.
2.	Features datatypes and distribution of data analysis.
3.	Features selection.
4.	Data Cleanup and preparation.
5.	Use of Pandas profiling module.
6.	Data visualization using matplotlib.pyplot and seaborn packages.
7.	Building the prediction models.
8.	Evaluating the Accuracy, precision, recall for each class across models.
9.	Selecting the best model with highest accuracy, precision and recall.
10.	Saving the model into pickle file and exposing the best model functionality through flask/streamlet Apis.


###  <span style="color:blue">  <b> EXPLORATORY DATA ANALYSIS WORK </b></span>

In [None]:
import pandas as pd
bank_df = pd.read_csv(r'bank-full.csv',sep=';',header=0)
bank_df.columns


In [None]:
# import packages which we required for Exploratory data analysis (EDA)
import pandas as pd  # to store tabular data
import numpy as np  # to do some math
import matplotlib.pyplot as plt  # a popular data visualization tool
import seaborn as sns  # another popular data visualization tool
%matplotlib inline  
plt.style.use('fivethirtyeight')  # a popular data visualization theme

In [None]:
bank_df.head()

In [None]:
bank_df.shape

In [None]:
bank_df.info()

In [None]:
bank_df.describe()

In [None]:
bank_df.isnull().sum()

In [None]:
bank_df.nunique()

<b> Subscribing to the term deposit (outcome)  has been dictated by couple of prominent features, lets see how they impact for outcome </b>
 1. housing loan
 2. personal loan
 3. previous campaign outcome
 4. balance.
 5. education
 6. marital status

In [None]:
len(bank_df.columns)
df_job=bank_df['job'].value_counts().to_frame().reset_index()
joborder=list(df_job.iloc[:,0])
printjobnames=joborder
joborder.sort(reverse=False)
joborder

In [None]:
for col in bank_df.columns[0:len(bank_df.columns)-1]:
    
    if col=="job":
        print(printjobnames)
        bank_df[col+'new']=bank_df[col].map(lambda x: joborder.index(x)+1)
        plt.hist(bank_df[bank_df['y']=='no'][col+'new'], 30, color='g',alpha=0.5, label='Not Subscribed to TD')
        plt.hist(bank_df[bank_df['y']=='yes'][col+'new'], 30, color='y',alpha=0.5, label='Subscribed to TD')
        bank_df.drop(columns=['jobnew'],inplace=True)
       
    else:
        plt.hist(bank_df[bank_df['y']=='no'][col], 30, color='g',alpha=0.5, label='Not Subscribed to TD')
        plt.hist(bank_df[bank_df['y']=='yes'][col], 30, color='y',alpha=0.5, label='Subscribed to TD')

    plt.legend(loc='upper right')
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.title('Histogram of {}'.format(col))
    plt.show()

Looking at the impact of data values from the features with respect to deposits. Following can be inferred.

1. Young age group are more interested in term deposits. The amount of customers subscribing to term deposit gradually reduces after a certain thershold age of 40 . 

2. Management, Technician, admin , bluecollar jobs have significant term deposits compared to other professionals. 
3. Divorced people are unlikely to subscribe into term deposits.
4. Default people will not be subscribed into term deposits.
5. People with no housing/personal loan are more subscribed into term deposits.
6. Good way in direct marketing campaign to convince the customers is through cellular connection instead of unknown & telephone.
7. December month will have zero customers enrolling into term deposit. 

In [None]:
bank_df

In [None]:
sum_df=bank_df.groupby('y').agg({'y':'count'})
sum_df


<b> About 8% of users have been subscribed into term deposits</b>

In [None]:
marital_df=bank_df.groupby(['marital','y']).agg({'balance':'sum','y':'count'})
marital_df



Bivariate Analysis: 
1. Duration and y ( response outcome)
2. Education,balance and y .
3. Marital, balance and y . 
4. Job and y

In [None]:
bank_df['duration_status'] = np.select([(bank_df['duration']< bank_df['duration'].mean())], ["Below Average"],default="Above Average")
duration_deposit=pd.crosstab(bank_df['duration_status'], bank_df['y']).apply(lambda r: round(r/r.sum(), 2) * 100, axis=1)

duration_deposit.plot(kind='bar',stacked=False,cmap='viridis')
plt.show()

In [None]:
bank_df.drop(columns=['duration_status'],inplace=True)

In [None]:
for i in bank_df["education"].value_counts().index:
    for j in bank_df["y"].value_counts().index:
        education_balance_amt = bank_df[(bank_df["education"] == i) & (bank_df["y"] == j)]["balance"].sum()
        percentage = round(education_balance_amt*100 / bank_df["balance"].sum(),3)
        print(f"Education level '{i}' and Deposit status '{j}' amount: {education_balance_amt}, percentage: {percentage}")


<mark> Persons with higher secondary , tertiary degrees have high balances </mark>

In [None]:
for i in bank_df["marital"].value_counts().index:
    for j in bank_df["y"].value_counts().index:
        marital_balance_amt = bank_df[(bank_df["marital"] == i) & (bank_df["y"] == j)]["balance"].sum()
        percentage = round(marital_balance_amt*100 / bank_df["balance"].sum(),3)
        print(f"Marital level '{i}' and Deposit status '{j}' amount: {marital_balance_amt}, percentage: {percentage}")

<b> Insights on Marital feature : 
    Individuals who are single , will have higher rate of subscribing into term deposits.
    Among the balances of all category of individuals who subscribed into term deposits , the divorced category records show a less balance.
  
</b>



In [None]:
sns.pairplot(bank_df)
plt.show()

## Understanding the categorical variables : job, marital , education, default, housing, loan, contact, poutcome, month.

In [None]:
categorical_columns=['job','marital','education','default','housing','loan','contact','poutcome','month','duration']
for col in categorical_columns:
    plt.figure(figsize=(9,16))
    sns.countplot(bank_df[col],hue="y",data=bank_df)
    #sns.barplot(bank_df[col].value_counts(),bank_df[col].value_counts().index,hue="y",data=bank_df)
    plt.title(col)
    plt.tight_layout(pad=0.5)


In [None]:
sns.pairplot(bank_df,hue="y",palette="husl")
plt.show()

### Pandas Profiling

1. About 1.8% of people have been defaulted.
2. More than 50% of users are having housing loan.
3. Most of them are having secondary education.
4. About 16% of users do have personal loan.

In [None]:
from pandas_profiling import ProfileReport
#profile = ProfileReport(bank_df)
#profile

### Correlation

In [None]:
model_bank_df=bank_df.copy()
from sklearn.preprocessing import LabelEncoder  

le = LabelEncoder()
model_bank_df['y']=le.fit_transform(model_bank_df['y'])

model_bank_df.tail()


In [None]:
model_bank_df.corr()['y']

In [None]:
sns.heatmap(model_bank_df.corr())

In [None]:
bank_df.shape

In [None]:
bank_df.isnull().sum()

In [None]:
bank_df['y'].value_counts(normalize=True)


###  <span style="color:blue">  <b> FEATURE CONSTRUCTION </b></span>


In [None]:
bank_df

#####  Data cleanup : 

#### poutcome and Contact columns have : unknown values , which is not a good set of rows, so we will be taking out those rows.
from poutcome => unknown, contact => unknown



In [None]:
lst_index=bank_df[(bank_df['poutcome']=='unknown') | (bank_df['contact']=='unknown')].index
#37029 records.

bank_df.drop(lst_index,inplace=True)
bank_df.shape

In [None]:
from sklearn.base import TransformerMixin

class CustomEncoder(TransformerMixin):
    def __init__(self, col, ordering=None):
        self.ordering = ordering
        self.col = col
        
    def transform(self, df):
        X = df.copy()
        X[self.col] = X[self.col].map(lambda x: self.ordering.index(x))
        return X
    
    def fit(self, *_):
        return self


In [None]:
from sklearn.base import TransformerMixin

class CustomDummifier(TransformerMixin):
    def __init__(self, cols=None):
        self.cols = cols
        
    def transform(self, X):
        return pd.get_dummies(X, columns=self.cols)
    
    def fit(self, *_):
        return self


In [None]:
bank_df.head(3)

In [None]:
dict_forencoder=dict()
dict_forencoder['default']=['no', 'yes']
dict_forencoder['housing']=['no', 'yes']
dict_forencoder['loan']=['no', 'yes']
dict_forencoder['y']=['no','yes']
dict_forencoder['contact']=['telephone','cellular']

#cd=CustomDummifier(cols=['poutcome'])
#bank_featured=cd.fit_transform(bank_df)

bank_featured=bank_df
cd=CustomDummifier(cols=['poutcome','month','job','marital','education'])
bank_featured=cd.fit_transform(bank_featured)


for colname in ['default','housing','loan','y','contact']:
    ce=CustomEncoder(col=colname,ordering=dict_forencoder[colname])
    bank_featured=ce.fit_transform(bank_featured)



bank_featured.head(10)

In [None]:
bank_featured.shape

In [None]:
bank_featured.corr()['y']

In [None]:
bank_featured.columns

In [None]:
bank_featured.dtypes

In [None]:
bank_featured.head(5)

###  <span style="color:blue">  <b> FEATURE SELECTION </b></span>


In [None]:
# Create our feature matrix
#bank_featured.drop(columns=['job','marital','education'],inplace=True)

X = bank_featured.drop('y', axis=1)

# create our response variable
y = bank_featured['y']

In [None]:
X

In [None]:
y

In [None]:
from sklearn.base import TransformerMixin, BaseEstimator



class CustomCorrelationChooser(TransformerMixin, BaseEstimator):
    def __init__(self, response, cols_to_keep=[], threshold=None):
        # store the response series
        self.response = response
        # store the threshold that we wish to keep
        self.threshold = threshold
        # initialize a variable that will eventually
        # hold the names of the features that we wish to keep
        self.cols_to_keep = cols_to_keep
        
    def transform(self, X):
        # the transform method simply selects the appropiate
        # columns from the original dataset
        return X[self.cols_to_keep]
        
    def fit(self, X, *_):
        # create a new dataframe that holds both features and response
        df = pd.concat([X, self.response], axis=1)
        # store names of columns that meet correlation threshold
        self.cols_to_keep = df.columns[df.corr()[df.columns[-1]].abs() > self.threshold]
        # only keep columns in X, for example, will remove response variable
        self.cols_to_keep = [c for c in self.cols_to_keep if c in X.columns]
        return self




In [None]:
from sklearn.model_selection import GridSearchCV
import numpy as np

def get_best_model_and_accuracy(model, params, X, y):
    grid = GridSearchCV(model,           # the model to grid search
                        params,          # the parameter set to try 
                        error_score=0., 
                        n_jobs=-1)  # if a parameter set raises an error, continue and set the performance as a big, fat 0
    grid.fit(X, y)           # fit the model and parameters
    # our classical metric for performance
    print("Best Accuracy: {}".format(grid.best_score_))
    # the best parameters that caused the best accuracy
    print("Best Parameters: {}".format(grid.best_params_))
    # the average time it took a model to fit to the data (in seconds)
    print("Average Time to Fit (s): {}".format(round(grid.cv_results_['mean_fit_time'].mean(), 3)))
    # the average time it took a model to predict out of sample data (in seconds)
    # this metric gives us insight into how this model will perform in real-time analysis
    print("Average Time to Score (s): {}".format(round(grid.cv_results_['mean_score_time'].mean(), 3)))


In [None]:
from copy import deepcopy
# Import four machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

# Set up some parameters for our grid search
# We will start with four different machine learning models
# logistic regression, KNN, Decision Tree, and Random Forest
lr_params = {'C':[1e-1, 1e0, 1e1, 1e2], 'penalty':['l1', 'l2']}
knn_params = {'n_neighbors': [1, 3, 5, 7]}
tree_params = {'max_depth': [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21]}
forest_params = {'n_estimators': [10, 50, 100], 'max_depth': [None, 1, 3, 5, 7]}


# instantiate the four machine learning models
lr = LogisticRegression(solver='liblinear')
knn = KNeighborsClassifier()
d_tree = DecisionTreeClassifier()
forest = RandomForestClassifier()

ccc = CustomCorrelationChooser(response=y)
ccc_pipe = Pipeline([('correlation_select', ccc), 
                     ('classifier', d_tree)])
tree_pipe_params = {'classifier__max_depth': 
                    [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21]}
# make a copy of the decisino tree pipeline parameters
ccc_pipe_params = deepcopy(tree_pipe_params)

# update that dictionary with feature selector specific parameters
ccc_pipe_params.update({
               'correlation_select__threshold':[0.1, 0.2,.3, 0.4]})

print(ccc_pipe_params)

# better than original (by a little, and a bit faster on 
# average overall
get_best_model_and_accuracy(ccc_pipe, ccc_pipe_params, X, y)  

In [None]:
ccc_selected=CustomCorrelationChooser(y,threshold=0.2)
ccc_selected.fit_transform(X)
ccc_selected.cols_to_keep

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
k_best = SelectKBest(f_classif, k=4)
k_best.fit_transform(X, y)
p_values = pd.DataFrame({'column': X.columns, 'p_value': k_best.pvalues_}).sort_values('p_value')

p_values.head(10)
p_values[p_values['p_value'] < .05]

In [None]:
p_values[p_values['p_value'] >= .05]

In [None]:
k_best = SelectKBest(f_classif)

# Make a new pipeline with SelectKBest
select_k_pipe = Pipeline([('k_best', k_best), 
                          ('classifier', d_tree)])

select_k_best_pipe_params = deepcopy(tree_pipe_params)

select_k_best_pipe_params.update({'k_best__k':list(range(3,15))+['all'],  # the 'all' literally does nothing to subset
                                 })
print(select_k_best_pipe_params)
# comparable to our results with correlationchooser
get_best_model_and_accuracy(select_k_pipe, select_k_best_pipe_params, X, y)  

In [None]:
k_best = SelectKBest(f_classif,k=3)
k_best.fit_transform(X,y)
p_values = pd.DataFrame({'column': X.columns, 'p_value': k_best.pvalues_}).sort_values('p_value')
p_values.head(4)


In [None]:
X.iloc[:,[3,7,11,13]]

In [None]:
# instantiate a class that choses features based
# on feature importances according to the fitting phase
# of a separate decision tree classifier

from copy import deepcopy
from sklearn.feature_selection import SelectFromModel
from sklearn.pipeline import Pipeline

select_from_pipe = Pipeline([('select', SelectFromModel(DecisionTreeClassifier())), 
                             ('classifier', d_tree)])
tree_pipe_params = {'classifier__max_depth': [1, 3, 5, 7]}
select_from_pipe_params = deepcopy(tree_pipe_params)

select_from_pipe_params.update({
              'select__threshold': [.01, .05, .1, "mean", "median", "2.*mean"],
              'select__estimator__max_depth': [None, 1, 3, 5, 7]
              })

print(select_from_pipe_params)

# not better than original
get_best_model_and_accuracy(select_from_pipe, 
                            select_from_pipe_params, 
                            X, y)  



In [None]:
select_from_pipe.set_params(**{'select__threshold': 0.01, 
                               'select__estimator__max_depth': None, 
                               'classifier__max_depth': 1})


# fit our pipeline to our data
select_from_pipe.steps[0][1].fit(X, y)

# list the columns that the SVC selected by calling the get_support() method from SelectFromModel
X.columns[select_from_pipe.steps[0][1].get_support()]

<mark> 1. sanity check </mark>

<mark> 2. If we only the worst columns </mark>

In [None]:
# sanity check
# If we only the worst columns
the_worst_of_X = X[X.columns.drop(['housing', 'duration', 'poutcome_failure', 'poutcome_success'])]
the_best_of_X = X.loc[:,['age','balance','housing','day','duration','campaign','pdays','previous','poutcome_success','job_technician']]
the_super_best_of_X = X.loc[:,['housing', 'duration', 'poutcome_failure', 'poutcome_success']]

# much worst than the original 0.8203 without removing anything
# goes to show, that selecting the wrong features will 
# hurt us in predictive performance
get_best_model_and_accuracy(d_tree, tree_params, the_worst_of_X, y)  
get_best_model_and_accuracy(d_tree, tree_params, the_best_of_X, y)
get_best_model_and_accuracy(d_tree, tree_params, the_super_best_of_X, y)  


In [None]:
features=X.columns[select_from_pipe.steps[0][1].get_support()].to_frame().reset_index()['index']
X_bestfeatures=X.loc[:,np.array(features)]
X_bestfeatures

###  <span style="color:blue">  <b>MODEL BUILDING </b></span>


####  Knn algorithm , finding the best K and then finding the accuracy using that K value.

In [None]:
X = X.loc[:,['housing', 'duration', 'poutcome_failure', 'poutcome_success']]

In [None]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=1, stratify=y)


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score


scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)
accuracy = []

for i in [3,5,7,9,11,13,15]:
  classifier = KNeighborsClassifier(n_neighbors=i,p=2,metric='euclidean') # start with k=3, using L^2 norm
  classifier.fit(X_train_sc,y_train)
  y_pred =  classifier.predict(X_test_sc)
  accuracy.append(accuracy_score(y_test,y_pred))

accuracy

In [None]:
import matplotlib.pyplot as plt
lst_x=range(3,17,2)
plt.plot(lst_x,accuracy,"r-")
ax = plt.axes()
ax.set_xticks([3,5,7,9,11,13,15])
plt.xlabel("k")
plt.ylabel("Accuracy of Predictions")
plt.title("Plot of Accuracy vs k value for kNN")
plt.show()

#### Best value of k is 9 

In [None]:
Model=[]
TP=[]
FP=[]
TN=[]
FN=[]
accuracy_table=[]
TPR_lst=[]
TNR_lst=[]

In [None]:
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
classifier_bestK = KNeighborsClassifier(n_neighbors=9) # start with k=3, using L^2 norm, dont't need last two args since those are defaults

classifier_bestK.fit(X_train_sc,y_train)
y_pred=classifier_bestK.predict(X_test_sc)
accuracy=accuracy_score(y_test,y_pred)
cm=confusion_matrix(y_test,y_pred,labels=[0,1])
cm_display=metrics.ConfusionMatrixDisplay(confusion_matrix = cm,display_labels=['Not subscribed','Subscribed'])
cm_display.plot()
plt.show()

true_neg=cm[0][0]
true_pos=cm[1][1]
false_pos=cm[0][1]
false_neg=cm[1][0]

TPR = true_pos/(true_pos + false_neg)
TNR = true_neg/(true_neg + false_pos)

Model.append('Knn')
TN.append(true_neg)
TP.append(true_pos)
FP.append(false_pos)
FN.append(false_neg)
TPR_lst.append(TPR)
TNR_lst.append(TNR)
accuracy_table.append(accuracy)

In [None]:
accuracy

In [None]:
X.shape

##### Decision Tree algorithm

In [None]:
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
tree_clf = DecisionTreeClassifier(max_depth=13)
accuracy_dtree=[]
for i in [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]:
  tree_clf = DecisionTreeClassifier(max_depth=i)
  tree_clf.fit(X_train,y_train)
  y_pred =  tree_clf.predict(X_test)
  accuracy_dtree.append(accuracy_score(y_test,y_pred))

accuracy_dtree


In [None]:
import matplotlib.pyplot as plt
lst_x=range(1,16,1)
plt.plot(lst_x,accuracy_dtree,"r-")
ax = plt.axes()
ax.set_xticks([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
plt.xlabel("Maxdepth")
plt.ylabel("Accuracy of Predictions")
plt.title("Plot of Maxdepth vs Accuracy value for Decision tree")
plt.show()

### Max depth of 4 will have greater accuracy compared to other values.

In [None]:
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
tree_clf = DecisionTreeClassifier(max_depth=5)
tree_clf.fit(X_train,y_train)
y_pred =  tree_clf.predict(X_test)
accuracy=accuracy_score(y_test,y_pred)
cm=confusion_matrix(y_test,y_pred,labels=[0,1])
cm_display=metrics.ConfusionMatrixDisplay(confusion_matrix = cm,display_labels=['Not subscribed','Subscribed'])
cm_display.plot()
plt.show()

true_neg=cm[0][0]
true_pos=cm[1][1]
false_pos=cm[0][1]
false_neg=cm[1][0]

TPR = true_pos/(true_pos + false_neg)
TNR = true_neg/(true_neg + false_pos)

Model.append('DecisionTree')
TN.append(true_neg)
TP.append(true_pos)
FP.append(false_pos)
FN.append(false_neg)
TPR_lst.append(TPR)
TNR_lst.append(TNR)
accuracy_table.append(accuracy)

In [None]:
X_test

### RandomForest Classifier

In [None]:
from collections import Counter
def max_frequency_label(ensemble_values):
    """
                  function which returns the maximum value of passed in object.
                   Parameters:
                   ensemble_values (String): String containing the value.

                   Returns:
                   String: Returns a maximum occurance character from the string.

        """
    res = Counter(ensemble_values)
    res = max(res, key=res.get)
    return str(res)

from collections import Counter
def min_frequency_label(ensemble_values):
    """
                  function which returns the maximum value of passed in object.
                   Parameters:
                   ensemble_values (String): String containing the value.

                   Returns:
                   String: Returns a maximum occurance character from the string.

        """
    res = Counter(ensemble_values)
    res = min(res, key=res.get)
    return str(res)



dict_best={}
dict_best_matrix={}
for N in range(1,11):
    drange=[1,2,3,4,5]


    print(' during the number of decision trees in Random forest : {} '.format(N))
    for d in range(1,6):

        rf_clf = RandomForestClassifier(n_estimators=N,criterion='entropy',max_depth=d)
        rf_clf.fit(X_train,y_train)
        y_pred = rf_clf.predict(X_test)
        mat = confusion_matrix(y_test, y_pred)

        acc_calc=metrics.accuracy_score(y_test, y_pred)
        print("Accuracy for tree : {} and depth : {} is {}:".format(N,d,acc_calc))
        
        keyval=str(N)+'_'+str(d)+'_tree'
        dict_best[keyval]=acc_calc
        dict_best_matrix[keyval]=mat

bestkey=max_frequency_label(dict_best)
print(' THE BEST KEY COMBINATION OF N is {} and d is : {} '.format(bestkey.split('_')[0],bestkey.split('_')[1]))
print(' THE BEST ACCURACY FOR BEST COMBINATION OF N and d is : {}'.format(dict_best[bestkey]))
true_neg=dict_best_matrix[bestkey][0][0]
true_pos=dict_best_matrix[bestkey][1][1]
false_pos=dict_best_matrix[bestkey][0][1]
false_neg=dict_best_matrix[bestkey][1][0]

TPR = true_pos/(true_pos + false_neg)
TNR = true_neg/(true_neg + false_pos)
Model.append('RandomForestClassifier')
TN.append(true_neg)
TP.append(true_pos)
FP.append(false_pos)
FN.append(false_neg)
TPR_lst.append(TPR)
TNR_lst.append(TNR)
accuracy_table.append(accuracy)


In [None]:
print(' BEST CONFUSUION MATRIX IS : {}'.format(dict_best_matrix[bestkey]))
sns.heatmap(dict_best_matrix[bestkey],square=True, annot=True, fmt = 'd', cbar=True, xticklabels=['Not subscribed','Subscribed'], yticklabels=['Not subscribed','Subscribed'])
plt.xlabel('true label')
plt.ylabel('predicted label')

In [None]:
accuracy_table

#### Gaussian NB. 


In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn import metrics

model = GaussianNB()
model.fit(X_train,y_train)
y_pred=model.predict(X_test)
mat = confusion_matrix(y_test, y_pred)
cm_display=metrics.ConfusionMatrixDisplay(confusion_matrix = mat,display_labels=['Not subscribed','Subscribed'])
cm_display.plot()
plt.show()
accuracy=metrics.accuracy_score(y_test, y_pred)
print('Accuracy for gaussian model is ', accuracy)
Model.append('GaussianNB')
tn, fp, fn, tp = mat.ravel()
TN.append(tn)
TP.append(tp)
FP.append(fp)
FN.append(fn)
TPR_lst.append(tp/(tp + fn))
TNR_lst.append(tn/(tn + fp))
accuracy_table.append(accuracy)

#### Logistic Regression Model

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
import pandas as pd
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(solver='liblinear')
log_reg.fit(X_train,y_train)
y_pred=log_reg.predict(X_test)
mat = confusion_matrix(y_test, y_pred)
cm_display=metrics.ConfusionMatrixDisplay(confusion_matrix = mat,display_labels=['Not subscribed','Subscribed'])
cm_display.plot()
plt.show()
accuracy=metrics.accuracy_score(y_test, y_pred)
print('Accuracy for Logistic Regression model is ', accuracy)
Model.append('LogisticRegression')
tn, fp, fn, tp = mat.ravel()
TN.append(tn)
TP.append(tp)
FP.append(fp)
FN.append(fn)
TPR_lst.append(tp/(tp + fn))
TNR_lst.append(tn/(tn + fp))
accuracy_table.append(accuracy)

#### Linear SVM 


In [None]:





from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
import pandas as pd
from sklearn.svm import LinearSVC # support vector classification
c=10
# Fit model to data
linear_svm = LinearSVC(C=c,loss="hinge")
linear_svm.fit(X_train_sc, y_train)
y_pred = linear_svm.predict(X_test_sc)

mat = confusion_matrix(y_test, y_pred)
cm_display=metrics.ConfusionMatrixDisplay(confusion_matrix = mat,display_labels=['Not subscribed','Subscribed'])
cm_display.plot()
plt.show()
accuracy=metrics.accuracy_score(y_test, y_pred)
print('Accuracy for SVM  model is ', accuracy)
tn, fp, fn, tp = mat.ravel()
Model.append('LinearSVC')
TN.append(tn)
TP.append(tp)
FP.append(fp)
FN.append(fn)
TPR_lst.append(tp/(tp + fn))
TNR_lst.append(tn/(tn + fp))
accuracy_table.append(accuracy)

In [None]:
accuracy_table

### VOTING CLASSIFIER  AND BAGGING CLASSIFIER

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report

log_clf = LogisticRegression(solver='liblinear')
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(
        estimators = [('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
        voting = 'hard')
voting_clf.fit(X_train, y_train)
y_pred=voting_clf.predict(X_test)


print(classification_report(y_test, y_pred))
mat = confusion_matrix(y_test, y_pred)
cm_display=metrics.ConfusionMatrixDisplay(confusion_matrix = mat,display_labels=['Not subscribed','Subscribed'])
cm_display.plot()
plt.show()
accuracy=metrics.accuracy_score(y_test, y_pred)
print('Accuracy obtained is ',accuracy)
Model.append('VotingClassifier')
tn, fp, fn, tp = mat.ravel()
TN.append(tn)
TP.append(tp)
FP.append(fp)
FN.append(fn)
TPR_lst.append(tp/(tp + fn))
TNR_lst.append(tn/(tn + fp))
accuracy_table.append(accuracy)

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))





In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

bag_clf = BaggingClassifier(
        DecisionTreeClassifier(), n_estimators=500,
        max_samples=100, bootstrap=True, n_jobs=-1)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)


print(classification_report(y_test, y_pred))
mat = confusion_matrix(y_test, y_pred)
cm_display=metrics.ConfusionMatrixDisplay(confusion_matrix = mat,display_labels=['Not subscribed','Subscribed'])
cm_display.plot()
plt.show()
accuracy=metrics.accuracy_score(y_test, y_pred)
print('Accuracy obtained is ',accuracy)

Model.append('BaggingClassifier')
tn, fp, fn, tp = mat.ravel()
TN.append(tn)
TP.append(tp)
FP.append(fp)
FN.append(fn)
TPR_lst.append(tp/(tp + fn))
TNR_lst.append(tn/(tn + fp))
accuracy_table.append(accuracy)

In [None]:
df=pd.DataFrame({'Model':Model,'TP':TP,'FP':FP,'TN':TN,'FN':FN,'accuracy':accuracy_table,'TPR':TPR_lst,'TNR':TNR_lst})
df.head(20)

### Model outputs :

Among all the models , Decision tree classifier performs better with higher accuracy and good TPR and TNR numbers as well.


In [None]:
import pickle
pickle.dump(tree_clf, open('model.pkl','wb'))