# Machine Learning Project by Teodor Chakarov

Atomobiles are big thing in everyday life of an average human. They help us being fast, flexable and independent. Though, we have a lot of social problems like air polution, traffic jams, car accidentsand etc.
In that case every country set as regulation for required insurance, at least one, in order to drive a car. People with expensive vehicles should be obligated to do an insurance on order to drive those fast cars.

Here I'm going to see how many people tend to trust an insurance company and make a prediction models for Classification and Regression problems.


# Part 1 - Vehicle Insurance 

In this machine learning part, I'm going to inspect and try to build a model for an insurance company. I have a dataset which has people who use this insurance company's products for Health insurance. The dataset attribues are:
1) Gender

2) Age

3) Driving License (1 - Yes, 0 - No)

4) Redion Code - Unique code for the region of the customer

5) Previously Insured - (0 - Person hasn't got previous vehicle insurance, 1 - Person has got previous vehicle insurance)

6) Vehicle Age

7) Previous Vehicle Damage

8) Annual Premium - The amount customer needs to pay as premium in the year

9) Policy Sales Chanel - Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc

10) Vintage - Number of Days, Customer has been associated with the company

11) Response - Does the customer wants to get vehicle insurance (0 - No, 1 - Yes)

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
import seaborn as sns


In [None]:
np.random.seed(42)

In [None]:
from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures

from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV

from sklearn.linear_model import LogisticRegressionCV

from sklearn.tree import DecisionTreeClassifier

from sklearn.svm import SVC

from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier

from sklearn.neighbors import KNeighborsClassifier

from sklearn.decomposition import PCA

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score, plot_confusion_matrix, plot_roc_curve

from lightgbm import LGBMClassifier

In [None]:
insurance = pd.read_csv("/kaggle/input/health-insurance-cross-sell-prediction/train.csv")

In [None]:
insurance.head()

In [None]:
insurance = insurance.drop('id', axis = 1)

In [None]:
insurance.describe().T

In [None]:
insurance.isna().sum()

In [None]:
insurance.dtypes

## Data Exploration


I'm going briefly to explore the dataset, try to see correlations and how the data is acting between the columns 

In [None]:
print(f"Number of observations are {insurance.shape}")

In [None]:
colors = ['blue', 'red']
plt.title('Insurance Clients based on Gender',fontsize=15)
circle = plt.Circle((0, 0), 0.6, color = 'white')
insurance['Gender'].value_counts().plot(kind='pie', figsize=(8, 8), rot=1, colors=colors, autopct = '%.2f%%')
plt.axis('off')
plt.legend()
plt.show()

We can see almost equal distibution between men and women

In [None]:
plt.title('Vehicle age distribution',fontsize=15)
insurance['Vehicle_Age'].value_counts().plot(kind='bar', figsize=(8, 8))
plt.xlabel('Years of a car')
plt.ylabel('Count')
plt.legend()
plt.show()

### Customers with previous insurance

In [None]:
health = insurance.groupby(['Gender', 'Previously_Insured'], as_index='Gender').count()
health

In [None]:
colors = ['#1849CA', 'crimson', 'green', 'pink']
circle = plt.Circle((0, 0), 0.6, color = 'white')
health.plot(x= 'Gender', y='Age',kind='pie', figsize=(8, 8), rot=1, colors=colors, autopct = '%.2f%%')
plt.title('Insured by Gender')
plt.legend()
plt.axis('off')
plt.show()

Clearly men who don't have previous incuranse are more than men who do have but women are nearly equaly distributed.



In [None]:
plt.figure(figsize=(8,6))
plt.title('Vehicle damage by gender',fontsize=15)
insurance['Vehicle_Damage'].value_counts().plot(kind='bar', figsize=(8, 8))
plt.xlabel('Damaged car?')
plt.ylabel('Count')
plt.legend()
plt.show()

Here again we have equaly distribution between two categories

In [None]:
men = insurance[insurance['Gender'] =='Male']
female = insurance[insurance['Gender'] == 'Female']

In [None]:
print(men.shape)
print(female.shape)

#### What did people choose for insuranse? 

In [None]:
people_without_insuranse_accept= insurance[(insurance['Previously_Insured'] == 0) & (insurance['Response'] == 1)]
people_with_insuranse =  insurance[(insurance['Previously_Insured'] == 1) & (insurance['Response'] == 1)]

In [None]:
print(people_without_insuranse_accept.shape)
print(people_with_insuranse.shape)

In [None]:
print(f"As we see people who don't have previous insurance and who will pay for one are {people_without_insuranse_accept.shape[0]} and people who will continue paying are {people_with_insuranse.shape[0]}")

In [None]:
people_without_insuranse_reject = insurance[( insurance['Previously_Insured'] == 0) & (insurance['Response'] == 0)]
people_with_insuranse_reject =  insurance[(insurance['Previously_Insured'] == 1) & (insurance['Response'] == 0)]

In [None]:
print(people_with_insuranse_reject.shape)
print(people_without_insuranse_reject.shape)

In [None]:
print(f"We can see that people who don't have previous insurance and won't pay for one are {people_without_insuranse_reject.shape[0]} and people who will stop paying are {people_with_insuranse_reject.shape[0]}")

In [None]:
plt.figure(figsize=(8,6))

plt.title('Are people into vehicle insurance or not')

sns.countplot(x = 'Previously_Insured', hue='Response', data = insurance)
plt.ylabel("Count")

plt.show()

#### Our dataset until this point is very well balanced but here we can see that people's response about future insurance is really low. 
We can see that bigger % of people who will pay for one are people who don't have previous insurance.

People's response isn't balanced. That is either people are not satisfied with the insurance company's products or they don't need one.


### Distribution

In [None]:
plt.figure(figsize=(8,6))
plt.title('Age Distribution')
m = sns.kdeplot(x = men['Age'], shade = True, legend = True, label = 'Male')
w = sns.kdeplot(x = female['Age'], shade = True, legend = True, label = 'Female')
plt.legend()
plt.show()

We tend to have more people in ther young adult years (20-28) and (40-45) years

In [None]:
plt.figure(figsize=(8,6))

plt.title('Vehicle Damage and customers response')
sns.countplot(x = 'Vehicle_Damage', hue='Response', data = insurance)

plt.xlabel("Vehicle Damage")
plt.ylabel("Count")
plt.show()

 We can see that people who got in a car accident will have an insurance.

In [None]:
plt.figure(figsize=(8,6))

plt.title('Vehicle Insured Clients')
vehicle_damage = insurance[['Gender', 'Response', 'Age']]
vehicle_damage = vehicle_damage[vehicle_damage['Response'] == 1]
men = vehicle_damage[vehicle_damage.Gender == 'Male']
female = vehicle_damage[vehicle_damage.Gender == 'Female']
m = sns.kdeplot(x = men['Age'], shade = True, legend = 'True', label = 'Male')
w = sns.kdeplot(x = female['Age'], shade = True, legend = 'True', label = 'Female')

plt.legend()
plt.show()

Age definetly afects people response about the insurance. Pople between age of 30-50 tend to look forward an insurance.

In [None]:
plt.figure(figsize=(8,6))

plt.title('Vehicle Insured Clients')
vehicle_damage = insurance[['Gender', 'Previously_Insured', 'Age']]
vehicle_damage = vehicle_damage[vehicle_damage['Previously_Insured'] == 1]
men = vehicle_damage[vehicle_damage.Gender == 'Male']
female = vehicle_damage[vehicle_damage.Gender == 'Female']
m = sns.kdeplot(x = men['Age'], shade = True, legend = 'True', label = 'Male')
w = sns.kdeplot(x = female['Age'], shade = True, legend = 'True', label = 'Female')

plt.legend()
plt.show()

More of the perviously insuranced clients are at young age.

### In conclusion:
1) In general our dataset is well balanced except people's response about new insurnace.

2) Age, Car Accidents, Previously damaged cars, previous insuranced client are in relations with their response

3) We have to make sure in the machine learning part to stratify the Response equaly in the training and testing set!

## Preparing for Machine Learning

We need to transform our string columns to categorical numbers in order to use them for algorithms

In [None]:
insurance.head()

In [None]:
insurance.shape

In [None]:
insurance_categorical = pd.get_dummies(insurance)

In [None]:
insurance_categorical.columns = ['Age', 'Driving_License' , 'Region_Code', 'Previously_Insured',
                                 'Annual_Premium', 'Policy_Sales_Channel', 'Vintage', 'Response',
                                 'Gender_Female', 'Gender_Male','Vehicle_Age_1-2', 'Vehicle_Age_<1',
                                 'Vehicle_Age_>2', 'Vehicle_Damage_No', 'Vehicle_Damage_Yes']

In [None]:
insurance_categorical.head()

In [None]:
insurance_categorical.shape

### Feature Selection

We are going to see the correlations and try to exclude some of the unimportant features in this dataset 

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(insurance_categorical.corr(), annot = True, fmt = '.1g')

plt.show()

So here we can clearly see the correlations in the datast.

Vintage, Annual Premium are in not strong correlation with any of the other features so we can exclude them in this feature selection.

Region code don't have strong relations as well but for now i think it is part of the bigger picture and i'm going to use it.

I don't see features (except dummies features) over 0.9 as well to exclude them because of high variance.

In [None]:
insurance_categorical = insurance_categorical[['Age', 'Driving_License', 'Region_Code', 'Previously_Insured',
                   'Policy_Sales_Channel', 'Gender_Female', "Gender_Male",
                   'Vehicle_Age_1-2', 'Vehicle_Age_<1', 'Vehicle_Age_>2', 'Vehicle_Damage_No',
                          'Vehicle_Damage_Yes', 'Response']]


In [None]:
insurance_categorical.head()

In [None]:
insurance_categorical.shape

### Normalization

I am going to scale the data using standart MinMax scaler

In [None]:
scaler = MinMaxScaler()

In [None]:
insurance_categorical_scaled = pd.DataFrame(scaler.fit_transform(insurance_categorical), index=insurance_categorical.index,
                                            columns=insurance_categorical.columns)

In [None]:
insurance_categorical_scaled.describe().T

In [None]:
pca = PCA()
pca_data = pca.fit_transform(insurance_categorical_scaled)
np.cumsum(pca.explained_variance_ratio_)

We can see a how 'heavy' each column is fot the future machine learning. I will try without droping some of the columns first and then I will drop some of them.
I want to see how we can interact and what results are we going to have for each hypothesis.

### Split data

Since we have 381109 observations i'm going to split with training/testing sets like 70/30 and will use a Cross-validation with 8 splits of the training set.

I'm going to stratify 'Response' because as we saw, people who respond with 'No' are more than 'Yes' and we want to split them equaly.

In [None]:
insurance_categorical_scaled_attribues = insurance_categorical_scaled.drop('Response', axis = 1)
insurance_categorical_scaled_target = insurance_categorical_scaled['Response']

In [None]:
print(insurance_categorical_scaled_attribues.shape)
print(insurance_categorical_scaled_target.shape)

In [None]:
insurance_attrributes_train, insurance_attrributes_test,insurance_target_train, insurance_target_test = train_test_split(insurance_categorical_scaled_attribues,
                                                                                                                         insurance_categorical_scaled_target,
                                                                                                                         train_size = 0.7, stratify = insurance_categorical_scaled_target,
                                                                                                                        random_state = 42)

In [None]:
print(insurance_attrributes_train.shape)
print(insurance_attrributes_test.shape)
print(insurance_target_train.shape)
print(insurance_target_test.shape)

In [None]:
k_fold = StratifiedKFold(n_splits = 5)

## Machine Larning algorithms

The problem can be classified as Binary Classification

Im going to try to see the best algorithm using gridSearch (for hyerparameter tuning), base linear classification algorithms and Ensemble methods as well.


The functions that i will use are:

    1) GetModelScores - here i fit and train the model and give scores of both training and testing sets
    
    2) GetOnlyScores - here i will get only scores of the given model 

In [None]:
def GetModelScores (estimator, X_train, X_test, y_train, y_test):
    scores_train = pd.DataFrame(columns= ['Accuracy','F1 Score','Precision','Recall','ROC_AUC'])
    scores_test = pd.DataFrame(columns= ['Accuracy','F1 Score','Precision','Recall','ROC_AUC'])
    
    model = estimator
    model.fit(X_train, y_train)
    
    prediction_train = model.predict(X_train)
    prediction_test = model.predict(X_test)
    
    try:
        score_train = model.predict_proba(X_train)[:,1]
        roc_train= roc_auc_score(y_train, score_train, average = "weighted")
    except:
        roc_train = 0
        
    try:
        score_test = model.predict_proba(X_test)[:,1]
        roc_test= roc_auc_score(y_test, score_test, average = "weighted")
    except:
        roc_test = 0
    
    
    
    scores_train['Accuracy'] = accuracy_score(y_train, prediction_train)*100,
    scores_train['F1 Score'] = f1_score(y_train, prediction_train, average = "weighted")*100,
    scores_train['Precision'] = precision_score(y_train, prediction_train, average = "weighted")*100,
    scores_train['Recall'] = recall_score(y_train, prediction_train, average = "weighted")*100,
    scores_train['ROC_AUC'] = roc_train*100
    
       
    scores_test['Accuracy'] = accuracy_score(y_test, prediction_test)*100,
    scores_test['F1 Score'] = f1_score(y_test, prediction_test, average = "weighted")*100,
    scores_test['Precision'] = precision_score(y_test, prediction_test, average = "weighted")*100,
    scores_test['Recall'] = recall_score(y_test, prediction_test, average = "weighted")*100,
    scores_test['ROC_AUC'] = roc_test*100
    
    print(scores_train)
    print(scores_test)

In [None]:
def GetOnlyScores (estimator, y_test, X_test, y_train, X_train):
    scores_train = pd.DataFrame(columns= ['Accuracy','F1 Score','Precision','Recall','ROC_AUC'])
    scores_test = pd.DataFrame(columns= ['Accuracy','F1 Score','Precision','Recall','ROC_AUC'])
    
    prediction_train = estimator.predict(X_train)
    prediction_test = estimator.predict(X_test)
    
    try:
        score_train = estimator.predict_proba(X_train)[:,1]
        roc_train= roc_auc_score(y_train, score_train, average = "weighted")
    except:
        roc_train = 0
        
    try:
        score_test = estimator.predict_proba(X_test)[:,1]
        roc_test= roc_auc_score(y_test, score_test, average = "weighted")
    except:
        roc_test = 0
    
   
    scores_train['Accuracy'] = accuracy_score(y_train, prediction_train)*100,
    scores_train['F1 Score'] = f1_score(y_train, prediction_train, average = "weighted")*100,
    scores_train['Precision'] = precision_score(y_train, prediction_train, average = "weighted")*100,
    scores_train['Recall'] = recall_score(y_train, prediction_train, average = "weighted")*100,
    scores_train['ROC_AUC'] = roc_train*100
       
    scores_test['Accuracy'] = accuracy_score(y_test, prediction_test)*100,
    scores_test['F1 Score'] = f1_score(y_test, prediction_test, average = "weighted")*100,
    scores_test['Precision'] = precision_score(y_test, prediction_test, average = "weighted")*100,
    scores_test['Recall'] = recall_score(y_test, prediction_test, average = "weighted")*100,
    scores_test['ROC_AUC'] = roc_test*100

    
    print(scores_train)
    print(scores_test)

### H0: Using all the attributes 

I'm going to see only basic algorithms withount hyperparameters and see the scores of them.

#### 1) Logistic Regression

In [None]:
GetModelScores(LogisticRegressionCV(), insurance_attrributes_train, insurance_attrributes_test, insurance_target_train, insurance_target_test)

#### 2)Decision Tree 

In [None]:
GetModelScores(DecisionTreeClassifier(),insurance_attrributes_train, insurance_attrributes_test, insurance_target_train, insurance_target_test)

3) Random Forest

In [None]:
GetModelScores(RandomForestClassifier(), insurance_attrributes_train, insurance_attrributes_test, insurance_target_train, insurance_target_test)

4) K-Neighbors

In [None]:
#GetModelScores(KNeighborsClassifier(), insurance_attrributes_train, insurance_attrributes_test, insurance_target_train, insurance_target_test)

#### I made a SVC but it is really slow because of the big data: Accuracy: 0.877, F1 Score: 0.820, Precision: 0.771, Recall: 0.877

Based of the ROC Score i will perform Hyperparameter Tuning on Logistic Regression and Random Forest with Grid Search

#### 1) Logistic Regression

In [None]:
parameters = {
    'Cs': [0.001, 0.01, 1, 10, 100],
    "max_iter": [30, 50, 70]
}

In [None]:
grid_logistic = GridSearchCV(LogisticRegressionCV(), param_grid = parameters, scoring = 'roc_auc', cv = k_fold, n_jobs =-1)

In [None]:
grid_logistic.fit(insurance_attrributes_train, insurance_target_train)

In [None]:
grid_logistic.best_estimator_

In [None]:
grid_logistic.cv_results_

In [None]:
GetOnlyScores(grid_logistic, insurance_target_test, insurance_attrributes_test,
              insurance_target_train, insurance_attrributes_train)

We can see mean_test_score is the same like ROC_AUC form testing set which is good mark its not overfitting for sure

#### 2) Random Forest

In [None]:
parameters = {
    "n_estimators": [50, 200, 400],
    "max_depth": [10, 50, 70]
}

In [None]:
grid_forest = GridSearchCV(RandomForestClassifier(), parameters, scoring = 'roc_auc', cv = k_fold, n_jobs =-1)
grid_forest.fit(insurance_attrributes_train, insurance_target_train)

grid_forest.best_estimator_


In [None]:
grid_forest.cv_results_

In [None]:
GetOnlyScores(grid_forest, insurance_target_test, insurance_attrributes_test, 
              insurance_target_train, insurance_attrributes_train)

In [None]:
print(f'We see that the ROC score is better than logistic regression also the F1_score so for now we can stop with Random Forest with {grid_forest.best_estimator_}')

In [None]:
plot_confusion_matrix(grid_forest, insurance_attrributes_test, insurance_target_test, normalize='pred')

Because of the high unbalanced target variable we have less than 50% True Positives

### H1: Machine Learning without high correlation attributes

Here i will see the results if i drop some of the columns with high corr. That way I can perform manuall demensionality reduction to get better results

In [None]:
insurance_lower_dim = insurance_categorical.drop(['Gender_Female', 'Vehicle_Age_<1', 'Vehicle_Damage_No'], axis = 1)

In [None]:
insurance_lower_dim.shape

In [None]:
insurance_dim_target = insurance_lower_dim.Response
insurance_dim_attributes = insurance_lower_dim.drop('Response', axis=1)

In [None]:
scaler = MinMaxScaler()

In [None]:
insurance_lower_dim_scaled = scaler.fit_transform(insurance_dim_attributes)

In [None]:
insurance_lower_dim_scaled.shape

In [None]:
dim_scaled_attributes_train, dim_scaled_attributes_test,dim_target_train, dim_target_test = train_test_split(insurance_lower_dim_scaled, insurance_dim_target,
                                                                                                            test_size = 0.3,stratify = insurance_dim_target,
                                                                                                            random_state = 42)


In [None]:
print(dim_scaled_attributes_train.shape)
print(dim_scaled_attributes_test.shape)
print(dim_target_train.shape)
print(dim_target_test.shape)

#### Random Forest Classifier for dimension reduction dataset

In [None]:
GetModelScores(RandomForestClassifier(), dim_scaled_attributes_train, dim_scaled_attributes_test,dim_target_train, dim_target_test)

#### Logistic Regression for dimension reduction dataset

In [None]:
GetModelScores(LogisticRegressionCV(cv = k_fold), dim_scaled_attributes_train, dim_scaled_attributes_test,dim_target_train, dim_target_test)

### Conclusion between H0 and H1

I am going to use the smaller dataset since the performance is the same as the 13 attributes one. 
That way i can prevent future overfitting and will learn fastter.

### BOOSTING 

I want to see if i use boosting ML algorithms, can i get better results for the classes 

#### AdaBoostClassifier

In [None]:
GetModelScores(AdaBoostClassifier(), dim_scaled_attributes_train, dim_scaled_attributes_test,dim_target_train, dim_target_test)

#### GradientBoostingClassifier

In [None]:
GetModelScores(GradientBoostingClassifier(), dim_scaled_attributes_train, dim_scaled_attributes_test,dim_target_train, dim_target_test)

#### LGBMClassifier

In [None]:
GetModelScores(LGBMClassifier(), dim_scaled_attributes_train, dim_scaled_attributes_test,dim_target_train, dim_target_test)

#### We can see that LGBMClassifier is the best boosting algorithm and it is really fast with big data. I'm going to perform hyperparameter Tuning

In [None]:
parameters_grid = {
             'num_leaves': [5, 10, 50], 
             'n_estimators': [200, 400, 600],
             'reg_lambda': [5, 50, 100]
        }

In [None]:
grid_booster = GridSearchCV(LGBMClassifier(), parameters_grid, scoring = 'roc_auc',
                            n_jobs = -1, cv = k_fold)

In [None]:
grid_booster.fit(dim_scaled_attributes_train, dim_target_train)

In [None]:
print(f'Here we can see with hyperparameters: {grid_booster.best_estimator_} we have best scores for:')

In [None]:
GetOnlyScores(grid_booster, dim_target_test, dim_scaled_attributes_test,dim_target_train, dim_scaled_attributes_train)

In [None]:
plot_confusion_matrix(grid_booster, dim_scaled_attributes_test, dim_target_test, normalize='pred')
plt.show()

We see that Negative response we have 88 % but Positives are 47%. That's because we have unbalanced Response.
To get better scores in this matrix I'm going to perform Oversampling

### H2: UNDER- AND OVER-SAMPLING

I'm going to use combined method SMOTETomek to see how good the model is going to perform.

In [None]:
from imblearn.combine import SMOTETomek

In [None]:
insurance_lower_dim.Response.value_counts()

We can see original distribution of the target labels 

In [None]:
oversample_attributes = insurance_lower_dim.drop('Response', axis = 1)
oversample_target = insurance_lower_dim.Response

In [None]:
balance_data = SMOTETomek()

In [None]:
oversample_attributes_res, oversample_target_res = balance_data.fit_resample(oversample_attributes, oversample_target)

In [None]:
oversample_target_res.value_counts()

We can see that the positive response are equal to the negative one because i performed oversampling and got liitle bit undersampling of attributes

In [None]:
sc = MinMaxScaler()

In [None]:
oversample_target = sc.fit_transform(oversample_attributes_res)

In [None]:
oversample_attributes_train, oversample_attributes_test, oversample_target_train, oversample_target_test = train_test_split(oversample_attributes_res,
                                                                                             oversample_target_res, test_size = 0.3, 
                                                                                            random_state = 42)

#### GridSearch with LGBMClassifier

In [None]:
mod_params = {
              'n_estimators':[400, 600, 800],
              'num_leaves': [10, 50, 80],
              'reg_lambda': [0.001, 1, 5, 10]
}

In [None]:
mod = GridSearchCV(LGBMClassifier(), mod_params, scoring = 'roc_auc', cv = k_fold)

In [None]:
mod.fit(oversample_attributes_train, oversample_target_train)

In [None]:
mod.best_estimator_

In [None]:
GetOnlyScores(mod, oversample_target_test, oversample_attributes_test, oversample_target_train, oversample_attributes_train)

In [None]:
plot_confusion_matrix(mod, oversample_attributes_test, oversample_target_test, normalize='pred')
plt.show()

In [None]:
plot_roc_curve(mod, oversample_attributes_test, oversample_target_test, name = "Certainty of the algorithm")
plt.show()

We can see how certain the model is in predicting the output classes. By looking at the graph we can see our cureve is going up more than going to False Positive (right) side

### CONCLUSION

We can see we have less scores in comparison to parameter-tuned LGBMClassifier but the roc_auc score is better, also the confusion matrix gives us better results. For now i'm satisfied with our last model in which we performed:

1) MinMax regularization

2) Feature Selection in which we exclude from the original 15 to 10 columns 

3) Perform an over- and under-sampling in which we deal with unbalanced dataset

4) We chose the best ML algorithm and it's boosting algorithm LGBMClassifier. It is fast with big data and we got the best scores with it.

5) We perform Grid Search with Cross Validation in which we got best estimators.

6) And at last we combined all of these steps to get 91% True Negatives and 76 % True Positives