<h1><b>Project Type - Supervised Machine learning (Classification Model)
<h1><b>Contribution - Individual

<h1><b>Project Summary


* This project aimed to help an insurance company expand its business by offering vehicle insurance to its existing health insurance customers. The company wanted to find out which customers were likely to be interested in buying vehicle insurance. They collected data on their past customers and used it to build a predictive model.

* The first step was to clean and explore the data. They made sure there were no duplicate or empty values in the data, and they organized the data in a way that could be easily analyzed. They also looked for any patterns or trends in the data through exploratory analysis. They discovered that a higher number of interested customers were male.

* To create the predictive model, they compared four different machine learning algorithms: Logistic Regression, K-Nearest Neighbors, Random Forest, and XGBoost. They used various metrics to evaluate the models' performance, including accuracy, precision, recall, F1-score, average precision, and ROC AUC score. They also fine-tuned the models using hyperparameter tuning to improve their performance.

* The dataset used in the study had information on 381,109 customers and 12 features. They conducted feature engineering to handle missing values, outliers, and correlations between features. They also transformed categorical variables into numerical ones. The dataset was split into a training set and a testing set, with a ratio of 70:30.

* After evaluating the four machine learning algorithms, it was found that Gradient Boosting provided the best overall performance in terms of accuracy, precision, recall, and F1 score. The Gradient Boosting algorithm achieved an accuracy of 84% and an average precision, recall, and F1 score of 84%, 85%, and 84%, respectively. It also achieved a ROC AUC score of 84%. While other algorithms such as Random Forest, XGBoost, KNN, and Logistic Regression performed well with accuracy scores ranging from 80% to 82%, they did not outperform the Gradient Boosting algorithm.

* These results suggest that the Gradient Boosting algorithm is effective in predicting customer interest in vehicle insurance. The predictive model can be used to target marketing campaigns towards potential customers. Overall, the project showed how machine learning algorithms can help identify potential customers and expand a business's customer base.

<h1><b>Business Context

* Vehicle insurance is similar to medical insurance, but instead of covering medical expenses, it covers damages related to vehicles. Every year, customers pay a premium amount to an insurance company. In case of an accident or any unfortunate event involving the insured vehicle, the insurance company provides compensation, known as the "sum assured," to the customer.

* Now, the insurance company wants to develop a model that can predict whether a customer would be interested in purchasing vehicle insurance. By understanding which customers are more likely to be interested, the company can create targeted communication strategies to reach out to them. This will help the company optimize its business model and increase its revenue.

* By building a predictive model using customer data, such as demographics, vehicle details, and policy information, the company can predict the likelihood of a customer being interested in vehicle insurance. This model will enable the company to focus its communication efforts on those customers who are more likely to be interested in purchasing insurance. By reaching out to the right customers, the company can improve its business model and increase its revenue by selling more vehicle insurance policies.






<h1><b>Problem Statement
 
* The goal of this project is to predict whether customers are interested in buying vehicle insurance using a dataset that contains information about 381,109 customers and 12 different characteristics. These characteristics include details about demographics like gender, age, and region, as well as information about vehicles such as their age and damage history. The dataset also includes information about the insurance policy, such as the premium amount and how the customer learned about the insurance.

* The main focus is on the "Response" column, which indicates whether a customer is interested in buying vehicle insurance or not. This is the variable we want to predict accurately.

* The aim is to develop a model that can analyze the given information about each customer and predict their interest in vehicle insurance. By training the model on this dataset and utilizing various machine learning techniques, we can build a predictive model that will be able to identify potential customers who are likely to be interested in purchasing vehicle insurance.

* The ultimate goal is to use this predictive model to target marketing efforts towards those customers who are most likely to be interested in vehicle insurance, thus increasing the efficiency and effectiveness of the insurance company's marketing strategy.

# ***Let's Begin !***

## ***1. Know Your Data***

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Import Libraries

In [None]:
# Import Libraries

import pandas as pd
import numpy as np

# Importing Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from xgboost import XGBRFClassifier

from sklearn.model_selection import GridSearchCV

from sklearn import metrics
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, precision_score, recall_score, roc_auc_score


# Importing warning for ignore warnings 
import warnings
warnings.filterwarnings("ignore")

### Dataset Loading

In [None]:
# Load Dataset
df=pd.read_csv("/content/drive/MyDrive/capstone project 3/TRAIN-HEALTH INSURANCE CROSS SELL PREDICTION.csv")
df

### Dataset First View

In [None]:
# Dataset First Look
df.head()

In [None]:
# making a copy of data for safity purpose
df_copy = df.copy()


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print(len(df[df.duplicated()]))

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isna().sum().sort_values(ascending= False).reset_index().rename(columns={'index':'Columns',0:'Null values'})

In [None]:
# Visualizing the missing values
plt.figure(figsize=(14, 5))
sns.heatmap(df.isnull(), cmap='viridis', yticklabels=False)
plt.xlabel("column_name", size=14, weight="bold")
plt.title("Missing Values in Column",fontweight="bold",size=17)
plt.show()

Attribute De

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

<h1><b>Variable Description

>id: Unique ID for the customer

>Gender : Gender of the customer

>Age : Age of the customer

>Driving_License : 0 = Customer does not have DL, 1 = Customer already has DL

>Region_Code : Unique code for the region of the customer

>Previously_Insured : 1 = Customer already has Vehicle Insurance, 0 = Customer doesn't have Vehicle Insurance

>Vehicle_Age : Age of the Vehicle

>Vehicle_Damage : 1 = Customer got his/her vehicle damaged in the past. 0 = Customer didn't get his/her vehicle damaged in the past.

>Annual_Premium : The amount customer needs to pay as premium in the year

>Policy Sales Channel : Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.

>Vintage : Number of Days, Customer has been associated with the company

>Response : 1 : Customer is interested, 0 : Customer is not interested

In [None]:
# Dataset Describe
df.describe(include="all")

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique().reset_index().rename(columns={'index':'Columns',0:'Unique values'})

## 3. ***Data Wrangling***

>Data wrangling (Also known as data munging) is the practice of cleansing, restructuring, and enriching raw data. This process is very critical for businesses to perform because it is the only method that makes raw data usable.

>Raw data is complex because it has not been processed or integrated into a system. With data wrangling, these records are transformed into a standard format that helps highlight valuable insights. The process entails consolidating data into one location and rectifying any missing information or errors.


<h1>Drop all column that is not required for our analysis

In [None]:
# drop Id columns 
df.drop('id',inplace=True,axis=1)

<h1>Extrect numerical and categorical column

In [None]:
# make a function to extract categorical and numerical columns
def extract_cat_num(df):
  '''
  This function extract categorocal and Numerical columns in dataset
  '''

  cat_col=[col for col in df.columns if df[col].dtype=='object']
  num_col=[col for col in df.columns if df[col].dtype!='object']
  return cat_col,num_col

In [None]:
cat_col,num_col=extract_cat_num(df)

In [None]:
df.nunique()

In [None]:
# fetch unique value in categorical feature columns

for col in cat_col:
  print('{} has {} values'.format(col,df[col].unique()))
  print('\n')

<h1>Visualized categorical columns

In [None]:
plt.figure(figsize=(20,4),dpi=200)

for i,feature in enumerate(cat_col):
  plt.subplot(1,3,i+1)
  sns.countplot(x=df[feature])
  plt.title(feature,fontsize=16,color='red')

In [None]:
# fetch numerical columns 
num_col

<h1>Visualized numerical column

In [None]:
plt.figure(figsize=(10,6),dpi=200)

for i,feature in enumerate(num_col):
  plt.subplot(3,3,i+1)
  df[feature].hist()
  plt.title(feature, fontsize=12,color='red')
  plt.tight_layout()

In [None]:
# check data types 
df.dtypes

###<h1> What all manipulations have you done and insights you found?

>Dropped the id column

>All attributes don't have any discrepancy, so no need to correct any attribute column.

>Categorical columns are: Gender, Vehicle_Age, Vehicle_Damage

>Numerical columns are: Age, driving_License, Response Previously_InsuredRegion_Code, Annual_Premium, Policy_Sales_Channel, Vintage

>Response attribute is our Output

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

####<h1> Chart - 1        Pie chart on dependant variable

In [None]:
# Chart - 1 visualization code

df['Response'].value_counts().plot(kind='pie',figsize=(15,6), autopct="%1.1f%%",shadow=False,labels=['Non_Interested(%)','Interested(%)'])
plt.title('Response of customers',color='red');

# count of Target variable
df['Response'].value_counts()

#####<h1> 1. Why did you pick the specific chart?

>A pie chart is a useful tool to display the distribution of various categories in a dataset. By dividing the circle into proportional sections, each representing a different category, the pie chart allows for a clear comparison of the relative size of each category. The use of different colors for each section further enhances the clarity of the representation and makes it easier to understand and interpret the data.

#####<h1> 2. What is/are the insight(s) found from the chart?

>The data shows that a large majority (87.7%) of customers are not interested, while a smaller portion (12.3%) are interested. The response variable is imbalanced with more instances of "not interested" than "interested".

#####<h1> 3. Will the gained insights help creating a positive business impact? 
<h3>Are there any insights that lead to negative growth? Justify with specific reason.


>It depends on the specific business scenario and the insights that were gained from the pie chart. The pie chart provides information on the proportion of different categories in a dataset, but it is up to the business to use that information in a meaningful way to drive positive impact.

>For example, if the pie chart showed a large proportion of customers who were not interested, the business could use that information to identify areas for improvement and increase customer engagement. On the other hand, if the pie chart showed a large proportion of customers who were interested, the business could capitalize on that by focusing on maintaining and growing that customer base.

>Ultimately, the impact will depend on how the insights are applied and acted upon by the business.

####<h1> Chart - 2    Visualized categorical variable with target *variable*

In [None]:
# Chart - 2 visualization code

# distribution of categorical variables in the dataset

plt.figure(figsize=(20,6),dpi=200)

for i, feature in enumerate(cat_col):
  plt.subplot(1, len(cat_col), i+1)
  sns.countplot(x=df[feature], hue='Response', data=df)
  plt.title(feature, fontsize=16, color='red')
  plt.tight_layout()

#####<h1> 1. Why did you pick the specific chart?

>A count plot, also referred to as a bar plot, is a visualization technique that displays the frequency of each category in a categorical or nominal variable. The frequency counts are represented as bars, making it simple to understand the distribution of values in the dataset. Furthermore, the y-axis can be adjusted to show not just the count, but also other statistics such as the percentage of total values for each category. This additional information helps to provide deeper insights into the data and facilitates easy comparison of the proportions of different categories.

#####<h1> 2. What is/are the insight(s) found from the chart?

>The statement "Male are more interested in vehicle insurance than female" can be confirmed by checking the count of males and females in the bar plot of the feature representing gender.

>The statement "In term of vehicle_age, vehicle_age of 1-2 year are more interested in insurance followed by < 1 year and >2 years" can be confirmed by checking the count of each vehicle age category in the bar plot of the feature representing vehicle age.

>The statement "Customers having Vehicle_damage are more interested in insurance" can be confirmed by checking the count of customers with and without vehicle damage in the bar plot of the feature representing vehicle damage

#####<h1> 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

>If the goal is to target customers who are more likely to purchase insurance.

>The insights gained from the plots indicating that male customers and customers with 1-2 year old vehicles are more interested in insurance could inform targeted marketing efforts.

####<h1> Chart - 3 Continuous variable with tagret variable

In [None]:
# Chart - 3 visualization code

# checking Outliers in numeric features using seaborn boxplot

# list of continuous columns in dataset

list=['Age','Region_Code','Annual_Premium','Policy_Sales_Channel','Vintage']

plt.figure(figsize=(12,6), dpi=200)

for i, feature in enumerate(list):
    plt.subplot(2, 3, i+1)
    sns.boxplot(x='Response', y=df[feature], data=df)
    plt.title(feature, fontsize=16, color='red')

plt.tight_layout()

#####<h1> 1. Why did you pick the specific chart?

>A box plot, or box and whisker plot, is a visualization technique used to show the distribution of continuous data, not categorical data. The plot displays information about the shape of the distribution, including the median, quartiles, and outliers. The box of the plot displays the interquartile range (IQR), which represents the range between the first and third quartiles (25th and 75th percentiles), while the whiskers extend to the minimum and maximum values, excluding outliers. Outliers are plotted as individual points outside of the whiskers.

>The box plot provides a concise summary of the data, making it useful for comparing distributions across different groups or for identifying potential outliers or skewness in the data. However, it is not suitable for visualizing categorical or nominal data. For categorical data, you would typically use a bar plot instead.

#####<h1> 2. What is/are the insight(s) found from the chart?

>This code creates a figure with multiple subplots, each showing the relationship between a continuous feature and the target variable Response. The sns.boxplot function is used to plot the features, with the box plot showing any potential outliers in the data. The target variable Response is used to color the boxes, allowing for a visual assessment of the relationship between the feature and the target variable.

From the description, it seems that only the Annual_Premium feature has outliers, which may have an impact on the performance of machine learning algorithms. It's important to consider this when selecting or preprocessing features for modeling.



#####<h1> 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

>The insights lead to informed decisions and effective strategies, they have the potential to drive positive business impact.

####<h1> Chart - 4  Distplot with all numerical column with target variable


In [None]:
# Chart - 4 visualization code

plt.figure(figsize = (16, 16))
plt.suptitle("Analysis Of Variable Response",fontweight="bold", fontsize=20)

plt.subplot(4,3,1)
sns.kdeplot(x='Age', hue='Response', palette = 'Set2', shade=True, data=df)

plt.subplot(4,3,2)
sns.kdeplot(x='Region_Code', hue='Response', palette = 'Set2', shade=True, data=df)

plt.subplot(4,3,3)
sns.kdeplot(x='Annual_Premium', hue='Response', palette = 'Set2', shade=True, data=df)

plt.subplot(4,3,4)
sns.kdeplot(x='Policy_Sales_Channel', hue='Response', palette = 'Set2', shade=True, data=df)

plt.subplot(4,3,5)
sns.kdeplot(x='Vintage', hue='Response', palette = 'Set2', shade=True, data=df)


#####<h1> 1. Why did you pick the specific chart?

>Display a univariate or bivariate distribution using a histogram, kernel density estimation, or rug plot. Displots are available in the seaborn library in Python and are useful for visualizing the distribution of data.

#####<h1> 2. What is/are the insight(s) found from the chart?

>From this insights, we came to know that there are Age and Annual_premium are positively skewed.

####<h1> Chart - 5 Relationship between age and target variable

In [None]:
# Chart - 5 visualization code
#Age VS Response
plt.figure(figsize=(20,10))
sns.countplot(x='Age',hue='Response',data=df);
plt.tight_layout()

#####<h1> 1. Why did you pick the specific chart?

>This interpretation of the graph is based on the assumption that the height of the bars represents the number of people who are interested in vehicle insurance, and the x-axis categories represent different age groups. Based on the description given, it seems that the 20-30 age group has a higher count of people who are interested in vehicle insurance compared to 50+ age groups, although the 30-50 age group has the highest overall interest in vehicle insurance.

#####<h1> 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

>If the target audience for vehicle insurance is primarily in the 20-30 age and 30-50 age group.


#####<h1><b>Hypothesis Testing<h1>


>The null hypothesis (H0): would be that there is no relationship between the predictors and the response variable (vehicle insurance purchase).

>Alternative hypothesis (Ha): would be that there is a relationship between the predictors and the response variable.

In [None]:
# performing the independant t test on numerical varialbe
import scipy.stats as stats

# make a dataframe
tstats_df= pd.DataFrame()

# run a loop for all numerical variable
for i in num_col:
  tstats= stats.ttest_ind(df.loc[df['Response']==1,i],df.loc[df['Response']==0,i])
  temp= pd.DataFrame([i,tstats[0],tstats[1]]).T
  temp.columns=['Variable Name','T-statstic','P-value']
  tstats_df=pd.concat([tstats_df,temp],axis=0,ignore_index=True)

tstats_df=tstats_df.sort_values(by='P-value').reset_index(drop=True)
tstats_df

>Variables and their P-value

Here our level of significance(alpha ) is 0.05. we got variable id,Vintage, which are not significance so we reject the null hypothesis.

#####<h1><b>Feature Engenerring<h1><b>

In [None]:
df.types

<h3>convert variable to appropriate datatypes :-

 <H4>changing categorical value to numerical value

In [None]:
# Assume vehicle_Age as an ordinal categorical 

from sklearn.preprocessing import OrdinalEncoder

# Define the ordering of categories
age_ordering = ['< 1 Year', '1-2 Year', '> 2 Years']

# Create an ordinal encoder with the specified ordering
encoder = OrdinalEncoder(categories=[age_ordering])

# Fit and transform the encoder on the 'Vehicle_Age' column in train_df
df['Vehicle_Age'] = encoder.fit_transform(df[['Vehicle_Age']])

In [None]:
# OneHotEncoding using pandas (Gender and Vehicle_Damage as a nominal category)

df=pd.get_dummies(df,columns=['Gender','Vehicle_Damage'],drop_first=True)

In [None]:
df.head()

<h1><b>Handling Outlier

In [None]:
# Visualizing the outlier using boxplot

plt.figure(figsize=(10,5))
sns.boxplot(y='Annual_Premium',x='Response',data=df )

In [None]:
# Visualizing the variation of Annual_Premium using rugplot

plt.figure(figsize=(20,4),dpi=200)
sns.rugplot(y='Response',x='Annual_Premium',data=df);


In [None]:
# Number of row having Annual_Premium >140000

outliers = df.loc[df['Annual_Premium']>135000]
outliers.shape

In [None]:
# drop row having Annual_Premium >140000
df = df[df['Annual_Premium']<=135000]
df.shape

In [None]:
# again Visualizing the outlier

plt.figure(figsize=(10,5))
sns.boxplot(y='Annual_Premium',x='Response',data=df );

In [None]:
plt.figure(figsize=(20,4),dpi=200)
sns.rugplot(y='Response',x='Annual_Premium',data=df)

>Removing outliers from the dataset can have an effect on the model performance. If the outliers are affecting the model negatively, i.e., increasing the error rate or bias, removing them may improve the model performance by reducing the error rate or bias.

<h1><b>Check Correlation and Multicollinearity between features

>When performing feature selection in machine learning, it is recommended to perform correlation analysis first to identify highly correlated features. Once this analysis is complete, a variety of feature selection techniques can be used to identify the most important features.

In [None]:
df.head(5)

In [None]:
corr=df.corr().round(2)
plt.figure(figsize=(10,4),dpi=200)
sns.heatmap(corr,annot=True,cmap = 'YlOrBr')
plt.title('Correlation between all the variables', size=16)
plt.show()

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Function to calculate Multicollinearity
def calc_vif(X):

  # VIF dataframe
  vif = pd.DataFrame()
  vif["feature"] = df.columns
  
  # calculating VIF for each feature
  vif["VIF"] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]
  return(vif)

In [None]:
calc_vif(df)

In [None]:
df=df.drop('Driving_License',axis=1)

In [None]:
corr=df.corr().round(2)
plt.figure(figsize=(10,4),dpi=200)
sns.heatmap(corr,annot=True,cmap = 'YlOrBr')
plt.title('Correlation between all the variables', size=16)
plt.show()

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Function to calculate Multicollinearity
def calc_vif(X):

  # VIF dataframe
  vif = pd.DataFrame()
  vif["feature"] = df.columns
  
  # calculating VIF for each feature
  vif["VIF"] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]
  return(vif)

In [None]:
calc_vif(df)

<h2>Select best feature of your model

In [None]:
#separating the dependent and independent variables

X=df.drop(columns='Response')
y=df['Response']

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [None]:
ordered_rank_features = SelectKBest(score_func=chi2,k=9)
ordered_feature = ordered_rank_features.fit(X,y)

In [None]:
# check score of all feature

ordered_feature.scores_

In [None]:
# make dataframe and store in a variable

datascores = pd.DataFrame(ordered_feature.scores_, columns=['Score'])

In [None]:
datascores

In [None]:
# make dataframe from X_train and store in variable

dfcols = pd.DataFrame(X.columns)

In [None]:
# concatinate both dataframe

pd.concat([dfcols, datascores],axis=1)

features_rank = pd.concat([dfcols, datascores],axis=1)

In [None]:
features_rank

In [None]:
# give column name to feature_rank dataframe

features_rank.columns = ['feature','score']

In [None]:
# fetch top 8 features based on score
 
features_rank.nlargest(8,'score')

In [None]:
selected_columns = features_rank.nlargest(8,'score')['feature'].values

In [None]:
selected_columns

In [None]:
X_new = X[selected_columns]

In [None]:
# final independent feature look

X_new.head()

<h1><b>Data imbalanced Handling

In [None]:
# Dependant Column Value Counts
print(df.Response.value_counts())
print(" ")

# Dependant Variable Column Visualization
df['Response'].value_counts().plot(kind='pie',
                              figsize=(15,6),
                               autopct="%1.1f%%",
                               startangle=90,
                               shadow=True,
                               labels=['Not-Interested(%)','Interested(%)'],
                               colors=['skyblue','red'],
                               explode=[0,0]
                              )

#####<b>Do you think the dataset is imbalanced? Explain Why

>Dependent column data ratio is 88:12. So, during model creating it's obvios that there will be bias and having a great chance of predicting the majority one so frequently. SO the dataset should be balanced before it going for the model creation part.

In [None]:
# Handaling imbalance dataset using SMOTE

#importing SMote to make our dataset balanced
from imblearn.over_sampling import SMOTE

smote = SMOTE()

# fit predictor and target variable
X_smote, y_smote = smote.fit_resample(X_new,y)

print('Original dataset shape {} \n Resampled dataset shape {}'.format(len(df),len(y_smote)))

#####<b>What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

SMOTE (Synthetic Minority Over-sampling technique) used for balanced the 88:12 dataset.

 >SMOTE is a technique in machine learning for dealing with issues that arise when working with an unbalanced data set. In practice, unbalanced data sets are common and most ML algorithms are highly prone to unbalanced data so we need to improve their performance by using techniques like SMOTE.

 >SMOTE is a data augmentation algorithm that creates synthetic data points from raw data. SMOTE can be thought of as a more sophisticated version of oversampling or a specific data augmentation algorithm.

 >SMOTE has the advantage of not creating duplicate data points, but rather synthetic data points that differ slightly from the original data points. SMOTE is a superior oversampling option.

 >That's why for lots of advantages, I have used SMOTE technique for balancinmg the dataset.

In [None]:
X_new.shape

<h2><b>Data Spliting

In [None]:
# Dividing the dataset into train and test set

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X_smote,y_smote,test_size=0.3,random_state=0)

#####<b>What data splitting ratio have you used and why?

>Dividing the data into training and testing sets is a common approach in machine learning to evaluate the performance of a model. The idea is to use the training data to estimate the parameters of the model, and the testing data to evaluate the performance of the model on new, unseen data.

>By dividing the data into an 80/20 ratio, you are following the Pareto principle, which states that 80% of the effects come from 20% of the causes. In this case, the 80% of the data is used for training, and 20% is used for testing. This split ensures that you have enough data to accurately estimate the parameters of the model while also having enough data to accurately evaluate its performance.

>However, it's important to note that the choice of split ratio (80/20 or any other) depends on the size of your dataset and the complexity of your model. If you have a large dataset, you may be able to use a smaller ratio (e.g., 70/30), while if you have a small dataset, you may need to use a larger ratio (e.g., 90/10).

>In general, the goal is to find the right balance between the variance of the parameter estimates and the variance of the performance statistics, so that neither is too high. Therefore, I choose 70:30 ratio.

<h1><b>Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train= scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

#####<b>Which method have you used to scale you data and why?

>I used MinMaxscaler as it preserves the shape of the original distribution. Note that MinMaxScaler doesn't reduce the importance of outliers. The default range for the feature returned by MinMaxScaler is 0 to 1.

<h1><b>Model Implimentation

The following algorithms are used in ML implemenation

1.Logistic Regression

2.k_nearest neighbours

3.RandomForestClassifier

4.XGB boostclassifier

<h2><b> 1.Apply Logistic Regression:

In [None]:
# Model Implementation
clf = LogisticRegression(fit_intercept=True, max_iter=10000)

# Fit the Algorithm
clf.fit(X_train, y_train)

In [None]:
# Checking the coefficients
clf.coef_

In [None]:
# Checking the intercept value
clf.intercept_

In [None]:
# Predict on the model
# Get the predicted probabilities
train_preds = clf.predict_proba(X_train)
test_preds = clf.predict_proba(X_test)

In [None]:
# Get the predicted classes
train_class_preds = clf.predict(X_train)
test_class_preds = clf.predict(X_test)

In [None]:
# Get the accuracy scores
train_accuracy = accuracy_score(train_class_preds,y_train)
test_accuracy = accuracy_score(test_class_preds,y_test)

print("The accuracy on train data is ", train_accuracy)
print("The accuracy on test data is ", test_accuracy)

1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#  confusion matrix for train 
labels = ['Not_Interested', 'Interested']
cm = confusion_matrix(y_train, train_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)


In [None]:
# Get the confusion matrix for test

labels = ['Non_Interested', 'Interested']
cm = confusion_matrix(y_test, test_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax); #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
# report metrics for train data
print(metrics.classification_report(train_class_preds, y_train))
print(" ")

print("roc_auc_score_train")
print(metrics.roc_auc_score(y_train, train_class_preds))

In [None]:
# report metrics for test data
print(metrics.classification_report(test_class_preds, y_test))
print(" ")

print("roc_auc_score_test")
print(metrics.roc_auc_score(y_test, test_class_preds))

Based on the results of the Logistic Regression algorithm

>The classifier has a high precision (93%) for the "Interested" class, which means that the classifier correctly predicted a high percentage of positive instances among all predicted positive instances. However, it has a lower recall (72%), which means that the classifier missed a significant number of actual positive instances.

>The overall accuracy, average precision, recall, and F1-score are all similar (78-79% accuracy and 79% average precision, 82% recall, and 78% F1-score). This indicates that the model is performing moderately well, but it is not excellent. The high average precision suggests that the model has a low false positive rate.

>The ROC AUC score is 78%, which indicates that the model's ability to distinguish between positive and negative classes is moderate.

>The classifier can be improved with hyperparameter tuning.

2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV)
logistic = LogisticRegression(max_iter=100)
solvers = ['lbfgs']
penalty = ['10','l2','14','16','20','18']
c_values = [1000,100, 10, 1.0, 0.1, 0.01,0.001]

# define grid search
grid = dict(solver=solvers,penalty=penalty,C=c_values)

# cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(logistic, param_grid=grid, n_jobs=-1, cv=5, scoring='f1',error_score=0)

# Fit the Algorithm
grid_result=grid_search.fit(X_train, y_train)

print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))


# Predict on the model
# Get the predicted classes
train_class_preds = grid_result.predict(X_train)
test_class_preds = grid_result.predict(X_test)

In [None]:
# result dataframe for train data
lr_train_roc=roc_auc_score(y_train, train_class_preds)
lr_train_acc = accuracy_score(y_train, train_class_preds)
lr_train_prec = precision_score(y_train, train_class_preds)
lr_train_rec = recall_score(y_train, train_class_preds)
lr_train_f1 = f1_score(y_train, train_class_preds)

results = pd.DataFrame([['Logistic Regression', lr_train_acc,lr_train_prec,lr_train_rec, lr_train_f1,lr_train_roc]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score','ROC'])
results

In [None]:
# result dataframe for test data
lr_test_roc=roc_auc_score(y_test, test_class_preds)
lr_test_acc = accuracy_score(y_test, test_class_preds)
lr_test_prec = precision_score(y_test, test_class_preds)
lr_test_rec = recall_score(y_test, test_class_preds)
lr_test_f1 = f1_score(y_test, test_class_preds)

results = pd.DataFrame([['Logistic Regression', lr_test_acc,lr_test_prec,lr_test_rec, lr_test_f1,lr_test_roc]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score','ROC'])
results

In [None]:
# hypertuned report metrics for train data
print(metrics.classification_report(train_class_preds, y_train))
print(" ")

print("roc_auc_score_train")
print(metrics.roc_auc_score(y_train, train_class_preds))

In [None]:
# hypertuned report metrics for test data
print(metrics.classification_report(test_class_preds, y_test))
print(" ")

print("roc_auc_score_test")
print(metrics.roc_auc_score(y_test, test_class_preds))

>GridSearchCV is a popular method for hyperparameter tuning that combines grid search and cross-validation. Grid search will try every possible combination of the specified hyperparameters and their values, while cross-validation will evaluate the model performance using a different portion of the data. it can be computationally expensive as the number of combinations increase.

>The precision and recall values for both "Non-Interested" (64% precision and 91% recall) and "Interested" (94% precision and 72% recall) classes are reasonably good, suggesting that the model is not making too many false positive or false negative predictions.

>The GridSearchCV method has helped in optimizing the hyperparameters of the logistic regression algorithm, leading to improved performance on the dataset compared to the default parameters.

<h2><b>2. K_nearest neighbours(KNN):


In [None]:
# Import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier

#Setup arrays to store training and test accuracies
neighbors = np.arange(1,15)
train_accuracy =np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))

for i,k in enumerate(neighbors):
    # Setup a knn classifier with k neighbors
    knn = KNeighborsClassifier(n_neighbors=k)
    
    # Fit the model
    knn.fit(X_train, y_train)
    
    # Compute accuracy on the training set
    train_accuracy[i] = knn.score(X_train, y_train)
    
    # Compute accuracy on the test set
    test_accuracy[i] = knn.score(X_test, y_test)

In [None]:
# Generate plot

plt.title('k-NN Varying number of neighbors')
plt.plot(neighbors, test_accuracy, label='Testing Accuracy')
plt.plot(neighbors, train_accuracy, label='Training accuracy')
plt.legend()
plt.xlabel('Number of neighbors')
plt.ylabel('Accuracy')
plt.show()

In [None]:
# take k=4

knn = KNeighborsClassifier(n_neighbors=4)

# Fit the model
knn.fit(X_train,y_train)

In [None]:
# Predict on the model
# Making predictions on train and test data
train_class_preds = knn.predict(X_train)
test_class_preds = knn.predict(X_test)

1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Get the confusion matrix for train 

labels = ['Non_Interested', 'Interested']
cm = confusion_matrix(y_train, train_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
# Get the confusion matrix test

labels = ['Non_Interested', 'Interested']
cm = confusion_matrix(y_test, test_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
kn_train_roc=roc_auc_score(y_train, train_class_preds)
kn_train_acc = accuracy_score(y_train, train_class_preds)
kn_train_prec = precision_score(y_train, train_class_preds)
kn_train_rec = recall_score(y_train, train_class_preds)
kn_train_f1 = f1_score(y_train, train_class_preds)

results = pd.DataFrame([['Random Forest', kn_train_acc,kn_train_prec,kn_train_rec, kn_train_f1,kn_train_roc]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score','ROC'])
results

In [None]:
kn_test_roc=roc_auc_score(y_test, test_class_preds)
kn_test_acc = accuracy_score(y_test, test_class_preds)
kn_test_prec = precision_score(y_test, test_class_preds)
kn_test_rec = recall_score(y_test, test_class_preds)
kn_test_f1 = f1_score(y_test, test_class_preds)

results = pd.DataFrame([['Random Forest', kn_test_acc,kn_test_prec,kn_test_rec, kn_test_f1,kn_test_roc]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score','ROC'])
results

In [None]:
# report metrics for train data
print(metrics.classification_report(train_class_preds, y_train))
print(" ")

print("roc_auc_score_train")
print(metrics.roc_auc_score(y_train, train_class_preds))

In [None]:
# hypertuned report metrics for test data
print(metrics.classification_report(test_class_preds, y_test))
print(" ")

print("roc_auc_score_test")
print(metrics.roc_auc_score(y_test, test_class_preds))

#####<b>Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Based on the results of the KNN algorithm:-
 
 >The KNN classifier has high precision and recall for both the "Non-Interested" and "Interested" classes, with 91% precision and 86% recall for the "Non-Interested" class and 85% precision and 91% recall for the "Interested" class. These results suggest that the model is making relatively few false positive and false negative predictions.

 >The overall accuracy, average precision, recall, and F1-score are all good (88% accuracy and 88% average precision, recall, and F1-score).
The ROC AUC score is 88%.

>The testing results show lower precision and recall compared to the training results, but still have a good accuracy, average precision, recall, and F1-score (81% accuracy and 81% average precision, recall, and F1-score).

<h2><b>3. Random Forest Classifier:


In [None]:
# Create an instance of the RandomForestClassifier
rf_model = RandomForestClassifier()

# Fit the Algorithm
rf_model.fit(X_train,y_train)

# Predict on the model
# Making predictions on train and test data
train_class_preds = rf_model.predict(X_train)
test_class_preds = rf_model.predict(X_test)

In [None]:
# Calculating accuracy on train and test
train_accuracy = accuracy_score(y_train,train_class_preds)
test_accuracy = accuracy_score(y_test,test_class_preds)

print("The accuracy on train dataset is", train_accuracy)
print("The accuracy on test dataset is", test_accuracy)

1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#confusion matrix for train

labels = ['Non_Interested', 'Interested']
cm = confusion_matrix(y_train, train_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
#confusion matrix for train

labels = ['Non_Interested', 'Interested']
cm = confusion_matrix(y_test, test_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
# report metrics for train data
print(metrics.classification_report(train_class_preds, y_train))
print(" ")

print("roc_auc_score_train")
print(metrics.roc_auc_score(y_train, train_class_preds))

In [None]:
# report metrics for test data
print(metrics.classification_report(test_class_preds, y_test))
print(" ")

print("roc_auc_score_test")
print(metrics.roc_auc_score(y_test, test_class_preds))

Based on the results of the RandomForest algorithm:- 

>The RandomForest algorithm has high precision (98%) and recall (98%) for non-interested customers in the training dataset. However, the precision and recall for non-interested customers are slightly lower in the testing dataset at 82% and 87%, respectively.

>The F1-score for both non-interested and interested customers is high (98%) in the training dataset, but lower in the testing dataset at 85%. This indicates that the model is overfitting to the training dataset and not performing as well on new data.

>The accuracy is high (98%) in the training dataset, but lower in the testing dataset at 85%. This is consistent with the F1-score results and suggests that the model is overfitting to the training dataset.

>The average precision, recall, and F1-score are all high (98%) in the training dataset, but lower in the testing dataset at 85%. This is again consistent with the overfitting observed in the F1-score and accuracy results.

>The ROC AUC score is also high (98%) in the training dataset, but lower in the testing dataset at 85%. This suggests that the model's ability to distinguish between positive and negative classes is still good, but not as good as the performance on the training dataset.

>Hyperparameter tuning techniques can be used to attempt to improve the model's performance on the testing dataset and reduce the overfitting to the training dataset.

In conclusion, the RandomForest algorithm shows a high level of performance on the training dataset but lower performance on the testing dataset, suggesting overfitting. Hyperparameter tuning techniques can be used to improve the model's generalization to new data.

2. Cross- Validation & Hyperparameter Tuning

In [None]:
# n_estimators-----> Number of trees
# max_depth--------> Maximum depth of trees
# min_samples_split------> Minimum number of samples required to split a node 
# min_samples_leaf-------> Minimum number of samples required at each leaf node

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., RandomForestCV)
# random forest model
randomForest = RandomForestClassifier(random_state=0)
parameters = {'n_estimators':[50,80,100],'max_depth':[4,6,8],
             'min_samples_split':[50,100,150],
             'min_samples_leaf':[40,50]
             }
# Fit the Algorithm
rf_grid= GridSearchCV(randomForest, parameters, scoring='f1', cv=3)
rf_grid.fit(X_train,y_train)

In [None]:
# model best parameters
print(f'The best fit is found to be {rf_grid.best_params_}')

In [None]:
# Predict on the model
# Making predictions on train and test data
train_class_preds = rf_grid.predict(X_train)
test_class_preds = rf_grid.predict(X_test)

In [None]:
# Visualizing evaluation Metric Score chart
# Get the confusion matrix for train 

labels = ['Non_Interested', 'Interested']
cm = confusion_matrix(y_train, train_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
# Get the confusion matrix test

labels = ['Non_Interested', 'Interseted']
cm = confusion_matrix(y_test, test_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
# result dataframe for train data
rf_train_roc=roc_auc_score(y_train, train_class_preds)
rf_train_acc = accuracy_score(y_train, train_class_preds)
rf_train_prec = precision_score(y_train, train_class_preds)
rf_train_rec = recall_score(y_train, train_class_preds)
rf_train_f1 = f1_score(y_train, train_class_preds)

results = pd.DataFrame([['Logistic Regression', rf_train_acc,rf_train_prec,rf_train_rec, rf_train_f1,rf_train_roc]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score','ROC'])
results

In [None]:
# result dataframe for test data
rf_test_roc=roc_auc_score(y_test, test_class_preds)
rf_test_acc = accuracy_score(y_test, test_class_preds)
rf_test_prec = precision_score(y_test, test_class_preds)
rf_test_rec = recall_score(y_test, test_class_preds)
rf_test_f1 = f1_score(y_test, test_class_preds)

results = pd.DataFrame([['Random Forest', rf_test_acc,rf_test_prec,rf_test_rec, rf_test_f1,rf_test_roc]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score','ROC'])
results

In [None]:
# hypertuned report metrics for train data
print(metrics.classification_report(train_class_preds, y_train))
print(" ")

print("roc_auc_score_train")
print(metrics.roc_auc_score(y_train, train_class_preds))

In [None]:
# hypertuned report metrics for test data
print(metrics.classification_report(test_class_preds, y_test))
print(" ")

print("roc_auc_score_test")
print(metrics.roc_auc_score(y_test, test_class_preds))

#####<b>Which hyperparameter optimization technique have you used and why?

>GridSearchCV which uses the Grid Search technique for finding the optimal hyperparameters to increase the model performance.

>our goal should be to find the best hyperparameters values to get the perfect prediction results from our model. But the question arises, how to find these best sets of hyperparameters? One can try the Manual Search method, by using the hit and trial process and can find the best hyperparameters which would take huge time to build a single model.

>For this reason, methods like Random Search, GridSearch were introduced. Grid Search uses a different combination of all the specified hyperparameters and their values and calculates the performance for each combination and selects the best value for the hyperparameters. This makes the processing time-consuming and expensive based on the number of hyperparameters involved.

>In GridSearchCV, along with Grid Search, cross-validation is also performed. Cross-Validation is used while training the model.

>That's why I have used GridsearCV method for hyperparameter optimization.



#####<b>Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

After hypertunning of randomforest algorithm:-

 >For training dataset, Non-Interested customer has a precision of 71%, recall of 90% and f1-score of 79%. For Interested customer, precision is 92%, recall is 76% and f1-score is 83%.

 >The accuracy is 82% and average precision, recall & f1-score are 82%, 83% and 81% respectively with a roc auc score of 82%. For testing dataset, Non-interested customer has a precision of 71%, recall of 90% and f1-score of 80%.

 >For Interested customer, precision is 92%, recall is 76% and f1-score is 83%.The accuracy is 81% and average precision, recall & f1-score are 84%, 81% and 81% respectively with a roc auc score of 81%.

<h1><b>4. XgBoost Classifier:

In [None]:
# ML Model - 4 Implementation
xg_model = XGBClassifier()

# Fit the Algorithm
xg_models=xg_model.fit(X_train,y_train)

# Predict on the model
# Making predictions on train and test data
train_class_preds = xg_models.predict(X_train)
test_class_preds = xg_models.predict(X_test)

1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Get the confusion matrix for train 

labels = ['Non_Interested', 'Interested']
cm = confusion_matrix(y_train, train_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix: XgBoost Classifier')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
# Get the confusion matrix test

labels = ['Non_Interested', 'Interested']
cm = confusion_matrix(y_test, test_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix:  XgBoost Classifier')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
print(metrics.classification_report(train_class_preds, y_train))
print(" ")

print("roc_auc_score_train")
print(metrics.roc_auc_score(y_train, train_class_preds))

In [None]:
print(metrics.classification_report(test_class_preds, y_test))
print(" ")

print("roc_auc_score_test")
print(metrics.roc_auc_score(y_test, test_class_preds))

Based on the results of the Xgboost algorithm:-

 >For training dataset, Non-Interested customer has a precision of 74%, recall of 89% and f1-score of 81%. For Interested customer, precision is 90%, recall is 78% and f1-score is 84%.

 >The accuracy is 82% and average precision, recall & f1-score are 82%, 83% and 82% respectively with a ROC AUC score of 82%.

 >For testing dataset, Non-interested customer has a precision of 75%, recall of 89% and f1-score of 81%. For Interested customer, precision is 90%, recall is 78% and f1-score is 84%.

 >The accuracy is 88% and average precision, recall & f1-score are 84%, 83% and 83% respectively with a ROC AUC score of 82%.

#####<b>2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 4 Implementation with hyperparameter optimization techniques (RandomSearchCV)

from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBClassifier

# Set up the XGBoost classifier
xgb = XGBClassifier(random_state=0)

# Define the hyperparameter search space
parameters = {'n_estimators': [50, 80, 100],
              'max_depth': [4, 6, 8],
              'min_samples_split': [50, 100, 150],
              'min_samples_leaf': [40, 50]}

# Use RandomizedSearchCV to find the best hyperparameters
random_search = RandomizedSearchCV(xgb, parameters, scoring='roc_auc', cv=5)

# Fit the model on the training data
random_search.fit(X_train, y_train)

In [None]:
# model best parameters
print(f'The best fit is found to be {random_search.best_params_}')

In [None]:
# Predict on the model
# Making predictions on train and test data
train_class_preds = random_search.predict(X_train)
test_class_preds = random_search.predict(X_test)


In [None]:
# result dataframe for train data
Xgb_train_roc=roc_auc_score(y_train, train_class_preds)
Xgb_train_acc = accuracy_score(y_train, train_class_preds)
Xgb_train_prec = precision_score(y_train, train_class_preds)
Xgb_train_rec = recall_score(y_train, train_class_preds)
Xgb_train_f1 = f1_score(y_train, train_class_preds)

results = pd.DataFrame([['XGBoost Classifier', Xgb_train_acc,Xgb_train_prec,Xgb_train_rec, Xgb_train_f1,Xgb_train_roc]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score','ROC'])

In [None]:
# result dataframe for test data
Xgb_test_roc=roc_auc_score(y_test, test_class_preds)
Xgb_test_acc = accuracy_score(y_test, test_class_preds)
Xgb_test_prec = precision_score(y_test, test_class_preds)
Xgb_test_rec = recall_score(y_test, test_class_preds)
Xgb_test_f1 = f1_score(y_test, test_class_preds)

results = pd.DataFrame([['XGBoost Classifier', Xgb_test_acc,Xgb_test_prec,Xgb_test_rec, Xgb_test_f1,Xgb_test_roc]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score','ROC'])
results

In [None]:
# hypertuned report metrics for train data
print(metrics.classification_report(train_class_preds, y_train))
print(" ")

print("roc_auc_score_train")
print(metrics.roc_auc_score(y_train, train_class_preds))

In [None]:
# hypertuned report metrics for test data
print(metrics.classification_report(test_class_preds, y_test))
print(" ")

print("roc_auc_score_test")
print(metrics.roc_auc_score(y_test, test_class_preds))

#####<b>Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

After hypertunning following conclusions are:

 >For training dataset, Non-Interested customer has a precision of 78%, recall of 90% and f1-score of 83%. For Interested customer, precision is 91%, recall is 80% and f1-score is 84%.

 >The accuracy is 84% and average precision, recall & f1-score are 84%, 85% and 84% respectively with a ROC AUC score of 84%.

 >For testing dataset, Non-interested customer has a precision of 78%, recall of 89% and f1-score of 83%. For Interested customer, precision is 91%, recall is 80% and f1-score is 85%.

 >The accuracy is 84% and average precision, recall & f1-score are 84%, 85% and 84% respectively with a ROC AUC score of 84%.

#####<B>1. Which Evaluation metrics did you consider for a positive business impact and why?

>In conclusion, when both false negatives and false positives need to be minimized, the f1-score should be considered as it balances between precision and recall. In such cases, recall is usually given more importance, but precision should not be neglected. The goal is to have a high recall and moderate f1-score.

#####<b>2. Which ML model did you choose from the above created models as your final prediction model and why?

In [None]:
from prettytable import PrettyTable

# Summarizing the results obtained
test = PrettyTable(['Sl. No.','Model','Train_Accuracy','Test_Accuracy', 'Train_Precision','Test_Precision','Train_Recall','Test_Recall','Train_F1_score','Test_F1_score'])
test.add_row(['1','Logistic Regression',lr_train_acc,lr_test_acc,lr_train_prec,lr_test_prec,lr_train_rec,lr_test_rec,lr_train_f1,lr_test_f1])
test.add_row(['2','k_nearest neighbours',kn_train_acc,kn_test_acc,kn_train_prec,kn_test_prec,kn_train_rec,kn_test_rec,kn_train_f1,kn_test_f1])
test.add_row(['3','Random Forest',rf_train_acc,rf_test_acc,rf_train_prec,rf_test_prec,rf_train_rec,rf_test_rec,rf_train_f1,rf_test_f1])
test.add_row(['4','XGboost Classsifier',Xgb_train_acc,Xgb_test_acc,Xgb_train_prec,Xgb_test_prec,Xgb_train_rec,Xgb_test_rec,Xgb_train_f1,Xgb_test_f1])

print(test)

In [None]:
# Plotting Recall scores

ML_models = ['Logistic Regression','K Nearest Neighbors','Random Forests','XG Boost']
train_recalls = [lr_train_rec,kn_train_rec,rf_train_rec,Xgb_train_rec]
test_recalls = [lr_test_rec,kn_test_rec,rf_test_rec,Xgb_test_rec]
  
X_axis = np.arange(len(ML_models))

plt.figure(figsize=(10,5))
plt.barh(X_axis - 0.2, train_recalls, 0.4, label = 'Train Recall')
plt.barh(X_axis + 0.2, test_recalls, 0.4, label = 'Test Recall')
  
plt.yticks(X_axis,ML_models)
plt.xlabel("Recall score")
plt.title("Recall score for each model")
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left',title='Legend')
plt.show()

#####<b>3. Explain the model which you have used and the feature importance using any model explainability tool?

>We will use Shapley values to explain the black box model(Random Forest).

>It shows the contribution or the importance of each feature on the prediction of the model. This makes it more explainable.

<h1><b>Conclusion

<b>Here are the key points from the conclusion of the Health Insurance Cross Sell Prediction project:

 * The goal of the project was to identify existing Health Insurance customers who are likely to be interested in purchasing Vehicle Insurance.

 * The Gradient Boosting algorithm provided the best overall performance in terms of accuracy, precision, recall, and F1 score.
 
 * The model achieved an accuracy of 84% and an average precision, recall, and F1 score of 84%, 85%, and 84%, respectively.

 * Other algorithms such as Random Forest, XGBoost, KNN, and Logistic Regression also performed well, with accuracy scores ranging from 80% to 82%, but did not outperform Gradient Boosting.

 * The findings suggest that the Gradient Boosting algorithm is an effective machine learning approach for predicting customer interest in a vehicle's insurance.

 * The model could be used to inform targeted marketing campaigns for the insurance company.

 * Exploratory Data Analysis revealed that more males were interested in Vehicle Insurance.
 
 * Feature engineering was used to transform categorical variables into numerical variables.
 
 * The dataset consisted of 381109 observations and 12 features.
 
 * Evaluation metrics used for the models included precision, recall, f1-score, accuracy, average precision, and ROC AUC score.
 
 * Hyperparameter tuning was used to improve the performance of the models.

<H2><b>*Hurrah! You have successfully completed your Machine Learning Capstone Project !!!*