# **Project Name**    -Health Insurance Cross sell



##### **Project Type**    - Classification
##### **Contribution**    - Individual
##### **Student Name-**   - Rutik Retwade


# **Project Summary -**

The Health Insurance Cross Sell Prediction dataset encompasses the details of 3,81,109 customers who express interest in purchasing insurance, with 10 predictor variables and one target variable. Our initial steps involved data collection, thorough data cleaning to address null values, distribution analysis, and consideration of outliers. After this, we performed typecasting to ensure that the data is in the proper format for visualization.

Our analysis continued with an in-depth exploratory data analysis (EDA) phase, during which we crafted univariate, bivariate, and multivariate plots to unearth valuable insights. These insights informed our decisions regarding the subsequent steps in the machine learning (ML) model pipeline.

To prepare the data for modeling, we engaged in feature engineering and tackled multicollinearity among the independent variables by employing the Variance Inflation Factor (VIF). Notably, we chose not to address outliers, as their removal could lead to a loss of significant information and potentially introduce bias into the results.

We also recognized that certain features were categorical in nature and needed to be encoded into numerical values for machine learning algorithms. We accomplished this using Binary Label Encoding.

The dataset presented a challenge of high class imbalance in the target variable, Response. To mitigate this, we applied the Synthetic Minority Oversampling Technique (SMOTE) to create a balanced dataset.

With the data now well-prepared, we split it into train and test sets to ensure a stratified representation of both classes. Subsequently, we implemented a range of machine learning models, starting with the simple yet effective Logistic Regression, followed by Decision Trees, Random Forests, Naive Bayes, and XGBoost. We evaluated model performance using various classification metrics, including Precision, Recall, F1 Score, Accuracy, and AUC-ROC. Additionally, we assessed the model's effectiveness by examining the confusion matrix to determine the number of correctly and incorrectly classified patients.

# **GitHub Link -**

# **Problem Statement**


**Our client, an insurance company, seeks your expertise to develop a predictive model that can anticipate whether policyholders from the previous year would express interest in the company's Vehicle Insurance offerings. Insurance policies involve a financial arrangement where the company commits to provide compensation for specified losses, damages, illnesses, or death in exchange for regular premium payments from the customers. Premiums are the recurring payments made by customers to secure these guarantees.**

**For instance, a customer may pay an annual premium of Rs. 5000 for a health insurance cover of Rs. 200,000. In the event of illness and hospitalization, the insurance provider covers the hospitalization costs up to Rs. 200,000. This may seem financially challenging for the company, but it relies on the principle of probabilities. While 100 customers may pay the same premium, only a few (e.g., 2-3) may require hospitalization that year, spreading the risk among all policyholders.**

**Similarly, in the context of vehicle insurance, customers pay an annual premium, and in the unfortunate event of an accident, the insurance company provides compensation, known as the 'sum assured,' to the customer.**

**Creating a predictive model to determine a customer's interest in Vehicle Insurance is invaluable for the company. It enables the company to tailor its communication strategy, reach out to potential customers effectively, and optimize its business model and revenue generation.**

**To make this prediction, you have access to customer information, including demographics (gender, age, region code type), vehicle details (vehicle age, damage status), and policy-related data (premium amount, sourcing channel), among other factors.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

# Basic
import numpy as np
import pandas as pd

# Plotation
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
%matplotlib inline

# ML Models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier

# Evaluation Metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import log_loss

# Hyper Parameter Tuning
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV


# Miscellaneous
import time
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import mutual_info_classif
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Mounting of drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
path =('/content/drive/MyDrive/Colab Notebooks/Almabetter/ML Classification project/TRAIN-HEALTH INSURANCE CROSS SELL PREDICTION.csv')
dataset=pd.read_csv(path)

### Dataset First View

In [None]:
# Dataset First Look
dataset.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print('Number of (rows, columns) are',dataset.shape)

### Dataset Information

In [None]:
# Dataset Info
dataset.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
dataset.duplicated().sum()

**From the above observation we conclude that in our data set there are no duplicate values**

#### Missing Values/Null Values

In [None]:
#Check the null value in each columns
def find_null_values(data):
    print('Following shows the number of Missing values present in the dataset:')
    print(data.isnull().sum())

find_null_values(dataset)


In [None]:
# Visualizing the missing values

def visualize_null_values(data):
    print('Missing values through figure')
    plt.figure(figsize=(14, 5))
    sns.heatmap(data.isnull(), cbar=True, yticklabels=False)
    plt.xlabel("Column Name", size=14, weight="bold")
    plt.title("Missing Values in Columns", fontweight="bold", size=17)
    plt.show()

visualize_null_values(dataset)


### What did you know about your dataset?

1. The dataset 'TRAIN-HEALTH INSURANCE CROSS SELL PREDICTION' consists of 381,109 rows and 12 columns, and it's noteworthy that no null values are present in this dataset.

2. A comprehensive check for duplicates in both datasets confirms their absence.

3. Furthermore, the dataset is entirely devoid of null values.

4. The dataset encompasses a total of four numeric features and five categorical features, with the target variable being numeric.


1. The dataset consists of **381,109 rows and 12 columns**.
2. There are **no null values**, ensuring data completeness.
3. **No duplicate values** are present, which supports data integrity.
4. There are **no missing values**, making the dataset ready for analysis.
5. The dataset contains a combination of **4 numerical features and 5 categorical features** that are essential for understanding customer responses in the health insurance industry.

## ***2. Understanding Your Variables***

In [None]:
# Dataset
dataset.columns

In [None]:
# Dataset Describe
dataset.describe(include='all')

### Variables Description

1. **id**: A unique identifier for each customer.

2. **Gender**: The gender of the customer.

3. **Age**: The age of the customer.

4. **Driving_License**: A binary indicator where 0 represents a customer who does not have a driving license, and 1 represents a customer who already has a driving license.

5. **Region_Code**: A unique code that identifies the region of the customer.

6. **Previously_Insured**: A binary indicator where 1 represents a customer who already has vehicle insurance, and 0 represents a customer who doesn't have vehicle insurance.

7. **Vehicle_Age**: The age of the customer's vehicle.

8. **Vehicle_Damage**: A binary indicator where 1 represents a customer whose vehicle has been damaged in the past, and 0 represents a customer whose vehicle hasn't been damaged.

9. **Annual_Premium**: The amount the customer needs to pay as a premium for the insurance policy in the year.

10. **Policy_Sales_Channel**: An anonymized code that represents the channel through which the company reaches out to the customer. This could include different agents, mail, phone, in-person contact, etc.

11. **Vintage**: The number of days the customer has been associated with the insurance company.

12. **Response**: The target variable indicating customer interest, where 1 represents a customer who is interested, and 0 represents a customer who is not interested in the insurance product.

These columns contain essential information for analyzing and predicting customer responses to the health insurance cross-sell.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print('Check Number of unique value in each column')
print(dataset.nunique())
print('--'*50)

In [None]:
# Check Unique Values for each variable.
for i in dataset.columns.tolist():
  print(f"The Unique Values of Variable ', {i}, 'are:", dataset[i].unique())
  print()
  print('--'*50)
  print()

## 3. ***Data Wrangling***

Data wrangling, indeed, plays a crucial role in the data analysis process. It involves a series of steps aimed at cleaning, transforming, and organizing raw data into a format that is suitable for analysis. This process is essential for several reasons:

1. **Data Quality**: Data often contains errors, missing values, inconsistencies, and outliers. Data wrangling helps identify and rectify these issues, ensuring that the data is accurate and reliable for analysis.

2. **Data Integration**: In many cases, data is collected from various sources and in different formats. Data wrangling involves merging or joining these disparate datasets to create a unified and comprehensive dataset.

3. **Data Transformation**: Raw data may not be in a suitable format for analysis. Data wrangling can involve converting data types, creating new variables, and aggregating or summarizing information to make it more amenable to analysis.

4. **Handling Missing Data**: Dealing with missing data is a common challenge. Data wrangling helps decide how to handle missing values, whether through imputation or exclusion, to ensure that they don't hinder analysis.

5. **Data Scaling**: Scaling or normalizing data is often necessary to bring variables onto a common scale. This is crucial for certain machine learning algorithms that are sensitive to the magnitude of variables.

6. **Feature Engineering**: Data wrangling can involve creating new features from existing data, which can enhance the predictive power of machine learning models.

7. **Data Reduction**: In some cases, data may be too large or have too many irrelevant variables. Data wrangling can involve reducing the dimensionality of data while retaining the most critical information.

8. **Data Aggregation**: Aggregating data over time periods or geographical regions can be useful for trend analysis and summarization.

9. **Data Exploration**: During data wrangling, exploratory data analysis is often performed to gain insights into the data, identify patterns, and generate hypotheses for further analysis.

In summary, data wrangling is a critical step in the data analysis process. It ensures that data is accurate, consistent, and in the right format for analysis, enabling more meaningful and reliable insights to be derived from the data.

### Data Wrangling Code

In [None]:
# Check Numeric of data
dataset.info()

In [None]:
dataset.head()

In [None]:
#  Drop the ID column form the dataset
dataset.drop(['id'] , axis=1, inplace= True)

In [None]:
# Code to Group the Age of the customar
def convert_numerical_to_categorical(df):
    # Categorizing Age feature
    df['Age_Group'] = df['Age'].apply(lambda x:'YoungAge' if x >= 20 and x<=45 else 'MiddleAge' if x>45 and x<=65 else 'OldAge')

convert_numerical_to_categorical(dataset)

In [None]:
dataset.columns

In [None]:
Continuous_variable =['Age', 'Region_Code', 'Annual_Premium','Policy_Sales_Channel'	,'Vintage']

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Display the number of Males and Females in the given dataset
print('*' * 55)
print('Gender Distribution in the Dataset')
gender_counts = dataset['Gender'].value_counts()
print(gender_counts)

# Create a pie plot to visualize the distribution of Gender
colors = ['skyblue', 'lightcoral']
plt.figure(figsize=(15, 6))
plt.pie(gender_counts, labels=['Male', 'Female'], colors=colors, autopct="%1.1f%%", startangle=90, shadow=True)
plt.title('Gender Distribution', fontdict={'fontsize': 15, 'fontweight': 'bold'})

# Visualize the Relationship between Gender and Response
sns.set(style="whitegrid")
plt.figure(figsize=(10, 5))
sns.countplot(x="Response", hue="Gender", palette={"Male": "skyblue", "Female": "lightcoral"}, data=dataset)
plt.xlabel('Response', fontdict={'fontsize': 12})
plt.ylabel('Count', fontdict={'fontsize': 14})
plt.title('Response vs. Gender', fontdict={'fontsize': 15, 'fontweight': 'bold'})


##### 1. Why did you pick the specific chart?

We've used a pie chart to show the breakdown of percentages for our dependent variable, which is great for comparing different percentages within a circle.

For exploring relationships between two variables, we've used a catplot. It's a helpful choice to understand how these variables interact in a straightforward way.

##### 2. What is/are the insight(s) found from the chart?

From the pie chart, it's evident that there are 206,089 individuals, accounting for 54.1% of the total, in the Male category, and 175,020 individuals, which makes up 45.9%, in the Female category within the dataset. This observation indicates that the number of individuals in the Male category is notably higher compared to the Female category.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Our analysis of the graph suggests that more men are interested in buying health insurance compared to women. So, it's a good idea for the company to pay extra attention to men, as they seem to be more interested in our health insurance offerings.

#### Chart - 2

In [None]:
# Chart - 2: Visualizations

# Pie chart for the distribution of age groups
plt.figure(figsize=(8, 8))
dataset['Age_Group'].value_counts().plot(kind='pie', explode=(0.1, 0.1, 0.1), shadow=True, autopct='%1.2f%%', pctdistance=1.5, labeldistance=1.8)
plt.title('Age Group Distribution', fontsize=15, fontweight='bold')

# Catplot to explore the relationship between Response and Age Group
sns.catplot(x="Response", hue="Age_Group", kind="count", palette="pastel", data=dataset)
plt.xlabel('Response', fontsize=15)
plt.ylabel('Count', fontsize=15)
plt.title('Response vs. Age Group', fontsize=15, fontweight='bold')


##### 1. Why did you pick the specific chart?

We used a pie chart to show how parts make up the whole, making it easy to compare percentages using different colors. When we want to compare different percentages, a pie chart is the way to go.

For understanding how two things relate to each other, we used a catplot. Catplots are handy for seeing how variables work together. So, a pie chart helps with percentages, and a catplot helps with understanding how things are connected.

##### 2. What is/are the insight(s) found from the chart?

The pie chart shows that:

- About 67.53% of the data is from the young age group.
- Middle-aged people make up around 25.11% of the data.
- The old-age group accounts for roughly 7.37% of the data.

What's notable is that a larger number of responses come from the young age group, suggesting they are more interested in our offerings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The pie chart and count plot show that:

- Younger people are quite interested in buying health insurance, which is great for our business.
- Middle-aged individuals are somewhat less interested compared to the younger age group.
- Response from older people is very low.

In simple terms, concentrating on the younger age group can boost our business, as they are showing the most interest.

#### Chart - 3

In [None]:
# Chart - 3 visualization code

print('*'*55)
print('How many of them have Driving Liscense')
print(dataset['Driving_License'].value_counts())


# Distribution of Driving License through pieplot
print('*'*55)
dataset['Driving_License'].value_counts().plot(kind='pie', figsize=(15,6), autopct="%1.1f%%", startangle=90,shadow=True,
                               labels=['Having Driving Liscense(%)','Not Having Driving Liscense(%)'],
                               colors=['Brown','green'], explode=[0,0])

In [None]:
plt.figure(figsize=(5,5))
sns.barplot(x='Driving_License', y='Response', data=dataset)
plt.show()

##### 1. Why did you pick the specific chart?

We have pick up the pie chart and bar graph wher pie chart expresses a part-to-whole relationship in your data. It's easy to explain the percentage comparison through area covered in a circle with different colors. Where differenet percentage comparison comes into action pie chart is used frequently. So, I used Pie chart and which helped me to get the percentage comparision of the dependant variable and Bar graph gives the relationship between two Driving Liscense.

##### 2. What is/are the insight(s) found from the chart?

From the above pie chart we ahve found that 99.8% of the data have Driving License and onle 0.2% and high volume of data have response Yes those have Driving License.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

As we can see from the pieplot that the data which having Driving License can make good impect in business and those do not have driving Liscense not useful for us so we have to targe data having Driving Liscense.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(20,10))
sns.countplot(x='Region_Code',hue='Response',data=dataset)

##### 1. Why did you pick the specific chart?

A count plot is like a bar chart that helps us compare different groups in our data. It shows which groups are most common and how they relate to each other. In our case, we're using it to see how different reasons are connected to responses.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code

# Dependant Column Value Counts
print('*'*55)
print('How many of them have  Previously Insurenced')
print(dataset['Previously_Insured'].value_counts())

# Dependant Variable Column Visualization
print('*'*50)
dataset['Previously_Insured'].value_counts().plot(kind='pie', figsize=(15,6), autopct="%1.1f%%", startangle=180, shadow=True,
                               labels=['Do not have Previously_Insured(%)','Previously_Insured(%)'],
                               colors=['purple','green'],explode=[0,0])

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(8,8))
sns.barplot(x='Previously_Insured', y='Response', data=dataset)
plt.show()

##### 1. Why did you pick the specific chart?

We have pick up the pie chart expresses a part-to-whole relationship in your data. It's easy to explain the percentage comparison through area covered in a circle with different colors. Where differenet percentage comparison comes into action pie chart is used frequently. So, I used Pie chart and which helped me to get the percentage comparision of the dependant variable.

##### 2. What is/are the insight(s) found from the chart?

From the above pie chart we have found that 54.2% of data belong the Do not have Previously_Insurece and 45.8% of them are have Previously_Insurence from the dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Display the value counts for the 'Vehicle_Damage' column
print("Count of 'Vehicle_Damage' values:")
print(dataset['Vehicle_Damage'].value_counts())

# Create a pie chart to visualize the distribution of 'Vehicle_Damage'
colors = ['purple', 'green']
plt.figure(figsize=(15, 6))
dataset['Vehicle_Damage'].value_counts().plot(kind='pie', autopct="%1.1f%%", startangle=180, shadow=True, labels=['Vehicle has Damage', 'Vehicle has no damage'], colors=colors, explode=[0, 0])
plt.title('Vehicle Damage Distribution')


In [None]:
# Chart - 6 Visualization

plt.figure(figsize=(6, 6))
sns.barplot(x='Vehicle_Damage', y='Response', data=dataset, palette='muted')
plt.show()


##### 1. Why did you pick the specific chart?

We've chosen to use a pie chart to illustrate the part-to-whole relationship in our data. A pie chart makes it simple to understand percentage comparisons by dividing a circle into portions with different colors. Whenever we need to compare different percentages, a pie chart is a common and useful choice. In our case, it helped us visually compare the percentages of the dependent variable.

##### 2. What is/are the insight(s) found from the chart?

The pie chart above provides these insights:

- Approximately 50.5% of the data belongs to customers with vehicles that have damage.
- About 49.5% of the data is from customers whose vehicles do not have damage.

Additionally, the pie chart suggests that customers who already have a driving license are more likely to be interested in purchasing health insurance.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Vehicle Age Distribution
plt.figure(figsize=(8, 5))
sns.countplot(x='Vehicle_Age', hue='Response', data=dataset, palette='Set2')
plt.xlabel('Count', fontsize=14)
plt.ylabel('Vehicle Age', fontsize=14)
plt.title('Vehicle Age Distribution', fontsize=15, fontweight='bold')
plt.show()



##### 1. Why did you pick the specific chart?

We have pick up countplot chart which shows the relationship between Vehicle age and response which helps us to analyse the dataset.

##### 2. What is/are the insight(s) found from the chart?

Certainly, here's a simplified version of your observations:

- Customers with vehicle age less than 1 year have a very low chance of purchasing insurance.
- Customers with vehicle age between 1 to 2 years are more likely to be interested in insurance compared to the other two categories.
- Customers with vehicle age less than 1 year have a very low chance of buying insurance.

These observations emphasize the significance of vehicle age in determining customer interest in insurance.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

From the given data we found that those data which have Vehicle_Age between 1-2 Year are highly impect in our business the the chances of the customer to chouse the Health Insurence.

#### Chart - 8

In [None]:
# Visualizing code of hist plot for each column to know the data distribution
# Visualizing code of hist plot for each columns to know the data distibution
def Distribution_plot(data):
  for col in Continuous_variable:
    fig=plt.figure(figsize=(9,6))
    ax=fig.gca()
    feature= (data[col])
    sns.distplot(data[col])
    ax.axvline(feature.mean(),color='magenta', linestyle='dashed', linewidth=2)
    ax.axvline(feature.median(),color='cyan', linestyle='dashed', linewidth=2)
    ax.set_title(col)
  plt.show()

Distribution_plot(dataset)

##### 1. Why did you pick the specific chart?

We have used distplot of find the univariate distribution of the continuous variable of the dataset against the density distribution.

We have chouse the histplot to find the univariate analysis of the dataset and find mean, median of the distributed data of all the variable in the dataset.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Correlation Heatmap visualization code
plt.figure(figsize=(12,10))
sns.heatmap(dataset.corr(), cmap ='PiYG', annot = True)

##### 1. Why did you pick the specific chart?

  **Answer.** We have pick up this heatmap chart to find insights to analyse that how the given one variable are the corelation to another variable.

##### 2. What is/are the insight(s) found from the chart?

**Observation:**
From the heatmap, we can see that there is no significant correlation between any variables. Therefore, we can proceed with implementing algorithms without the need to address correlations.

Variables with Negative Correlation:
- Previously_Insured
- Policy_Sales_Channel
- Vintage

Variables with Positive Correlation:
- Age
- Driving License
- Region Code
- Annual Premium

This information helps us understand the relationships between the variables in the dataset.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Pair Plot visualization code
sns.pairplot(dataset,hue='Response')

##### 1. Why did you pick the specific chart?

A pair plot is like a visual guide that helps us figure out which features work best together to explain relationships between variables or create distinct groups. It's a way to see patterns in the data and how features relate to each other. Think of it as a visual map of these relationships, which can be handy for understanding the data better.

##### 2. What is/are the insight(s) found from the chart?

The chart above tells us that there is not a strong linear relationship between the variables, and the data points are not easily separated by simple straight lines. This means that the data doesn't show clear linear patterns, and it might require more complex models or approaches to analyze and understand.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

We have different statistical tests for different scenarios:
1. Single categorical feature -> One proportion test
2. Two categorical features -> Chi squared test
3. More than two category in categorical features -> ANOVA test
4. One numerical and one categorical(=2 categories) feature-> ANOVA test
5. One numerical feature -> T-test
6. Two numerical feature -> Corelation test
7. One numerical and one categorical(>2 categories) feature -> T-test

Let's just define three hypothetical statements and perform the needed tests for the same


### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis :** There is relation between "Age" of the customer  their "Response" to get Health Insurence.

**Alternate Hypothesis :** There is a no relationship between "Age" and their "Response:" to get Insurence.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

from scipy.stats import chi2_contingency

# defining the table
data = [dataset['Age'], dataset['Response']]
stat, p, dof, expected = chi2_contingency(data)

# interpret p-value
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Reject null hypothesis')
else:
    print('Accept null hypothesis')

##### Which statistical test have you done to obtain P-Value?

Chi Square Test


##### Why did you choose the specific statistical test?

We have used chi-square test in order to determine whether there is a significant association between the two variables. In our case 'Age' and 'Responce' are the two variables. test shows that age and Response have a significant impact on each other,therefore we Accept null hypothesis


### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis :** There is no relation between "Temperature" and "Ranted Bike Coount"

**Alternate Hypothesis :** There is a relationship between "Temperature" and "Ranted Bike Coount"

#### 2. Perform an appropriate statistical test.

In [None]:

# Perform Statistical Test to obtain P-Value

from scipy.stats import pearsonr


first_sample1 = dataset["Driving_License"].head(200000)
second_sample1 = dataset["Response"].head(200000)

stat, p = pearsonr(first_sample1, second_sample1)
print('stat=%.3f, p = %.5f'%(stat, p))
if p> 0.05:
  print('Accept Null Hypothesis')
else:
  print('Rejected Null Hypothesis')






##### Which statistical test have you done to obtain P-Value?

Pearson Correlation

##### Why did you choose the specific statistical test?

We have used Pearson Correlation test in order to determine whether there is a significant association between the two variables. In our case 'Driving License' and 'Responce' are the two variables. test shows that 'Driving LIcense' and 'Response' have not significant impact on each other,therefore we Rejected Null Hypothesis.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis :** There is no relation between "Vintage" and "Response"

**Alternate Hypothesis :** There is a relationship between "Vintage" and "Response"

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value


from scipy.stats import chi2_contingency

# defining the table
data = [dataset['Vintage'], dataset['Response']]
stat, p, dof, expected = chi2_contingency(data)

# interpret p-value
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Reject null hypothesis')
else:
    print('Accept null hypothesis')

##### Which statistical test have you done to obtain P-Value?

Chi Square Test

##### Why did you choose the specific statistical test?

We have used  Pearson Correlation test in order to determine whether there is a significant association between the two variables. In our case 'Vintage' and 'Responce' are the two variables. test shows that 'Vintage' and 'Response' have not significant impact on each other,therefore we Rejected Null Hypothesis.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

**There is no missing values persent in the dataset so there is no need to heldle missing values from this dataset.**

#### What all missing value imputation techniques have you used and why did you use those techniques?

When we are working with large set of data then there is chance that missing value persent into it so we have to handle error and following are the techenique to handle missing value--



1.   Deleting Rows with missing values

2.   Impute missing values for continuous variable

3.   Using Algorithms that support missing values

4.   Prediction of missing values

5.   Imputation using Deep Learning Library


### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# defining the code for outlier detection and percentage using IQR.
def find_outliers(data):
    outliers = []
    data = sorted(data)
    q1 = np.percentile(data, 25)
    q2 = np.percentile(data, 50)
    q3 = np.percentile(data, 75)
    print(f"q1:{q1}, q2:{q2}, q3:{q3}")

    IQR = q3-q1
    lwr_bound = q1-(1.5*IQR)
    upr_bound = q3+(1.5*IQR)
    print(f"Lower bound: {lwr_bound}, Upper bound: {upr_bound}, IQR: {IQR}")

    for i in data:
        if (i<lwr_bound or i>upr_bound):
            outliers.append(i)
    len_outliers= len(outliers)
    print(f"Total number of outliers are: {len_outliers}")

    print(f"Total percentage of outlier is: {round(len_outliers*100/len(data),2)} %")

In [None]:
# code to find outliers
def showoutliers(data):
  plt.figure(figsize=(30,15))
  for n,column in enumerate(data.describe().columns):
    plt.subplot(5, 4, n+1)
    sns.boxplot(data[column])
    plt.title(f'{column.title()}',weight='bold')
    plt.tight_layout()


showoutliers(dataset)

In [None]:
#Define variable
continuous_variable = ['Region_Code', 'Annual_Premium', 'Policy_Sales_Channel','Age']
categorical_variable = [ 'Driving_License','Previously_Insured','Vintage', 'Response']
object_data= ['Gender', 'Vehicle_Age', 'Vehicle_Damage', 'Age_Group']

In [None]:
# Determining IQR, Lower and Upper bound and number out outliers present in each of the continous numerical feature
for feature in continuous_variable:
  print('--'*50)
  print('** Percentage of outliers of continuous variable of columns **')
  print(feature,":")
  find_outliers(dataset[feature])
  print("\n")

In [None]:
# Handling Outliers & Outlier treatments
# Defining the function that treats outliers with the IQR technique
def treat_outliers_iqr(data):
    # Calculate the first and third quartiles
    q1, q3 = np.percentile(data, [25, 75])

    # Calculate the interquartile range (IQR)
    iqr = q3 - q1

    # Identify the outliers
    lower_bound = q1 - (1.5 * iqr)
    upper_bound = q3 + (1.5 * iqr)
    outliers = [x for x in data if x < lower_bound or x > upper_bound]

    # Treat the outliers (e.g., replace with the nearest quartile value)
    treated_data = [q1 if x < lower_bound else q3 if x > upper_bound else x for x in data]
    treated_data_int = [int(absolute) for absolute in treated_data]

    return treated_data_int

In [None]:
#Passing all the feature one by one from the list of continous_value_feature in our above defined function for outlier treatment
for feature in continuous_variable:
  dataset[feature]= treat_outliers_iqr(dataset[feature])


# Determining IQR, Lower and Upper bound and number out outliers present in each of the continous numerical feature a
for feature in continuous_variable:
  print('--'*50)
  print('** After treating outlierrs the percentage of outliers are : **')
  print(feature,":")
  find_outliers(dataset[feature])
  print("\n")

In [None]:
plt.figure(figsize=(30,15))
for n,column in enumerate(dataset.describe().columns):
  plt.subplot(5, 4, n+1)
  sns.boxplot(dataset[column])
  plt.title(f'{column.title()}',weight='bold')
  plt.tight_layout()


##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
# Create copy of the dataset
df = dataset.copy()
#Drop the Age column because we have created Age_group columns
df.drop('Age', axis=1, inplace= True)

# List of variable that have to make label encoding
col =['Gender', 'Vehicle_Damage', 'Vehicle_Age', 'Age_Group']

#Import library for label Encoding
from sklearn import preprocessing

In [None]:
# Code for label Encoding
def Label_encoding(data):
  label_encoder = preprocessing.LabelEncoder()

  for var in col:
    data[var] = label_encoder.fit_transform(data[var])

Label_encoding(df)

df.head()

#### What all categorical encoding techniques have you used & why did you use those techniques?

I've utilized Label Encoding to transform the 'Gender', 'Vehicle_Damage', 'Vehicle_Age', and 'Age_Group' columns into a numerical format. This technique is beneficial for enhancing the performance of machine learning models, as it allows the model to understand and work with categorical variables by converting them into numerical representations. In essence, Label Encoding simplifies the data and makes it suitable for machine learning algorithms, contributing to better model performance.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
plt.figure(figsize=(10,10))
sns.heatmap(df.corr(), cmap ='PiYG', annot = True)

Let's include only those features in our final dataframe that are highly impacting on the dependent variable. For this we are using Variance Inflation Factor technique to determine multicolinearity

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
# Import VIF library
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Code to calculate VIF
def calc_vif(X):

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)

In [None]:
calc_vif(df[[i for i in df.describe().columns ]])

Calculating VIF(Variance Inflation Factor) by excluding :

Driving_license variable because It give 30.00 VIF which is very high that why we not include it.

In [None]:
# calculating the vif by excluding some features which are not giving any information
calc_vif(df[[i for i in df.describe().columns if i not in ['Driving_License']]])

In [None]:
final_df = df[[i for i in df.describe().columns if i not in ['Driving_License']]]

##### What all feature selection methods have you used  and why?

We have plotted the seaborn's scatterplot and seaborn's heatmap to see the relationship of each of the feature with target variable and observed that some features like BPmeds, diabetes, totchol etc. are positively correlated with target variable. While sex, education are negatively correlated with target variable.

##### Which all features you found important and why?

In [None]:
final_df.columns

We have selected these feature and we found that all the feature does not have
high VIF which impect our dataset. the feature which is important for our dataset which is stored in final_df

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
x = final_df.drop(columns='Response',axis =1)
y= final_df['Response']

## Importing train_test_split from sklearn
from sklearn.model_selection import train_test_split

## Spliting data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0, stratify=y)

In [None]:
# Checking the distribution of in training dataset
print("Distribution of data for dependent variable in train :")
print(y_train.value_counts())

print('-'*50)

# Checking the distribution of in test dataset
print("Distribution of data for dependent variable in test :")
print(y_test.value_counts())

##### What data splitting ratio have you used and why?

We have distributed the training and test dataset in the ratio of 80% (Training) and 20% (Test) from the whole dataset.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Dataset is highly imbalance, which is obvious. The number of customer who is intrested to take Helth Insurense the response of  number 0 and 1 show that the customer is intrested to Health Insurense or not so let's check the dataset.

In [None]:
# Handling Imbalanced Dataset (If needed)

print('Check the given data is balanced or not')

print('--'*50)
plt.title("Response class Distribution")
sns.countplot(df['Response'])
plt.title("Transaction Class Distribution")
df['Response'].value_counts()

In [None]:
# Importing SMOTE for balancing the dataset
from imblearn.over_sampling import SMOTE

# Fitting the data
smote = SMOTE(sampling_strategy='minority', random_state=0)
x_sm, y_sm = smote.fit_resample(x, y)

# Checking Value counts for both classes Before and After handling Class Imbalance:
for col,label in [[y,"Before"],[y_sm,'After']]:
  print(label+' Handling Class Imbalace:')
  print(col.value_counts(),'\n')

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

we are using SMOTE on the training set to address these imbalances. The Synthetic Minority Oversampling Technique, or SMOTE for short, is a type of data augmentation for the minority class. The strategy works because it generates convincing new synthetic examples from the minority class that are substantially near in feature space to already existing examples from the minority class.

In [None]:
# Respliting the dataset after using SMOTE
x_smote_train, x_smote_test, y_smote_train, y_smote_test = train_test_split(x_sm,y_sm , test_size = 0.2, random_state = 0)

In [None]:
# Check the distribution of in train dataset  dependent Variable
print("Distribution of data for dependent variable in test :")
print(y_smote_train.value_counts())

print('-'*50)

# Check the distribution of in test dataset  dependent Variable
print("Distribution of data for dependent variable in test :")
print(y_smote_train.value_counts())

In [None]:

## Rescaling your data
# Importing StandardScaler for Data Scaling
from sklearn.preprocessing import StandardScaler

# Creating object
std_scaler= StandardScaler()

# Fit and Transform
x_smote_train= std_scaler.fit_transform(x_smote_train)
x_smote_test= std_scaler.transform(x_smote_test)

## ***7. ML Model Implementation***

### ML Model - 1

##**1) Logestic regression**

In [None]:
# Defining a function to train the input model and print evaluation matrics such as classification report, confusion matrix and AUC-ROC curve in visualize format
def analyse_model(model, x_train, x_test, y_train, y_test):

  '''Takes classifier model, train-set and test-set as input and prints the evaluation matrices in visualize format and returns the model'''

  # Fitting the model
  model.fit(x_train,y_train)

  # Finding best parameters
  try:
    print(f"The best parameters are: {model.best_params_}")
  except:
    pass

  # Plotting Evaluation Metrics for train and test dataset
  for x, act, label in ((x_train, y_train, 'Train-Set'),(x_test, y_test, "Test-Set")):

    # Getting required metrics
    pred = model.predict(x)
    pred_proba = model.predict_proba(x)[:,1]
    report = pd.DataFrame(classification_report(y_pred=pred, y_true=act, output_dict=True))
    fpr, tpr, thresholds = roc_curve(act, pred_proba)

    # Classification report
    plt.figure(figsize=(18,3))
    plt.subplot(1,3,1)
    sns.heatmap(report.iloc[:-1, :-2].T, annot=True, cmap=sns.color_palette("crest", as_cmap=True),fmt=".2f",annot_kws={"fontsize":14, "fontweight":"bold"},linewidths=1.0)
    plt.title(f'{label} Classification Report')

    # Confusion Matrix
    plt.subplot(1,3,2)
    matrix= confusion_matrix(y_true=act, y_pred=pred)
    sns.heatmap(matrix, annot=True, cmap=sns.color_palette("flare", as_cmap=True),fmt=".2f", annot_kws={"fontsize":14, "fontweight":"bold"},linewidths=1.0)
    plt.title(f'{label} Confusion Matrix')
    plt.xlabel('Predicted labels')
    plt.ylabel('Actual labels')

    # AUC_ROC Curve
    plt.subplot(1,3,3)
    plt.plot([0,1],[0,1],'k--')
    plt.plot(fpr,tpr,label=f'AUC = {np.round(np.trapz(tpr,fpr),3)}')
    plt.legend(loc=4)
    plt.title(f'{label} AUC_ROC Curve')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.tight_layout()

  plt.show()

  return model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Importing LogisticRegression from sklearn
from sklearn.linear_model import LogisticRegression

# Fitting Logistic Regression Model and Visualizing evaluation Metric Score chart
logistic_classifier = LogisticRegression(fit_intercept=True, penalty='l2',max_iter=20000,random_state=0)

# Predict on the model
analyse_model(logistic_classifier, x_smote_train, x_smote_test, y_smote_train, y_smote_test)





After applying Logistic Regression, which is a relatively simple binary classification model, we obtained the following performance metrics:

- Accuracy: 79%
- Precision: 73% for class 1 and 89% for class 0
- Recall: 92% for class 1 and 66% for class 0
- F1 score: 81% for class 1 and 76% for class 0
- AUC-ROC curve: 85.3% for the training dataset and 85.4% for the testing dataset.

These metrics provide an evaluation of the model's performance in predicting customer interest in buying health insurance. The model shows good accuracy and a balanced trade-off between precision and recall.

### ML Model - 2 Decision tree

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Import Decission Tree algorithms
from sklearn.tree import DecisionTreeClassifier


# Fitting Decission tree algorthms
clf_gini = DecisionTreeClassifier(criterion = "gini", random_state = 100,max_depth=3, min_samples_leaf=5)

# Analyse the model
analyse_model(clf_gini,x_smote_train,x_smote_test, y_smote_train, y_smote_test)

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

After implemented Logistic Regression we have proceded with Decission Tree Algorithms we found that there is slightly increase in the accuracy from 79% to 80%

--> Accuracy of 80%

--> Precision of 73% fror class1 and 93% for class0

--> Recall of 95% for class1 and 65% for class0,

--> F1 score 83% for class1 and 77% for class0

--> AUC_ROC curve 84.9% for Train 85% for Test

### **ML Model - 3 Random Forest**

In [None]:
# Importing RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

# Fitting RandomForestClassifier Model
RF_classifier = RandomForestClassifier(n_estimators=500,max_depth=3,n_jobs=-1,random_state=0)

# Analysing the model and Visualizing evaluation Metric Score chart
analyse_model(RF_classifier, x_smote_train, x_smote_test, y_smote_train, y_smote_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

After implemented Logistic Regression and Decission Tree next we have implemented Random Forest Algorithms we found that there is slightly  increase in their accuracy of 80% to 81%.

--> Accuracy of 81%

--> Precision of 75% fror class 1 and 90% for class 0.

--> Recall of 93% for class 1 and 69% for class 0.

--> F1 score 83% for class 1 and 78% for class 0.

--> AUC_ROC curve 86.8% for Train and Test both.

After applyed Logistic regression we have implemented Decission Tree Classification algorithms we have achived recall nof .95 on train and .95 on test dataset alonge with auc-roc score 84.9% and accuracy 85%

##**ML model 3- Random Forest Classifier**

In [None]:
# Importing RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

# Fitting RandomForestClassifier Model
RF_classifier = RandomForestClassifier(n_estimators=500,max_depth=3,n_jobs=-1,random_state=0)

# Analysing the model and Visualizing evaluation Metric Score chart
analyse_model(RF_classifier, x_smote_train, x_smote_test, y_smote_train, y_smote_test)

After implementing Random Forest, we observed a slight increase in accuracy compared to Logistic Regression:

- Accuracy: 81%
- Precision: 75% for class 1 and 90% for class 0
- Recall: 93% for class 1 and 69% for class 0
- F1 score: 83% for class 1 and 78% for class 0
- AUC-ROC curve: 86.8% for both the training and testing datasets.

Random Forest shows improved performance, with a higher accuracy and better precision and recall for class 1. It also has a higher AUC-ROC score, indicating its effectiveness in predicting customer interest in health insurance.

After applying Decision Tree Classification, we achieved the following results:

- Recall: 0.95 on both the training and testing datasets.
- AUC-ROC score: 84.9%.
- Accuracy: 85%.

Decision Tree Classification showed an excellent recall rate of 95% on both the training and testing datasets, which means it is effective at identifying customers interested in health insurance. The AUC-ROC score of 84.9% and an accuracy of 85% indicate its overall good performance.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:

# Importing GridSearchCV from sklearn
from sklearn.model_selection import GridSearchCV

# Defining classifier instance
classifier= RandomForestClassifier(random_state=0)

# Defining parameters
grid_values = {'n_estimators':[150,250,300,350], 'max_depth':[7,8,10]}

# Fitting RandomForestClassifier Model with GridSearchCV
RF_grid_classifier = GridSearchCV(classifier, param_grid = grid_values, scoring = 'roc_auc', cv=3)

# Analysing the model
analyse_model(RF_grid_classifier, x_smote_train, x_smote_test, y_smote_train, y_smote_test)

##### Which hyperparameter optimization technique have you used and why?

We have used GridSearchCV as the hyperparameter optimization technique as it uses all possible combinations of hyperparameters and provides the more accurate results. It then calculates the performance for each combination and selects the best value for the hyperparameters. This offers the most accurate tuning method

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In order to minimise the errors between actual and predicted values, we evaluate our ML model using different metrics such as Recall, F-1 score, Accuracy and AUC-ROC. All these metrics try to give us an indication on how close we are with the real/expected output. In our case, each evaluation metric is showing not much difference on the train and test data which shows that our model is predicting a closer expected value

##**ML Model - Naive Bayes Classifier**

In [None]:
# Import Library
from sklearn.naive_bayes import GaussianNB

# Fitting model
NB_classifier =GaussianNB()

# Analyse the model
analyse_model(NB_classifier, x_smote_train, x_smote_test, y_smote_train, y_smote_test)

After implementing the Naive Bayes algorithm, we obtained the following results:

- Accuracy: 78%
- Precision: 70% for class 1 and 94% for class 0.
- Recall: 96% for class 1 and 59% for class 0.
- F1 score: 81% for class 1 and 72% for class 0.
- AUC-ROC curve: 84.2% for the training dataset and 84.1% for the testing dataset.

Naive Bayes achieved an accuracy of 78%, which is lower compared to other models. However, it showed a high recall rate of 96% for identifying customers interested in health insurance. The precision for class 0 is relatively high at 94%, indicating that the model is good at correctly classifying customers not interested in health insurance. The AUC-ROC curve scores indicate a reasonable level of performance.

##**ML Model XG boost**

In [None]:

# Importing RandomForestClassifier
from xgboost import XGBClassifier

# Fitting XGBClassifier Model
XGB_classifier = XGBClassifier(n_estimators=150,max_depth=3,n_jobs=-1,random_state=0)

# Analysing the model and Visualizing evaluation Metric Score chart
analyse_model(XGB_classifier, x_smote_train, x_smote_test, y_smote_train, y_smote_test)

In [None]:

# Importing GridSearchCV from sklearn
from sklearn.model_selection import GridSearchCV

# Defining classifier instance
classifier= XGBClassifier(random_state=0)

# Defining parameters
grid_values = {'learning_rate':[0.01, 0.1,1],'n_estimators':[250,300,350], 'max_depth':[2,3,4]}

# Fitting RandomForestClassifier Model with GridSearchCV
XGB_grid_classifier = GridSearchCV(classifier, param_grid = grid_values, scoring = 'roc_auc', cv=3)

# Analysing the model
analyse_model(XGB_grid_classifier, x_smote_train, x_smote_test, y_smote_train, y_smote_test)

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

In [None]:
# Storing different regression metrics in order to make dataframe and compare them
models = ["Logistic_regression","Decision Tree",'Random Forest', 'Naive Bayes',"XGboost"]
Accuracy =  [0.79,  0.80, 0.81, 0.78, 0.84]
Precision = [0.73,  0.73, 0.77, 0.70, 0.81]
Recall =    [0.92,  0.95, 0.93, 0.96, 0.90]
F1_Score=   [0.81,  0.83, 0.84, 0.81, 0.85]
AUCROC =    [0.85,  0.86, 0.90, 0.84, 0.93]

# Create dataframe from the lists
data = {'Models': models,
        'Accuracy' : Accuracy,
        'Precision': Precision,
        'Recall': Recall,
        'F1_Score': F1_Score,
        'AUCROC': AUCROC,
       }
metric_df = pd.DataFrame(data)

# Printing dataframe
metric_df

We have choosen XG-Boost Model because it's gives **93.7% for train and 92.8 test AUC_ROC** and Highest **accuracy of 86% for train and 84% for test** and the difference between class 0 and class 1  is minimum as compare to orther model.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

**Conclusion from EDA**

Here are the key insights from the dataset:

1. In the dataset, 54.1% of the customers are male, while 45.9% are female. Male customers show a higher interest in buying health insurance.

2. Approximately 67.53% of the customers fall into the age group of 20 to 45, 25.11% belong to the age group of 45 to 65, and 7.37% are above the age of 65.

3. Nearly 99.8% of the customers in the dataset have a valid driving license, while only 0.2% do not possess a driving license.

4. About 45.8% of the customers in the dataset already have insurance coverage.

These insights provide a better understanding of the customer demographics and their response to health insurance.

**Conclusion from Machine Learning Model**

We applied five different classification machine learning models to analyze the data, including Logistic Regression, Decision Tree, Random Forest, Naive Bayes, and XG Boost.

Out of these models, XG Boost outperformed the others, achieving the highest recall, precision, F1 score, accuracy, and AUC-ROC score. This indicates that XG Boost is the most effective model for this dataset and can provide the best predictions for customer responses.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***

In [None]:
my_list = [1, 2, 3, 4, 5]
new_list = [x * 2 for x in my_list if x % 2 == 0]
print(new_list)