#**Project Name - Health Insurance Cross Sell Prediction**


*   **Project Type** - Classification
*   **Contribution** - Individual








# **Problem Statement**

**BUSINESS PROBLEM OVERVIEW**

---

Our client is an Insurance company that has provided Health Insurance to its customers now they need your help in building a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company. An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee. For example, you may pay a premium of Rs. 5000 each year for a health insurance cover of Rs. 200,000/- so that if, God forbid, you fall ill and need to be hospitalised in that year, the insurance provider company will bear the cost of hospitalisation etc. for upto Rs. 200,000. Now if you are wondering how can company bear such high hospitalisation cost when it charges a premium of only Rs. 5000/-, that is where the concept of probabilities comes in picture. For example, like you, there may be 100 customers who would be paying a premium of Rs. 5000 every year, but only a few of them (say 2-3) would get hospitalised that year and not everyone. This way everyone shares the risk of everyone else.

Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of certain amount to insurance provider company so that in case of unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer.

Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimise its business model and revenue.

Now, in order to predict, whether the customer would be interested in Vehicle insurance, you have information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel) etc.


# **Let's Begin**

##**1.Know your data**


**Import Libraries**



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.model_selection import GridSearchCV

In [None]:
from google.colab import drive
drive.mount('/content/drive')

**Dataset Loading**

In [None]:
import pandas as pd

# Load data set
df = pd.read_csv('/content/drive/MyDrive/Project/ML/capson/TRAIN-HEALTH INSURANCE CROSS SELL PREDICTION.csv')

**Dataset First View**

In [None]:
df.head()

**Dataset Rows & Columns count**

In [None]:
df.shape

**Data Information**

In [None]:
df.info()

**Duplicate Values**

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

**Missing Values/Null Values**

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
# Checking Null Value by plotting Heatmap
sns.heatmap(df.isnull(), cbar= False)

**What did you know about your dataset?**

This dataset is provided by an insurance company to help build a model that predicts whether past health insurance policyholders would be interested in purchasing vehicle insurance. The dataset contains various features related to customer demographics, vehicle details, and policy specifics.

The above dataset has 381109 rows and 12 columns. There are no mising values and duplicate values in the dataset.

# **2. Understanding Your Variables**

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe(include='all')


###  Variables Description









* **Gender:** Gender of the policyholder (Male, Female).
* **Age:** Age of the policyholder.
* **Driving_License:** Whether the policyholder has a driving license (1: Yes, 0: No).
* **Region_Code:** Unique code for the policyholder's region.
* Previously_Insured: Whether the policyholder already has vehicle insurance (1: Yes, 0: No).
* **Vehicle_Age:** Age of the policyholder's vehicle (> 2 Years, 1-2 Year, < 1 Year).
* **Vehicle_Damage:** Whether the policyholder's vehicle has been damaged in the past (Yes, No).
* **Annual_Premium:** Amount the policyholder needs to pay as the annual premium.
* **Policy_Sales_Channel:** Code for the channel through which the policy was sold.
* **Vintage:** Number of days the customer has been associated with the company.
* **Response:** Target variable indicating whether the customer is interested in vehicle insurance (1: Yes, 0: No).

#**3. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables.**

## 1 .Demographic Insights:

*  **Q**. How does age distribution differ between male and female customers?



In [None]:
# Set the aesthetics for the plots
sns.set(style="whitegrid")

# Age and Gender Distribution
plt.figure(figsize=(10, 6))
sns.boxplot(x='Gender', y='Age', data=df)
plt.title('Age Distribution by Gender')
plt.show()

1. Why did you pick the specific chart?

A box plot is chosen for visualizing the age distribution by gender for several reasons:



*   Summary Statistics: Box plots provide a five-number summary (minimum, first quartile, median, third quartile, and maximum) which is useful for understanding the distribution of age for each gender.
*   Outliers: They easily highlight outliers in the data, which could be important for identifying unusual age values within each gender.


*   Comparison: They facilitate easy comparison between different groups (male and female) by displaying the distribution and spread of age data side-by-side.
*   Visual Clarity: They provide a clear and concise way to visualize the central tendency and variability of age for each gender.


---







2. What is/are the insight(s) found from the chart?



*   The median age for males is slightly higher than that for females. This indicates that the central age tendency for males is higher compared to females.
*   Both males and females have a wide age range, but the range for males is slightly broader. This suggests that the male customer base spans a wider age spectrum.


*   The IQR for males is wider than that for females, indicating more variability in the age of male customers. Females have a more concentrated age distribution.
*   There are several outliers in the age distribution for females, primarily on the higher end. These outliers could represent a small segment of older female customers who might have different needs or preferences.


---







3. Will the gained insights help creating a positive business impact?

The insights from the box plot highlight important trends in age distribution across genders, allowing the insurance company to refine its marketing, product development, customer segmentation, resource allocation, and risk assessment strategies. These adjustments can lead to improved customer satisfaction, better-targeted marketing efforts, and optimized resource use, ultimately driving business growth and profitability.

**Q Is there a relationship between gender and interest in vehicle insurance (Response)?**

In [None]:
# Count of Male and Female Customers
plt.figure(figsize=(6, 4))
sns.countplot(x='Gender', data=df)
plt.title('Count of Male and Female Customers')
plt.show()

 1. Why Did You Pick the Specific Chart?



I chose the count plot for the following reasons:

* **Simplicity:** A count plot is straightforward and easy to interpret. It clearly shows the number of observations in each category.
* **Comparison:** It effectively compares the counts of male and female customers, allowing us to quickly see any disparities in gender distribution.
* **Visual Clarity:** Count plots are particularly useful for categorical data, making it an ideal choice to visualize the distribution of gender in the dataset.


---



2. What is/are the insight(s) found from the chart?
* **Gender Distribution:** The chart shows that the number of male customers is slightly higher than the number of female customers. This indicates a gender imbalance in the customer base, with males being more represented.

* **Potential Market Segment:** The significant number of both male and female customers suggests that there is a substantial market for both genders. However, the company might want to investigate why there is a higher proportion of male customers and whether this reflects market trends or indicates a gap in attracting female customers.


---



3. Will the Gained Insights Help Create a Positive Business Impact?

Yes, the insights from this chart can help create a positive business impact in several ways:

**Targeted Marketing Strategies:**

* **Addressing Imbalance:** Understanding the gender imbalance allows the company to tailor marketing campaigns to attract more female customers, thereby balancing the customer base.
* Enhanced Engagement: Tailoring communication and marketing efforts to the needs and preferences of each gender can enhance customer engagement and improve conversion rates.

**Product Development:**

* **Gender-Specific Products:** With knowledge of the gender distribution, the company can develop or adjust products to better meet the needs of both male and female customers, potentially increasing customer satisfaction and loyalty.

**Customer Segmentation:**

* **Better Segmentation:** Knowing the distribution of male and female customers helps in more accurate customer segmentation, allowing the company to create targeted communication strategies for different segments, optimizing resource allocation and marketing spend.

**Strategic Decisions:**

* **Market Analysis:** This insight provides a foundation for further market analysis. The company can investigate factors contributing to the gender imbalance, such as product appeal, pricing strategies, or marketing channels used, and adjust strategies accordingly.

## 2. Insurance Insights:

**Q.What is the distribution of annual premiums?**

In [None]:
# Annual Premium Analysis
plt.figure(figsize=(8, 3))
sns.histplot(df['Annual_Premium'], kde=True)
plt.title('Distribution of Annual Premiums')
plt.show()

plt.figure(figsize=(8, 5))
sns.boxplot(y='Annual_Premium', data=df)
plt.title('Box Plot of Annual Premiums')
plt.show()

1. Why I Picked the Specific Charts

* Histogram with KDE (Kernel Density Estimate):

 * **Purpose:** The histogram with a KDE overlay provides a clear view of the distribution of the Annual_Premium values. It helps us see how the premiums are spread out, where they cluster, and if there are any noticeable peaks or gaps.

 * **Usefulness:**This is useful for understanding the overall distribution of annual premiums, identifying common premium values, and detecting any skewness in the data.
* Box Plot:

 * **Purpose:** The box plot is used to visualize the distribution of Annual_Premium with a focus on identifying outliers, the interquartile range (IQR), and the median.
 * **Usefulness:**This is useful for identifying outliers in the data and understanding the spread and central tendency of the annual premiums. It helps in detecting any unusual data points that might need further investigation.


---




* **2.**What is/are the insight(s) found from the chart?

**Histogram with KDE**

 * **Distribution Shape:**The histogram shows that the majority of annual premiums are clustered towards the lower end of the scale. There is a long tail extending towards the higher premium values, indicating that while most premiums are low, there are some very high premium values.
 * **Skewness:** The data is highly skewed to the right, meaning there are a few very high premium values compared to the bulk of the data.
Modes: There are distinct peaks near the lower end, suggesting that certain premium amounts are more common.

**Box Plot**
* **Median:**The median annual premium is relatively low, close to the bottom of the box. This reinforces the observation from the histogram that most premiums are on the lower side.
* **IQR (Interquartile Range):**The box plot shows a relatively small IQR, indicating that the middle 50% of the data is tightly clustered.
Outliers: There are many outliers above the upper whisker, which confirms the presence of high premium values. These outliers represent a small portion of the data but have significantly higher premiums.


---



3. Will the gained insights help creating a positive business impact?

The visualizations provide valuable insights into the distribution of annual premiums. Understanding the skewness, the common premium values, and the outliers can help the company refine its marketing strategies, adjust pricing models, manage risk more effectively, and ultimately enhance customer retention and satisfaction. These actions will contribute positively to the company's business impact by aligning services and products with customer needs and behaviors.

**Q How does vehicle age affect the likelihood of a customer being interested in vehicle insurance?**

In [None]:
# Vehicle Age and Insurance Interest
plt.figure(figsize=(6, 4))
sns.countplot(x='Vehicle_Age', hue='Response', data=df)
plt.title('Vehicle Age vs. Insurance Interest')
plt.show()


**Insights from the Chart**

* **Count Plot**

* **Chart Selection:** The count plot is chosen because it effectively displays the frequency of categorical data. Here, it shows the count of customers interested in vehicle insurance (Response = 1) and not interested (Response = 0) across different vehicle age categories.

**Insights**

**1. Distribution Across Vehicle Ages:**

* > 2 Years: Few customers have vehicles older than 2 years, and among these, even fewer are interested in vehicle insurance.
* 1-2 Year: The majority of customers have vehicles aged 1-2 years, and a significant portion of these customers show interest in vehicle insurance.

* < 1 Year: The count of customers with vehicles less than 1 year old is also high, but the interest in insurance is lower compared to the 1-2 year age group.

**2. Interest in Insurance by Vehicle Age:**

* Customers with vehicles aged 1-2 years are the most interested in purchasing vehicle insurance.
* Customers with vehicles older than 2 years or less than 1 year show lower interest levels.

**Positive Business Impact**

**1. Targeted Marketing:**

* **Focus on 1-2 Year Vehicles:** The company can target marketing efforts towards customers with vehicles aged 1-2 years, as they show the highest interest in purchasing insurance. Tailored marketing campaigns can be designed to highlight the benefits of vehicle insurance specifically for this group.
* **Educational Campaigns for Other Groups:** For customers with vehicles older than 2 years or less than 1 year, educational campaigns can be created to inform them about the importance and benefits of vehicle insurance, potentially increasing their interest.

**2. Product Development:**

* Customized Insurance Plans: Develop specialized insurance plans or discounts for vehicles aged 1-2 years to capitalize on the high interest in this segment.

* Incentives for Low-Interest Groups: Offer incentives or tailored packages for customers with vehicles older than 2 years or less than 1 year to increase their interest in purchasing insurance.

**3. Customer Retention and Acquisition:**

* Retention of High-Interest Group: Implement strategies to retain customers with 1-2 year-old vehicles by offering loyalty benefits or renewal discounts.
* Acquisition Strategies: Use the insights to acquire new customers by targeting similar demographics and vehicle age groups that show high interest in vehicle insurance.

**4. Risk Management:**

* Understanding Risk Profiles: The company can use these insights to better understand the risk profiles associated with different vehicle ages, helping to price premiums more accurately and manage risk more effectively.

##3. Behavioral Insights:

**Q.Does having a driving license affect the interest in vehicle insurance?**

In [None]:
# Driving License and Insurance Interest
plt.figure(figsize=(6, 4))
sns.countplot(x='Driving_License', hue='Response', data=df)
plt.title('Driving License vs. Insurance Interest')
plt.xlabel('Driving License (1 = Yes, 0 = No)')
plt.ylabel('Count')
plt.show()


**1. Why did you pick the specific chart?**

**Count Plot:** The count plot is chosen because it is effective for displaying the frequency distribution of categorical variables. It allows us to compare the number of customers interested and not interested in vehicle insurance across different driving license statuses.


---

**2. What is/are the insight(s) found from the chart?**

Based on the generated plot, we can gain the following insights:

**1. Distribution by Driving License Status:**

* The majority of customers possess a driving license (Driving_License = 1).
* The count of customers without a driving license (Driving_License = 0) is significantly lower.

**2. Interest in Vehicle Insurance:**

* Among customers with a driving license, there is a noticeable proportion interested in vehicle insurance (Response = 1).
* The plot will show if there is a significant difference in interest levels between customers with and without a driving license.


---
**3. Will the gained insights help create a positive business impact?**
**1.Targeted Marketing:**

* **Driving License Holders:** If customers with driving licenses show higher interest in vehicle insurance, marketing efforts can be focused on this group. Personalized marketing campaigns can highlight the benefits of vehicle insurance for licensed drivers.
* **Non-Driving License Holders:** If there is an interest among non-license holders, educational campaigns can be developed to inform them about the value of vehicle insurance, even for non-drivers, possibly targeting secondary policies or family members.

**2.Product Development:**

* **Special Offers for License Holders:** Develop special offers or discounts for customers with driving licenses to incentivize them further.
Alternative Products: For customers without driving licenses, the company can explore alternative insurance products or complementary services that might appeal to them.

**3.Customer Segmentation:**

* **License-Based Segmentation:** Segment customers based on their driving license status to tailor communication and product offerings more effectively. This can enhance customer satisfaction and increase conversion rates.


**Q. How does previous insurance status (Previously_Insured) correlate with the response to vehicle insurance?**

In [None]:
# Previous Insurance Status and Insurance Interest
plt.figure(figsize=(6, 4))
sns.countplot(x='Previously_Insured', hue='Response', data=df)
plt.title('Previous Insurance Status vs. Insurance Interest')
plt.xlabel('Previously Insured (1 = Yes, 0 = No)')
plt.ylabel('Count')
plt.show()


1. What is/are the insight(s) found from the chart?

Based on the generated plot, we can gain the following insights:

**1. Distribution by Previous Insurance Status:**

* We can observe how many customers had previous insurance (Previously_Insured = 1) versus those who did not (Previously_Insured = 0).

**2. Interest in Vehicle Insurance:**

* The plot will show if there is a significant difference in interest levels between customers who had previous insurance and those who did not. For example, it might reveal that customers without previous insurance are more interested in vehicle insurance, or vice versa.


---
2. Will the gained insights help create a positive business impact?

Analyzing the relationship between previous insurance status and interest in vehicle insurance provides valuable insights that can inform marketing strategies, product development, customer segmentation, and risk management. By leveraging these insights, the company can create a positive business impact by aligning its offerings with customer preferences and behaviors.


##4. Sales Channel Insights:

**Q.Which policy sales channels are most effective in converting customers to vehicle insurance?**

In [None]:
# Set the plot size
plt.figure(figsize=(12, 6))

# Create a count plot of Policy Sales Channel with hue as Response
sns.countplot(x='Policy_Sales_Channel', hue='Response', data=df)

# Set the title and labels
plt.title('Policy Sales Channels vs. Insurance Interest')
plt.xlabel('Policy Sales Channel')
plt.ylabel('Count')

# Show the plot
plt.show()


**Insights from the Chart**

1. High Conversion Channels:
* Sales channels with a higher number of Response = 1 (interested in vehicle insurance) compared to Response = 0 are more effective.
2. Low Conversion Channels:
* Sales channels with a higher number of Response = 0 (not interested in vehicle insurance) indicate less effective channels for converting customers.

**Potential Business Impact**
1. Targeted Efforts:

* Focus marketing and sales efforts on channels that show a higher conversion rate to maximize effectiveness and ROI.

2. Resource Allocation:

* Allocate more resources to high-performing channels and investigate low-performing ones to understand and improve their conversion rates.

3. Strategic Planning:

* Use these insights for strategic planning and improving overall sales strategies by leveraging the most effective sales channels.

**Q.Is there any significant difference in the vintage (number of days customer has been associated with the company) between those interested and not interested in vehicle insurance?**

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='Response', y='Vintage', data=df)
plt.title('Vintage vs. Insurance Interest')
plt.xlabel('Response (0: Not Interested, 1: Interested)')
plt.ylabel('Vintage (Days)')
plt.show()


**Interpretation of Results**

**Box Plot:** The box plot will show the distribution of vintage values for both groups. If there is a significant difference, we may see a noticeable shift or difference in the medians and the spread of the data.

**Conclusion**
By visualizing the data and performing a statistical test, we can determine if vintage is an important factor influencing customer interest in vehicle insurance. This can help the company tailor its strategies based on customer tenure.

# **4. Implementing Feature Engineering**

In [None]:
new_df = df.copy()
print(new_df.head())

In [None]:

# Create Age Bins
new_df['Age_Bin'] = pd.cut(new_df['Age'], bins=[0, 25, 45, 65, 100], labels=['Young', 'Middle-aged', 'Senior', 'Elder'])

# Calculate Premium per Day
new_df['Premium_per_Day'] = df['Annual_Premium'] / (new_df['Vintage'] + 1)  # Adding 1 to avoid division by zero

# Create Interaction Terms
new_df['Age_Premium_Interaction'] = new_df['Age'] * new_df['Annual_Premium']
new_df['Age_Vintage_Interaction'] = new_df['Age'] * new_df['Vintage']

# Previous Insurance and Damage Interaction
new_df['Prev_Ins_Damage_Interaction'] = new_df['Previously_Insured'].astype(str) + '_' + df['Vehicle_Damage']

# Example of Region Code Grouping
# Assuming we want to group region codes into 'Low', 'Medium', 'High' based on frequency
region_counts = df['Region_Code'].value_counts()
new_df['Region_Code_Grouped'] = new_df['Region_Code'].apply(lambda x: 'Low' if region_counts[x] < 100 else ('Medium' if 100 <= region_counts[x] < 200 else 'High'))

# Display the first few rows to check new features
new_df.head()


**Explanation of New Features**

**1. Age Bins:** Categorizes customers into different age groups which might have different tendencies towards vehicle insurance.

**2. Premium per Day:** Helps in understanding how much a customer pays per day, which could correlate with their interest in insurance.

**3. Interaction Terms:** These terms can capture combined effects of age with premium and vintage with age on the response variable.

**4. Previous Insurance and Damage Interaction:** This combined feature can capture the interaction effect of having previous insurance and vehicle damage.

**5. Region Code Grouping:** Simplifies region codes into broader categories which might help in reducing the complexity and improving model performance.


---




**Visualizing New Features**

In [None]:

# Create a figure and a set of subplots
fig, axs = plt.subplots(1, 2, figsize=(15, 6))

# Count plot for Age Bins
sns.countplot(x='Age_Bin', data=new_df, ax=axs[0])
axs[0].set_title('Count of Customers in Age Bins')

# Box plot for Premium per Day
sns.boxplot(y='Premium_per_Day', data=new_df, ax=axs[1])
axs[1].set_title('Box Plot of Premium per Day')

plt.tight_layout()
plt.show()

# Create another figure and a set of subplots
fig, axs = plt.subplots(1, 2, figsize=(15, 6))

# Count plot for Previous Insurance and Damage Interaction
sns.countplot(x='Prev_Ins_Damage_Interaction', data=new_df, ax=axs[0])
axs[0].set_title('Count of Previous Insurance and Damage Interaction')

# Count plot for Region Code Grouped
sns.countplot(x='Region_Code_Grouped', data=new_df, ax=axs[1])
axs[1].set_title('Count of Region Code Grouped')

plt.tight_layout()
plt.show()



# **5. Data Preparation**

**Encoding Categorical Variables:** Convert categorical features into numerical format using techniques like one-hot encoding or label encoding.

In [None]:
new_df.head()

In [None]:
#Encoding Categorical Variables

# One-hot encode categorical features
new_df = pd.get_dummies(new_df, columns=['Gender', 'Vehicle_Age', 'Vehicle_Damage', 'Age_Bin', 'Prev_Ins_Damage_Interaction', 'Region_Code_Grouped'])

# Encode 'Response' if not already done
new_df['Response'] = LabelEncoder().fit_transform(new_df['Response'])


**Scaling/Normalization**

In [None]:
# Initialize the scaler
scaler = StandardScaler()

# Apply scaling to numerical features
num_cols = ['Age', 'Annual_Premium', 'Vintage', 'Premium_per_Day', 'Age_Premium_Interaction', 'Age_Vintage_Interaction']
new_df[num_cols] = scaler.fit_transform(new_df[num_cols]) # Use new_df instead of df


**Splitting the Dataset**

In [None]:
# Define feature matrix (X) and target vector (y)
X = new_df.drop(['Response'], axis=1)
y = new_df['Response']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# **6. Model Training and Evaluation**

**Select and Train Models**

In [None]:
# Initialize models
log_reg = LogisticRegression()
rf = RandomForestClassifier()
gb = GradientBoostingClassifier()

# Train models
log_reg.fit(X_train, y_train)
rf.fit(X_train, y_train)
gb.fit(X_train, y_train)

**Evaluate Models**

In [None]:
# Predictions
log_reg_pred = log_reg.predict(X_test)
rf_pred = rf.predict(X_test)
gb_pred = gb.predict(X_test)

# Evaluation
print("Logistic Regression:")
print(classification_report(y_test, log_reg_pred))
print("ROC AUC Score:", roc_auc_score(y_test, log_reg_pred))

print("Random Forest:")
print(classification_report(y_test, rf_pred))
print("ROC AUC Score:", roc_auc_score(y_test, rf_pred))

print("Gradient Boosting:")
print(classification_report(y_test, gb_pred))
print("ROC AUC Score:", roc_auc_score(y_test, gb_pred))


# **7: Model Selection and Hyperparameter Tuning**

**Hyperparameter Tuning using Grid Search**

In [None]:
# Import necessary library
from sklearn.model_selection import GridSearchCV

# Define parameter grid for Random Forest
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

# Initialize Grid Search
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='roc_auc', n_jobs=-1)

# Fit Grid Search
grid_search.fit(X_train, y_train)

# Best parameters and score
print("Best Parameters:", grid_search.best_params_)
print("Best ROC AUC Score:", grid_search.best_score_)



# **Conclusion of the Project**

**Project Overview**

The project aimed to predict whether policyholders of a health insurance company would be interested in vehicle insurance. The dataset included various demographic, vehicle-related, and policy-related features.


**Models and Performance**

Three models were trained and evaluated:


 **1.   Logistic Regression**

 **2.   Random Forest**

 **3.   Gradient Boosting**



---


   **Model Evaluation**

   1.**Logistic Regression:**

* Precision, recall, and F1-score for class 1 (interested in vehicle insurance) were all zero, indicating that the model failed to identify any positive cases.

* ROC AUC Score: 0.5, suggesting no better performance than random guessing.

2. **Random Forest:**

* The model had a better performance compared to logistic regression but still showed low recall and F1-score for class 1.

* ROC AUC Score: 0.529, indicating slight improvement but still not satisfactory.

3. **Gradient Boosting:**

* Similar to logistic regression, the gradient boosting model failed to identify positive cases.

* ROC AUC Score: 0.5, again no better than random guessing.

**Conclusion of the Project**

The objective of the project was to predict whether policyholders with health insurance would be interested in vehicle insurance using various machine learning models. After training and evaluating the models (Logistic Regression, Random Forest, Gradient Boosting), the results indicated that the models performed poorly in predicting the interest in vehicle insurance, especially for the minority class (those interested in vehicle insurance).

**Key Insights:**

1. **Model Performance:**

* Logistic Regression and Gradient Boosting showed poor performance with ROC AUC scores around 0.5, indicating no predictive power.

* Random Forest performed slightly better with a ROC AUC score of 0.53 but still not satisfactory.

2. **Class Imbalance:**

* There is a significant class imbalance in the dataset, with the majority of customers not interested in vehicle insurance. This imbalance affected the model's ability to learn and predict the minority class.

**Business Solutions and Recommendations**

1. **Address Class Imbalance:**

* **Upsampling/Downsampling:** Implement techniques like SMOTE (Synthetic Minority Over-sampling Technique) or undersampling the majority class to balance the dataset.

* **Class Weight Adjustment:** Modify the class weights in the model to place more importance on the minority class.

2. **Feature Engineering:**

* **New Features:** Introduce new features that may have a stronger correlation with vehicle insurance interest, such as customer interaction data, socio-economic factors, and customer feedback.

* **Interaction Terms:** Create interaction terms between variables to capture
more complex relationships.

3. **Model Tuning and Selection:**


* Hyperparameter Tuning: Perform extensive hyperparameter tuning using techniques like Grid Search or Random Search to improve model performance.
Advanced Models: Explore more advanced models like XGBoost, LightGBM, or deep learning techniques which might capture more complex patterns.

4. **Customer Segmentation:**

* **Segmentation Analysis:** Segment the customers based on demographics, vehicle age, and previous insurance status to identify potential target groups for vehicle insurance.

* **Personalized Marketing:** Use the segmented data to develop targeted marketing campaigns tailored to different customer segments.

4. **Data Enrichment:**

* **External Data:** Enrich the dataset with external data sources such as credit scores, social media activity, or other behavioral data to gain more insights into customer preferences and improve model accuracy.

* **Feedback Loop:** Implement a feedback loop where the model is continuously updated with new data and retrained to improve its accuracy over time.

6. **Cross-Sell Strategy:**

* **Targeted Offers:** Provide targeted offers or discounts on vehicle insurance for existing health insurance customers.
Bundling Products: Create bundled insurance products that offer both health and vehicle insurance at a discounted rate to incentivize cross-purchasing.




# **7: Github**

Github link = [Click Heare](https://github.com/rohit80025/Health-Insurance-Cross-Sell-Prediction-ML.git)
