<a href="https://colab.research.google.com/github/vkstar444/TRAIN-HEALTH-INSURANCE-CROSS-SELL-PREDICTION/blob/main/TRAIN_HEALTH_INSURANCE_CROSS_SELL_PREDICTIO_DEEP_LEARNING.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  TRAIN-HEALTH INSURANCE CROSS SELL PREDICTION



##### **Project Type**    - Classification
##### **Contribution**    - Individual


# **Project Summary -**

### Summary of Health Insurance Cross-Sell Prediction Dataset

The dataset contains 381,109 rows and 12 columns, which represent information about individuals and their interest in purchasing health insurance. The aim is to predict whether a customer will buy insurance (captured in the `Response` column). Below is a detailed breakdown of the dataset and key insights.

#### Key Features:
1. **ID**: This is a unique identifier for each individual and serves no analytical purpose other than as a reference.
   
2. **Gender**: This categorical variable has two values: "Male" and "Female." Gender might influence purchasing behavior or insurance uptake trends, though it's important to test whether this assumption holds.

3. **Age**: The `Age` column provides the age of each customer. Age is an essential factor in health insurance as older individuals might have different needs and risk profiles compared to younger customers. Health risks tend to increase with age, making older individuals more likely to buy insurance.

4. **Driving_License**: This binary feature indicates whether the individual has a valid driving license (1 = Yes, 0 = No). While it seems irrelevant to health insurance, it may be correlated with other behavioral traits or eligibility conditions for certain insurance products.

5. **Region_Code**: This numeric feature encodes different regions where customers live. Regional patterns could help in identifying locations where insurance penetration is higher or lower, which can guide marketing and outreach strategies.

6. **Previously_Insured**: A critical factor, this binary variable (1 = Yes, 0 = No) indicates whether the customer is already insured. Customers who are already insured may be less likely to purchase another insurance product, which makes this feature highly informative for predicting the target outcome.

7. **Vehicle_Age**: This categorical variable classifies the customer's vehicle into three groups: "< 1 Year", "1-2 Year", and "> 2 Years." While vehicle age doesn’t directly impact health insurance needs, it may signal how risk-averse or financially conservative the individual is, which might indirectly affect their likelihood to buy insurance.

8. **Vehicle_Damage**: This is another binary variable (Yes/No) indicating whether the vehicle has been damaged in the past. People with a damaged vehicle may have a higher risk profile or could be more inclined to buy insurance for protection, including health insurance.

9. **Annual_Premium**: This is the amount of premium paid by the customer for their current insurance policy. This feature is continuous and directly reflects the customer's financial capability and interest in insurance products. Higher premiums may reflect more comprehensive insurance, while lower premiums may indicate basic coverage.

10. **Policy_Sales_Channel**: This feature is a numeric code representing the distribution channel through which the insurance was sold (e.g., online, through agents, or other means). Understanding the effectiveness of different sales channels is crucial for tailoring marketing efforts and identifying the channels with the highest conversion rates.

11. **Vintage**: This indicates the number of days the customer has been associated with the insurance provider. Customers who have been with a provider for longer periods might show higher loyalty, but could also be less likely to switch or purchase additional products if their needs are already met.

12. **Response**: This is the target variable (1 = Yes, 0 = No), which indicates whether the customer purchased the health insurance or not. The goal is to predict this outcome based on the other features in the dataset.

#### Initial Insights:
- **Previously_Insured** is likely to be one of the most important features. If a customer is already insured, they are likely to decline new insurance offers, resulting in a `Response` of 0.
- **Annual_Premium** and **Vintage** could be key predictors of customer behavior. Higher premiums might indicate higher engagement or financial readiness to buy more insurance, while longer vintage could point to greater loyalty and likelihood of buying more products.
- **Vehicle_Age** and **Vehicle_Damage** could offer indirect insights into the customer’s risk tolerance and propensity to invest in protection, including health insurance.

#### Conclusion:
The dataset is rich in features that relate directly and indirectly to an individual's likelihood to buy health insurance. To fully unlock insights, a thorough exploratory data analysis (EDA) and feature engineering would be necessary to uncover relationships between features and to build a predictive model. Factors such as previous insurance status, age, and premium amount will likely play a crucial role in determining insurance purchase decisions.

By focusing on these relationships, companies can better target potential customers and improve their cross-selling efforts.

# **GitHub Link -**

https://github.com/vkstar444/TRAIN-HEALTH-INSURANCE-CROSS-SELL-PREDICTION

# **Problem Statement**


### Problem Statement

The goal of this project is to predict whether a customer will purchase a health insurance product based on their demographic, vehicle-related, and policy-related information. This problem addresses the need for insurance companies to identify customers who are more likely to buy additional insurance products, enabling more targeted marketing strategies and efficient allocation of resources.

With a large dataset of over 380,000 records, we aim to build a predictive model that can accurately forecast customer behavior, particularly their likelihood to respond positively to health insurance offers. Key factors include demographic variables (e.g., age, gender, region), vehicle characteristics (e.g., vehicle age, past damage), and insurance history (e.g., whether the customer is already insured, annual premium paid). The objective is to optimize cross-selling efforts by predicting the `Response` variable, which indicates whether a customer purchased health insurance (`1` for Yes, `0` for No).

The model will be instrumental in helping insurance companies:
- **Increase sales**: By targeting customers with the highest probability of purchasing.
- **Enhance customer retention**: By identifying loyal or long-term customers likely to respond to cross-sell offers.
- **Optimize marketing strategies**: By understanding key customer attributes that drive purchase decisions.

The challenge lies in finding meaningful patterns within the data that can distinguish between likely buyers and non-buyers, thereby improving the efficiency and effectiveness of the company's sales operations.

# ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow import keras
from tensorflow.keras import layers

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
data = pd.read_csv('/content/drive/MyDrive/TRAIN-HEALTH INSURANCE CROSS SELL PREDICTION(Classification)_DEEP_LEARNING/TRAIN-HEALTH INSURANCE CROSS SELL PREDICTION.csv')

### Dataset First View

In [None]:
# Dataset First Look
print('Dataset First Look')
data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("\nDataset Rows & Columns count:")
print(f"Rows: {data.shape[0]}, Columns: {data.shape[1]}")

### Dataset Information

In [None]:
# Dataset Info
print("\nDataset Information:")
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print("\nDataset Duplicate Value Count:")
print(f"Duplicate Values: {data.duplicated().sum()}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print("\nMissing Values/Null Values Count:")
print(data.isnull().sum())

In [None]:
# Visualizing the missing values
print("\nVisualizing the missing values:")
sns.heatmap(data.isnull(), cbar=False)
plt.show()

### What did you know about your dataset?

#### Understanding the Dataset

The dataset consists of 381,109 records, each representing a customer, with 12 columns that provide demographic, vehicle-related, and insurance-related information. Here’s a breakdown of what is known about the dataset:

#### Key Attributes:
1. **Demographic Information**:
   - **Gender**: Categorical variable indicating the gender of the customer (Male or Female).
   - **Age**: Numerical variable providing the age of the customer. Age is a significant factor in predicting the likelihood of purchasing insurance, as health risks increase with age.

2. **Vehicle-Related Information**:
   - **Vehicle_Age**: Categorized into three groups (`< 1 Year`, `1-2 Year`, `> 2 Years`), which reflects the age of the customer's vehicle. Vehicle age can be an indicator of financial conservatism or risk aversion.
   - **Vehicle_Damage**: Binary variable indicating whether the customer’s vehicle has been damaged before. This variable indirectly hints at the customer’s risk tolerance.

3. **Insurance-Related Information**:
   - **Driving_License**: Binary variable indicating whether the customer has a valid driving license. Most customers have a driving license, so this feature may not add much variation, but it could still be relevant.
   - **Previously_Insured**: A binary variable that shows whether the customer already has insurance. This is a highly informative feature because customers who are already insured may not be interested in purchasing additional coverage.
   - **Annual_Premium**: A continuous variable representing the amount paid for the current insurance policy. Higher premiums may reflect a customer’s financial capacity and inclination toward buying more comprehensive coverage.
   - **Policy_Sales_Channel**: Numeric variable that represents the sales channel (e.g., agents, online) through which the policy was sold. Different sales channels might have varying effectiveness.
   - **Vintage**: The number of days the customer has been associated with the insurance provider. Customers with a longer association might exhibit higher loyalty, impacting their likelihood to buy more products.

4. **Target Variable**:
   - **Response**: This is the main target variable (1 = Yes, 0 = No), representing whether the customer bought health insurance or not.

#### Data Type Summary:
- The dataset contains both categorical (e.g., Gender, Vehicle_Age) and numerical (e.g., Age, Annual_Premium) variables.
- There are no missing values in the dataset, as all columns have complete data.
- The dataset seems well-structured, with most columns having an intuitive relationship to the target variable (Response).

#### Key Insights:
- **Previously_Insured** is likely to be a critical feature because customers who already have insurance may not be interested in buying more, leading to a `Response` of 0.
- **Annual_Premium** and **Vintage** could help in predicting customer behavior, with higher premiums indicating a greater likelihood to buy more products, and longer vintage suggesting loyalty.
- **Vehicle_Damage** and **Vehicle_Age** may indirectly influence the customer’s risk profile and affect their insurance decisions.

Overall, the dataset contains rich information that can be used to predict whether a customer will purchase health insurance, which is useful for building predictive models for marketing and sales optimization.

In [None]:
# Convert categorical variables to numeric
data['Gender'] = data['Gender'].map({'Male': 1, 'Female': 0})
data['Vehicle_Damage'] = data['Vehicle_Damage'].map({'Yes': 1, 'No': 0})
data['Previously_Insured'] = data['Previously_Insured'].astype(int)

In [None]:
# One-hot encoding for categorical variables
data = pd.get_dummies(data, columns=['Vehicle_Age', 'Policy_Sales_Channel'], drop_first=True)

In [None]:
# Define features and target variable
X = data.drop(['id', 'Response'], axis=1)
y = data['Response']

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
# Build the Deep Learning model
model = keras.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(X_train.shape[1],)))
model.add(layers.Dense(32, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

In [None]:
model.summary()

In [None]:
# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, epochs=50, batch_size=32, validation_split=0.2)



In [None]:
# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {accuracy:.4f}")

In [None]:
# Plot training history
plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'], label='val_accuracy')
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()
plt.show()

In [None]:


plt.plot(history.history['loss'], label='loss')
plt.plot(history.history['val_loss'], label='val_loss')
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend()
plt.show()