<a href="https://colab.research.google.com/github/swapnil-mergal/HEALTH-INSURANCE-CROSS-SELL-PREDICTION-PROJECT/blob/main/Individual_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HEALTH INSURANCE CROSS SELL PREDICTION

# **Problem Statement**

Our client is an Insurance company that has provided Health Insurance to its customers now they need your help in building a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company.

An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.

For example, you may pay a premium of Rs. 5000 each year for a health insurance cover of Rs. 200,000/- so that if, God forbid, you fall ill and need to be hospitalised in that year, the insurance provider company will bear the cost of hospitalisation etc. for upto Rs. 200,000. Now if you are wondering how can company bear such high hospitalisation cost when it charges a premium of only Rs. 5000/-, that is where the concept of probabilities comes in picture. For example, like you, there may be 100 customers who would be paying a premium of Rs. 5000 every year, but only a few of them (say 2-3) would get hospitalised that year and not everyone. This way everyone shares the risk of everyone else.

Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of certain amount to insurance provider company so that in case of unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer.

Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimise its business model and revenue.

Now, in order to predict, whether the customer would be interested in Vehicle insurance, you have information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel) etc.

# **Attribute Information**

1. id :	Unique ID for the customer

2. Gender	: Gender of the customer

3. Age :	Age of the customer

4. Driving_License	0 : Customer does not have DL, 1 : Customer already has DL

5. Region_Code :	Unique code for the region of the customer

6. Previously_Insured	: 1 : Customer already has Vehicle Insurance, 0 : Customer doesn't have Vehicle Insurance

7. Vehicle_Age :	Age of the Vehicle

8. Vehicle_Damage	 :1 : Customer got his/her vehicle damaged in the past. 0 : Customer didn't get his/her vehicle damaged in the past.

9. Annual_Premium	: The amount customer needs to pay as premium in the year

10. PolicySalesChannel :	Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.

11. Vintage :	Number of Days, Customer has been associated with the company

12. Response :	1 : Customer is interested, 0 : Customer is not interested

# Loading Dataset

In [None]:
# Essential Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning Models
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier

# Evaluation Metrics
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score, roc_curve


# Load Dataset

In [None]:
#reading dataset
df1=pd.read_csv('/content/TRAIN-HEALTH INSURANCE CROSS SELL PREDICTION.CSV')


In [None]:
df1.head()

In [None]:
#Descriptive statistics
df1.describe()

In [None]:
#Shape of data
df1.shape

In [None]:
df1.columns

* Our dataset have
 no null value.

In [None]:
df1.info()

# Step 3: Exploratory Data Analysis (EDA)

# missing values

In [None]:
print("Missing Values:\n", df1.isnull().sum())


In [None]:
df1.columns

#Correlation heatmap

In [None]:
# Convert categorical columns to numerical using Label Encoding or One-Hot Encoding

# Select categorical columns with string values
categorical_cols = ['Gender', 'Vehicle_Age', 'Vehicle_Damage', 'Policy_Sales_Channel']

# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder

# Create a LabelEncoder object
label_encoder = LabelEncoder()

# Encode categorical columns
for col in categorical_cols:
    df1[col] = label_encoder.fit_transform(df1[col])

# Now you can create the correlation heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(df1.corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()

# Visualize Target Variable

In [None]:
df_sample = df1.sample(n=10000, random_state=42)  # 10,000 random rows
sns.countplot(x=df_sample['Response'])
plt.title("Distribution of Target Variables (Sampled)")
plt.show()

# Vehicle Age Distribution

In [None]:
sns.countplot(df1['Vehicle_Age'])
plt.title("Vehicle Age Distribution")
plt.show()


# Step 4: Preprocess Data
# Encoding Categorical Variables

In [None]:
# Encode 'Gender', 'Vehicle_Age', 'Vehicle_Damage'
le = LabelEncoder()
df1['Gender'] = le.fit_transform(df1['Gender'])
df1['Vehicle_Age'] = df1['Vehicle_Age'].map({'< 1 Year': 0, '1-2 Year': 1, '> 2 Years': 2})
df1['Vehicle_Damage'] = le.fit_transform(df1['Vehicle_Damage'])


# Feature Scaling

In [None]:
scaler = StandardScaler()
numerical_features = ['Age', 'Annual_Premium', 'Policy_Sales_Channel', 'Vintage']
df1[numerical_features] = scaler.fit_transform(df1[numerical_features])


# Split Data

In [None]:
# Features and Target
X = df1.drop(columns=['Response'])
y = df1['Response']

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Step 5: Implement Algorithms
1. Logistic Regression



In [None]:
lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_lr))


2.Random Forest

In [None]:
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))


In [None]:

# Clean Column Names
df1.columns = df1.columns.str.strip()

# Use a Subset of the Data for Faster Execution (Optional)
df1_sample = df1.sample(n=10000, random_state=42)  # Sample 10,000 rows for testing

# Define Features and Target Variable
X = df1_sample.drop(['Response', 'id'], axis=1, errors='ignore')  # Drop target & ID columns
y = df1_sample['Response']

# Split Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the Data for Faster SVM Convergence
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train a Linear SVM (Faster Alternative to SVC)
svm_model = LinearSVC(max_iter=1000, random_state=42)
svm_model.fit(X_train_scaled, y_train)

# Make Predictions
y_pred = svm_model.predict(X_test_scaled)

# Evaluate the Model
print(classification_report(y_test, y_pred))

3.Support Vector Machine

In [None]:
svm = SVC(probability=True)
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)
print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))


4.Decision Tree

In [None]:
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_dt))


5.XG Boost

In [None]:
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_test)
print("XGBoost Accuracy:", accuracy_score(y_test, y_pred_xgb))


# Step 6: Evaluate Models
# Confusion Matrix and Classification Report



In [None]:
from sklearn.metrics import plot_confusion_matrix

models = {'Logistic Regression': lr, 'Random Forest': rf, 'SVM': svm, 'Decision Tree': dt, 'XGBoost': xgb}

for name, model in models.items():
    print(f"\n{name}")
    print("Confusion Matrix:\n", confusion_matrix(y_test, model.predict(X_test)))
    print("Classification Report:\n", classification_report(y_test, model.predict(X_test)))


# ROC CURVE

In [None]:
plt.figure(figsize=(10, 8))

for name, model in models.items():
    fpr, tpr, _ = roc_curve(y_test, model.predict_proba(X_test)[:, 1])
    plt.plot(fpr, tpr, label=name)

plt.plot([0, 1], [0, 1], 'k--')
plt.title("ROC Curve")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
plt.show()


# Step 7: Save and Export Results

In [None]:
# Save model predictions
output = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred_xgb})
output.to_csv("predictions.csv", index=False)


In [None]:
#Dependent variable 'Response'
plt.figure(figsize=(8,7))
sns.set_theme(style='whitegrid')
sns.countplot(x=df1['Response'],data=df1)

* From above fig we can see that the data is highly imbalanced.

In [None]:
#Distribution of Age
plt.figure(figsize=(15,8))
sns.countplot(x=df1['Age'],data=df1)

* From the above distribution of age we can see that most of the customers age is between 21 to 25 years.There are few Customers above the age of 60 years.

In [None]:

plt.figure(figsize=(7,9))
plt.pie(df1['Previously_Insured'].value_counts(), autopct='%.0f%%', shadow=True, startangle=200, explode=[0.01,0])
plt.legend(labels=['Insured','Not insured'])
plt.show()


* 54% customer are previously insured ahe 46% customer are are not insured yet.
* Customer who are not perviosly insured are likely to be inetrested.

In [None]:
plt.figure(figsize=(15,9))
a=df1['Annual_Premium']
sns.distplot(a, color='purple')

* From the distribution plot we can infer that the annual premimum variable is right skewed

In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(df1['Annual_Premium'])

*  For the boxplot above we can see that there's a lot of outliers in the annual premium.

In [None]:
plt.figure(figsize=(5,7))
sns.countplot(x=df1['Vehicle_Damage'])

* Customers with Vehicle_Damage are likely to buy insurance

In [None]:
df1['Vehicle_Age'].hist();

* From the above plot we can see that most of the people are having vehicle age between 1 or 2 years and very few people are having vehicle age more than 2 years.

#Bivariate analysis

In [None]:
#Age VS Response
plt.figure(figsize=(16,8))
sns.countplot(data=df1, x='Age',hue='Response', palette='CMRmap_r')
plt.xlabel('Age response')
plt.ylabel('count')
plt.show()

* People ages between from 31 to 50 are more likely to respond.
*  while Young people below 30 are not interested in vehicle insurance.


In [None]:
#Gender vs Response
df1.groupby(['Gender', 'Response']).size().unstack().plot(kind = 'bar', stacked = True)

* Male category is slightly greater than that of female and chances of buying the insurance is also little high

In [None]:
plt.figure(figsize = (10,6) )
sns.countplot(data = df1, x = 'Vehicle_Age', hue = 'Response', palette='Dark2_r')
plt.xlabel('Vehicle Age', fontsize = 15)
plt.ylabel('Count', fontsize = 15)
plt.title('Vehicle Age and Customer Response analysis', fontsize = 19)
plt.show()

* Customers with vechicle age 1-2 years are more likely to interested as compared to the other two

* Customers with with Vehicle_Age <1 years have very less chance of buying Insurance

In [None]:
sns.barplot(x = 'Response', y ='Annual_Premium', data = df1)

*  People who response have slightly higher annual premium

In [None]:
plt.figure(figsize = (20, 8))
sns.heatmap(df1.corr(), annot = True)

* Target variable is not much affected by Vintage variable. we can drop least correlated variable.

#Encoding Object columns
* changing categorical value to numerical values

In [None]:
df1['Gender'] = df1['Gender'].map({'Female':1, 'Male':0})
df1.head()

In [None]:
df1['Vehicle_Age']= df1['Vehicle_Age'].map({'< 1 Year':0,'1-2 Year':1,'> 2 Years':2})
df1.head()

In [None]:
df1['Vehicle_Damage']=df1['Vehicle_Damage'].map({'Yes':1, 'No':0})
df1.head()

#Feature Engineering and Feature Selection

In [None]:
correlation = df1.corr()
correlation['Response'].sort_values(ascending = False)[1:]

In [None]:
X=df1.drop(columns=['id','Driving_License','Policy_Sales_Channel','Vintage','Response'])# independent variable
y = df1['Response']# dependent variable

In [None]:
# Fill any numerical NaNs with mode()

fill_mode = lambda col: col.fillna(col.mode())
X = X.apply(fill_mode, axis=0)
df1 = df1.apply(fill_mode, axis=0)

#Model Building

In [None]:
# check for imbalance in data
df1['Response'].value_counts()

+* We can clearly see that there is a huge difference between the data set.
* Standard ML techniques such as Decision Tree and Logistic Regression have a bias towards the majority class, and they tend to ignore the minority class. So solving this issue we use resampling technique.


In [None]:
#Resampling
ros = RandomOverSampler(random_state=0)
X_new,y_new= ros.fit_resample(X, y)

print("After Random Over Sampling Of Minor Class Total Samples are :", len(y_new))
print('Original dataset shape {}'.format(Counter(y)))
print('Resampled dataset shape {}'.format(Counter(y_new)))


# Splitting the data in train and test sets

In [None]:
X_train, X_test ,y_train, y_test=  train_test_split(X_new, y_new, random_state=42, test_size=0.3)
X_train.shape, X_test.shape , y_train.shape, y_test.shape

In [None]:
# Normalizing the Dataset using Standard Scaling Technique.
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)

# Logistic Regression

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Load the dataset
data = pd.read_csv('/content/TRAIN-HEALTH INSURANCE CROSS SELL PREDICTION.CSV')

# Separate features and target variable
X = data.drop('Response', axis=1)  # Features
y = data['Response']               # Target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocess the data using ColumnTransformer and Pipeline
categorical_features = ['Gender']  # Add more categorical columns if needed
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features)
    ])

# Instantiate the LogisticRegression model
model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(random_state=42))
])

# Fit the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test data
pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, pred)
classification_rep = classification_report(y_test, pred)

print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:\n", classification_rep)


**Model** **Evaluation**

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Evaluate the model
accuracy = accuracy_score(y_test, pred)
precision = precision_score(y_test, pred)
recall = recall_score(y_test, pred)
f1 = f1_score(y_test, pred)
conf_matrix = confusion_matrix(y_test, pred)

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
print("Confusion Matrix:\n", conf_matrix)


In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve

# Assuming you have trained and predicted using your model
# Replace 'model' with your trained logistic regression model
prob = model.predict_proba(X_test)[:, 1]

fpr, tpr, thresholds = roc_curve(y_test, prob)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label='Logistic Regression')
plt.plot([0, 1], [0, 1], linestyle='--', color='black')  # Diagonal line
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()
plt.show()


**confusion_matrix**

In [None]:
matrix= confusion_matrix(y_test, pred)
print(matrix)
sns.heatmap(matrix ,annot=True, fmt='g')

* From the confusion matrix we see that the model is predicting positive responses but also predicting negative response too.

In [None]:
print(classification_report(pred, y_test))

#RandomForest Classifier

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load the dataset
data = pd.read_csv('/content/TRAIN-HEALTH INSURANCE CROSS SELL PREDICTION.CSV')

# Separate features and target variable
X = data.drop('Response', axis=1)  # Features
y = data['Response']               # Target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocess the data using OneHotEncoder for categorical variables
categorical_features = ['Gender']  # Add more categorical columns if needed
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features)
    ])

# Instantiate the RandomForestClassifier model
RF_model = RandomForestClassifier(random_state=42)

# Create a pipeline with preprocessing and modeling steps
model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RF_model)
])

# Fit the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test data
rf_pred = model.predict(X_test)

# Obtain prediction probabilities
rf_proba = model.predict_proba(X_test)[:, 1]

# Evaluate the model
accuracy = accuracy_score(y_test, rf_pred)
classification_rep = classification_report(y_test, rf_pred)

print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:\n", classification_rep)


**Model Evaluation**

In [None]:
# Evaluation
r_rf=  recall_score(y_test, rf_pred)
print("recall_score : ", r_rf)

p_rf= precision_score(y_test, rf_pred)
print("precision_score :",p_rf)

f1_rf= f1_score(y_test, rf_pred)
print("f1_score :", f1_rf)

A_rf= accuracy_score(y_test, rf_pred)
print("accuracy_score :",A_rf)

acu_rf = roc_auc_score(rf_pred, y_test)
print("ROC_AUC Score:",acu_rf)

In [None]:
fpr, tpr, _ = roc_curve(y_test, rf_proba)

plt.title('Random Forest ROC curve')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.plot(fpr,tpr)
plt.plot((0,1), linestyle="--",color='black')
plt.show()

**confusion_matrix**

In [None]:
matrix= confusion_matrix(y_test,rf_pred)
print(matrix)
sns.heatmap(matrix ,annot=True, fmt='g')

The confusion matrix now shows that the model now is much better with predicting positive responses.



In [None]:
print(classification_report(rf_pred, y_test))

The model performs very well, so we can use it to predict unknown data.

#XGBoost

In [None]:


# Separate features and target variable
X = data.drop('Response', axis=1)
y = data['Response']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocess the data using OneHotEncoder for categorical variables
categorical_features = ['Gender', 'Vehicle_Age', 'Vehicle_Damage']  # Add more categorical columns if needed
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features)
    ])

# Instantiate the XGBClassifier model
xgb_model = XGBClassifier(random_state=42)

# Create a pipeline with preprocessing and modeling steps
model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', xgb_model)
])

# Fit the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test data
xgb_pred = model.predict(X_test)

# Obtain prediction probabilities
xgb_prob = model.predict_proba(X_test)[:, 1]

# Evaluate the model
accuracy = accuracy_score(y_test, xgb_pred)
classification_rep = classification_report(y_test, xgb_pred)

print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:\n", classification_rep)


**Model** **Evaluation**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, recall_score, precision_score, f1_score, roc_auc_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Load the dataset
data = pd.read_csv('/content/TRAIN-HEALTH INSURANCE CROSS SELL PREDICTION.CSV')

# Separate features and target variable
X = data.drop('Response', axis=1)
y = data['Response']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocess the data using OneHotEncoder for categorical variables
categorical_features = ['Gender', 'Vehicle_Age', 'Vehicle_Damage']
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features)
    ])

# Instantiate the XGBClassifier model
xgb_model = XGBClassifier(random_state=42)

# Create a pipeline with preprocessing and modeling steps
model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', xgb_model)
])

# Fit the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test data
XG_pred = model.predict(X_test)

# Obtain prediction probabilities
XG_proba = model.predict_proba(X_test)[:, 1]

# Evaluate the model
r_XG = recall_score(y_test, XG_pred)
print("recall_score:", r_XG)

p_XG = precision_score(y_test, XG_pred)
print("precision_score:", p_XG)

f1_XG = f1_score(y_test, XG_pred)
print("f1_score:", f1_XG)

A_XG = accuracy_score(y_test, XG_pred)
print("accuracy_score:", A_XG)

acu_XG = roc_auc_score(y_test, XG_proba)
print("ROC_AUC Score:", acu_XG)


In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_test, XG_proba)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label='XGBoost')
plt.plot([0, 1], [0, 1], linestyle='--', color='black')  # Diagonal line
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()
plt.show()


**confusion_matrix**

In [None]:
matrix= confusion_matrix(y_test,XG_pred)
print(matrix)
sns.heatmap(matrix ,annot=True, fmt='g')

From the confusion matrix we see that the model is a bit better with predicting positive responses.

In [None]:
print(classification_report(XG_pred, y_test))

#Comparing  the Model



In [None]:
# Check column names
for col in X_train.columns:
    if '[' in col or ']' in col or '<' in col:
        new_col_name = col.replace('[', '_').replace(']', '_').replace('<', '_')
        X_train.rename(columns={col: new_col_name}, inplace=True)

# Check column names
for col in X_test.columns:
    if '[' in col or ']' in col or '<' in col:
        new_col_name = col.replace('[', '_').replace(']', '_').replace('<', '_')
        X_test.rename(columns={col: new_col_name}, inplace=True)

# Now proceed with training your XGBoost model


#Conclusion


* In this project, we aimed to predict whether customers would be interested in purchasing health insurance, utilizing machine learning techniques. The dataset provided valuable insights into various customer attributes that could influence their decision. We employed three different classification models - Logistic Regression, Random Forest, and XGBoost - to build predictive models. The evaluation of these models included key performance metrics such as accuracy, recall, precision, f1-score, and ROC AUC.

* Our analysis revealed that the XGBoost model outperformed the other two models in terms of accuracy, recall, precision, and f1-score. This indicates that XGBoost is particularly effective at capturing true positives, minimizing false negatives, and providing an overall better balance between precision and recall. Additionally, the ROC AUC score for the XGBoost model indicated its strong discriminative power in distinguishing between positive and negative cases.

* The success of the project can be attributed to several factors, including the careful preprocessing of data, feature engineering, and selecting appropriate models. The one-hot encoding of categorical variables, standardization of numerical features, and appropriate train-test splitting ensured the models' accuracy and generalization to new data.

* It is important to note that while the models demonstrated promising results, the ultimate application of the models should be accompanied by careful consideration of business context, potential risks, and cost-benefit analysis. Moreover, continuous monitoring and periodic model updates are essential to maintain their relevance and effectiveness over time.

* In conclusion, this project showcases the effectiveness of machine learning techniques in predicting customer behavior, specifically in the context of health insurance cross-selling. The results obtained can provide valuable insights to insurance companies in optimizing their marketing strategies and enhancing customer engagement.