<a href="https://www.kaggle.com/code/nurulsakinah/insurance-cross-selling-classification-xgb?scriptVersionId=194502458" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

Aim: build a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company.<br>
Evaluation by ROC-AUC : needs to be maximized


# 1. Import libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from xgboost.sklearn import XGBClassifier
import lightgbm as lgbm
import catboost as cb
import xgboost as xgb
from sklearn.metrics import roc_auc_score
import optuna



# To ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# 2. Load the dataset

In [None]:
# read file
df_ori = pd.read_csv('/kaggle/input/playground-series-s4e7/train.csv')
df_test = pd.read_csv('/kaggle/input/playground-series-s4e7/test.csv')

# 3. Overview and understand the data


In [None]:
# get a brief look of the dataset
df_ori.head()

Check the dataset size. The dataset is huge, over 11 million rows

In [None]:
# check shape of the data
df_ori.shape

Check duplicate and missing values. <br>
No missing values or duplicates detected

In [None]:
df_ori.duplicated().sum()

In [None]:
# check missing values
df_ori.isnull().sum()

In [None]:
# check data type
df_ori.info()

# 4. Data Preprocessing

Since the dataset is huge, we can reduce the memory by changing the data type.<br>
This Memory Optimization Strategy was taken from 
<a href="https://www.kaggle.com/code/jmascacibar/optimizing-memory-usage-with-insurance-cross-sell?kernelSessionId=186392861">this notebook</a> written by JMASCACIBAR.

In [None]:
def converting_datatypes(df):
    '''This method reduces memory for numeric columns in the dataframe'''
    df = df.copy()
    try:
        # Converting data types
        df['Gender'] = df['Gender'].astype('category')
        df['Vehicle_Age'] = df['Vehicle_Age'].astype('category')
        df['Vehicle_Damage'] = df['Vehicle_Damage'].astype('category')
        df['Age'] = df['Age'].astype('int8')
        df['Driving_License'] = df['Driving_License'].astype('int8')
        df['Region_Code'] = df['Region_Code'].astype('int8')
        df['Previously_Insured'] = df['Previously_Insured'].astype('int8')
        df['Annual_Premium'] = df['Annual_Premium'].astype('int32')
        df['Policy_Sales_Channel'] = df['Policy_Sales_Channel'].astype('int16')
        df['Vintage'] = df['Vintage'].astype('int16')
        df['Response'] = df['Response'].astype('int8')
        print(df.info(memory_usage='deep'))
    except KeyError as e:
        print(f"Error: {e} not found in DataFrame")
    except Exception as e:
        print(f"An error occurred: {e}")
    return df

apply above function to the dataset

In [None]:
df = converting_datatypes(df_ori)
# check the data size after converting

Differentiate between cateorical and continous data so that we can plot appropriate graph for exploratory data analysis later. By looking at unique values, we can identify which is categorical features.

In [None]:
# check unique values in each column
unique_col = df.nunique()
unique_col

## Separate categorical & continuous variable

In [None]:
# get the categorical variables
# Define minimum number of unique values threshold
min_unique = 54
# get name columns
categorical_col = df.columns[df.nunique() <= min_unique].tolist()
cont_col = df.columns[df.nunique() > min_unique].tolist()
# drop 'id'
cont_col.remove('id') 

print(categorical_col)
print(cont_col)

## Encode categorical features
The categorical data is encoded to numerical so that we can run correlation analysis on these features. 

In [None]:
def encode_categorical_features(df):    
    df['Gender'] = df['Gender'].map({'Male': 0, 'Female': 1})
    df['Vehicle_Age'] = df['Vehicle_Age'].map({'< 1 Year': 0, '1-2 Year': 1, '> 2 Years': 2})
    df['Vehicle_Damage'] = df['Vehicle_Damage'].map({'No': 0, 'Yes': 1})
    return df

In [None]:
df = encode_categorical_features(df)
df.head()

# 5. Exploratory Data Analysis

Explore data by visualising each features.<br>




## Data distribution
Distribution of target 'Response' distribution

In [None]:
# Checking if target data is Imbalanced
response_data = df['Response'].value_counts()

plt.figure(figsize=(4, 4))
fig, ax = plt.subplots()

# Add percentages to the pie chart
ax.pie(response_data, labels=[0, 1], autopct='%1.1f%%', startangle=90, colors=['#ff9999','#66b3ff'])
ax.set_title("Piechart of 'Response' in data", fontsize = 11)
ax.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()

The piechart shows imbalanced data with Response '0' as majority class.

Distribution of continous variables

In [None]:
# Boxplot
# Creating grid of subplots
fig, ax = plt.subplots(2, 2, figsize=(13, 13))

ax = ax.flatten()
# Loop through columns and plot box plots
for idx, col in enumerate(cont_col):
    sns.boxplot(data=df, y=col, ax=ax[idx], color='skyblue')
    ax[idx].set_title(f'Box Plot of {col}')
    ax[idx].set_ylabel(col)
    ax[idx].set_ylabel(col)

# Adjust the layout to prevent overlap
plt.tight_layout()

# Show the plot
plt.show()

In [None]:
# Create a 2x2 grid of subplots
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Flatten the axes array for easier iteration
axes = axes.flatten()

# Loop through columns and axes
for col, ax in zip(cont_col, axes):
    sns.histplot(df[col], bins = 20,kde=True, ax=ax)
    ax.set_title(f'Histogram and KDE of {col}', fontsize=12)
    ax.set_xlabel(col, fontsize=10)
    ax.set_ylabel('Density', fontsize=10)

# Adjust the layout to prevent overlap
plt.tight_layout()

# Show the plot
plt.show()

Distribution for categorical variables

In [None]:
# Create a 2x2 grid of subplots
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Flatten the axes array for easier iteration
axes = axes.flatten()

# Loop through columns and axes
for col, ax in zip(categorical_col, axes):
    sns.countplot(x=df[col], ax=ax, hue= df['Response'])
    ax.set_title(f'Count Plot of {col}', fontsize=12)
    ax.set_xlabel(col, fontsize=10)
    ax.set_ylabel('Count', fontsize=10)

# Adjust the layout to prevent overlap
plt.tight_layout()

# Show the plot
plt.show()

Data shows most people have a driving license but at the same time, there are more response in 0 compared to 1.

### Correlation matrix

In [None]:
def correlation_analysis(df):
    ''' visualise the correlation matrix,
    and return the highest correlated pairs'''

    # Correlation matrix
    df_num = df.select_dtypes(include=[np.number, 'category'])

    # Compute the correlation matrix
    corr = df_num.corr()

    # Generate a mask for the upper triangle
    mask = np.triu(np.ones_like(corr, dtype=bool))

    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=(15, 10))

    # Generate a custom diverging colormap
    cmap = sns.diverging_palette(230, 20, as_cmap=True)

    # Draw the heatmap with the mask
    sns.heatmap(corr, annot=True, mask=mask, cmap=cmap,
                fmt='.2f', cbar=True, annot_kws={"size": 12})

    # unstack the matrix
    corr_unstacked = corr.unstack()

    # Convert the Series to a DataFrame and reset the index
    corr_df = pd.DataFrame(corr_unstacked).reset_index()
    corr_df.columns = ['Feature1', 'Feature2', 'Correlation']

    # Remove self-correlations (correlation of a feature with itself)
    corr_df = corr_df[corr_df['Feature1'] != corr_df['Feature2']]

    # Get the absolute values of the correlations
    corr_df['Correlation'] = corr_df['Correlation'].abs()

    # Sort the DataFrame by absolute correlation values in descending order
    corr_df = corr_df.sort_values(by='Correlation', ascending=False)

    corr_df['sorted_features'] = corr_df.apply(lambda row: tuple(
        sorted([row['Feature1'], row['Feature2']])), axis=1)
    corr_df = corr_df.drop_duplicates(subset=['sorted_features'])

    # Extract the top 10 related features
    top_related = corr_df.head(5)

    return top_related


top_correlated = correlation_analysis(df)

top_correlated


In [None]:
# Extract feature pairs for visualization
top_pairs = top_correlated[['Feature1', 'Feature2']].values
top_pairs

# 6. Model Building
The target is separated from the data and then the data is split into 80/20 ratio for train and test set

In [None]:
# separate the Response as y
y = df['Response']
# get X by dropping column id and 'Response' from df
x = df.drop(columns=['id', 'Response'])

# Split to train and test set by 80/20
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.20, random_state=0)

print(f'Training set size: {x_train.shape}, {y_train.shape}')
print(f'Testing set size: {x_test.shape}, {y_test.shape}')

## XGBoost
The model was tuned using 10% of the data with Optuna.

In [None]:
# Hyperparameter tuning with optuna 2nd TEST

# Extract 10% of the data
x_sample, _, y_sample, _ = train_test_split(x, y, test_size=0.90, random_state=0)

# Split the extracted 10% data into 80% train and 20% test sets
x_train_s, x_test_s, y_train_s, y_test_s = train_test_split(x_sample, y_sample, test_size=0.20, random_state=0)

# Verify the sizes of the splits
print(f'Training set size: {x_train_s.shape}, {y_train_s.shape}')
print(f'Testing set size: {x_test_s.shape}, {y_test_s.shape}')

def objective(trial):
    params = {
        'n_estimators': trial.suggest_loguniform('n_estimators', 7500, 15000),
        'eta': trial.suggest_loguniform('eta', 0.01, 0.3),
        'alpha': trial.suggest_loguniform('alpha', 0.01, 0.5),
        'subsample': trial.suggest_uniform('subsample', 0.75, 1.0),
        'colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.2, 0.5),
        'max_depth': trial.suggest_int('max_depth', 10, 15),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 5),
        'max_child_weight': trial.suggest_int('min_child_weight', 8, 12),
        'gamma': trial.suggest_loguniform('gamma', 1e-8, 1e-3),
        'eval_metric': 'auc',
        'random_state': 42,
        'max_bin': trial.suggest_int('max_bin', 100000, 300000),
        'tree_method':'hist',
        'eval_metric':'auc',
        'objective':'binary:logistic',
        "enable_categorical": True  # Ensure categorical handling is enabled
    }
    
    dtrain = xgb.DMatrix(x_train_s, label=y_train_s, enable_categorical=True)
    dvalid = xgb.DMatrix(x_test_s, label=y_test_s, enable_categorical=True)
    
    watchlist = [(dtrain, 'train'), (dvalid, 'eval')]
    
    model = xgb.train(params, dtrain, evals=watchlist, early_stopping_rounds=50, verbose_eval=False)
    
    preds = model.predict(dvalid)
    auc = roc_auc_score(y_test_s, preds)
    
    return auc


study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50, timeout=600)


print("Best params: ", study.best_params)
print("Best AUC değeri: ", study.best_value)

In [None]:
# parameters from optuna, but change the n_estimators to 10000
model = XGBClassifier(
    n_estimators=10000,
    eta= 0.28376548783458155,
    alpha = 0.16464911888229491,
    subsample = 0.8654931054006644, 
    colsample_bytree = 0.30702002596633265, 
    max_depth = 15,
    min_child_weight = 5,
    gamma = 0.0009847067724641178,
    max_bin = 102824,
    eval_metric='auc',
    random_state=42,
    enable_categorical=True
)


# Train the model with early stopping
model.fit(
    x_train,
    y_train,
    eval_set=[(x_test, y_test)],
    early_stopping_rounds=50,
    verbose=200
)

# Print the best iteration
print("Best iteration:", model.best_iteration)

# Use the underlying booster to predict on validation set using the best iteration
booster = model.get_booster()
y_pred_prob = booster.predict(xgb.DMatrix(x_test, enable_categorical=True), iteration_range=(0, model.best_iteration + 1))
auc = roc_auc_score(y_test, y_pred_prob)
print(f"Validation AUC: {auc:.5f}")


# 7. Model Evaluation 


In [None]:
from yellowbrick.features import FeatureImportances
from yellowbrick.classifier import ConfusionMatrix, ClassificationReport, ROCAUC, DiscriminationThreshold

fig, axes = plt.subplots(2, 2, figsize=(15, 15))

model.importance_type = 'total_gain'

visualgrid = [
    FeatureImportances(model,  ax=axes[0][0], colormap= 'winter'),
    ConfusionMatrix(model, ax=axes[0][1], cmap= 'GnBu'),
    ClassificationReport(model, ax=axes[1][0], cmap= 'GnBu'),
    ROCAUC(model, ax=axes[1][1]),
]

for viz in visualgrid:
    viz.fit(x_train, y_train)
    viz.score(x_test, y_test)
    viz.finalize()

plt.show()

# 8. Submission

In [None]:
# # data preprocessing for subsmission
# convert data type
df_test = converting_datatypes(df_test)
# extraxt ids
test_ids = df_test['id']
# remove id from dataset
df_test.drop(columns=['id'],inplace = True)

In [None]:
# make predictions
df_test_dmatrix = xgb.DMatrix(df_test, enable_categorical=True)
y_pred = booster.predict(df_test_dmatrix,iteration_range=(0, model.best_iteration + 1))

In [None]:
# Create submission file
submission = pd.DataFrame({
    'id': test_ids,
    'Response': y_pred
})
submission

In [None]:
submission.to_csv("submission.csv", index=False)

# 9. Conclusion
More improvements need to be done such as: <br>
look in depth into each features in EDA <br>
try to build LGBM model for comparison<br>