# Home Site Quite Conversion Challenge 

Before asking someone on a date or skydiving, it's important to know your likelihood of success. The same goes for quoting home insurance prices to a potential customer. Homesite, a leading provider of homeowners insurance, does not currently have a dynamic conversion rate model that can give them confidence a quoted price will lead to a purchase. 

Using an anonymized database of information on customer and sales activity, including property and coverage information, Homesite is challenging you to predict which customers will purchase a given quote. Accurately predicting conversion would help Homesite better understand the impact of proposed pricing changes and maintain an ideal portfolio of customer segments. 

## Main Challenges 

This dataset was huge ~260K rows( aka samples) and 298 (features) and to add to that challenge the data was anonymized so 
doing feature engineering would be very random and usually brute force . I though of handeling this via feature selection and boosting methodology 

__I implemented two feature selection stratergies__ 

- __Mutual information:__
Mutual information (MI) between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.

- __Reculsive Feature Elimination:__
Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), the goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through any specific attribute or callable. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

After inspecting and performing EDA on the selected features I decided to treat all featues as catergorical. 

Once I have the feature selected to 50 from 298 I triend two model one simple __Logistic regression__ with one-hot encoding and other __LightGBM__ . With logistic regression I Was able to get the ROC-AUC score to 0.95 but the model took a long time to train due to large number of one-hot encoding 

I hyper-parameter tuned two Light GBM model with __Optuna__. Optuna is a hyperparameter framework . One feature which I like about it is that it allows us to stop the run for un-promising combination of values . This allows us to run hyper-parameter search for a larger grid.  

First model was trained on features obtained using mutual information which gave the ROC-AUC score as 0.93 and the second model was trained with features obtained from RFE which gave me a ROC-AUC score of 0.96+  For the final private test submission I was able to get a score of 0.9627 on the private leader board. 

Finally I used Sklearn Pipeline to optimize the prediction workflow for the test set. This allowed me to skip storing all the feature encoding values for 50 feature columns. 

## Key Learning 

- Feature Selection Techniques 
- Sklearn Pipeline 

## Part2 Notebook: Final Implementation 
https://www.kaggle.com/sumeetsawant/insurance-quote-xgboost-and-pipeline-auc-0-9627 


## Upvote if you like the work 
LinkedIn: https://www.linkedin.com/in/sawantsumeet/

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
import warnings 
import gc
#warnings.filterwarnings('ignore')

from sklearn import model_selection
from sklearn import linear_model ,metrics
%matplotlib inline 

In [None]:
#Clear memory from previous run if any 
gc.collect()

In [None]:
import zipfile
with zipfile.ZipFile('/kaggle/input/homesite-quote-conversion/train.csv.zip', 'r') as zip_ref:
    zip_ref.extractall('./')

In [None]:
df=pd.read_csv('./train.csv')

In [None]:
df.head()

In [None]:
df.shape

In [None]:
color=sns.color_palette()[3]
sns.countplot(x=df['QuoteConversion_Flag'],color=color);

## Inbalanced dataset 
## Something to see :
## Undersampling
## Artificial data generation 
## Weights 


columns=df.columns.to_list()

plt.figure(figsize=(30,30))
for i in range(50):
    sample=np.random.choice(columns);
    if(sample not in feat):
        feat.append(sample)
    else:
        while(sample in feat):
            sample=np.random.choice(columns);
        
       
    plt.subplot(10,5,i+1)
    sns.countplot(x=df[sample],color=color);
    plt.title(sample);
plt.tight_layout()

Some features look obvious candidates to drop since they have just one value so thay can be dropped 

### Feature Selection using univariate analysis 

- Filter methods  : Using univariate analysis (Mutual information) 
- Wrapper methods : Using Recursive Feature Elimination 

In [None]:
from sklearn import feature_selection
from sklearn.feature_selection import mutual_info_classif

X = df.copy()
y = X.pop('QuoteConversion_Flag')

# Dropping null columns and columns having only 1 values 
X.drop(columns=['PersonalField7', 'PersonalField84', 'PropertyField3', 'PropertyField4',
                'PropertyField29', 'PropertyField32', 'PropertyField34',
                'PropertyField36', 'PropertyField38',
                'GeographicField10A','PropertyField6'], axis=1,inplace=True)

# Label encoding for categoricals
for colname in X.select_dtypes("object"):
    X[colname], _ = X[colname].factorize()

# Selecting 50 top features using Mutual_info_classif
kbest_mutual=feature_selection.SelectKBest(score_func=mutual_info_classif,k=50)
kbest_mutual.fit_transform(X,y)


cols = kbest_mutual.get_support(indices=True)
X_mutual=X.iloc[:,cols]

### Selecting features using RFE 

In [None]:
from sklearn import ensemble 

estimator = ensemble.ExtraTreesClassifier()
selector = feature_selection.RFE(estimator, n_features_to_select=50, step=10)

selector.fit(X, y)

X_RFE=X.iloc[:,selector.get_support()]

#print("Selected Features: %s" % (X.columns[selector.get_support()]))
print("Feature Ranking: %s" % (selector.ranking_))

In [None]:
# Exporting the selected features into dataset for future use 
X_RFE=pd.concat([X_RFE,y],axis=1)
X_mutual=pd.concat([X_mutual,y],axis=1)

X_RFE.to_csv('./RFE_features.csv',index=False,header=True)
X_mutual.to_csv('./mutual_info_features.csv',index=False,header=True)

### Implemention models after feature selections 

In [None]:
#Import columns from dataset

#df=pd.read_csv('../input/insurancefeatures-homesite/train.csv')

RFE_columns=pd.read_csv('../input/insurancefeatures-homesite/RFE_features.csv').columns



In [None]:
# Taking just the columns which have been found useful via RFE analyis from the main dataset  

df_RFE=df.loc[:,RFE_columns]

## Checking the various columns which remain in our dataset 

for col in df_RFE.columns:
    print(col,"       :   ",df_RFE[col].nunique())
    

In [None]:
## Visulaize the features 

columns=[col for col in df_RFE.columns if col not in ['SalesField8','Original_Quote_Date']]
plt.figure(figsize=(30,30))
for i in range(30):
    plt.subplot(6,5,i+1)
    sns.countplot(x=df_RFE[columns[i]],color=color);
    plt.title(columns[i]);
plt.tight_layout()

del columns 
gc.collect()
#most columns look catergorical 

In [None]:
#Create Folds

df_RFE=df_RFE.reset_index(drop=True)

df_RFE['fold']=-1
kf=model_selection.StratifiedKFold(n_splits=5)
for fold ,(train_idx,val_id) in enumerate(kf.split(X=df_RFE,y=df_RFE['QuoteConversion_Flag'])):
    df_RFE.loc[val_id,'fold']=fold


In [None]:
#Preprocess data 
# converting it into one hot vectors , dropping the dummy variable and also the original column

df_RFE.drop(columns=['Original_Quote_Date','SalesField8'],axis=1,inplace=True)

columns=[col for col in df_RFE.columns if col not in ['fold','QuoteConversion_Flag']]

for column in columns:
    df_RFE=pd.concat([df_RFE,pd.get_dummies(df_RFE[column],prefix=column,drop_first=True)],axis=1)
    df_RFE.drop(columns=column,axis=1,inplace=True)

### Trying Various Models 

### Logistic Regression

In [None]:
## Lets fit a logistic classification model which can also server as our baseline 

val_ROC_score=[]
train_ROC_score=[]

for fold in np.arange(5):
    
    
    print(f' fold {fold}')
    
    df_train=df_RFE[df_RFE['fold']!=fold]
    df_val=df_RFE[df_RFE['fold']==fold]
    
    lr=linear_model.LogisticRegression(max_iter=100,n_jobs=-1, random_state=42,
                                      class_weight='balanced')
    
    X_train=df_train.drop(columns=['fold','QuoteConversion_Flag'],axis=1)
    y_train=df_train['QuoteConversion_Flag']
    
    X_val=df_val.drop(axis=1,columns=['fold','QuoteConversion_Flag'])
    y_val=df_val['QuoteConversion_Flag']
    
    
    lr.fit(X_train,y_train)
    
    y_pred_train=lr.predict_proba(X_train)
    y_pred=lr.predict_proba(X_val)
    
    
    val_ROC_score.append(metrics.roc_auc_score(y_val,y_pred[:,1].reshape(-1,1)))
    train_ROC_score.append(metrics.roc_auc_score(y_train,y_pred_train[:,1].reshape(-1,1)))
    
    print(f'Completed fold {fold}')
    
    
print(f'Mean_ROC_Score using Logistic Regression is: {np.mean(val_ROC_score)}')



### Light GBM with Reculsive Feature Elimination and Tuned with Optuna 

In [None]:
## Prepare the dataset 


## Label Encode the columns 

from sklearn import preprocessing

for column in df_RFE.columns:
    lb=preprocessing.LabelEncoder()
    if column not in 'QuoteConversion_Flag':
        df_RFE.loc[:,column]=lb.fit_transform(df_RFE[column].values)

### Tune the model 

In [None]:
## Try baseline Light GBM with the above features 

import lightgbm as lgb
import logging
import optuna
import sys

# 1. Define an objective function to be maximized.
def objective(trial):
    
    
    data, target = df_RFE.drop('QuoteConversion_Flag',axis=1),df_RFE['QuoteConversion_Flag']
    train_x, valid_x, train_y, valid_y =model_selection.train_test_split(data, target, test_size=0.15,stratify=target)
    dtrain = lgb.Dataset(train_x, label=train_y)
    dvalid=lgb.Dataset(valid_x,label=valid_y)

    # 2. Suggest values of the hyperparameters using a trial object.
    param = {
        'objective': 'binary',
        'metric': 'auc',
        'device_type':'cpu',
        'seed':42,
        'verbosity':-1,
        'boosting_type': trial.suggest_categorical('boosting_type',['gbdt','rf']),
        'lambda_l1': trial.suggest_loguniform('lambda_l1', 1e-8, 10.0),
        'lambda_l2': trial.suggest_loguniform('lambda_l2', 1e-8, 10.0),
        'num_leaves': trial.suggest_int('num_leaves', 2, 256),
        'feature_fraction': trial.suggest_uniform('feature_fraction', 0.4, 1.0),
        'bagging_fraction': trial.suggest_uniform('bagging_fraction', 0.4, 1.0),
        'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
        'max_bin':trial.suggest_int('max_bin',2,26,step=2),
        'learning_rate':trial.suggest_float('lr',0.1,1,log=True),
        'num_iterations':200
    }
    
    pruning_callback = optuna.integration.LightGBMPruningCallback(trial, "auc",'valid')

    gbm=lgb.train(param, dtrain,valid_sets=[dvalid],valid_names=['valid'],callbacks=[pruning_callback])
    
    auc=gbm.best_score['valid']['auc']
    
    return auc

# 3. Create a study object and optimize the objective function.

optuna.logging.get_logger("optuna").addHandler(logging.StreamHandler(sys.stdout))
optuna.logging.set_verbosity(optuna.logging.WARNING)

study = optuna.create_study(direction='maximize',
                            pruner=optuna.pruners.MedianPruner(n_warmup_steps=10),
                            sampler=optuna.samplers.TPESampler(),
                            study_name='Light GBM optimization')

study.optimize(objective, n_trials=120)

### Display the best model's score and hyper-parameters 

In [None]:
print("Best trial:")
trial=study.best_trial

print("  Value: {}".format(trial.value))

print("  Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))

### Light GBM with Mutual information features and Tuned with Optuna 
so here i select the feature which I got via mutual information

In [None]:
### Using Dataset generated from mutual_info_features 

#Import columns from dataset

#df=pd.read_csv('../input/insurancefeatures-homesite/train.csv')

df_mutual=pd.read_csv('../input/insurancefeatures-homesite/mutual_info_features.csv').columns

# Taking just the columns which have been found useful via RFE analyis from the main dataset  

df_mutual=df.loc[:,df_mutual]

## Checking the various columns which remain in our dataset 

for col in df_mutual.columns:
    print(col,"       :   ",df_mutual[col].nunique())

In [None]:
## Prepare the dataset 

## Label Encode the columns 

from sklearn import preprocessing

for column in df_mutual.columns:
    lb=preprocessing.LabelEncoder()
    if column not in 'QuoteConversion_Flag':
        df_mutual.loc[:,column]=lb.fit_transform(df_mutual[column].values)

### Tune the model 

In [None]:
## Try baseline Light GBM with the above features 

import lightgbm as lgb
import logging
import optuna
import sys

# 1. Define an objective function to be maximized.
def objective(trial):
    
    
    data, target = df_mutual.drop('QuoteConversion_Flag',axis=1),df_mutual['QuoteConversion_Flag']
    train_x, valid_x, train_y, valid_y =model_selection.train_test_split(data, target, test_size=0.15,stratify=target)
    dtrain = lgb.Dataset(train_x, label=train_y)
    dvalid=lgb.Dataset(valid_x,label=valid_y)

    # 2. Suggest values of the hyperparameters using a trial object.
    param = {
        'objective': 'binary',
        'metric': 'auc',
        'device_type':'cpu',
        'seed':42,
        'verbosity':-1,
        'boosting_type': trial.suggest_categorical('boosting_type',['gbdt','rf']),
        'lambda_l1': trial.suggest_loguniform('lambda_l1', 1e-8, 10.0),
        'lambda_l2': trial.suggest_loguniform('lambda_l2', 1e-8, 10.0),
        'num_leaves': trial.suggest_int('num_leaves', 2, 256),
        'feature_fraction': trial.suggest_uniform('feature_fraction', 0.4, 1.0),
        'bagging_fraction': trial.suggest_uniform('bagging_fraction', 0.4, 1.0),
        'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
        'max_bin':trial.suggest_int('max_bin',2,26,step=2),
        'learning_rate':trial.suggest_float('lr',0.1,1,log=True),
        'num_iterations':200
    }
    
    pruning_callback = optuna.integration.LightGBMPruningCallback(trial, "auc",'valid')

    gbm=lgb.train(param, dtrain,valid_sets=[dvalid],valid_names=['valid'],callbacks=[pruning_callback])
    
    auc=gbm.best_score['valid']['auc']
    
    return auc

# 3. Create a study object and optimize the objective function.

optuna.logging.get_logger("optuna").addHandler(logging.StreamHandler(sys.stdout))
optuna.logging.set_verbosity(optuna.logging.WARNING)

study = optuna.create_study(direction='maximize',
                            pruner=optuna.pruners.MedianPruner(n_warmup_steps=10),
                            sampler=optuna.samplers.TPESampler(),
                            study_name='Light GBM optimization')

study.optimize(objective, n_trials=120)


### Display the best model's score and hyper-parameters 

In [None]:
print("Best trial:")
trial=study.best_trial

print("  Value: {}".format(trial.value))

print("  Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))

## Feature Selection articles 
https://towardsdatascience.com/feature-selection-techniques-in-machine-learning-with-python-f24e7da3f36e
https://www.analyticsvidhya.com/blog/2020/10/feature-selection-techniques-in-machine-learning/
https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/
    
## Usesful functions 
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html

## Model 
https://towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db#:~:text=Unlike%20CatBoost%20or%20LGBM%2C%20XGBoost,supplying%20categorical%20data%20to%20XGBoost
https://medium.com/sfu-cspmp/xgboost-a-deep-dive-into-boosting-f06c9c41349 

## Optuna
https://neptune.ai/blog/optuna-vs-hyperopt