# Home Site Quite Conversion Challenge 

Before asking someone on a date or skydiving, it's important to know your likelihood of success. The same goes for quoting home insurance prices to a potential customer. Homesite, a leading provider of homeowners insurance, does not currently have a dynamic conversion rate model that can give them confidence a quoted price will lead to a purchase. 

Using an anonymized database of information on customer and sales activity, including property and coverage information, Homesite is challenging you to predict which customers will purchase a given quote. Accurately predicting conversion would help Homesite better understand the impact of proposed pricing changes and maintain an ideal portfolio of customer segments. 

## Main Challenges 

This dataset was huge ~260K rows( aka samples) and 298 (features) and to add to that challenge the data was anonymized so 
doing feature engineering would be very random and usually brute force . I though of handeling this via feature selection and boosting methodology 

__I implemented two feature selection stratergies__ 

- __Mutual information:__
Mutual information (MI) between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.

- __Reculsive Feature Elimination:__
Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), the goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through any specific attribute or callable. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

After inspecting and performing EDA on the selected features I decided to treat all featues as catergorical. 

Once I have the feature selected to 50 from 298 I triend two model one simple __Logistic regression__ with one-hot encoding and other __LightGBM__ . With logistic regression I Was able to get the ROC-AUC score to 0.95 but the model took a long time to train due to large number of one-hot encoding 

I hyper-parameter tuned two Light GBM model with __Optuna__. Optuna is a hyperparameter framework . One feature which I like about it is that it allows us to stop the run for un-promising combination of values . This allows us to run hyper-parameter search for a larger grid.  

First model was trained on features obtained using mutual information which gave the ROC-AUC score as 0.93 and the second model was trained with features obtained from RFE which gave me a ROC-AUC score of 0.96+  For the final private test submission I was able to get a score of 0.9627 on the private leader board. 

Finally I used Sklearn Pipeline to optimize the prediction workflow for the test set. This allowed me to skip storing all the feature encoding values for 50 feature columns. 

## Key Learning 

- Feature Selection Techniques 
- Sklearn Pipeline 

## Part1 Notebook 
I will also link to this notebook my work where I optimized and did some EDA on the dataset 


## Upvote if you like the work 
LinkedIn: https://www.linkedin.com/in/sawantsumeet/

## Import Library 

In [None]:
import pandas as pd 
import numpy as np 
import gc
import lightgbm as gbm
from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import LabelEncoder,OrdinalEncoder
from sklearn import model_selection,metrics 
import warnings
warnings.filterwarnings("ignore")

## File Loading 

In [None]:
## Import the data 
mutual_columns=pd.read_csv('../input/insurancefeatures-homesite/mutual_info_features.csv').columns
RFE_columns=pd.read_csv('../input/insurancefeatures-homesite/RFE_features.csv').columns

df=pd.read_csv('../input/insurancefeatures-homesite/train.csv')

df_mutual=df[mutual_columns]
df_RFE=df[RFE_columns]

## Dropping few columns from df_mutual as they have to many catergories as we are going to model the annomymus feature 
## columns as purely catergorical 

df_RFE.drop(columns=['Original_Quote_Date','SalesField8'],axis=1,inplace=True)

# Delete not necessary items 
del df
gc.collect()

## Declaring the model parameters tuned using Optuna 

In [None]:
# Parameters for the two light GBM model used . This parameters where obtained by Hyperparameter Optimization using Optuna


## Parameters for Light GBM using Reculsive Feature Elimination 
RFE_params={ 
    'boosting_type': 'gbdt',
    'lambda_l1': 4.540006226304331e-08,
    'lambda_l2': 4.715716309514142,
    'num_leaves': 105,
    'feature_fraction': 0.89,
    'bagging_fraction': 1,
    'bagging_freq': 4,
    'min_child_samples': 65,
    'max_bin': 20,
    'learning_rate': 0.14, }

### Parameters for Light GBM using Mutual info 
mutual_info_params={'boosting_type': 'gbdt',
    'lambda_l1': 4.956734949314487e-08,
    'lambda_l2': 2.278541145546624e-08,
    'num_leaves': 131,
    'feature_fraction': 0.6,
    'bagging_fraction': 0.76,
    'bagging_freq': 2,
    'min_child_samples': 21,
    'max_bin': 18,
    'learning_rate': 0.15}

## Intialize the models
RFE_gbm=gbm.LGBMClassifier(**RFE_params)
mutual_gbm=gbm.LGBMClassifier(**mutual_info_params)

## Sklearn Pipline to train the model 

In [None]:
### Pipeline Implementation 

## model 1

X_train,X_val,y_train,y_val=model_selection.train_test_split(df_mutual.drop('QuoteConversion_Flag',axis=1),
                                                             df_mutual['QuoteConversion_Flag'],random_state=42,
                                                            stratify=df_mutual['QuoteConversion_Flag'])

GBM1=Pipeline([
                ('label_encoder',OrdinalEncoder(handle_unknown='use_encoded_value',unknown_value=-99)),
                 ('mutual_gbm',mutual_gbm)                
            ])

GBM1.fit(X_train,y_train)
y_predict_mutual=GBM1.predict_proba(X_val)


## Model 2 

X_train,X_val,y_train,y_val=model_selection.train_test_split(df_RFE.drop('QuoteConversion_Flag',axis=1),
                                                             df_RFE['QuoteConversion_Flag'],random_state=42,
                                                            stratify=df_RFE['QuoteConversion_Flag'])

GBM2=Pipeline([
                ('label_encoder',OrdinalEncoder(handle_unknown='use_encoded_value',unknown_value=-99)),
                ('RFE_gbm',RFE_gbm)                
            ])


GBM2.fit(X_train,y_train)
y_predict_RFE=GBM2.predict_proba(X_val)




## Taking the average of both predictions 
y_avg=(y_predict_mutual[:,1]+y_predict_RFE[:,1])/2



In [None]:
## Validation AUC 

print("AUC score for the Light GBM ensemble is:{:.2f}".format(metrics.roc_auc_score(y_val,y_avg)))

## Test Set Submission 

In [None]:
## Test Submission 

## Extract the test set 
import zipfile
with zipfile.ZipFile('/kaggle/input/homesite-quote-conversion/test.csv.zip', 'r') as zip_ref:
    zip_ref.extractall('./')
    
## Extract the submission file 

with zipfile.ZipFile('../input/homesite-quote-conversion/sample_submission.csv.zip', 'r') as zip_ref:
    zip_ref.extractall('./')
    
 #Load the test set    
df_test=pd.read_csv('./test.csv')

## just take the required columns requied for predicting on it 

RFE_columns=[col for col in RFE_columns if col not in 'QuoteConversion_Flag'] #  Test wont have the label column 
df_test=df_test[RFE_columns]
df_test.drop(columns=['Original_Quote_Date','SalesField8'],axis=1,inplace=True)

#Predict on it using the GBM2 pipeline . Here pipe line has made a task easy as we do not have to store features 
y_test= GBM2.predict_proba(df_test)

# Store the values obtained on the test set into the submission file 
df_submission=pd.read_csv('./sample_submission.csv')

df_submission['QuoteConversion_Flag']=y_test[:,1]

df_submission.to_csv('./LightGBM_RFE_Features.csv',index=False)