#  Credit Card Leads Prediction 


***A JOB-A-THON conducted by Analytics Vidhya***

## Problem Statment:

### To find the probabilities of customer interested in getting a credit card, which will help the bank in cross selling amongst different bank account categories using the variables such as Age, Occupation, Avg_Account Balance,etc..

## Solution:
We are going to look at boosting, Data processing and Data Modeling and selecting Optimal Threshold for optimal split.

## Import necesary libraries

In [None]:
import pandas as pd#linear algebra
import numpy as np# data processing I/O CSV files

#data visualization library
import matplotlib.pyplot as plt
plt.rc("font", size=14)
import seaborn as sns
sns.set()

#machine learning library

#label encoding library
from sklearn.preprocessing import LabelEncoder

#cross validation score,kfold and stratified kfold library
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import KFold, StratifiedKFold

#gradient boosting library
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

#roc auc score library
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

#filter out warnings
import warnings
warnings.simplefilter(action='ignore')

### loading The Train Data And Test Data

In [None]:
data_train=pd.read_csv("../input/jobathon-may-2021-credit-card-lead-prediction/Train_Data.csv")
data_test=pd.read_csv("../input/jobathon-may-2021-credit-card-lead-prediction/Test_Data.csv")

### Analysis The Train And Test Datasets

In [None]:
data_train.head()

In [None]:
data_test.head()

### Checking The Rows In Train And Test Data

In [None]:
data_train.shape

In [None]:
data_test.shape

### Checking The Null Values In Train And Test Data 

In [None]:
data_train.info()

It can be observed that only 'Credit_Product' coloumn has missing values.

In [None]:
data_test.info()

### Checking Duplicated Values In Train And Test Data

In [None]:
data_train[data_train.duplicated(keep= False)]

In [None]:
data_test[data_test.duplicated(keep=False)]

In [None]:
data_train.describe()

In [None]:
data_test.describe()

### Data Visualizations

In [None]:
sns.set_style('darkgrid')#set background 

fig, ax = plt.subplots(1, 1, figsize=(10, 5))
sns.countplot(data=data_train, x='Credit_Product', hue='Is_Lead', ax=ax, palette='rocket')
ax.set_title('Credit_Product - Is_Lead Plot', size=20, loc='Left', y=1.04)

sns.despine()#remove axes spines
plt.show()

As only 'Credit_Product'columns has missing values, we shall visualize it first.

In [None]:
plt.figure(figsize=(16, 7))
temp = data_train.copy()
temp['Age'] = pd.cut(temp.Age, bins=[20, 35, 50, 65, 80, 95])

sns.countplot(data=temp, x='Age', hue='Is_Lead', palette='ocean_r')

plt.show()

It was found that the age could be dividen into age groups

In [None]:
numerical = ['Age','Vintage','Avg_Account_Balance']
sns.pairplot(data=data_train,x_vars=numerical, hue = 'Is_Lead', palette='BuPu')

We shall now plot the numberical variables to look at the distribution

In [None]:
temp1 = data_train.copy()
temp1[numerical] = np.log(data_train[numerical])
sns.pairplot(data=temp1,x_vars=numerical, hue = 'Is_Lead', palette='OrRd')

We shall log trasform the variables and plot again

### Combined The Train and Test Data 

In [None]:
def get_combined_data():
    train = pd.read_csv('../input/jobathon-may-2021-credit-card-lead-prediction/Train_Data.csv')
    test = pd.read_csv('../input/jobathon-may-2021-credit-card-lead-prediction/Test_Data.csv')
    targets = train.Is_Lead
    train.drop('Is_Lead', 1, inplace=True)
    combined = train.append(test)
    combined.reset_index(inplace=True)
    combined.drop(['index', 'ID'], inplace=True, axis=1)
    return combined

In [None]:
combined = get_combined_data()
combined.describe()

In [None]:
combined.shape

### Filling The Null Values

In [None]:
def impute_Credit_Product():
    global combined
    combined['Credit_Product'].fillna('Yes', inplace=True) 

In [None]:
impute_Credit_Product()

In [None]:
combined.info()

### Changing To Labelencoding

In [None]:
def process_gender():
    global combined
    combined['Gender'] = combined['Gender'].map({'Male':1,'Female':0})   

In [None]:
def process_Occupation():
    global combined
    combined['Occupation'] = combined['Occupation'].map({'Other':0,'Salaried':1,'Self_Employed':2,'Entrepreneur':3})

In [None]:
def process_Credit_Product():
    global combined
    combined['Credit_Product'] = combined['Credit_Product'].map({'No':0,'Yes':1})

In [None]:
def process_Is_Active():
    global combined
    combined['Is_Active'] = combined['Is_Active'].map({'No':0,'Yes':1})

In [None]:
process_gender()
process_Occupation()
process_Credit_Product()
process_Is_Active()

In [None]:
combined.head()

In [None]:
combined.shape

In [None]:
label = LabelEncoder()
var_label = ['Region_Code','Channel_Code']
for i in var_label:
    combined[i]=label.fit_transform(combined[i])

### Normalize The Data

In [None]:
sns.distplot(combined["Avg_Account_Balance"])

In [None]:
combined["Avg_Account_Balance"]=np.log(combined["Avg_Account_Balance"])

In [None]:
sns.distplot(combined["Avg_Account_Balance"])

### Spiliting The Data

In [None]:
train=combined[:245725]
test=combined[245725:]
targets=data_train.Is_Lead

### User defined function for validating all the models

In [None]:
def cross_val_score(train,targets,model,params, folds=9): 
    
    skf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=21)
    
    for fold, (train_temp,test_temp) in enumerate(skf.split(train,targets)):
        
        x_train,y_train = train.iloc[train_temp], targets.iloc[train_temp]
    
        x_test,y_test = train.iloc[test_temp],targets.iloc[test_temp]
    
    
        models=model(**params)
        models.fit(x_train,y_train,
                  eval_set=[(x_test, y_test)],
                  early_stopping_rounds=100,
                  verbose=400)
    
        pred = models.predict_proba(x_test)[:, 1]
        roc = roc_auc_score(y_test, pred)
        print(f"roc_auc_score: {roc}")
        print("-"*50)
        
    return models

### Cat Boosting

In [None]:
cat_params= {'n_estimators': 20000, 
                  'depth': 4, 
                  'learning_rate': 0.023, 
                  'colsample_bylevel': 0.655, 
                  'bagging_temperature': 0.921, 
                  'l2_leaf_reg': 10.133}

In [None]:
result_cat_boost=cross_val_score(train,targets,CatBoostClassifier,cat_params)

### Light Gradiant Boosting

In [None]:
lgb_params= {'learning_rate': 0.045, 
             'n_estimators': 20000, 
             'max_bin': 94,
             'num_leaves': 10, 
             'max_depth': 27, 
             'reg_alpha': 8.457, 
             'reg_lambda': 6.853, 
             'subsample': 0.749}

In [None]:
result_lgb = cross_val_score(train,targets,LGBMClassifier,lgb_params)

### Extreame Gradiant Boosting

In [None]:
xgb_params= {'n_estimators': 20000, 
             'max_depth': 6, 
             'learning_rate': 0.0201, 
             'reg_lambda': 29.326, 
             'subsample': 0.818, 
             'colsample_bytree': 0.235, 
             'colsample_bynode': 0.820, 
             'colsample_bylevel': 0.453}

In [None]:
result_xgb = cross_val_score(train,targets,XGBClassifier,xgb_params)

### Average We Are Taking  of XGB,LGB And CAT

In [None]:
pred_test_lgb = result_lgb.predict_proba(test)[:,1]
pred_test_xgb = result_xgb.predict_proba(test)[:,1]
pred_test_cat = result_cat_boost.predict_proba(test)[:,1]
prediction = (pred_test_lgb + pred_test_cat+pred_test_xgb)/3

### Output File

In [None]:
output = prediction
df_output = pd.DataFrame()
temp = pd.read_csv('../input/jobathon-may-2021-credit-card-lead-prediction/Test_Data.csv')
df_output['ID'] = temp['ID']
df_output['Is_Lead'] = output
df_output.to_csv('output_cxl.csv',index=False)

# Conclusion
We have obtained a good roc_auc_score score for test data.
The thresholds for each of the model has helped in decent split and we have successfully achieved the objective
### Future Improvements:
The models can be tuned for hyperparameter optimization, but because the training data is large, it takes time for parametrs to get tuned.