## <div align = "center"> Credit Card Leak Prediction</div>

## [1]. Business Problem


Happy Customer Bank is a mid-sized private bank that deals in all kinds of banking products, like Savings accounts, Current accounts, investment products, credit products, among other offerings.

The bank also cross-sells products to its existing customers and to do so they use different kinds of communication like tele-calling, e-mails, recommendations on net banking, mobile banking, etc. 

In this case, the Happy Customer Bank wants to cross sell its credit cards to its existing customers. The bank has identified a set of customers that are eligible for taking these credit cards.<br>

Now, the bank is looking for your help in identifying customers that could show higher intent towards a recommended credit card, given:
- Customer details (gender, age, region etc.)
- Details of his/her relationship with the bank (Channel_Code,Vintage, 'Avg_Asset_Value etc.)

### Problem Statement 

Given the information about customer and his relationship to the bank predict whether he/she is interested for the credit card. 
- 0 if customer is not interested.
- 1 if customer is interested.

**Real-world/Business objectives and constraints**<br>
- No low-latency requirement. 

## [2]. Machine Learning Problem

### Dataset Overview

**Train Data**
<p>


    1. ID - Unique Identifier for a row
    2. Gender - Gender of the Customer
    3. Age - Age of the Customer (in Years)
    4. Region_Code - Code of the Region for the customers
    5. Occupation - Occupation Type for the customer
    6. Channel_Code - Acquisition Channel Code for the Customer  (Encoded)
    7. Vintage - Vintage for the Customer (In Months)
    8. Credit_Product - If the Customer has any active credit product (Home loan,
                      Personal loan, Credit Card etc.)
    9. Avg_Account_Balance - Average Account Balance for the Customer in last 12 Months
    10. Is_Active - If the Customer is Active in last 3 Months
    11. Is_Lead(Target) - If the Customer is interested for the Credit Card
                        - 0 : Customer is not interested
                        - 1 : Customer is interested
  </p>

**Test Data**
<p>


    1. ID - Unique Identifier for a row
    2. Gender - Gender of the Customer
    3. Age - Age of the Customer (in Years)
    4. Region_Code - Code of the Region for the customers
    5. Occupation - Occupation Type for the customer
    6. Channel_Code - Acquisition Channel Code for the Customer  (Encoded)
    7. Vintage - Vintage for the Customer (In Months)
    8. Credit_Product - If the Customer has any active credit product (Home loan,
                      Personal loan, Credit Card etc.)
    9. Avg_Account_Balance - Average Account Balance for the Customer in last 12 Months
    10. Is_Active - If the Customer is Active in last 3 Months

  </p>

### Evaluation Metric

- The evaluation metric for this competition is roc_auc_score across all entries in the test set.

## [3]. Reading Data

### Importing Libraries

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

from tqdm import tqdm #for printing status bar
import warnings 
warnings.filterwarnings("ignore")

from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_curve, confusion_matrix
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import CountVectorizer
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from scipy.stats import randint as sp_randint
from sklearn.model_selection import RandomizedSearchCV
from imblearn.over_sampling import SMOTE

### Loading Data 

In [None]:
train_df = pd.read_csv("../input/jobathon-may-2021-credit-card-lead-prediction/train.csv")
test_df = pd.read_csv("../input/jobathon-may-2021-credit-card-lead-prediction/test.csv")

In [None]:
train_df.head()

In [None]:
print("Number of train data points are", train_df.shape[0])
print("Number of test data points are", test_df.shape[0])

In [None]:
train_df.info()

In [None]:
print("Number of rows with null values is", train_df.isnull().sum().sum())

In [None]:
#printing null values
train_df[train_df.isnull().any(axis=1)]

In [None]:
# printing class label distribution of null values
train_df[train_df.isnull().any(axis=1)]["Is_Lead"].value_counts()

In [None]:
print("Number of duplicate data points is : ", train_df.duplicated().sum())

- There are null values present in the dataset and most of them have class label = 1. We will use this information later.

## [4]. Exploratory Data Analysis

In [None]:
sns.set_style("whitegrid")
sns.countplot(x = "Is_Lead", data=train_df)
plt.title("Is_Lead Distribution")
plt.show()

In [None]:
print("Total number of datapoints is", train_df.shape[0])
print("Number of points in negative class is %d which is %f percent."%(len(train_df["Is_Lead"] == 0),100-np.round(train_df["Is_Lead"].mean()*100,2)))
print("Number of points in positive class is %d which is %f percent."%(len(train_df["Is_Lead"] == 1),np.round(train_df["Is_Lead"].mean()*100,2)))

- It can be seen that the data is imbalanced.

In [None]:
sns.countplot(x = "Gender", data= train_df)
plt.show()

In [None]:
sns.countplot(x = "Is_Lead", data = train_df, hue = "Gender")
plt.title("Distribution of Gender for each class label")
plt.show()

- Males and females are almost same in both the class labels

In [None]:
sns.FacetGrid(data = train_df ,height=8,hue = "Is_Lead")\
    .map(sns.distplot, "Age") \
    .add_legend()
plt.show()

- It can be seen that older people are dominant in class 1 and people in age 20 to 30 have dominant 0 class label

In [None]:
train_df.Region_Code.value_counts()

In [None]:
plt.figure(figsize=(30,7))
sns.countplot(x = "Region_Code", data = train_df, hue = "Is_Lead", palette="pastel")
plt.title("Distribution of Region Code for each class label")
plt.show()

In [None]:
train_df.Occupation.value_counts()

In [None]:
plt.figure(figsize=(10,7))
sns.countplot(x = "Is_Lead", data = train_df, hue = "Occupation", palette="pastel")
plt.title("Distribution of Region Code for each class label")
plt.show()

- Salaried employees are more likely to have class label 0 while Enterpreneur are more likely to have class label 1

In [None]:
train_df.Channel_Code.value_counts()

In [None]:
plt.figure(figsize=(10,7))
sns.countplot(x = "Is_Lead", data = train_df, hue = "Channel_Code", palette="pastel")
plt.title("Distribution of Region Code for each class label")
plt.show()

- Region X1 and X4 is more likely to have Is_Lead = 0

In [None]:
sns.FacetGrid(data = train_df ,height=8,hue = "Is_Lead")\
    .map(sns.distplot, "Vintage") \
    .add_legend()
plt.show()

- 0 majority for low Vintage 1 majority for high vintage

In [None]:
sns.boxplot(x = "Avg_Account_Balance",data = train_df,hue = "Is_Lead")
plt.show()

In [None]:
for i in range(0,101,10):
    print(i,"th percentile of avg_account_balance is", np.percentile(train_df["Avg_Account_Balance"], i))

In [None]:
for i in range(90,101):
    print(i,"th percentile of avg_account_balance is", np.percentile(train_df["Avg_Account_Balance"], i))

In [None]:
for i in range(991,1001):
    print(i/10,"th percentile of avg_account_balance is", np.percentile(train_df["Avg_Account_Balance"], i/10))

- There are outliers present in avg_account_balance column; tried removing it but didn't get the desried results so not removing it.

In [None]:
sns.FacetGrid(data = train_df[train_df["Avg_Account_Balance"]<0.4e7] ,height=8,hue = "Is_Lead")\
    .map(sns.distplot, "Avg_Account_Balance") \
    .add_legend()
plt.show()

In [None]:
sns.FacetGrid(data = train_df[train_df["Avg_Account_Balance"]>0.4e7] ,height=8,hue = "Is_Lead")\
    .map(sns.distplot, "Avg_Account_Balance") \
    .add_legend()
plt.show()

Not much can be inferred from Avg_Account_Balance column alone

In [None]:
plt.figure(figsize=(10,7))
sns.countplot(x = "Is_Lead", data = train_df, hue = "Is_Active", palette="pastel")
plt.title("Distribution of Region Code for each class label")
plt.show()

In [None]:
train_df.Credit_Product.value_counts()

In [None]:
# filling null values with No_Info
train_df = train_df.fillna("No_Info")
test_df = test_df.fillna("No_Info")

In [None]:
train_df.Credit_Product.value_counts()

In [None]:
sns.set_style("whitegrid")
plt.figure(figsize=(10,7))
sns.countplot(x = "Is_Lead", data = train_df, hue = "Credit_Product", palette="pastel")
plt.title("Distribution of Region Code for each class label")
plt.show()

- Is_Lead is dominantly 1 if Credit_Product is Yes and is dominantly 1 if Credit__Product is No_info.

## [5]. Pre-processing

In [None]:
#creating a copy of dataframe
train_df_preprocessed = train_df.copy()
test_df_preprocessed = test_df.copy()

In [None]:
train_df.head()

In [None]:
train_df_preprocessed.head()

In [None]:
train_df_preprocessed.shape

In [None]:
train_df_preprocessed.info()

- We will be using one hot encoding to encode the features.

In [None]:
# one hot encoding the categorical features
encoder = CountVectorizer()
for column in tqdm(["Gender","Region_Code","Occupation","Channel_Code","Credit_Product","Is_Active"]):
    array1 = encoder.fit_transform(train_df[column]).toarray()
    array1_df = pd.DataFrame(array1, columns= [column + str(i) for i in range(array1.shape[1])])
    array2 = encoder.transform(test_df[column]).toarray()
    array2_df = pd.DataFrame(array2, columns= [column + str(i) for i in range(array1.shape[1])])
    train_df_preprocessed.drop(columns=column, inplace=True)
    test_df_preprocessed.drop(columns = column, inplace = True)
    train_df_preprocessed = pd.concat([train_df_preprocessed,array1_df], axis =1)
    test_df_preprocessed = pd.concat([test_df_preprocessed,array2_df], axis = 1)
    
    
    

In [None]:
train_df_preprocessed.head()

In [None]:
test_df_preprocessed.head()

In [None]:
#dropping id column
train_df_preprocessed.drop(columns = "ID", inplace = True)
test_df_preprocessed.drop(columns = "ID", inplace = True)

In [None]:
train_df_preprocessed.head()

In [None]:
test_df_preprocessed.head()

In [None]:
# creating X and Y datasets
Y_train_onehot = train_df_preprocessed["Is_Lead"]
X_train_onehot = train_df_preprocessed.drop(columns = "Is_Lead")

In [None]:
Y_train_onehot.value_counts()

In [None]:
X_train_onehot.head()

## [6]. Modelling

- Using tree based algorithms as the dimensionality of data is moderately low and most of the features are binary.
- We will use Ensemble Techniques like Gradient Boosting for better model performance.

In [None]:
#LGBMClassifier with RandomizedSearchCV for hyperparameter tuning
param_dist = {"n_estimators":sp_randint(40,100),
              "colsample_bytree":np.array([0.5,0.6,0.7,0.8,0.9,1]),
              "subsample":np.array([0.5,0.6,0.7,0.8,0.9,1]),
              "reg_lambda":np.array([1e-5,1e-4,1e-3,1e-2,0.1,1,10,100]),
              "reg_alpha":np.array([1e-5,1e-4,1e-3,1e-2,0.1,1,10,100]),
              "min_child_samples": sp_randint(25,65),
                "max_depth": sp_randint(1,20)}

clf1 = LGBMClassifier(boosting_type = "gbdt",n_jobs =-1,random_state = 42)

lgbm_random = RandomizedSearchCV(clf1, param_distributions=param_dist,
                                   n_iter=20,cv=10,scoring='roc_auc',random_state=42,verbose=1)

lgbm_random.fit(X_train_onehot,Y_train_onehot)
print('mean test scores',lgbm_random.cv_results_['mean_test_score'])

In [None]:
lgbm_random.best_params_

In [None]:
final_model = LGBMClassifier(colsample_bytree =  0.8,
 max_depth =  8,
 min_child_samples = 45,
 n_estimators =  55,
 reg_lambda =  0.0001,
 reg_alpha= 0.1,
 subsample =  0.9, n_jobs=-1,boosting_type = "gbdt")
final_model.fit(X_train_onehot,Y_train_onehot)
proba = final_model.predict_proba(X_train_onehot)[:,1]
train_score = roc_auc_score(Y_train_onehot,proba)
cv_score = cross_val_score(final_model,X_train_onehot,Y_train_onehot,scoring="roc_auc",verbose=2,cv =5).mean()
print(cv_score)
print(train_score)

In [None]:
#calculating test probabilites
test_proba = final_model.predict_proba(test_df_preprocessed)[:,1]
final_solution = pd.DataFrame()
final_solution["ID"] = test_df["ID"]
final_solution["Is_Lead"] = test_proba


In [None]:
final_solution

In [None]:
#saving to df
final_solution.to_csv("final_solution.csv")

In [None]:
df = pd.read_csv("final_solution.csv")
df.head()
df.drop(columns = "Unnamed: 0",inplace = True)

In [None]:
df.head()

In [None]:
df.to_csv("solution.csv", index=False)

### Oversampling

- As there was a class imbalance present in the train data we will try to overcome it by oversampling. For this we will use SMOTE. 

In [None]:
oversample = SMOTE(random_state=42, n_jobs=-1)
X_train_ovr, Y_train_ovr = oversample.fit_resample(X_train_onehot,Y_train_onehot)
print(X_train_ovr.shape)
print(Y_train_ovr.shape)
print(Y_train_ovr.value_counts())

In [None]:
param_dist = {"n_estimators":sp_randint(40,100),
              "colsample_bytree":np.array([0.5,0.6,0.7,0.8,0.9,1]),
              "subsample":np.array([0.5,0.6,0.7,0.8,0.9,1]),
              "reg_lambda":np.array([1e-5,1e-4,1e-3,1e-2,0.1,1,10,100]),
              "reg_alpha":np.array([1e-5,1e-4,1e-3,1e-2,0.1,1,10,100]),
              "min_child_samples": sp_randint(25,65),
                "max_depth": sp_randint(1,20)}


clf2 = LGBMClassifier(boosting_type = "gbdt",n_jobs =-1,random_state = 21,silent=True)

lgbm_random1 = RandomizedSearchCV(clf2, param_distributions=param_dist,
                                   n_iter=20,cv=10,scoring='roc_auc',random_state=21,verbose=1)

lgbm_random1.fit(X_train_ovr,Y_train_ovr)
print('mean test scores',lgbm_random1.cv_results_['mean_test_score'])

In [None]:
lgbm_random1.best_params_

In [None]:
final_model1 = LGBMClassifier(colsample_bytree =  0.8,
 max_depth =  16,
 min_child_samples = 39,
 n_estimators =  86,
 reg_lambda =  0.001,
 reg_alpha= 1.0,
 subsample =  0.9, n_jobs=-1,boosting_type = "gbdt")
final_model1.fit(X_train_ovr,Y_train_ovr)
proba = final_model1.predict_proba(X_train_onehot)[:,1]
train_score = roc_auc_score(Y_train_onehot,proba)
cv_score = cross_val_score(final_model1,X_train_onehot,Y_train_onehot,scoring="roc_auc",verbose=2,cv =5).mean()
print(cv_score)
print(train_score)

In [None]:
# Calculating test probabilites
test_proba1 = final_model1.predict_proba(test_df_preprocessed)[:,1]
final_solution1 = pd.DataFrame()
final_solution1["ID"] = test_df["ID"]
final_solution1["Is_Lead"] = test_proba1

In [None]:
# saving to df
final_solution1.to_csv("solution1.csv", index=False)

- To further improve the model performance we will choose the average probability that we get from both of the models.

In [None]:
# taking average proba of both models
final_solution1["Is_Lead"] = (final_solution1["Is_Lead"] + df["Is_Lead"])/2

In [None]:
final_solution1.head()

In [None]:
#final output file
final_solution1.to_csv("solution3.csv", index=False)

In [None]:
param_dist = {"n_estimators":sp_randint(40,100),
              "colsample_bytree":np.array([0.5,0.6,0.7,0.8,0.9,1]),
              "subsample":np.array([0.5,0.6,0.7,0.8,0.9,1]),
              "reg_lambda":np.array([1e-5,1e-4,1e-3,1e-2,0.1,1,10,100]),
              "reg_alpha":np.array([1e-5,1e-4,1e-3,1e-2,0.1,1,10,100]),
              "min_child_samples": sp_randint(25,65),
                "max_depth": sp_randint(1,20)}

clf1 = LGBMClassifier(boosting_type = "gbdt",n_jobs =-1,random_state = 10)

lgbm_random = RandomizedSearchCV(clf1, param_distributions=param_dist,
                                   n_iter=20,cv=10,scoring='roc_auc',random_state=42,verbose=1)

lgbm_random.fit(X_train_onehot,Y_train_onehot)
print('mean test scores',lgbm_random.cv_results_['mean_test_score'])

In [None]:
lgbm_random.best_estimator_

In [None]:
final_model = LGBMClassifier(colsample_bytree=0.8, max_depth=16, min_child_samples=39,
               n_estimators=86, random_state=10, reg_alpha=1.0,
               reg_lambda=0.001, subsample=0.8)
final_model.fit(X_train_onehot,Y_train_onehot)
proba = final_model.predict_proba(X_train_onehot)[:,1]
train_score = roc_auc_score(Y_train_onehot,proba)
cv_score = cross_val_score(final_model,X_train_onehot,Y_train_onehot,scoring="roc_auc",verbose=2,cv =5).mean()
print(cv_score)
print(train_score)

In [None]:
test_proba = final_model.predict_proba(test_df_preprocessed)[:,1]
final_solution1 = pd.DataFrame()
final_solution1["ID"] = test_df["ID"]
final_solution1["Is_Lead"] = test_proba

In [None]:
final_solution1.to_csv("submission_lgbm.csv", index = False)

### Stacking classifier

In [None]:
param_dist = {"n_estimators":sp_randint(40,100),
              "colsample_bytree":np.array([0.5,0.6,0.7,0.8,0.9,1]),
              "subsample":np.array([0.5,0.6,0.7,0.8,0.9,1]),
              "reg_lambda":np.array([1e-5,1e-4,1e-3,1e-2,0.1,1,10,100]),
              "reg_alpha":np.array([1e-5,1e-4,1e-3,1e-2,0.1,1,10,100]),
              "min_child_samples": sp_randint(25,65),
                "max_depth": sp_randint(1,20)}

clf1 = XGBClassifier(boosting_type = "gbdt",n_jobs =-1,random_state = 0,verbosity =0,scale_pos_weight = 3.2158)

xgb_random = RandomizedSearchCV(clf1, param_distributions=param_dist,
                                   n_iter=20,cv=5,scoring='roc_auc',random_state=42,verbose=1)

xgb_random.fit(X_train_onehot,Y_train_onehot)
print('mean test scores',xgb_random.cv_results_['mean_test_score'])

In [None]:
xgb_random.best_estimator_

In [None]:
final_model_1 = XGBClassifier(base_score=0.5, booster='gbtree', boosting_type='gbdt',
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.6,
              gamma=0, gpu_id=-1, importance_type='gain',
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=7, min_child_samples=52,
              min_child_weight=1 ,monotone_constraints='()',
              n_estimators=46, n_jobs=-1, num_parallel_tree=1, random_state=0,
              reg_alpha=1e-05, reg_lambda=100.0, scale_pos_weight=3.2158,
              subsample=0.8, tree_method='exact', validate_parameters=1,
              verbosity=0)
final_model_1.fit(X_train_onehot,Y_train_onehot)
proba = final_model_1.predict_proba(X_train_onehot)[:,1]
train_score = roc_auc_score(Y_train_onehot,proba)
cv_score = cross_val_score(final_model_1,X_train_onehot,Y_train_onehot,scoring="roc_auc",verbose=2,cv =5).mean()
print(cv_score)
print(train_score)


In [None]:
test_proba_1 = final_model_1.predict_proba(test_df_preprocessed)[:,1]
final_solution_1 = pd.DataFrame()
final_solution_1["ID"] = test_df["ID"]
final_solution_1["Is_Lead"] = test_proba_1

In [None]:
final_solution_1.to_csv("xgb_solution.csv", index = False)

In [None]:
from mlxtend.classifier import StackingClassifier
from sklearn.linear_model import LogisticRegression

In [None]:
model_1  = LGBMClassifier(colsample_bytree=0.8, max_depth=16, min_child_samples=39,
               n_estimators=86, random_state=10, reg_alpha=1.0,
               reg_lambda=0.001, subsample=0.8)
model_1.fit(X_train_onehot,Y_train_onehot)

model_2 = XGBClassifier(base_score=0.5, booster='gbtree', boosting_type='gbdt',
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.6,
              gamma=0, gpu_id=-1, importance_type='gain',
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=7, min_child_samples=52,
              min_child_weight=1 ,monotone_constraints='()',
              n_estimators=46, n_jobs=-1, num_parallel_tree=1, random_state=0,
              reg_alpha=1e-05, reg_lambda=100.0, scale_pos_weight=3.2158,
              subsample=0.8, tree_method='exact', validate_parameters=1,
              verbosity=0)
model_2.fit(X_train_onehot,Y_train_onehot)

In [None]:
alpha = [1e-7,1e-6,1e-5, 0.0001,0.001,0.01,0.1,1,10] 
for i in alpha:
    lr = LogisticRegression(C = i )
    sclf = StackingClassifier([model_1,model_2], meta_classifier=lr, use_probas=True )
    sclf.fit(X_train_onehot,Y_train_onehot)
    train_proba = sclf.predict_proba(X_train_onehot)[:,1]
    train_score = roc_auc_score(Y_train_onehot,train_proba)
    cv_score = cross_val_score(sclf, X_train_onehot,Y_train_onehot,scoring="roc_auc",verbose=0,cv =5).mean()
    print("Stacking classifier for alpha = %f, train score is %f and cv_score is %f"%(i, train_score, cv_score))
    
    
    

In [None]:
lr = LogisticRegression(C =0.0001)
sclf =StackingClassifier([model_1,model_2], meta_classifier=lr, use_probas=True )
sclf.fit(X_train_onehot,Y_train_onehot)
train_proba = sclf.predict_proba(X_train_onehot)[:,1]
train_score = roc_auc_score(Y_train_onehot,train_proba)
cv_score = cross_val_score(sclf, X_train_onehot,Y_train_onehot,scoring="roc_auc",verbose=0,cv =5).mean()
print("Stacking classifier for alpha = %f, train score is %f and cv_score is %f"%(i, train_score, cv_score))

In [None]:
test_proba_sclf = sclf.predict_proba(test_df_preprocessed)[:,1]
final_solution_sclf = pd.DataFrame()
final_solution_sclf["ID"] = test_df["ID"]
final_solution_sclf["Is_Lead"] = test_proba_sclf

In [None]:
final_solution_sclf.to_csv("sclf_submission.csv", index = False)

In [None]:
final_solution_1 = pd.read_csv("solution3.csv")

In [None]:
final_solution_x = final_solution_1.copy()

In [None]:
# Stacking stacking output with oversampled model output
final_solution_x["Is_Lead"] = (final_solution_1['Is_Lead'] + final_solution_sclf["Is_Lead"])/2

In [None]:
final_solution_x.to_csv("submission_x.csv", index = False)