## Credit Card Lead Prediction


##### Happy Customer Bank is a mid-sized private bank that deals in all kinds of banking products, like Savings accounts,Current accounts, investment products, credit products, among other offerings.

##### The bank also cross-sells products to its existing customers and to do so they use different kinds of communication like telecasting, e-mails, recommendations on net banking, mobile banking, etc.

##### In this case, the Happy Customer Bank wants to cross-sell its credit cards to its existing customers. The bank has identified a set of customers that are eligible for taking these credit cards.

### Features

##### ID :- Unique Identifier for a row

##### Gender :- Gender of the Customer

##### Age :- Age of the Customer (in Years)

##### Region_Code :- Code of the Region for the customers

##### Occupation :- Occupation Type for the customer

##### Channel_Code :- Acquisition Channel Code for the Customer (Encoded)

##### Vintage :- Vintage for the Customer (In Months)

##### Credit_Product :- If the Customer has any active credit product (Home loan,Personal loan, Credit Card etc.)

##### AvgAccountBalance :- Average Account Balance for the Customer in last 12 Months

##### Is_Active :- If the Customer is Active in last 3 Months


### Target

##### Is_Lead :- If the Customer is interested for the Credit Card

##### 0 : Customer is not interested
##### 1 : Customer is interested

#### Importing all required Libraries and Dataset.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
import time
warnings.filterwarnings(action="ignore")
plt.style.use(["seaborn-bright","dark_background"])

In [None]:
train = pd.read_csv("../input/jobathon-may-2021-credit-card-lead-prediction/train.csv")
test = pd.read_csv("../input/jobathon-may-2021-credit-card-lead-prediction/test.csv")
train.head()

#### Checking for missing values.

In [None]:
for i in train.columns:
    per = (train[i].isnull().sum()/len(train))*100
    print("Feature {} has {}% data missing".format(i,per))

#### Removing rows with null values.

In [None]:
print("Shape before removing null values - {}".format(train.shape))
train.dropna(inplace=True,axis=0)
print("Shape after removing null values - {}".format(train.shape))

#### Checking for unique values in categorical features.

In [None]:
for i in train.columns:
    if train[i].dtype=="object":
        print("{} has {} unique values i.e. {}".format(i,train[i].nunique(),train[i].unique()))

#### Data Visualization

In [None]:
fig = px.histogram(train,"Avg_Account_Balance",color="Is_Lead",height=400,width=700,title="Average Account Balance")
fig.show()

In [None]:
train["log_Avg_Account_Balance"] = np.log10(train["Avg_Account_Balance"])

In [None]:
fig = px.histogram(train,"log_Avg_Account_Balance",color="Is_Lead",height=400,width=700,title="Log of Average Account Balance")
fig.show()

In [None]:
fig = px.histogram(train,"Age",title="Age",height=400,width=700,color="Is_Lead")
fig.show()

In [None]:
train["log_age"] = np.log10(train["Age"])

In [None]:
fig = px.histogram(train,"log_age",title="Log Age",height=400,width=700,color="Is_Lead")
fig.show()

In [None]:
fig = px.histogram(train,"Vintage",title="Vintage",height=400,width=700,color="Is_Lead")
fig.show()

In [None]:
train["Sr. No"] = np.arange(0,216400)

In [None]:
fig = px.pie(train,names="Occupation",values="Sr. No",color="Occupation",height=500,width=500,title="Occupation")
fig.show()

In [None]:
fig = px.pie(train,names="Gender",values="Sr. No",color="Gender",height=500,width=500,title="Gender")
fig.show()

In [None]:
fig = px.pie(train,names="Channel_Code",values="Sr. No",color="Channel_Code",height=500,width=500,title="Channel_Code")
fig.show()

In [None]:
fig = px.pie(train,names="Credit_Product",values="Sr. No",color="Credit_Product",height=500,width=500,title="Credit Product")
fig.show()

In [None]:
fig = px.pie(train,names="Is_Active",values="Sr. No",color="Is_Active",height=500,width=500,title="Is Active")
fig.show()

In [None]:
fig = px.sunburst(train, path=['Occupation', 'Gender', 'Is_Lead'], values='Sr. No',height=500,width=500)
fig.show()

In [None]:
fig = px.sunburst(train, path=['Channel_Code', 'Gender','Is_Active', 'Is_Lead'], values='Sr. No',height=500,width=500)
fig.show()

#### Converting categorical data to numerical data.

In [None]:
train["Gender"] = train["Gender"].replace(['Female','Male'],[0,1])
train["Channel_Code"] = train["Channel_Code"].replace(['X3','X1','X2','X4'],[3,1,2,4])
train["Credit_Product"] = train["Credit_Product"].replace(['No','Yes'],[0,1])
train["Is_Active"] = train["Is_Active"].replace(['No','Yes'],[0,1])

In [None]:
plt.figure(figsize=(11,9))
sns.heatmap(train.corr(),cmap="spring",annot=True)
plt.title("Correleation Heatmap")
plt.show()

In [None]:
sns.countplot(train["Is_Lead"])
plt.show()

##### Our dataset is imbalanced as the value counts for target 0 is much more than target 1.

##### Creating balanced dataset.

In [None]:
class_count_0, class_count_1 = train['Is_Lead'].value_counts()

class_0 = train[train['Is_Lead'] == 0]
class_1 = train[train['Is_Lead'] == 1]
print('class 0:', class_0.shape)
print('class 1:', class_1.shape)

In [None]:
class_0_under = class_0.sample(class_count_1)

train1 = pd.concat([class_0_under, class_1], axis=0)

print("total class of 1 and 0:",train1['Is_Lead'].value_counts())
train1['Is_Lead'].value_counts().plot(kind='bar', title='target')

In [None]:
train1.drop(columns=["ID","Age","Sr. No"],inplace=True)

##### Creating dummy variables for remaining categorical features.

In [None]:
train1 = pd.get_dummies(train1,columns=["Region_Code","Occupation"])

In [None]:
train1.drop(columns=["Avg_Account_Balance"],inplace=True)

In [None]:
train1.shape

##### Spliting dataset.

In [None]:
X = train1.drop(columns=["Is_Lead"])
y = train1["Is_Lead"]

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_valid, y_train, y_valid = train_test_split(X,y,test_size=0.2,random_state=101)

##### Scaling data using standardscaler.

In [None]:
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()

In [None]:
column = X.columns
x_train = scale.fit_transform(x_train)
x_train = pd.DataFrame(x_train,columns=column)
x_valid = scale.fit_transform(x_valid)
x_valid = pd.DataFrame(x_valid,columns=column)

#### Training different classification models.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier,GradientBoostingClassifier
from sklearn.naive_bayes import BernoulliNB,GaussianNB
from sklearn.linear_model import LogisticRegression

In [None]:
models = []
models.append(("DecisionTreeClassifier",DecisionTreeClassifier()))
models.append(("RandomForestClassifier",RandomForestClassifier()))
models.append(("ExtraTreesClassifier",ExtraTreesClassifier()))
models.append(("GradientBoostingClassifier",GradientBoostingClassifier()))
models.append(("BernoulliNB",BernoulliNB()))
models.append(("GaussianNB",GaussianNB()))
models.append(("LogisticRegression",LogisticRegression()))

In [None]:
from sklearn.metrics import f1_score,precision_score,recall_score,accuracy_score

In [None]:
for name, model in models:
    begin = time.time()
    model.fit(x_train,y_train)
    pred = model.predict(x_train)
    acc = accuracy_score(y_train,pred)
    f1score = f1_score(y_train,pred)
    pre_score = precision_score(y_train,pred)
    rec_score = recall_score(y_train,pred)
    print("For {}".format(name))
    print("On Training data :- Accuracy = {}, F1 Score = {}, Precision = {}, Recall = {}".format(round(acc,4),round(f1score,4),round(pre_score,4),round(rec_score,4)))
    
    pred1 = model.predict(x_valid)
    acc1 = accuracy_score(y_valid,pred1)
    f1score1 = f1_score(y_valid,pred1)
    pre_score1 = precision_score(y_valid,pred1)
    rec_score1 = recall_score(y_valid,pred1)
    print("On Validation data :- Accuracy = {}, F1 Score = {}, Precision = {}, Recall = {}".format(round(acc1,4),round(f1score1,4),round(pre_score1,4),round(rec_score1,4)))   

    end = time.time()
    print("Completion time :- {} sec\n".format(round(end - begin,4)))

#### Like training data we perform same tasks on test data.

In [None]:
test.head()

In [None]:
for i in test.columns:
    per = (test[i].isnull().sum()/len(test))*100
    print("Feature {} has {}% data missing".format(i,per))

In [None]:
test.isnull().sum()

In [None]:
test["Credit_Product"] = test["Credit_Product"].fillna("No")

In [None]:
test["Gender"] = test["Gender"].replace(['Female','Male'],[0,1])
test["Channel_Code"] = test["Channel_Code"].replace(['X3','X1','X2','X4'],[3,1,2,4])
test["Credit_Product"] = test["Credit_Product"].replace(['No','Yes'],[0,1])
test["Is_Active"] = test["Is_Active"].replace(['No','Yes'],[0,1])

In [None]:
test["log_Avg_Account_Balance"] = np.log10(test["Avg_Account_Balance"])

In [None]:
test["log_age"] = np.log10(test["Age"])

In [None]:
x_test = test.drop(columns=["ID","Age","Avg_Account_Balance"])

In [None]:
x_test = pd.get_dummies(x_test,columns=["Region_Code","Occupation"])

In [None]:
x_test.shape

In [None]:
X.shape

In [None]:
x_test = x_test[X.columns]

In [None]:
column = X.columns
X = scale.fit_transform(X)
X = pd.DataFrame(X,columns=column)

##### Based on time and accuracy we select the best model.

In [None]:
model = LogisticRegression()
model.fit(X,y)

In [None]:
label = test["ID"]

In [None]:
pred1 = model.predict(x_test)

In [None]:
submission = pd.DataFrame()
submission["ID"] = label
submission["Is_Lead"] = pred1