**Predictive Lead Scoring**

# import useful libraries and Load the data

In [7]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path

In [10]:
sys.path.append("..")
data_dir = Path('../data/') 
modules_dir = Path('../modules/')

> The dataset: https://www.kaggle.com/arashnic/banking-loan-prediction

**The content of the dataset**
- ID : Unique Customer ID
- Gender : Gender of the applicant
- DOB : Date of Birth of the applicant
- LeadCreationDate : Date on which Lead was created
- City_Code : Anonymised Code for the City
- City_Category: Anonymised City Feature
- Employer_Code: Anonymised Code for the Employer
- Employer_Category1 : Anonymised Employer Feature
- Employer_Category2: Anonymised Employer Feature
- Monthly_Income : Monthly Income in Dollars
- CustomerExistingPrimaryBankCode : Anonymised Customer Bank Code
- PrimaryBankType: Anonymised Bank Feature
- Contacted: Contact Verified (Y/N)
- Source : Categorical Variable representing source of lead
- Source_Category: Type of Source
- Existing_EMI : EMI of Existing Loans in Dollars
- Loan_Amount: Loan Amount Requested
- Loan_Period: Loan Period (Years)
- Interest_Rate: Interest Rate of Submitted Loan Amount
- EMI: EMI of Requested Loan Amount in dollars
- Var1: Anonymized Categorical variable with multiple levels
- Approved: (Target) Whether a loan is Approved or not (1-0) . Customer is Qualified Lead or not (1-0)

In [11]:
train = pd.read_csv(data_dir/'train.csv')

In [12]:
test = pd.read_csv(data_dir/'test.csv')

# Data Wrangling

## Data exploring

**samples**

In [13]:
train.sample(3)

Unnamed: 0,ID,Gender,DOB,Lead_Creation_Date,City_Code,City_Category,Employer_Code,Employer_Category1,Employer_Category2,Monthly_Income,...,Contacted,Source,Source_Category,Existing_EMI,Loan_Amount,Loan_Period,Interest_Rate,EMI,Var1,Approved
5100,APPQ80212315421,Female,21/10/90,08/07/16,C10001,A,COM0005577,B,4.0,1057.7,...,N,S143,B,0.0,,,,,0,0
47282,APPB90149643513,Male,01/03/93,03/09/16,C10003,A,COM0048528,B,4.0,1718.3,...,Y,S122,G,240.7,20000.0,3.0,18.25,726.0,2,0
17660,APPF30256077949,Female,02/08/93,26/07/16,C10002,A,COM0000396,C,4.0,1500.0,...,N,S159,B,1000.0,,,,,0,0


In [14]:
test.sample(3)

Unnamed: 0,ID,Gender,DOB,Lead_Creation_Date,City_Code,City_Category,Employer_Code,Employer_Category1,Employer_Category2,Monthly_Income,...,Primary_Bank_Type,Contacted,Source,Source_Category,Existing_EMI,Loan_Amount,Loan_Period,Interest_Rate,EMI,Var1
18949,APPN60172341115,Male,05/12/62,22/08/16,C10037,B,COM0007477,A,4.0,2300.0,...,P,Y,S133,B,0.0,10000.0,3.0,20.0,372.0,2
8277,APPZ10734018529,Male,12/04/89,08/07/16,C10004,A,COM0012758,B,4.0,2000.0,...,G,Y,S133,B,0.0,22000.0,3.0,20.0,818.0,4
17831,APPQ80997874023,Female,15/12/81,17/08/16,C10006,A,COM0050172,B,4.0,2700.0,...,P,N,S133,B,1300.0,,,,,0


**target and features**

- Our target is the column **Approved** from the train table: it consists of 0 if the loan default exist and the loan is not approved and 1 if not.
- The remaining columns 21 will be considered as features and after a statistical analysis we will selected only the best ones.

**size and shape**

In [23]:
train.shape

(69713, 22)

In [24]:
test.shape

(30037, 21)

> Let us perform all our operation on train data and create a modular code for the test dataset and futures tests 

**missing values and types of columns**

In [27]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69713 entries, 0 to 69712
Data columns (total 22 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   ID                                   69713 non-null  object 
 1   Gender                               69713 non-null  object 
 2   DOB                                  69698 non-null  object 
 3   Lead_Creation_Date                   69713 non-null  object 
 4   City_Code                            68899 non-null  object 
 5   City_Category                        68899 non-null  object 
 6   Employer_Code                        65695 non-null  object 
 7   Employer_Category1                   65695 non-null  object 
 8   Employer_Category2                   65415 non-null  float64
 9   Monthly_Income                       69713 non-null  float64
 10  Customer_Existing_Primary_Bank_Code  60322 non-null  object 
 11  Primary_Bank_Type           

> This helps us understand the data columns: 22
>- the DOB and Lead_creation_date are in wrong format : they should be datetime
>- Multiple columns have null values even if we are lucky to have all our target.

**number of unique values**

In [32]:
train.nunique().sort_values()

Approved                                   2
Gender                                     2
Contacted                                  2
Primary_Bank_Type                          2
City_Category                              3
Employer_Category1                         3
Employer_Category2                         4
Var1                                       5
Loan_Period                                6
Source_Category                            7
Source                                    29
Customer_Existing_Primary_Bank_Code       57
Interest_Rate                             72
Lead_Creation_Date                        92
Loan_Amount                              196
City_Code                                678
EMI                                     2179
Existing_EMI                            3245
Monthly_Income                          5010
DOB                                    10759
Employer_Code                          36617
ID                                     69713
dtype: int

> We can observe that almost all variable are Categorical because they have less unique values

## Data Cleaning

**mistype**

In [43]:
# Changing DOB and lead creation date to the appropriate type
def handle_datetime(data):
    data.DOB = data.DOB.astype('datetime64')
    data.Lead_Creation_Date = data.Lead_Creation_Date.astype('datetime64')
    return data

In [45]:
train = handle_datetime(train)

> we could have parsed those columns to datetime while reading with pandas.

**missing values**

In [64]:
pd.concat([train.isnull().sum().sort_values(),train.isnull().sum().sort_values()/train.shape[0]], axis=1)\
                        .rename(columns={0:'count', 1:'percentage'})

Unnamed: 0,count,percentage
ID,0,0.0
Source_Category,0,0.0
Source,0,0.0
Contacted,0,0.0
Var1,0,0.0
Monthly_Income,0,0.0
Approved,0,0.0
Lead_Creation_Date,0,0.0
Gender,0,0.0
DOB,15,0.000215


Our first strategy is to keep all the observation.
- Since most of those columns are categorical, we will create a category "unknown" where the value is missing.
- For numerical columns ( Loan Amount, load period, interest rate and EMI ) , we will use different approach. We will assume that if there is no load amount/ period then it is 0 there and the client doesn't have any loan

In [68]:
date.today()

datetime.date(2021, 3, 9)

In [72]:
train['age'] = train.DOB.apply(lambda x : date.today().year - x.year)

0        42.0
1        35.0
2        39.0
3        32.0
4        36.0
         ... 
69708    38.0
69709    50.0
69710    29.0
69711    43.0
69712    32.0
Name: DOB, Length: 69713, dtype: float64

In [74]:
from datetime import datetime, date 

def create_age_column(data):
    return data.DOB.apply(lambda x : date.today().year - x.year)

In [75]:

def handle_missing_val(data):
    #categorical
    # First let us create age columns instead of DOB so that we can replace by unknown
    data['age'] = create_age_column(data)
    # Replacing the missing cat variable by "unknown"
    
    return