# Advanced Certification Program in Computational Data Science

##  A program by IISc and TalentSprint

### Assignment: Catboost, XGBoost and LightGBM

## Learning Objectives
At the end of the experiment, you will be able to :

* perform data preprocessing
* perform feature transformation
* implement CatBoost, XGBoost and LightGBM model to perform classification using Lending Club dataset

In [None]:
#@title Walkthrough Video
from IPython.display import HTML
HTML("""<video width="420" height="240" controls>
<source src="https://cdn.chn.talentsprint.com/content/CatBoost_LightGBM_XGBoost.mp4">
</video>""")

## Introduction

**XGBoost** was originally produced by University of Washington researchers and is maintained by open-source contributors. XGBoost is available in Python, R, Java, Ruby, Swift, Julia, C, and C++. Similar to LightGBM, XGBoost uses the gradients of different cuts to select the next cut, but XGBoost also uses the hessian, or second derivative, in its ranking of cuts. Computing this next derivative comes at a slight cost, but it also allows a greater estimation of the cut to use.

**CatBoost** is developed and maintained by the Russian search engine Yandex and is available in Python, R, C++, Java, and also Rust. CatBoost distinguishes itself from LightGBM and XGBoost by focusing on optimizing decision trees for categorical variables, or variables whose different values may have no relation with each other (eg. apples and oranges).

**LightGBM** is a boosting technique and framework developed by Microsoft. The framework implements the LightGBM algorithm and is available in Python, R, and C. LightGBM is unique in that it can construct trees using Gradient-Based One-Sided Sampling, or GOSS for short.

To know more on comparisons between CatBoost, XgBoost and LightGBM, refer below
- [Article 1](https://cdn.iisc.talentsprint.com/CDS/Assignments/Module2/catboost%20vs%20lightgbm%20vs%20xgboost.pdf)
- [Article 2](https://cdn.iisc.talentsprint.com/CDS/Assignments/Module2/catboost%20lightgbm%20xgboost%202.pdf)

## Dataset Description

Lending Club is a lending platform that lends money to people in need at an interest rate based on their credit history and other factors. We will analyze this data and pre-process it based on our need and build a machine learning model that can identify a potential defaulter based on his/her history of transactions with Lending Club.

This dataset contains 42538 rows and 144 columns. **Out of these 144 columns, many columns have majorly null values.**

To know more about the Lending Club dataset features, refer [here](https://www.openintro.org/data/index.php?data=loans_full_schema).

### Setup Steps:

In [None]:
#@title Please enter your registration id to start: { run: "auto", display-mode: "form" }
Id = "" #@param {type:"string"}

In [None]:
#@title Please enter your password (your registered phone number) to continue: { run: "auto", display-mode: "form" }
password = "" #@param {type:"string"}

In [None]:
#@title Run this cell to complete the setup for this Notebook
from IPython import get_ipython

ipython = get_ipython()

notebook= "M2_AST_07_Catboost_XGBoost_LightGBM_A" #name of the notebook

def setup():
#  ipython.magic("sx pip3 install torch")
    ipython.magic("sx wget https://cdn.iisc.talentsprint.com/CDS/Datasets/LoanStats3a.csv")
    from IPython.display import HTML, display
    display(HTML('<script src="https://dashboard.talentsprint.com/aiml/record_ip.html?traineeId={0}&recordId={1}"></script>'.format(getId(),submission_id)))
    print("Setup completed successfully")
    return

def submit_notebook():
    ipython.magic("notebook -e "+ notebook + ".ipynb")

    import requests, json, base64, datetime

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword()}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:
        print(r["err"])
        return None
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None

    elif getAnswer() and getComplexity() and getAdditional() and getConcepts() and getComments() and getMentorSupport():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional,
              "concepts" : Concepts, "record_id" : submission_id,
              "answer" : Answer, "id" : Id, "file_hash" : file_hash,
              "notebook" : notebook,
              "feedback_experiments_input" : Comments,
              "feedback_mentor_support": Mentor_support}
      r = requests.post(url, data = data)
      r = json.loads(r.text)
      if "err" in r:
        print(r["err"])
        return None
      else:
        print("Your submission is successful.")
        print("Ref Id:", submission_id)
        print("Date of submission: ", r["date"])
        print("Time of submission: ", r["time"])
        print("View your submissions: https://learn-iisc.talentsprint.com/notebook_submissions")
        #print("For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.")
        return submission_id
    else: submission_id


def getAdditional():
  try:
    if not Additional:
      raise NameError
    else:
      return Additional
  except NameError:
    print ("Please answer Additional Question")
    return None

def getComplexity():
  try:
    if not Complexity:
      raise NameError
    else:
      return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None

def getConcepts():
  try:
    if not Concepts:
      raise NameError
    else:
      return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None


# def getWalkthrough():
#   try:
#     if not Walkthrough:
#       raise NameError
#     else:
#       return Walkthrough
#   except NameError:
#     print ("Please answer Walkthrough Question")
#     return None

def getComments():
  try:
    if not Comments:
      raise NameError
    else:
      return Comments
  except NameError:
    print ("Please answer Comments Question")
    return None


def getMentorSupport():
  try:
    if not Mentor_support:
      raise NameError
    else:
      return Mentor_support
  except NameError:
    print ("Please answer Mentor support Question")
    return None

def getAnswer():
  try:
    if not Answer:
      raise NameError
    else:
      return Answer
  except NameError:
    print ("Please answer Question")
    return None


def getId():
  try:
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup()
else:
  print ("Please complete Id and Password cells before running setup")



###  Import required packages

In [None]:
!pip -qq install catboost

In [None]:
!pip install scikit-learn==1.4.2

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_style('whitegrid')
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report
from sklearn.tree import DecisionTreeClassifier
from catboost import CatBoostClassifier, Pool, metrics, cv
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
import warnings
warnings.filterwarnings('ignore')

### Load Dataset

In [None]:
# Load the raw loan stats dataset
# YOUR CODE HERE

## Data Preprocessing

In [None]:
# View the top 5 rows of data
# YOUR CODE HERE

In [None]:
# Size of the dataset
# YOUR CODE HERE

In [None]:
# Checking info of the raw dataframe
# YOUR CODE HERE

### Check for missing values in the dataset

In [None]:
# Check missing values
# YOUR CODE HERE

In [None]:
# Total percentage of null values in the data
# YOUR CODE HERE

From above we can see that, about 63% of the values in the overall data are null values.

Let's visualize the null values using the heatmap.

In [None]:
# Checking for null values using a heatmap as a visualizing tool
# YOUR CODE HERE

As we can see from the above heatmap, there are lot of null values in the dataset. We have to carefully deal with these null values.

### Handling missing values in the data

- Select columns having null values less than 40%

In [None]:
# Creating a dataframe to display percentage of null values in different number of columns
# YOUR CODE HERE

# Store the columns count separately for each range
# YOUR CODE HERE

From the above results, we can see that there are only 53 columns out of 144 columns that have null values less than 40%.

In [None]:
# Considering only those columns which have null values less than 40% in that particular column
# YOUR CODE HERE

By considering columns with less number of null values, we were able to decrease total number of columns from 144 to 53.

Note that we will deal with null values present in these selected 53 columns later below.

### Removing columns having single distinct value

In [None]:
# Checking columns that have only single values in them i.e, constant columns
# YOUR CODE HERE

In [None]:
# After observing the above output, we are dropping columns which have single values in them
# YOUR CODE HERE

### Extract features from datetime columns

In [None]:
# Columns other than numerical value
# YOUR CODE HERE

In [None]:
# Check which columns needs to be converted to datetime
# YOUR CODE HERE

In [None]:
# Converting objects to datetime columns
# YOUR CODE HERE

In [None]:
# Checking the new datetime columns
# YOUR CODE HERE

In [None]:
# Considering only year of joining for 'earliest_cr_line' column
# YOUR CODE HERE

In [None]:
# Adding new features by getting month and year from [issue_d, last_pymnt_d, and last_credit_pull_d] columns
# YOUR CODE HERE

In [None]:
# Feature extraction
# YOUR CODE HERE

In [None]:
# Dropping the original features to avoid data redundancy
# YOUR CODE HERE

### Check for missing values in reduced dataset

In [None]:
# Checking for null values in the updated dataframe
# YOUR CODE HERE

### Handling Null values in reduced dataset

In [None]:
# Checking for Percentage of null values
# YOUR CODE HERE

In [None]:
# Dropping the 29 rows which have null values in few columns
# YOUR CODE HERE

In [None]:
# Checking again for Percentage of null values
# YOUR CODE HERE

Now, imputing the missing values with the median value for columns **'last_pymnt_d_year', 'last_pymnt_d_month', 'last_credit_pull_d_year', 'last_credit_pull_d_month', 'tax_liens'** as null values in these columns are less than 0.5% of the size.

In [None]:
# Imputing the null values with the median value
# YOUR CODE HERE

For **'revol_util'** column, filling null values with median(string) which is close to 50:

In [None]:
# For 'revol_util' column, fill null values with 50%
# YOUR CODE HERE

# Extracting numerical value from string
# YOUR CODE HERE

# Converting string to float
# YOUR CODE HERE

In [None]:
# Unique values in 'pub_rec_bankruptcies' column
# YOUR CODE HERE

From the above we can see that the **'pub_rec_bankruptcies'** column is highly imbalanced. So, it is better to fill it with median(0) value as even after building model the model will be skewed very much towards 0.

In [None]:
# YOUR CODE HERE

In [None]:
# Unique values in 'emp_length' column
# YOUR CODE HERE

In [None]:
# Seperating null values by assigning a random string
# YOUR CODE HERE

# Filling '< 1 year' as '0 years' of experience and '10+ years' as '10 years'
# YOUR CODE HERE

# Then extract numerical value from the string
# YOUR CODE HERE

# Converting it's dattype to float
# YOUR CODE HERE

In [None]:
# Checking again for Percentage of null values
# YOUR CODE HERE

In [None]:
# Removing redundant features and features which have percentage null values > 5%
# YOUR CODE HERE

### Converting categorical columns to numerical columns


In [None]:
df1.head(2)

In [None]:
# Unique values in 'term' column
# YOUR CODE HERE

In [None]:
# Unique values in 'int_rate' column
# YOUR CODE HERE

In [None]:
# Converting 'term' and 'int_rate' to numerical columns
# YOUR CODE HERE

Among the address related features, considering **'addr_state'** column and excluding **'zip_code'** column.

In [None]:
df2 = df1.drop('zip_code', axis = 1)

In [None]:
# One hot encoding on categorical columns
# YOUR CODE HERE

In [None]:
# Label encoding on 'grade' column
# YOUR CODE HERE

In [None]:
# Update 'grade' column
# YOUR CODE HERE

In [None]:
# Label encoding on 'sub_grade' column
# YOUR CODE HERE

In [None]:
# Update 'sub_grade' column
# YOUR CODE HERE

In [None]:
df2.head(2)

In [None]:
# Target feature
# YOUR CODE HERE

In [None]:
# Prediction features
# YOUR CODE HERE
# Target variable
# YOUR CODE HERE

In [None]:
# Label encoding the target variable
# YOUR CODE HERE

In [None]:
X.head(2)

### Split data into training and testing set

In [None]:
# Split the data into train and test
# YOUR CODE HERE

## Model Building

In [None]:
# Using DecisionTree as base model
# YOUR CODE HERE

In [None]:
# Prediciton using DecisionTree
# YOUR CODE HERE

### CatBoost

In [None]:
# Create CatBoostClassifier object
# YOUR CODE HERE

In [None]:
#cat_features = list(range(0, X.shape[1]))
# YOUR CODE HERE

In [None]:
# Prediction using CatBoost
# YOUR CODE HERE

In [None]:
#  Classification report for CatBoost model
# YOUR CODE HERE

### XGBoost

In [None]:
# Create XGBClassifier object
# YOUR CODE HERE

In [None]:
# Fit on training set
# YOUR CODE HERE

In [None]:
# Prediction using XGBClassifier
# YOUR CODE HERE

In [None]:
# Classification report for XGBoost
# YOUR CODE HERE

### LightGBM

In [None]:
# Create LGBMClassifier object
# YOUR CODE HERE

In [None]:
# Fit on training set
# YOUR CODE HERE

In [None]:
# Prediction using LGBMClassifier
# YOUR CODE HERE

In [None]:
# Classification report for LGBM
# YOUR CODE HERE

## Reference Reading:

https://neptune.ai/blog/when-to-choose-catboost-over-xgboost-or-lightgbm

### Please answer the questions below to complete the experiment:




In [None]:
#@title Select the FALSE statement: { run: "auto", form-width: "500px", display-mode: "form" }
Answer = "" #@param ["", "CatBoost can internally handle categorical variables in the data", "XGBoost cannot handle categorical features by itself, it only accepts numerical values similar to Random Forest", "LightGBM uses a histogram-based method for selecting the best split in order to speed up the training process", "All the above", "None of the above"]

In [None]:
#@title How was the experiment? { run: "auto", form-width: "500px", display-mode: "form" }
Complexity = "" #@param ["","Too Simple, I am wasting time", "Good, But Not Challenging for me", "Good and Challenging for me", "Was Tough, but I did it", "Too Difficult for me"]


In [None]:
#@title If it was too easy, what more would you have liked to be added? If it was very difficult, what would you have liked to have been removed? { run: "auto", display-mode: "form" }
Additional = "" #@param {type:"string"}


In [None]:
#@title Can you identify the concepts from the lecture which this experiment covered? { run: "auto", vertical-output: true, display-mode: "form" }
Concepts = "" #@param ["","Yes", "No"]


In [None]:
#@title  Text and image description/explanation and code comments within the experiment: { run: "auto", vertical-output: true, display-mode: "form" }
Comments = "" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [None]:
#@title Mentor Support: { run: "auto", vertical-output: true, display-mode: "form" }
Mentor_support = "" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [None]:
#@title Run this cell to submit your notebook for grading { vertical-output: true }
try:
  if submission_id:
      return_id = submit_notebook()
      if return_id : submission_id = return_id
  else:
      print("Please complete the setup first.")
except NameError:
  print ("Please complete the setup first.")