# Introduction and Documentation for the Pipeline

The goal of this project is to build a machine learning pipeline that will automate the process of retraining and fine-tuning the model every time we recieve new data

The architecture for this pipeline has been divided into 6 sections: 

1. Data Ingestion 
2. Data Preprocessing 
3. Feature Engineering
4. Model Training
5. Model Evaluation 
6. Model Deployment (not implemented in this project)

## Pipeline Designing 

Initially I was looking into complex ML pipeline architectures such as Directed Acyclic Graphs (DAG) as well as Single Leader Architecture but after reviewing the problem a few more times from the given question paper as well as the initial model that has been built in the 'lightgbm_build.ipynb' file, I realized that a simpler workflow would be much more suitable while also borrowing some concepts from DAG architecture. The next section is to show the initial and final draft for the pipeline design that I worked on

## Drafts

### Initial Draft Image 

<img src="../images/initial_draft.jpg" alt="Initial Draft Image" />

### Final Draft Image 

<img src="../images/final_draft.jpg" alt="Final Draft Image" />

### Discussion 

From the diagrams above, it can be observed that not much changed between my initial idea and the final design. The final draft image gives a high level overview of all the processes that happen within each step. To keep it simple I did not add any descriptions in the diagram as all the explanations and reasoning has been mentioned in the notebook along with the corresponding step. The pipeline overall, has been kept simple and designed in a way which allows for additional components as well as other pre-processing steps. Comments have been added at each important line of code in order to aid understanding of what the code is doing.  

I have referred to the 'lightgbm_build.ipynb' as while working on that model I had already planned most of the steps that I needed to carry out for my pipeline. While designing the pipeline I needed to make some adjustments and change the placement of a few pre-processing steps that were carried out. Instead of saving the model in the final steps, I saved the pipeline.

# Importing Libraries

In [36]:
#libraries for data exploration and preprocessing
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from scipy.stats import boxcox
from sklearn.impute import SimpleImputer
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler
import sklearn.metrics as metrics
import numpy as np

#library for graph plotting 
import matplotlib.pyplot as plt
import seaborn as sn

#library for model building 
import lightgbm as lgb

from sklearn.model_selection import GridSearchCV

import pickle

#libraries to build pipeline 
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.base import BaseEstimator, TransformerMixin

#saving the pipeline
import joblib

import warnings as warnings

# Data Ingestion 

In this section we are going to be reading data from the file 'loan.csv' which will be used in the retraining of the model. Do note that we make the assumption that <b>the only file that is changed is 'loan.csv'</b> so each time the file is changed we will run the pipeline

In [37]:
#import the data file from Notebook 1
file_path = '../data/loan.csv'
def load_data(file_path):
    data = pd.read_csv(file_path, parse_dates=['applicationDate', 'originatedDate'])
    
    #load initial model 
#     model_path = "models/"
#     with open(model_path + 'initial_model.pkl', 'rb') as file:
#         loaded_model = pickle.load(file)
    
    return data

In [38]:
data = load_data(file_path)

# Data Preprocessing

Within this section, we are going to carry out all the data processing in steps as shown below:
1. Filtering out for records by applications which are funded
2. Imputing missing/NA values with 0 for 'nPaidOff'
3. Encoding 'payFrequency'
4. Encoding 'loanStatus'
5. One-Hot encoding 'leadType'

In [39]:
class DataPreprocessing(BaseEstimator, TransformerMixin):
    def __init__(self):
        print("*"*50)
        print(">>>> Pipeline Started\n")
        print(">>>>>>> Data Preprocessing\n")
    
    def fit(self, X, y = None):
        print(">>>> fit called")
        
    def transform(self, X):
        X = X.query('isFunded==1')
        #fill in nPaidOff with 0
        X_cp = X.copy()
        X['nPaidOff'] = X_cp['nPaidOff'].fillna(0)
        
        #encode payFrequency
        new_vals = {
            'payFrequency': {'B': 0, 'W': 1, 'M': 2, 'S': 3, 'I': 4}
        }
        X = X.replace(new_vals)
        
        #encode loanStatus
        valid_group = ['Paid Off Loan', 'Settlement Paid Off', 'Settled Bankruptcy', 'Charged Off']
        #remove any entry not from the defined 4 
        X = X[X.loanStatus.isin(valid_group) != False]
        new_loanstatus = {
            'loanStatus': {'Paid Off Loan': 1, 'Settlement Paid Off': 1, 'Settled Bankruptcy': 0, 'Charged Off': 0}
        }
        X = X.replace(new_loanstatus)
        
        #one hot encoding on leadType
        X = pd.get_dummies(X, columns=['leadType'])
        
        return X
    
    def fit_transform(self, X):
        print(">>>> Model Fitting\n")

# Feature Engineering 

In this section I carried out the following feature engineering steps:

1. Feature Creation: Created a new feature, 'processTime', which is used to improve the model's overall capability to predict loan application risks
2. Feature Selection: Dropped/removed irrelevant columns such as 'approved', 'isFunded' as they were not needed given the fact that none of them had any direct links to how they could affect the locan application risks 
3. Scaling: Applied min-max scaling and box-cox to respective columns. 
4. Sampling: I used over-sampling to balance out the class distribution of the training set as it would otherwise result in the model being heavily biased given the fact that around 97% of the set was for class '1' and a mere 3% was for class '0'

In [40]:
class FeatureEngineering(BaseEstimator, TransformerMixin):
    def __init__(self):
        print(">>>> Feature Engineering\n")
        
    
    def fit(self, X):
        print(">>>> Feature Engineering fit called")
        
    def transform(self, X):
        #add process time
        X['processTime'] = (X['originatedDate'] - X['applicationDate']).dt.total_seconds() / 3600
        
        #drop irrelevant columns
        X.drop(['originated', 'isFunded', 'approved', 'originatedDate', 
                 'applicationDate', 'fpStatus', 'state'], axis="columns", inplace=True)
        
        #applying min-max scaling to loanAmount and originallyScheduledPaymentAmount
        scaler = MinMaxScaler()
        X[['loanAmount', 'originallyScheduledPaymentAmount']] = scaler.fit_transform(
            X[['loanAmount', 'originallyScheduledPaymentAmount']])

        #applying box-cox to leadCost and nPaidOff as both of their data distribution is heavily concentrated near 0 and is positively skewed
        X['nPaidOff'], _ = boxcox(X['nPaidOff'] + 1)
        X['leadCost'], _ = boxcox(X['leadCost'] + 1)
        
        #split data into x and y.
        #drop the unique identifiers
        X , y = X.drop(['loanId', 'anon_ssn', 'loanStatus', 'clarityFraudId'], axis=1), X["loanStatus"]
        ros = RandomOverSampler(random_state=8)
        X, y = ros.fit_resample(X, y)
        
        return X, y
    
    def fit_transform(self, X):
        print(">>>> Model Fitting\n")
        

# Model Training

In this section I build and fit the model on the given data.

In [41]:
class ModelTrain(BaseEstimator, TransformerMixin):
    def __init__(self):
        print(">>>> Model Training\n")
    
    def fit(self, X):
        print(">>>> Model Fitting\n")
        print("*"*50)
        print(">>>> Model has been built ")
        #unpack the values from the previous transformer
        X, y = X
        # Split the data into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=8)
        #building the model 
        model = lgb.LGBMClassifier(learning_rate=0.09,max_depth=-5,random_state=42)
        model.fit(X_train,y_train,eval_set=[(X_test,y_test),(X_train,y_train)],
                  verbose=20,eval_metric='auc')
        
        return model, X_train, X_test, y_train, y_test
        
    def transform(self, X):
        return self.fit(X)

In [42]:
#build the pipeline
pipeline = Pipeline(steps=[
    ("preprocess_data", DataPreprocessing()),
    ("feature_enginner", FeatureEngineering()),
    ("model_train", ModelTrain())
])

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    model, X_train, X_test, y_train, y_test = pipeline.transform(data)

**************************************************
>>>> Pipeline Started

>>>>>>> Data Preprocessing

>>>> Feature Engineering

>>>> Model Training

>>>> Model Fitting

**************************************************
>>>> Model has been built 
[20]	training's auc: 0.890784	training's binary_logloss: 0.522686	valid_0's auc: 0.875799	valid_0's binary_logloss: 0.53108
[40]	training's auc: 0.948498	training's binary_logloss: 0.439175	valid_0's auc: 0.935419	valid_0's binary_logloss: 0.449588
[60]	training's auc: 0.972176	training's binary_logloss: 0.38058	valid_0's auc: 0.961683	valid_0's binary_logloss: 0.392074
[80]	training's auc: 0.985576	training's binary_logloss: 0.328205	valid_0's auc: 0.978299	valid_0's binary_logloss: 0.340598
[100]	training's auc: 0.991376	training's binary_logloss: 0.287082	valid_0's auc: 0.986534	valid_0's binary_logloss: 0.299895


# Model Evaluation 

As discussed in the previous notebook, we cannot use accuracy to judge whether the model is performing well due to the class distribution inequality  
So here I calculated the AUC score for the model to check how good the model is performing 

In [43]:
#get the auc score 
print("*"*50)
AUC = metrics.roc_auc_score(y_test, model.predict(X_test))
print(f"Initial Model AUC: {AUC*100: .2f}%")

**************************************************
Initial Model AUC:  92.76%


# Model Fine-Tuning

For model fine-tuning, I used GridSearchCV to find the best parameters for the model in order to improve its performance. However to save computational resources, I added a check so that model fine-tuning would only happen if the model's current AUC score was less than 90%. This is because a score of 90% or above indicates that the model can perform very well already so no need to fine tune any further

In [44]:
#define the gridsearch function 

def grid_search():
    #initializing a LightGBM model and calling GridSearch on it 
    new_model = lgb.LGBMClassifier(random_state=40)
    param_grid = {
        'learning_rate': [0.1, 0.09, 0.05, 0.01],
        'max_depth': [-1, 3, 5, 7, 10],
        'num_leaves': [20, 30, 40, 50, 75, 100]
    }

    grid_search = GridSearchCV(new_model, param_grid, cv=5, scoring="roc_auc")
    grid_result = grid_search.fit(X_train, y_train)

    best_params = grid_result.best_params_
    best_score = grid_result.best_score_
    
    return grid_result


In [45]:
#only undergo grid search if the model's auc is less than 90% otherwise it is a waste of computational resources 
threshold = 0.90
if AUC < threshold:
    improved = grid_search()
    model = lgb.LGBMClassifier(**improved.best_params_)
    model.fit(X_train, y_train)   

In [46]:
#save the best model, its params and the pipeline 
print("*"*50)
AUC = metrics.roc_auc_score(y_test, model.predict(X_test))
print(f"Model AUC: {AUC*100: .2f}%")

**************************************************
Model AUC:  92.76%


In [47]:
# Save the entire pipeline
joblib.dump(pipeline, '../pipelines/trained_pipeline.joblib')

#save the latest model as reference
joblib.dump(model, '../models/latest_model.joblib')

['../models/latest_model.joblib']

In [50]:
# make the html file 
from nbconvert import HTMLExporter
import nbformat

notebook_file = 'ml_pipeline.ipynb'

# Read the notebook
with open(notebook_file, 'r', encoding='utf-8') as notebook_file_content:
    notebook_content = nbformat.read(notebook_file_content, as_version=4)

# Create an HTMLExporter instance
html_exporter = HTMLExporter()

# Converting notebook to HTML file
(html_output, resources) = html_exporter.from_notebook_node(notebook_content)

html_output_file = 'ml_pipeline.html'

with open(html_output_file, 'w', encoding='utf-8') as html_file:
    html_file.write(html_output)

#confirmation message
print(f'The notebook has been successfully converted to HTML. Output saved to {html_output_file}.')


The notebook has been successfully converted to HTML. Output saved to ml_pipeline.html.


In [49]:
!pip3 freeze

absl-py==2.0.0
anyio==4.2.0
appnope==0.1.3
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asgiref==3.7.2
asttokens==2.4.1
astunparse==1.6.3
async-lru==2.0.4
attrs==23.2.0
Babel==2.14.0
beautifulsoup4==4.12.2
bleach==6.1.0
cachetools==5.3.2
certifi==2023.7.22
cffi==1.16.0
charset-normalizer==3.3.0
coloredlogs==15.0.1
comm==0.2.0
debugpy==1.8.0
decorator==5.1.1
defusedxml==0.7.1
distlib==0.3.7
Django==4.2.6
executing==2.0.1
fastjsonschema==2.19.1
filelock==3.13.0
flatbuffers==23.5.26
fqdn==1.5.1
gast==0.5.4
google-auth==2.23.3
google-auth-oauthlib==1.0.0
google-pasta==0.2.0
grpcio==1.59.0
h5py==3.10.0
humanfriendly==10.0
idna==3.4
imageio==2.31.5
imbalanced-learn==0.11.0
imblearn==0.0
ipykernel==6.26.0
ipython==8.17.2
ipywidgets==8.1.1
isoduration==20.11.0
jedi==0.19.1
Jinja2==3.1.3
joblib==1.3.2
json5==0.9.14
jsonpointer==2.4
jsonschema==4.20.0
jsonschema-specifications==2023.12.1
jupyter==1.0.0
jupyter-console==6.6.3


# Notes and Future Improvements 

From this entire process I took a few notes and points that I would like to mention here:
- A lot of the parts of data pre-processing were implemented while making the assumption that aside from the column 'nPaidOff', each column would have data entries in them. So the improvement to make here would be to later on add in a few more pre-procesing steps that can impute data for other features that may not have entries within them 
- In the feature-engineering step, I had a vague idea on another feature which could be created but I was not confident and too clear as to whether it would actually help in improving the created model's performance to predict loan application risks. 


Total time spent: ~40hrs  
Notebook runtime: 20 mins