<a href="https://colab.research.google.com/github/reuben-mwangi/Loan-Prediction/blob/main/Deploying_machine_learning_model_using_Streamlit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview of Machine Learning Lifecycle
Let’s start with understanding the overall machine learning lifecycle, and the different steps that are
involved in creating a machine learning project. Broadly, the entire machine learning lifecycle can be
described as a combination of 6 stages

##1.Stage 1: Problem Definition 
 The first and most important part of any project is to define the problem statement. Here, we want to
describe the aim or the goal of our project and what we want to achieve at the end.
##2.Stage 2: Hypothesis Generation
 Once the problem statement is finalized, we move on to the hypothesis generation part. Here, we try to
point out the factors/features that can help us to solve the problem at hand.
##3. Stage 3: Data Collection

After generating hypotheses, we get the list of features that are useful for a problem. Next, we collect the
data accordingly. This data can be collected from different sources.

##Stage 4: Data Exploration and Pre-processing
After collecting the data, we move on to explore and pre-process it. These steps help us to generate
meaningful insights from the data. We also clean the dataset in this step, before building the model.
## Stage 5: Model Building
Once we have explored and pre-processed the dataset, the next step is to build the model. Here, we create
predictive models in order to build a solution for the project.

##Stage 6: Model Deployment
Once you have the solution, you want to showcase it and make it accessible for others. And hence, the
final stage of the machine learning lifecycle is to deploy that model.
These are the 6 stages of a machine learning lifecycle. The aim of this project is to understand the last
stage, i.e. model deployment, in detail using streamlit. However, I will briefly explain the remaining stages
and the complete machine learning lifecycle along with their implementation in Python, before diving deep
into the model deployment part using streamlit.

## Understanding the Problem Statement: Automating Loan Prediction
The project that I have picked is automating the loan eligibility process. The task is
to predict whether the loan will be approved or not based on the details provided by customers. Here is the
problem statement for this project:
### Automate the loan eligibility process based on customer details provided while filling online application form.

Based on the details provided by customers, we have to create a model that can decide where or not their
loan should be approved. This completes the problem definition part of the first stage of the machine
learning lifecycle. The next step is to generate hypotheses and point out the factors that will help us to
predict whether the loan for a customer should be approved or not.
As a starting point, here are a couple of factors that I think will be helpful for us with respect to this
project:
Amount of loan: The total amount of loan applied by the customer. My hypothesis here is that the
higher the amount of loan, the lesser will be the chances of loan approval and vice versa.
Income of applicant: The income of the applicant (customer) can also be a deciding factor. A higher
income will lead to higher probability of loan approval.
Education of applicant: Educational qualification of the applicant can also be a vital factor to predict
the loan status of a customer. My hypothesis is if the educational qualification of the applicant is
higher, the chances of their loan approval will be higher.
These are some factors that can be useful to predict the loan status of a customer. Obviously, this is a very
small list, and you can come up with many more hypotheses. But, since the focus of this article is on
model deployment, I will leave this hypothesis generation part for you to explore further.
Next, we need to collect the data. We know certain features that we want like the income details,
educational qualification, and so on. 
We have some variables related to the loan, like the loan ID, which is the unique ID for each customer, Loan
Amount and Loan Amount Term, which tells us the amount of loan in thousands and the term of the loan in
months respectively. Credit History represents whether a customer has any previous unclear debts or not.
Apart from this, we have customer details as well, like their Gender, Marital Status, Educational
qualification, income, and so on. Using these features, we will create a predictive model that will predict
the target variable which is Loan Status representing whether the loan will be approved or not.
Now we have finalized the problem statement, generated the hypotheses, and collected the data. Next are
the Data exploration and pre-processing phase. Here, we will explore the dataset and pre-process it. The
common steps under this step are as follows:
1.Univariate Analysis
2.Bivariate Analysis
3.Missing Value Treatment
4.Outlier Treatment
5.Feature Engineering
We explore the variables individually which is called the univariate analysis. Exploring the effect of one
variable on the other, or exploring two variables at a time is the bivariate analysis. We also look for any
missing values or outliers that might be present in the dataset and deal with them. And we might also
create new features using the existing features which are referred to as feature engineering. Again, I will
not focus much on these data exploration parts and will only do the necessary pre-processing.
After exploring and pre-processing the data, next comes the model building phase. Since it is a
classification problem, we can use any of the classification models like the logistic regression, decision
tree, random forest, etc. I have tried all of these 3 models for this problem and random forest produced the
best results. So, I will use a random forest as the predictive model for this project.




## Machine Learning model for Automating Loan Prediction

In [None]:
import pandas as pd

train = pd.read_csv("/content/train_ctrUa4K.csv")

train.head()

Here are the first five rows from the dataset. We know that machine learning models take only numbers as
inputs and can not process strings. So, we have to deal with the categories present in the dataset and
convert them into numbers.

In [None]:
train['Gender']= train['Gender'].map({'Male':0, 'Female':1})
train['Married']= train['Married'].map({'No':0, 'Yes':1})
train['Loan_Status']= train['Loan_Status'].map({'N':0, 'Y':1})

Here, we have converted the categories present in the Gender, Married and the Loan Status variable into
numbers, simply using the map function of python. Next, let’s check if there are any missing values in the
dataset:

In [None]:
train.isnull().sum()

So, there are missing values on many variables including the Gender, Married, LoanAmount variable. Next,
we will remove all the rows which contain any missing values in them:

In [None]:
train = train.dropna()

train.isnull().sum()

Now there are no missing values in the dataset. Next, we will separate the dependent (Loan_Status) and
the independent variables:


In [None]:
x= train[["Gender","Married","ApplicantIncome","LoanAmount","Credit_History"]]

y= train.Loan_Status

x.shape, y.shape

For this particular project, I have only picked 5 variables that I think are most relevant. These are the
Gender, Marital Status, ApplicantIncome, LoanAmount, and Credit_History and stored them in variable X.
Target variable is stored in another variable y. And there are 480 observations available. Next, let’s move on
to the model building stage.
Here, we will first split our dataset into a training and validation set, so that we can train the model on the
training set and evaluate its performance on the validation set.


In [None]:
from sklearn.model_selection import train_test_split

x_train,x_cv,y_train,y_cv = train_test_split(x,y,test_size = 0.2,random_state = 10)

We have split the data using the train_test_split function from the sklearn library keeping the test_size as
0.2 which means 20 percent of the total dataset will be kept aside for the validation set. Next, we will train
the random forest model using the training set:

In [None]:
 from sklearn.ensemble import RandomForestClassifier

 model = RandomForestClassifier(max_depth= 4,random_state= 10)

 model.fit(x_train,y_train)

Here, I have kept the max_depth as 4 for each of the trees of our random forest and stored the trained
model in a variable named model. Now, our model is trained, let’s check its performance on both the
training and validation set:

In [None]:
from sklearn.metrics import accuracy_score

pred_cv = model.predict(x_cv)

accuracy_score(y_cv,pred_cv)

The model is 80% accurate on the validation set. Let’s check the performance on the training set too:

In [None]:
pred_train = model.predict(x_train)
accuracy_score(y_train, pred_train)

Performance on the training set is almost similar to that on the validation set. So, the model has
generalized well. Finally, we will save this trained model so that it can be used in the future to make
predictions on new observations.

In [None]:
# Saving the model

import pickle

pickle_out = open("classifier.pkl",mode="wb")

pickle.dump(model,pickle_out)

pickle_out.close()

We are saving the model in pickle format and storing it as classifier.pkl. This will store the trained model
and we will use this while deploying the model.
This completes the first five stages of the machine learning lifecycle. Next, we will explore the last stage
which is model deployment. We will be deploying this loan prediction model so that it can be accessed by
others. And to do so, we will use Streamlit which is a recent and the simplest way of building web apps and
deploying machine learning and deep learning models.



## Introduction to Streamlit

As per the founders of Streamlit, it is the fastest way to build data apps and share them. It is a recent
model deployment tool that simplifies the entire model deployment cycle and lets you deploy your models
quickly. I have been exploring this tool for the past couple of weeks and as per my experience, it is a
simple, quick, and interpretable model deployment tool.
Here are some of the key features of Streamlit which I found really interesting and useful:
1. It quickly turns data scripts into shareable web applications. You just have to pass a running script to
the tool and it can convert that to a web app.
2. Everything in Python. The best thing about Streamlit is that everything we do is in Python. Starting
from loading the model to creating the frontend, all can be done using Python.
3. All for free. It is open source and hence no cost is involved. You can deploy your apps without paying
for them.
4. No front-end experience required. Model deployment generally contains two parts, frontend, and
backend. The backend is generally a working model, a machine learning model in our case, which is
built-in python. And the front end part, which generally requires some knowledge of other languages
like java scripts, etc. Using Streamlit, we can create this front end in Python itself. So, we need not
pred_train = model.predict(x_train)
accuracy_score(y_train,pred_train)
# saving the model
import pickle
pickle_out = open("classifier.pkl", mode = "wb")
pickle.dump(model, pickle_out)
pickle_out.close()
view raw
learn any other programming languages or web development techniques. Understanding Python is
enough.
Let’s say we are deploying the model without using Streamlit. In that case, the entire pipeline will look
something like this:
Model Building
Creating a python script
Write Flask app
Create front-end: JavaScript
Deploy
We will first build our model and convert it into a python script. Then we will have to create the web app
using let’s say flask. We will also have to create the front end for the web app and here we will have to use
JavaScript. And then finally, we will deploy the model. So, if you would notice, we will require the
knowledge of Python to build the model and then a thorough understanding of JavaScript and flask to
build the front end and deploying the model. Now, let’s look at the deployment pipeline if we use Streamlit:
Model Building
Creating a python script
Create front-end: Python
Deploy
Here we will build the model and create a python script for it. Then we will build the front-end for the app
which will be in python and finally, we will deploy the model. 



# Model Deployment of the Loan Prediction model using Streamlit

In [None]:
# we start by installing basics
!pip install -q pyngrok

!pip install -q streamlit

!pip install -q streamlit_ace


We have installed 3 libraries here. pyngrok is a python wrapper for ngrok which helps to open secure
tunnels from public URLs to localhost. This will help us to host our web app. Streamlit will be used to make
our web app.
Next, we will have to create a separate session in Streamlit for our app. You can download the
sessionstate.py file from here and store that in your current working directory. This will help you to create
a session for your app. Finally, we have to create the python script for our app. Let me show the code first
and then I will explain it to you in detail:

In [None]:
%%writefile app.py

import pickle
import streamlit as st

# loading the trained model

pickle_in = open("classifier.pkl","rb")

classifier = pickle.load(pickle_in)

@st.cache()

# defining the functions will make the prediction using the data which the user inputs

def prediction(Gender,Married,ApplicantIncome,Credit_History):

  #pre-processing user input
  if Gender == "Male":
          Gender=0
  else:
          Gender = 1
if Married == "Unmarried":
     Married = 0
else:
     married = 1
if Credit_History == "Unclear Debts":
    Credit_History = 0
else:
     Credit_History = 1
LoanAmount = LoanAmount/1000


# making Predictions

prediction = classifier.predict(
    [[Gender,Married,ApplicantIncome,LoanAmount,Credit_History]]
)

if prediction == 0:
      pred = "Rejected"
else:
      pred = "Approved"
return pred


# this is the main function in which we define our webpage


def main():
  # front end elements of the web page
  html_temp = """
  <div style = "background-color:yellow;padding:13px">
  <h1 style = "color:black;text-align:center;>Streamlit Loan Prediction ML App</h1>
  <div>
  """

  #display the front end aspect
  st.markdown(html_temp,unsafe_allow_html = True)

  # following lines create boxes in which user can enter data required to make prediction
  Gender = st.selectbox("Gender",("Male","Female"))
  Married = st.selectbox("Marital_Status",("Unmarried","Married"))
  ApplicantIncome = st.number_input("Applicants monthly income")
  loanAmount = st.number_input("Total loan amount")
  Credit_History = st.selectbox("Credit_History",("Unclear Debts","No unclear Debts"))
  result = ""

  # when "predict" is clicked,make the prediction and store it

  if st.button("predict"):
    result = prediction(Gender,Married,ApplicantIncome,LoanAmount,Credit_History)
    st.success("Your loan is {}".format(result))
    print(LoanAmount)

if __name__ =="__main__":
  main()
  

##Alright, let’s now host this app to a public URL using pyngrok library.

In [None]:
! streamlit run app.py &>dev/null&

Here, we are first running the python script. And then we will connect it to a public URL:

In [None]:
from pyngrok import ngrok

public_url = ngrok.connect("8501")

public_url

##This will generate a link something like this: