### Feature Engineering

In [None]:
# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
import random
import time

In [2]:
# set up seeds before I do any kind of model just so that I can replicate all the results that I've done before.
random.seed(100)

#### Dataset

In [3]:
dataset = pd.read_csv('Financial-Data.csv')

In [4]:
dataset = dataset.drop(columns= ['months_employed']) # getting rid of the column that we found to be kind of faulty.

#### Feature engineering

In [5]:
dataset['personal_account_months'] = (dataset.personal_account_m + (dataset.personal_account_y * 12))

In [6]:
dataset[ ['personal_account_m', 'personal_account_y', 'personal_account_months'] ].head()

Unnamed: 0,personal_account_m,personal_account_y,personal_account_months
0,6,2,30
1,2,7,86
2,7,1,19
3,2,7,86
4,2,8,98


In [7]:
dataset = dataset.drop(columns= ['personal_account_m', 'personal_account_y'])

#### Data Preprocessing

To do so, we have to leverage the Pandas function PD dot Getdummies.
<br><br>
It's going to find all the categorical variables, all the variables that are not numeric and it's going to encode them into their own dummy variables and it's going to do it all for us very quickly.

##### Dummy variables

In [8]:
dataset = pd.get_dummies(dataset) # first we run dataset and we set it to the result of pandas that getdummies on the data set

In [9]:
dataset.columns

Index(['Entry_id', 'age', 'home_owner', 'income', 'years_employed',
       'current_address_year', 'has_debt', 'amount_requested', 'risk_score',
       'risk_score_2', 'risk_score_3', 'risk_score_4', 'risk_score_5',
       'ext_quality_score', 'ext_quality_score_2', 'inquiries_last_month',
       'e_signed', 'personal_account_months', 'pay_schedule_bi-weekly',
       'pay_schedule_monthly', 'pay_schedule_semi-monthly',
       'pay_schedule_weekly'],
      dtype='object')

##### explained


- The pay schedule columns are the biweekly label, the monthly label, the semi monthly label and the weekly label.
    - So it is split that column into four different ones.
<br><br>
- we need to remove one of these columns from the data set to avoid the dummy variable trap, 
    - meaning that if we keep all of these columns here, they're going to be now dependent column, not linearly independent columns, which is what we want.
<br>
    - So by removing one of them, we make them linearly independent again.
<br><br>
- So which one are we going to remove? (your choice)
    - I personally going to remove semi-monthly just because semi-monthly is the weirdest of the pay schedules.
<br><br>

### Data Pre-processing

In [10]:
dataset = dataset.drop(columns= ['pay_schedule_semi-monthly'])

In [11]:
# get rid of all the columns that are useful but are not going to be part of the training set
# removing extra columns
response = dataset['e_signed']
users = dataset['Entry_id']
dataset = dataset.drop(columns= ['e_signed', 'Entry_id'])

##### Splitting into Train and Test Set

Train test split function is going to return four items.
<br>
The x axis of the independent variables for the train and test as well as 
<br>
The dependent variables for the train and the test.

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(dataset, # X value
                                                    response, # y value
                                                    test_size=0.2, # 20% is the size of response variable
                                                    random_state=0
                                                   )
# We have a train test split immediately.

##### Feature Scalling

In [13]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler() # we first create the standard scalar by calling the class

Now, this resulting data set is not going to be a Pandas dataframe anymore, meaning that it is going to lose the column names and it's going to lose the indexes.
<br>
So we want to convert this to an actual Pandas dataframe too. So we're going to apply pandas DataFrame.

In [14]:
# fit the scalar to the X train set and then we're going to transform the X train based on that scaling
# This is the result of the X train being scaled.
X_train2 = pd.DataFrame(sc_X.fit_transform(X_train)) 

In [15]:
# we're going to actually do the same thing - test set
# only transform because we already have fitted the scaler
X_test2 = pd.DataFrame(sc_X.transform(X_test))

In [16]:
# Now, what is next to do is to copy the columns in the indexes to this Xtrain and xtest two datasets 
# because those have been lost
# That should work to copy the original columns to the xtrain
X_train2.columns = X_train.columns.values
X_test2.columns = X_test.columns.values

X_train2.index = X_train.index.values
X_test2.index = X_test.index.values

So now our X train and everything has been scaled properly and the indexes and the columns have been recovered.<br>
So the only thing left to do is to actually set the X_train to this new X_train2 value.<br>
So this is just making sure that we keep the name of the X train consistent in the future lines.

In [17]:
X_train = X_train2
X_test = X_test2

### Model Building 

<u><i><b>Comparing Models:</b></i></u>
<br>
And we're going to apply different models to our dataset and compare the results that we get out of each one.<br>
At the end, we're going to select one best model and we're going to move forward with that one

#### Model 1: Logistic Regression

In [18]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train)

# L1 penalty: the lasso penalty 
# ---> this is going to make sure that our data is penalized if one particular variable has too much of a coefficient.

# this may not be needed, but if we add this to this model right now, 
# we're going to make sure that we're comparing the very best options of each of the different methods that we can use.

In [19]:
# Predicting test set
y_pred = classifier.predict(X_test)

In [20]:
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score 

In [21]:
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

In [22]:
# put results in pandas dataframe
# So the first argument should be the name of the model: Linear Regression
# we're going to add here in parentheses is lasso, because this is a lasso penalty
# Then name the columns
results = pd.DataFrame([['Logistic Regression (Lasso)', acc, prec, rec, f1]], 
                       columns= ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])
results

Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score
0,Logistic Regression (Lasso),0.562535,0.576386,0.706432,0.634817


<br>
- Precision: 
    - The precision is the rate of true positives divided by the rate of true positives and false positives.
    - What does that mean?
<br>
-- That means that out of all the predicted positives, we want to know how many have been predicted right and how many have been predicted wrong.
<br><br>
- Recall score:
    - what it means is true positives divided by true positives and false negatives
    - So the recall is telling us that out of all the actual positives, we have predicted them to be true around 70% of the time
    - recall is 70% ----> Now that means that there's some bias in this model


#### Model 2: Support Vector Machine (SVM)

##### Linear Kernal
<br>
SVM(Linear)
<br>

In [23]:
from sklearn.svm import SVC # importing Support Vector Classifier

classifier = SVC(random_state=0, kernel='linear')
classifier.fit(X_train, y_train)

In [24]:
# Predicting test set
y_pred = classifier.predict(X_test)

In [25]:
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score 

In [26]:
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

In [27]:
model_results = pd.DataFrame([['SVM (Linear)', acc, prec, rec, f1]], 
                       columns= ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])
model_results

Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score
0,SVM (Linear),0.568118,0.577597,0.735477,0.647045


In [28]:
results = results._append(model_results, ignore_index=True) # append this to the initial results table
results

Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score
0,Logistic Regression (Lasso),0.562535,0.576386,0.706432,0.634817
1,SVM (Linear),0.568118,0.577597,0.735477,0.647045


In [29]:
# results =  results.drop([1, 2, 3, 4], axis='index') ---- drop rows

- accuracy, precision: 
    - Almost identical.

- Recall: 
    - Pretty high
    - It's even higher than the one previously
    - So obviously there's still bias in this linear support vector machine model

- F1 score is 64:
    - A little bit higher, but not that much better
---------
<br>
So now let's do the same for a different kernel.
<br>

##### SVM: RBF Kernal

In [30]:
from sklearn.svm import SVC # importing Support Vector Classifier

classifier = SVC(random_state=0, kernel='rbf')
classifier.fit(X_train, y_train)

In [31]:
y_pred = classifier.predict(X_test)

In [32]:
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score 

acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

In [33]:
model_results = pd.DataFrame([['SVM (RBF)', acc, prec, rec, f1]], 
                       columns= ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])

results = results._append(model_results, ignore_index=True) # append this to the initial results table
results

Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score
0,Logistic Regression (Lasso),0.562535,0.576386,0.706432,0.634817
1,SVM (Linear),0.568118,0.577597,0.735477,0.647045
2,SVM (RBF),0.591569,0.60573,0.690871,0.645505


<br>RDF actually gives us a better accuracy by almost three points of a percent.
The precision increases as well and the recall goes down a little bit.<br>

#### Model 3: Random Forrest

In [34]:
from sklearn.ensemble import RandomForestClassifier 

classifier = RandomForestClassifier(random_state=0, n_estimators=100, criterion='entropy')
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

In [35]:
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score 

acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

In [36]:
# model with n=100 trees
model_results = pd.DataFrame([['Random Forest (n=100)', acc, prec, rec, f1]], 
                       columns= ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])

results = results._append(model_results, ignore_index=True) # append this to the initial results table
results

Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score
0,Logistic Regression (Lasso),0.562535,0.576386,0.706432,0.634817
1,SVM (Linear),0.568118,0.577597,0.735477,0.647045
2,SVM (RBF),0.591569,0.60573,0.690871,0.645505
3,Random Forest (n=100),0.62172,0.640098,0.678942,0.658948


<br>
random forest with 100 trees has given us a 62%, which is almost almost exactly, actually 3% higher than the SVM, which was the highest accuracy we had
<br><br>
precision has gone up and the recall has gone down by two points, meaning that it is more balanced
<br><br>
So now that we have decided which model to use, the last step that we're going to care about is to actually validate this model, to see if it performs like it says it does

#### K-Fold Cross Validation

In [37]:
from sklearn.model_selection import cross_val_score

accuracies = cross_val_score(estimator=classifier, # classifer which we built - random forest classifier with 100 trees
                            X = X_train,
                            y = y_train,
                            cv = 10) # number of folds = 10 

print("Random Forest Classifier Accuracy: %0.2f (+/-%0.2f)" %(accuracies.mean(), accuracies.std() *2))

Random Forest Classifier Accuracy: 0.63 (+/-0.03)


the accuracy that we got for K-fold, cross validation was 63% with 
standard deviation of around three points, meaning that it could be from 60 to 66
<br><br>
This is even higher than the run we did in the results dataset.
<br><br>
So then we can finally guarantee that random forest is the best option.<br>
We have to run here and we're going to be using it going forward.

#### Parameter Tuning - Grid Search

We're going to fine tune this model.<br>
We're going to do parameter tuning.<br>
And what this is going to accomplish is it's going to find the best parameters to random forest that 
gives us the most accuracy in our model.
<br><br>
So if we select a range of options to choose for each of those parameters and we try every single combination
of those to see which one performs the best with our data, that is pretty much what Gridsearch is doing.

##### Round 1: Entropy

In [38]:
# we're going to give it parameters which is going to be a dictionary of column names and possible values
parameters = {"max_depth":[3, None],
             "max_features":[1, 5, 10],
             "min_samples_split":[2, 5, 10],  
             "min_samples_leaf":[1, 5, 10],
             "bootstrap":[True, False],
             "criterion":["entropy"]}
# The reason we do two, five and ten instead of one, five and ten as above is because for min samples split, the default is two.
# Bootstrap: There's only two possible values for this argument - True / False
# The criterion is just going to ensure that we're trying this with entropy and nothing else.

In [39]:
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(estimator=classifier,   # model = random forest;
                          param_grid=parameters,
                          scoring="accuracy",      # So we want to judge which model is best simply based on its accuracy
                          cv=10,                   # k-fold cross validation
                          n_jobs=1 # n_jobs=-1               # I want it to use all of my cores to run this bunch of models
                          )  
# n_jobs: 
#   If you don't want it to take a big strain on your computer, 
#   you can change this to just one to just leave it as is as a default.
#   And that way it doesn't take too much of a toll on your computer when you're running it.
#   But I wanted to finish this as soon as possible, so I'm going to use negative one, which is again,
#   means just that you're using every single core available on your computer.


In [None]:
# fit this model

t0 = time.time() # we use the time library because we want to time how long all of this takes

grid_search = grid_search.fit(X_train, y_train)

t1 = time.time()  # set the final time

print("Took %0.2f seconds" %(t1-t0)) # float of 2 decimals

In [None]:
# if some weird error shows up: 
#                               - pip install joblib
#                               - conda install joblib

##### Round 2: Entropy

we're going to base our new results on the previous results
<br>

In [None]:
parameters = {"max_depth":[None],             # prev: none is the most optimal max depth
             "max_features":[3, 5, 7],        # prev: 5; so we try 3 and 7 on the edges to cover more range 
             "min_samples_split":[8, 10, 12], # prev: 10
             "min_samples_leaf":[1, 2, 3],    # prev: 1
             "bootstrap":[True],       # prev: True
             "criterion":["entropy"]} 

In [None]:
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(estimator=classifier,  
                          param_grid=parameters,
                          scoring="accuracy",      
                          cv=10,                   
                          n_jobs=1                
                          )  

In [None]:
t0 = time.time()
grid_search = grid_search.fit(X_train, y_train)
t1 = time.time()
print("Took %0.2f seconds" %(t1-t0))

<br>The results are the exact same that we got on the first round.<br>
This tells us that the results we got on the first round are just the optimal of all the ones we have tried.
<br><br>
So we're going to stick to that and run our tests on that particular set of results.
<br><br>
Now, the final part here is to apply our model to our test set so that we can see if this set of results actually improves the accuracy or not.<br>

###### Predicting Test Set

In [None]:
y_pred = grid_search.predict(X_test)

from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score 

acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

model_results = pd.DataFrame([['Random Forest (n=100), GSx2 + Entropy', acc, prec, rec, f1]], 
                       columns= ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])

results = results._append(model_results, ignore_index=True) # append this to the initial results table
results

##### Round 1: Gini

So this grid search model is actually performing better than the original random forest model.<br>
The second round didn't give us any lift, but the first one did.
<br><br>
Now, we attempted this grid search on the entropy version of Random Forest.<br>
We can do the exact same, but using the gini version.

In [None]:
parameters = {"max_depth":[3, None],
             "max_features":[1, 5, 10],
             "min_samples_split":[2, 5, 10],  
             "min_samples_leaf":[1, 5, 10],
             "bootstrap":[True, False],
             "criterion":["gini"]}

In [None]:
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(estimator=classifier,   
                          param_grid=parameters,
                          scoring="accuracy",      
                          cv=10,                   
                          n_jobs=1 # n_jobs=-1               
                          )  

In [None]:
t0 = time.time() 

grid_search = grid_search.fit(X_train, y_train)

t1 = time.time()  

print("Took %0.2f seconds" %(t1-t0)) 

##### Round 2: Gini

In [None]:
parameters = {"max_depth":[None],            # None 
             "max_features":[8, 10, 12],     # 10
             "min_samples_split":[2, 3, 4],  # 2
             "min_samples_leaf":[8, 10, 12], # 10
             "bootstrap":[True],      
             "criterion":["gini"]} 

In [None]:
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(estimator=classifier,  
                          param_grid=parameters,
                          scoring="accuracy",      
                          cv=10,                   
                          n_jobs=1                
                          )  

In [None]:
t0 = time.time()
grid_search = grid_search.fit(X_train, y_train)
t1 = time.time()
print("Took %0.2f seconds" %(t1-t0))

In [None]:
# Predicting Test Set

y_pred = grid_search.predict(X_test)

from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score 

acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

model_results = pd.DataFrame([['Random Forest (n=100), GSx2 + Gini', acc, prec, rec, f1]], 
                       columns= ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score'])

results = results._append(model_results, ignore_index=True) # append this to the initial results table
results

- So we see the Randomforest with Gridsearch applied two times on Guinea gave us a accuracy of 63.5 on that test set, a precision of 64.9, a recall of 70% and a F1 score of 63.6 7.3.
<br><br>
- So random forest with Entropy and Guinea are not that different.
- We can assume that the difference here are just due to randomness, but overall we're going to just stick to random forest entropy and realize that this is the best model that we can possibly use
<br><br>

<br>
The difference between Guinea and entropy has to do a lot with what this criterion means. <br>
So this criterion is the splitting criterion. <br>
So that means that when a parent is partitioned into two child regions in the tree and the decision tree that this random forest is made out of.
<br>
We need a particular splitting criterion for this.
<br><br>ENTROPY:<br>
And the entropy is a criterion and there is a whole equation behind this that is meant to maximize the informational content that our random forest has.
<br>
So the equation makes it so that we're maximizing the information that we keep after every split.
<br><br>GINI:<br>
The Gini version, on the other hand, minimizes the probability of mislabeling.
<br>
So when it does the splitting, it does so in a way that it values not mislabeling our leaves.


### Model Conclusion

Combining the predictions that we had with their actual values and the user identifiers, pretty much mapping every user to whatever prediction we did for that user.

<b> Formatting Final Results <b>

In [None]:
# concatenate the Y test, which is a column of E sign, and our user identifier, which is our entry IDs.
final_results = pd.concat([y_test, users], axis=1).dropna() # it makes each of these a column

final_results['predictions'] = y_pred

final_results = final_results[['Entry_id', 'e_signed', 'predictions']] # reorder this results in a way that makes more sense

In [None]:
# We already know that the accuracy for this is on the level of 63%, so that's pretty good.

### Final Remarks

- Predicting the likelihood of you signing a loan based on financial history.
<br>
- Now, our model, of course, has given us the 64% accuracy. 
- With it, we have an algorithm that can help predict whether or not a user will complete this signing step.
<br>
- One way to leverage this model is to target those predicted not to reach the sign phase with customized onboarding.
    - This means that when a lead arrives from a marketplace, they may receive a different onboarding experience based on how likely they are to finish the general onboarding process, the original process.
    - This can help our company minimize how many people drop off from the onboarding funnel.
        - This funnel of screens is effective as we as a company decide to build it.
        - Therefore, any user drop off in this funnel falls entirely on our shoulders.
<br>
- This is entirely about how we design this onboarding so we can maximize how many people can reach that screen.
    - So with new onboarding screens built intentionally to lead these users to finalize the loan application, we can actually attempt to get more than 40% of those users predicted not to finish the process to actually complete the E sign step.

- And why 40%?
    - Well, we know that our model has around a 64% accuracy.
    - So if of all the people who have been predicted not to reach the E sign step, 64% of them have been correctly predicted on average.
    - Of course, this means that only around 36 to 40% of this subset of people are expected to reach the e sign step in to complete it.

- So if we can change the onboarding process so we can get more than this 36 to 40% number and we can increase it, then we can drastically increase profits.
<br>
- Many lending companies, like the ones we are pretending to work for, provide hundreds if not thousands of loans every day, and they gain money for each one of these loans.
<br>
- So if we can increase the percentage of loan takers and we multiply this percentage by all the thousands or hundreds of people that were giving loans every day, we are increasing profits drastically.
- And all of this is done with a simple model.
<br>
- Studies course is that we don't need complex machine learning models to get value into your company.
- Many times when data scientists are joining a company, they're going to join with very small set of experience.
- Not everybody is going to have 5 to 10 years of experience, so they're only going to know how to build simple models.
- But as long as they understand the business context and they know how to leverage this model, then the profits are there for you to acquire.
<br>
- So never fret if your model can be a little bit simple as long as the results are there.