# Capstone Project: Predicting the need for hospital admittance using machine learning
### (Notebook 2 of 2)

By: Yash Nagpaul

*(Data Science Diploma Candidate, BrainStation)*

## Table of Contents:
1. <a href="#Note:">Introductory note</a>
2. <a href="#Part-3-—-Modelling">Part 3 — Modelling</a>

---
#### Note:
This is the second of two Jupyter notebooks used in this project. In this notebook, we will be modelling using the cleaned and scaled datasets created in the first notebook.

---
### Part 3 — Modelling

In [1]:
# import helper libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

In [5]:
X_train = np.loadtxt('./X_train.txt', delimiter=',')
X_test = np.loadtxt('./X_test.txt', delimiter=',')
y_train = np.loadtxt('./y_train.txt', delimiter=',')
y_test = np.loadtxt('./y_test.txt', delimiter=',')

# Sanity check
X_train

array([[-1.55037225e-02, -9.23507447e-01, -4.64896751e-02, ...,
        -1.91282758e-02, -1.05535639e-01, -2.11198305e-03],
       [-1.55037225e-02,  8.88801104e-01, -4.64896751e-02, ...,
        -1.91282758e-02, -1.05535639e-01, -2.11198305e-03],
       [-1.55037225e-02,  6.92875855e-01, -4.64896751e-02, ...,
        -1.91282758e-02, -1.05535639e-01, -2.11198305e-03],
       ...,
       [ 1.13714702e+00, -7.27582198e-01, -4.64896751e-02, ...,
        -1.91282758e-02, -1.05535639e-01, -2.11198305e-03],
       [ 2.28979776e+00, -1.26637663e+00, -4.64896751e-02, ...,
        -1.91282758e-02, -1.05535639e-01, -2.11198305e-03],
       [-1.16815446e+00,  1.23167029e+00, -4.64896751e-02, ...,
        -1.91282758e-02, -1.05535639e-01, -2.11198305e-03]])

## Logistic Regression
- Let's begin with one of the simplest classification models: A Logistic Regression.
    - Briefly described, a logistic regression fits a logistic function on the input variables and outputs the probability of a binary target variable.
- First, we will run a simple Logistic Regression without worrying about any hyperparameter optimization.
- That way, we will quickly get a baseline model accuracy to work with and improve upon.

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from warnings import filterwarnings
filterwarnings('ignore')

logreg = LogisticRegression(random_state=7)
logreg.fit(X_train, y_train)

train_acc = logreg.score(X_train, y_train)
test_acc = logreg.score(X_test, y_test)

print('Model accuracy with training data:', round(train_acc*100,2),"%")
print('Model accuracy with test data:', round(test_acc*100,2),"%")

Model accuracy with training data: 86.07 %
Model accuracy with test data: 86.04 %


#### DISCUSSION:
- Our first model is about 86% accurate on unseen data!
- That's not bad at all since it's significantly better than what we would get if we simply guessed every disposition as '0' (since 70% of all dispositions in our dataset are '0').
- It is important to note that we dropped over 300 columns since they were missing more than 10% of the values.
- Yet, we are able to achieve an accuracy that is almost at par with the original study.
- Let us save the model that we just trained into a pickle file:

In [10]:
import pickle

logreg_pkl_file = 'disposition_logreg.pkl'

with open(logreg_pkl_file, 'wb') as file:
    pickle.dump(logreg, file)

- Let's try to improve this result by optimizing the regression parameters

In [11]:
from sklearn.model_selection import GridSearchCV

# parameter grid for logistic regression
param_grid = {
    'C': np.logspace(-3, 7, 11),
    'penalty': ['l2']
    # The 'lbfgs' is the fastest solver for large datasets.
    # However, it doen't work with L1 penalty.
    # We will address this issue later.
}

# GridSearchCV allows us to search over multiple params in a model
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

logreg_cv.fit(X_train, y_train)

# Best hyperparameters
logreg_cv.best_params_

{'C': 0.1, 'penalty': 'l2'}

In [12]:
logreg2 = LogisticRegression(random_state=7, C= 0.1, penalty= 'l2')
logreg2.fit(X_train, y_train)

train_acc = logreg2.score(X_train, y_train)
test_acc = logreg2.score(X_test, y_test)

print('Model accuracy with training data:', round(train_acc*100,2),"%")
print('Model accuracy with test data:', round(test_acc*100,2),"%")

Model accuracy with training data: 86.07 %
Model accuracy with test data: 86.03 %


- Very well. We can see that even though the accuracies dropped slightly, the train and test accuracies are equal now.
- This means that with these hyperparameters, our model has found a perfect balance between bias and overfitting.
- Now, we want to try various methods to account for any multicoliearity in the dataset.
- I am keen on fitting a logreg model that works with L1 penalty. Therefore, I will set the *solver* to be 'saga' (since this is a large dataset)
- And then, let us repeat the process of finding the best hyperparameters through a grid search.

In [13]:
# RUN CELLS 11 and 12 AGAIN BEFORE PROCEEDING

In [14]:
# BASICALLY GRID SEARCH AND EVERYTHING PAST THAT