## Applied - Chapter 5

This question will use Default data set which record characteristics of people who default
on credit cards. There are 4 variables:
1. default - boolean.
2. student - boolean.
3. balance - current credit card balance.
4. income - income of observation

#### Import block

In [24]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import sklearn.linear_model as skl_lm
from sklearn.metrics import mean_squared_error, confusion_matrix, classification_report
from sklearn.model_selection import KFold, LeaveOneOut, train_test_split, cross_val_score
from sklearn.utils import resample
from sklearn.preprocessing import PolynomialFeatures

import statsmodels.formula.api as smf
from util import print_cm

%matplotlib inline
plt.style.use('seaborn-white')

Loading dataset

In [25]:
data_path = 'D:\\PycharmProjects\\ISLR\\data\\'
df = pd.read_excel(f'{data_path}Default.xlsx', usecols=list(range(1,5)))

# transform dummy
for i in ['default', 'student']:
    df[f'{i}2'] = df[i].astype('category').cat.codes

# Get X and y
X = df[['balance', 'income']]
y = df.default

# preview
df.head()

Unnamed: 0,default,student,balance,income,default2,student2
0,No,No,729.526495,44361.625074,0,0
1,No,Yes,817.180407,12106.1347,0,1
2,No,No,1073.549164,31767.138947,0,0
3,No,No,529.250605,35704.493935,0,0
4,No,No,785.655883,38463.495879,0,0


(a). Fitting a Logistic regression (using statsmodel to get a based line)

In [26]:
model = smf.logit('default2 ~ income + balance', data=df).fit()
print(model.summary().tables[1])

Optimization terminated successfully.
         Current function value: 0.078948
         Iterations 10
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -11.5405      0.435    -26.544      0.000     -12.393     -10.688
income      2.081e-05   4.99e-06      4.174      0.000     1.1e-05    3.06e-05
balance        0.0056      0.000     24.835      0.000       0.005       0.006


(b) Validation set approach (sklearn)

In [27]:
t_prop = 0.5

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=t_prop, random_state=None)

# Fitting model
regr = skl_lm.LogisticRegression()
pred = regr.fit(X_train, y_train).predict_proba(X_test)

# Results
pred_2 = np.where(pred[:,1] > 0.5, 'Yes', 'No')
print_cm(y_test, pred_2, regr)

Confusion Matrix 
 True         No  Yes
Predicted           
No         4824  175
Yes           1    0 

Classification report 
               precision    recall  f1-score   support

          No      0.965     1.000     0.982      4825
         Yes      0.000     0.000     0.000       175

    accuracy                          0.965      5000
   macro avg      0.482     0.500     0.491      5000
weighted avg      0.931     0.965     0.948      5000





Test error rate $\approx 0.033$

Repeat the process 3 times as follow:

In [28]:
regr = skl_lm.LogisticRegression()
for t_prop in [0.3, 0.2, 0.1]:
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=t_prop, random_state=None)
    
    # Fitting model
    pred = regr.fit(X_train, y_train).predict(X_test)
    
    # Results
    print(f'For test size of {t_prop}\n')
    print_cm(y_test, pred, regr)

For test size of 0.3

Confusion Matrix 
 True         No  Yes
Predicted           
No         2903   95
Yes           2    0 

Classification report 
               precision    recall  f1-score   support

          No      0.968     0.999     0.984      2905
         Yes      0.000     0.000     0.000        95

    accuracy                          0.968      3000
   macro avg      0.484     0.500     0.492      3000
weighted avg      0.938     0.968     0.952      3000

For test size of 0.2

Confusion Matrix 
 True         No  Yes
Predicted           
No         1931   69
Yes           0    0 

Classification report 
               precision    recall  f1-score   support

          No      0.966     1.000     0.982      1931
         Yes      0.000     0.000     0.000        69

    accuracy                          0.966      2000
   macro avg      0.483     0.500     0.491      2000
weighted avg      0.932     0.966     0.949      2000

For test size of 0.1

Confusion Matrix 
 Tru

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


(c) The error rates always hover around 0.03. 

(d) Now we fit the dummy student into the training variables

In [29]:
# New X
X = df[['income', 'balance', 'student2']]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=t_prop, random_state=None)

# Fitting model
regr = skl_lm.LogisticRegression()
pred = regr.fit(X_train, y_train).predict(X_test)

# Results
print_cm(y_test, pred, regr)

  'precision', 'predicted', average, warn_for)


Confusion Matrix 
 True        No  Yes
Predicted          
No         965   35
Yes          0    0 

Classification report 
               precision    recall  f1-score   support

          No      0.965     1.000     0.982       965
         Yes      0.000     0.000     0.000        35

    accuracy                          0.965      1000
   macro avg      0.482     0.500     0.491      1000
weighted avg      0.931     0.965     0.948      1000



Adding the dummy variable student does not help in our case at all!! The test error still hover 
around 0.03

## Question 6

(a) As running the model above, we have the standard error for both at near zero.

In [30]:
print(model.summary().tables[1])

                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -11.5405      0.435    -26.544      0.000     -12.393     -10.688
income      2.081e-05   4.99e-06      4.174      0.000     1.1e-05    3.06e-05
balance        0.0056      0.000     24.835      0.000       0.005       0.006


(b) for this question, I will use resample from sklearn. Note that this might be
the fastest way to do this.

First, we define the boot function as follow:

In [115]:
# Define boot
def boot_fn(data):
    data = resample(data, replace=True)
    regr = smf.logit('default2 ~ income + balance', data=data).fit(disp=False)
    return np.array(regr.params)

In [120]:
para = []
result = []
i = 0

# Run bootstrap 1000 times
while i < 101:
    result += [boot_fn(df)]
    para = np.array(result)
    i += 1

boot_se = np.std(para,axis=0)

In [121]:
# Compare dataframe
df_compare = pd.DataFrame({'model_se': model.bse, 'boot_SE': boot_se})
df_compare

Unnamed: 0,model_se,boot_SE
Intercept,0.434772,0.44845
income,5e-06,5e-06
balance,0.000227,0.000234


(d)
Running bootstrap 100 time gives us a very close estimate of the model SE. The interesting
part is that using LogisticRegression from sklearn didn't give me the same result!
