## Init: Import packages

In [1]:
# general
import numpy as np
import pandas as pd
#sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import RepeatedStratifiedKFold

## Solution 1: Overfitting & underfitting

See sol_eval_2.pdf

## Solution 2:  Resampling strategies

### a)

The two main advantages of resampling are:

• We are able to use larger training sets (at the expense of test set size) because the high variance this incurs
for the resulting estimator is smoothed out by averaging across repetitions.

• Repeated sampling reduces the risk of getting lucky (or not so lucky) with a particular data split, which
is especially relevant with few observations.

### b)

In [2]:
# load prepared Data Frame
german_credit = pd.read_csv("../data/german_credit_for_py.csv")
german_credit.head()

Unnamed: 0,credit_risk,status_... < 0 DM,status_... >= 200 DM / salary for at least 1 year,status_0<= ... < 200 DM,status_no checking account,credit_history_all credits at this bank paid back duly,credit_history_critical account/other credits elsewhere,credit_history_delay in paying off in the past,credit_history_existing credits paid back duly till now,credit_history_no credits taken/all credits paid back duly,...,telephone_no,telephone_yes (under customer name),foreign_worker_no,foreign_worker_yes,installment_rate,present_residence,number_credits,duration,amount,age
0,good,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,1.0,0.0,0.0,3.0,1.0,6,1169,67
1,bad,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,1.0,0.0,2.0,1.0,0.0,48,5951,22
2,good,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,2.0,2.0,0.0,12,2096,49
3,good,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,1.0,0.0,2.0,3.0,0.0,42,7882,45
4,bad,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,1.0,0.0,1.0,3.0,1.0,24,4870,53


In [3]:
german_X_raw = german_credit.iloc[:,1:]
german_y_raw = german_credit.iloc[:,0]

#Initialize Encoder for target
enc_target = LabelEncoder()

enc_target.fit(german_y_raw.values.ravel()) 
german_y = enc_target.transform(german_y_raw.values.ravel()) #now numpy array
# you can also use enc_target.fit_transform(X) to combine both steps

german_X = np.asarray(german_X_raw)

In [4]:
# using whole data set to train and predict 
log_mod = LogisticRegression(random_state=123, max_iter=1000).fit(german_X, german_y) #increase max iterations for convergence
german_pred = log_mod.predict(german_X)
print("Mean Accuracy: ", log_mod.score(german_X, german_y))
print("Mean Classification Error :", 1-log_mod.score(german_X, german_y))

Mean Accuracy:  0.781
Mean Classification Error : 0.21899999999999997


### c)

### (i) 3x10-CV

In [5]:
random_state = 43
err = []
rkf_3x10 = RepeatedKFold(n_splits=10, n_repeats=3, random_state=random_state)
for train, test in rkf_3x10.split(german_X):
    log_mod = LogisticRegression(max_iter=1000).fit(german_X[train,:], german_y[train])
    err.append(1-log_mod.score(german_X[test,:], german_y[test])) #score gives mean accuracy

res = np.array(err)
print("MCE of 3x10 CV: ", res.mean())

MCE of 3x10 CV:  0.24333333333333335


### (ii) 10x3-CV

In [6]:
err = []
rkf_10x3 = RepeatedKFold(n_splits=3, n_repeats=10, random_state=random_state)
for train, test in rkf_10x3.split(german_X):
    log_mod = LogisticRegression(max_iter=1000).fit(german_X[train,:], german_y[train])
    err.append(1-log_mod.score(german_X[test,:], german_y[test])) #score gives mean accuracy

res = np.array(err)
print("MCE of 10x3 CV: ",res.mean())

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


MCE of 10x3 CV:  0.2542998087908267


### (iii) 3x10-CV with stratification for the feature foreign worker

In [7]:
err = []
strat_gkf_10 = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=random_state)
# Note that providing y in split(X, y) is sufficient to generate the splits and hence np.zeros(n_samples) may be used as a placeholder for X instead of actual training data.
for train, test in strat_gkf_10.split(german_X, german_X[:,55]): #index 55 stands for column of foreign workers
    log_mod = LogisticRegression(max_iter=1000).fit(german_X[train,:], german_y[train])
    err.append(1-log_mod.score(german_X[test,:], german_y[test])) #score gives mean accuracy

res = np.array(err)
print("MCE of 3x10-CV with stratification: ", res.mean())

MCE of 3x10-CV with stratification:  0.24666666666666662


### (iv) Hold-out with 90% training data

In [8]:
X_train, X_test, y_train, y_test = train_test_split(german_X, german_y, test_size = 0.1, random_state=random_state)
log_mod = LogisticRegression(max_iter=1000).fit(X_train, y_train)
german_pred = log_mod.predict(X_test)
print("MCE of Hold-out split: ", 1-log_mod.score(X_test, y_test)) #score gives mean accuracy

MCE of Hold-out split:  0.28


### d)

Generalization error estimates are pretty stable across the different resampling strategies because we have a
fairly large number (1000) of observations. Still, the pessimistic bias of small training sets is visible: 10x3-CV,
using roughly 67% of data for training in each split, estimates a higher generalization error than 3x10-CV with
roughly 90% training data. Stratification by foreign worker does not seem to have much effect on the estimate.
However, we see a glaring difference when we use a single 90%-10% split, where the estimated GE is roughly 3.5
percentage points higher than with 3x10-CV, meaning we got a higher error just because of a unlucky split.

Comparing the results (except for the unreliable one produced by a single split) with the training error from b)
indicates no serious overfitting.

### e)

LOO is not a very good idea here – with 1000 observations this would take a very long time. Also, LOO has
high variance by nature. Repeated CV with a sufficient number of folds should give us a pretty good idea about
the expected GE of our learner.