Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name below.

Rename this problem sheet as follows:

    ps{number of lab}_{your user name}_problem{number of problem sheet in this lab}
    
for example
    
    ps7_blja_problem2

Submit your homework within one week until next Thursday, December 3, 2020, 9 am.

In [None]:
NAME = ""
EMAIL = ""
USERNAME = ""

---

# Introduction to Data Science
## Lab 7: Logistic regression
### Part B: Logistic regression in practice

In this lab, we want to investigate the `Default` data set known from the lecture.
It contains the predictors
- `student` status, either `'Yes'` or `'No'`
- `balance`, i.e., monthly credit card balance
- yearly `income`
and the response
- `default`, which is either `'Yes'` or `'No'`

We first load the necessary modules.

By the way, the command
    
    plt.rcParams['figure.figsize'] = [13, 5]
    
changes the default size of figures (in inches).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [13, 5]

**Task**: Download the file `Default.csv`.
Read it using the `pandas` function `read_csv` and store the `pandas DataFrame` in the variable `D`.
Make sure that:
- the index column is recognized appropriately.
- the column titles are correct

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert 'D' in locals()
assert isinstance(D.index, pd.Int64Index)
assert isinstance(D.columns, pd.Index)
assert D.columns[1] == 'student'
assert D.shape == (10000, 4)

**Task**: Inspect the data using the methods you've learned so far, e.g., `describe`, `hist`, `head`, etc.

In [None]:
# Task: Apply here at least two different methods to inspect the data set D

# YOUR CODE HERE
raise NotImplementedError()

You should observe that the method `describe` only contains the predictors `balance` and `income`, but not  `default` and `student`.

This is due to the fact that these values were read in by the `read_csv` function as **strings**.
We know from the lecture that these predictors are categorical (in particular binary).

In order to process these values we convert them to the data type `boolean`, i.e., we replace the `String` objects in the columns `default` and `student` by `Boolean`'s.
There are a lot of ways to accomplish this task; the easiest might be

    D.replace(to_replace='No', value=False, inplace=True)
    
**Task**: Replace every 'No' and 'Yes' in the `DataFrame` by the values `False` and `True`, resp.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert D.student.dtype == 'bool'
assert D.default.dtype == 'bool'

### Answer the following questions!

Store your answers in the given variables

**Question A**: How many students belong to the data set?

In [None]:
# answer_A = 
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert 'answer_A' in locals()

**Question B**: What is the mean **balance** of all samples?

In [None]:
# answer_B = 
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert 'answer_B' in locals()

**Question C**: What is the mean income of the **students**?

In [None]:
# answer_C =
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert 'answer_C' in locals()

**Question D**: How many **students** obtain an **income** of more than 20,000.

In [None]:
# answer_D = 
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert 'answer_D' in locals()

**Question E**: What is the 25% quantile of the predictor **balance**?

In [None]:
# answer_E = 
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert 'answer_E' in locals()

### Plotting the data set

Next, we want to plot both, the `income` and `balance` predictors as boxplots as a function of the `default` status.

**Task**: Complete the plotting command in the following cell. What do you observe?

In [None]:
fig, ax = plt.subplots(1,2)
D.boxplot(column='balance',by='default', ax=ax[0]);
# YOUR CODE HERE
raise NotImplementedError()

You should observe that it seems that the credit card balance has a large effect on the default status, while the income seems not to predict the default status very well.

Finally, the following cell let's you plot the `default`'s vs. the non-`default`'s of the data set. No task here!

In [None]:
D.plot(y='income', x='balance', kind='scatter',c = D.default, cmap = 'coolwarm', marker='.');

### Fitting a logistic regression model
Next, we want to fit a logistic regression model to our data.
Use the `LogisticRegression` function in the module `sklearn.linear_model`.
The behaviour is similar to a `LinearRegression` fit.

You can find the documentation of this function [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).
There are a lot of optional arguments, the most important might be the unimpressive looking parameter `C`, which determines the strength of regularization used in the algorithm that solves the maximum likelihood problem.

We will discuss regularization later in the lecture as well as in the labs. For now, it suffices if you keep the following in mind:

**The larger you choose `C`, the less the problem will be regularized.**

**Task**: Fit a logistic regression model that predicts the probability of `default` using `balance` as predictor. You should obtain the following values: $\beta_0: -10.6513$, $\beta_\text{balance}: 0.0055$.

Choose the following optional parameters:
* set the regularization parameter `C = 1e10` (which is the scientific notation of $C = 10^{10}$, and thus very large)
* set the error tolerance to `tol=1e-10`
* set the solver to `solver = 'liblinear'`

in this and the upcoming problems.

In [None]:
from sklearn.linear_model import LogisticRegression
# YOUR CODE HERE
raise NotImplementedError()

**Task**: Store the intercept of the model in a variable `intercept0` and the regression coefficient in the variable `reg_coef0`.
These quantities represent the coeffcients for a linear regression model predicting the log-odds, i.e.

$$
\log \left( \frac{p(x)}{1-p(x)} \right) = \beta_0 + \beta_1 \, x
$$

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert 'intercept0' in locals()
assert 'reg_coef0' in locals()

**Task**:
Predict the probability of `default` for a `balance` value of $\$$1,000 and $\$$2,000 and store your answers is the variables `pod_1000` and `pod_2000`, resp.

Use the method `predict_proba` of a `LogisticRegression` model.

**Note**: The model assumes that your data has the same format as your original training data. Therefore, you might have to reshape the input into the correct format.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert 'pod_1000' in locals()
assert 'pod_2000' in locals()

You should observe, that the probality of default of an individual with a credit card balance of $\$$1,000 is approximately 0.57\%.
The probality of default of an individual with a credit card balance of $\$$2,000 is approximately 58.6\%.

Now, we want to incorporate the predictors `income` and `student` status as well. This can be done easily using the same methods.

Execute the following code cell to train a new logistic regression model.

In [None]:
lr2= LogisticRegression(solver='liblinear', tol=1e-10, C=1e10)
X = D.loc[:,['balance','income','student']]
y = D.loc[:,'default']
reg2 = lr2.fit(X,y)

**Task**: Store the intercept of the new model in the variable `intercept_full` as well as the coefficients in variables `beta_balance`, `beta_income`, `beta_student`, resp.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert 'intercept_full' in locals()
assert 'beta_balance' in locals()
assert 'beta_income' in locals()
assert 'beta_student' in locals()

**Task**:
What is the default probability of a student and a non-student with a credit card balance of $\$$1,500, an income of $\$$40,000?
Store your answers in the variables `pod_student` and `pod_nonStudent`, resp.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert 'pod_student' in locals()
assert 'pod_nonStudent' in locals()

You should observe that a student with a credit card balance of $\$$1,500 and an income of $\$$40,000 has an estimated probability of default of 5.8\%, while an non-student with the same balance and income has a probability of default of 10.5\%.