# Inf2 - Foundations of Data Science
## S2 Week 01: Logistic regression

**Learning outcomes:** In this lab you will learn about logistic regression, interpretation of logistic regression coefficients and generating confidence intervals for logistic regression coefficients. By the end of this lab you should be able to:

- identify what transformations would be helpful to variables before applying logistic regression
- apply logistic regression to a dataset
- interpret the coefficients from application of logistic regression
- apply the bootstrap to logistic regression to obtain confidence intervals
- interpret the confidence intervals

**Remark:** The lecture topic on "Logistic regression" will be helpful background for this lab.

**Data information:** We will look at the credit approval dataset, which we have already looked at during the lectures, and we will try to reconstruct the results ourselves. The dataset was originally published on the [UCI repository with the attribute names and values changed to meaningless symbols](https://archive.ics.uci.edu/ml/datasets/credit+approval). We have used [this version of the dataset](https://github.com/KiranmayiR/Credit_Shiny), in which the attribute names have been inferred. However, we have changed some attributes. 

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
# Package to display the hints and soultions
from common.show_solutions import show

## A. Prepare the data

Our goal is to use logistic regression to understand what features are most important in the decision of giving an applicant a credit.

**Exercise 01:** The first step is to clean our dataset.
- Load the dataset.
- Display the first 20 entries.
- Remove all entries with invalid values.
- Replace all non-numeric values by reasonable numeric values.
- For simplicity, drop the `ZipCode` column.

**Remark:** Zip codes can have an impact on credit approval. For example, ML algorithms trained on racially biased data, where the information about race has been dropped, can still learn the bias, as people from same ethnic background tend to live in the same area. However, for logistic regression the zip code is unlikely provide any information, as two zip codes that differ by a single digit can be many miles apart. We could consider whether this is the case if we used $k$-Nearest-Neighbours.

In [None]:
# Run this cell to be offered with hints and solution
show(question=1)

In [None]:
# Your code

**Exercise 02:** Let's compare the values for the two genders. Compute the mean of the columns for each gender.

In [None]:
# Run this cell to be offered with hints and solution
show(question=2)

In [None]:
# Your code

**Exercise 03:**
- Create a pairplot of the data, giving approved and denied applications different colours. 
- As you can see there too many variables. Remove variables from the pairplot that you think are not displayed helpfully in a pairplot.

In [None]:
# Run this cell to be offered with hints and solution
show(question=3)

In [None]:
# Your code

## B. Transform the dataset so logistic regression works better

**Discussion:** We have already applied the log transform to datasets previously.
- Can you remember why this can be helpful?
- For which variables in your dataset would a log transform help? Have a look at the plot above.
- Can you think of data points the log transform might not work for, and how the transform could be modified to fix this?

In [None]:
# Run this cell to be offered with hints and solution
show(question=3.1)

Your answer:

**Exercise 04:** 
- Replace the `Income` variable by transforming it with a function `$f(x) = \log_{10} (x + 1)$` to give a version called `LogIncome`.
- Repeat the above step with the `CreditScore` variable to give a log transformed version called `LogCreditScore`.
- Plot a new pairplot to see the new distributions.

In [None]:
# Run this cell to be offered with hints and solution
show(question=4)

In [None]:
# Your code

## C. Use sk-learn to run Logistic regression

**Exercise 05:** Let's start with the simplest case of logistic regression. We want to know whether age alone is a good feature to predict whether someone receives a credit. 
- Use the `LogisticRegression()` to run logistic regression.
- Store the fitted model.
- Store the intercept and the coefficient of the model in `beta0` and `beta1`, respectively. Print the values.
- Create a scatterplot, in which, to make it intelligible, you randomly sample 50 data points. The x-axis should be `Age` and the y-axis should be `Approved`.
- Add a line plot to your figure that shows the probability predicted by the logistic regression.

In [None]:
# Run this cell to be offered with hints and solution
show(question=5)

In [None]:
# Your code

**Discussion:** Interpret the intercept `model.intercept_`, which is the same as $\hat\beta_0$ in the lecture notes. What quantity does it represent? Describe the characteristics of the customer for who the independent variables are all zero. Does such a customer exist?

Your answer:

**Exercise 06:** Above we have run the logistic regression on only one independent variable in order to be able to plot it in a figure. Now, run a logistic regression on the full data.

In [None]:
# Run this cell to be offered with hints and solution
show(question=6)

In [None]:
# Your code

**Discussion:** Interpret the coefficients `model.coef_`. You may find it helpful to convert the output from sklearn back into a pandas Series with an index. Try to use language that you think would be understandable by a general audience. 

Your answer:

## D. How many of these coefficients are meaningful?

How likely is it that some of these coefficients could have arisen by chance? We'd like to find confidence intervals for each coefficient. 

**Excercise 07:** Write a bootstrap function to generate the sampling distribution of all of the coefficients. On each bootstrap iteration, we'd like to store the values of the intercept and all of the coefficients in one row of a dataframe. We'll then be able to plot distribution of the dataframe, and compute confidence intervals from the marginal distributions. We suggest you follow the pattern in the previous lab, and write:
1. A function that takes a dataframe with the same column names as the credit approval dataset, fits a logistic regression model to the dataset and returns a pandas series containing the intercept and coefficients from the logistic regression
2. A bootstrap function that takes the above function as an `estimator` argument, and, on each bootstrap replication, stores the coefficients in the row of a data frame. It should return the bootstrap samples as a dataframe with an `Intercept` column and then one column for each independent variable. The function doesn't need to return the quantiles or the bootstrap standard error. Note that the column types of the data frame should be `float`.

You can test the first function by making sure it gives you the same results as when you ran the logistic regression on the credit dataset above.  Once you've written the function, try it out on the credit dataset. You can use the `.quantile()` function on the returned data frame to compute the quantiles. You can also look at a pairplot of the bootstrap samples.

In [None]:
# Run this cell to be offered with hints and solution
show(question=7)

In [None]:
# Your code

**Discussion** What can you conclude from the quantiles? Are any of the relationships you identified earlier open to question, because they may have arisen by chance?

Your Answer:

**We need your help:** This is a new course. In order for us to improve the labs for the next iterations, and to make sure that the next labs are better, we need your feedback. Please fill out the following [form](https://forms.office.com/Pages/ResponsePage.aspx?id=sAafLmkWiUWHiRCgaTTcYZmGMCx4KxlMjSTITqjdcXpUQkI2TUdDR1UwU0NBRE80OFVUMVRZM09KQi4u).

## E. Standardised quantities

By setting `max_iter=1000` we were able to ensure that the fitting of the logistic regression model converged. An alternative approach would be to standardise quantities. We can standardise the independent variables and then try fitting logistic regression again. However, the resulting coefficients will themselves be standardised, so we'll need to transform them back, to obtain the true figures. 

**Optional Exercise:** *If you're keen*, we suggest you just standardise the continuous variables. After running the same bootstrap function as above on the transformed data you can transform the parameters back using the formulae $\beta_{Age} = \frac{b_{Age}}{s_{Age}}$, where $b_{Age}$ is the transformed coefficient returned when the bootstrap function is applied to the transformed data.

In [None]:
# Run this cell to be offered with hints and solution
show(question=8)

In [None]:
# Your code