<a
    href="https://colab.research.google.com/github/univiemops/tewa1-computational-cognition/blob/main/08%20Regression%20with%20Mixed%20Data%20Types.ipynb"
    target="_blank" rel="noopener"> <img
      src="https://colab.research.google.com/assets/colab-badge.svg"
      alt="Open In Colab"> </a></p>
# Tutorial 8 - Regression with mixed data types

*Written and revised by Jozsef Arato, Mengfan Zhang, Dominik Pegler*  
Computational Cognition Course, University of Vienna  
https://github.com/univiemops/tewa1-computational-cognition

---
## This week's lab:

We will introduce you to categorical predictors and logistic regression. A categorical predictor is a variable that represents categories or groups. These groups can be nominal (with no inherent order, such as gender, colour, etc.) or ordinal (with a natural order, such as education level, economic status, etc.). Logistic regression is a statistical technique used to model the probability of a binary outcome based on one or more predictor variables. Categorical predictors and logistic regression can be essential tools when dealing with classification problems in the field of machine learning.

**Learning goals:** \
When finishing this tutorial, you should be able to ...
* understand categorical Variables and properly encoding them for statistical analysis 
* explore data with categorical predictors
* fit logistic regression and interpret coefficients
* understand common performance metrics for evaluating logistic regression models
---

## 1. Import libraries and the dataset

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
from scipy import io, linalg, stats
from sklearn.linear_model import Lasso, LinearRegression, Ridge

# import the kid iq dataset
response = requests.get(
    "https://ucloud.univie.ac.at/index.php/s/TMnJzRCmKD6ZbYB/download"
)
open("kid_iq.csv", "wb").write(response.content)

df = pd.read_csv("kid_iq.csv")

# Gelman, A., Hill, J., & Vehtari, A. (2020).
# Regression and Other Stories. Cambridge: Cambridge University Press.

## 2. Inspect the dataset

Now, let's first print the data to explore the number of potential predictors, dataset size and the data structure. 

In [None]:
print(np.shape(df))
df

This dataset contains children's IQ test scores and some information about their mothers. We have three continuous varialbes (*kid_score*, *mom_iq*, *mom_age*) and two categorical variables (*mom_hs* and *mom_work*) in this dataset. \
The *mom_hs* is a binary variable (also known as indicator/dichotomous/dummy variable) that indicates whether the mother graduated from high school (coded as 1) or not (coded as 0). The *mom_work* variable is defined on a four-point ordered scale representing four different work statuses during the first three years of the child's life.

Let's have a quick look at how these variables are correlated. Does the correlation match your expectations?

In [None]:
df.corr()

We can also visualize some of the data to get a better idea. Instead of using `plt.plot`, we will introduce to another way to create plots using `fig, ax = plt.subplots()` which has several advantages. It returns both the figure (fig) and one or more axes (ax) objects, and you can easily customize both the figure and the individual subplots. We'd like to create a figure with two subplots in the cell below, and we already have the code for the first one, could you complete the code for the second one to plot 'kid_score' against 'mom_hs'? You can customize your plot as you like. 

In [None]:
fig, ax = plt.subplots(ncols=2)  # a figure with a 1x2 grid of axes

# subplot 1
ax[0].scatter(
    df["mom_iq"], df["kid_score"], color="salmon", alpha=0.4
)  #  plot a scatter on the first axes object
ax[0].set_xlabel(
    "mother’s score on an IQ test"
)  # set the x label for the first axes object
ax[0].set_ylabel("kid's test scores")

# subplot 2
# YOUR CORE HERE

plt.tight_layout()  # show the figure you created

## 3. Fit a linear regression using categoriacal predictors

Now consider a linear regression of child test scores on two predictors: the mom's high school
indicator and IQ. We will use the `LinearRegression` class from scikit-learn that you learned last week. But feel free to use one of the two other models: `Lasso` and `Ridge`. They are identical to `LinearRegression`, but add penalties to the loss function (sum of squared errors) to reduce overfitting, helping to improve model performance on new data compared to standard `LinearRegression`. The only thing you have to do here is choose the parameter `alpha` for this penalty.

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html  
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

In [None]:
X = df[["mom_hs", "mom_iq"]]
y = df["kid_score"]

lr = # your model here
lr.fit(X, y)

Let's look at the coefficients of the predictors and the model fits. How would you interpret the results of the model?

In [None]:
print("Intercept:", lr.intercept_)
print("Slopes:", lr.coef_)
print("Score: ", lr.score(X, y))

The fitted linear regression above has the form of *kid_score = 25.73 + 5.95 ∗ mom_hs + 0.56 ∗ mom_iq + error*. t. Could you answer the following questions?
1. If a child had a mother with an IQ of 0 and who did not finish high school, how would you predict the child's test score?
2. If a child had a mother with an IQ of 0 who graduated from high school (just an assumption, I don't mean that anyone without an IQ could graduate from high school), how would you predict this child's test score?
3.  Comparing children with the same value of mom_hs but whose mothers differ by 2 points in IQ, what difference would you expect to see in the child's test score?

Write down your answers and keep two decimal places in the cell below.

In [None]:
q1 = # YOUR ANSWER HERE
q2 = # YOUR ANSWER HERE
q3 = # YOUR AMSWER HERE

In the cell below, we visualize the predictions and the original data. We plot two separate lines based on the 'mom_hs' predictor. 

In [None]:
y_pred = lr.predict(X)

fig, ax = plt.subplots()
ax.scatter(df["mom_iq"], df["kid_score"])
ax.plot(
    df.loc[df["mom_hs"] == 1, "mom_iq"],
    y_pred[df["mom_hs"] == 1],
    color="r",
    label="no_hs",
)
ax.plot(
    df.loc[df["mom_hs"] == 0, "mom_iq"],
    y_pred[df["mom_hs"] == 0],
    color="g",
    label="hs",
)
ax.legend()
ax.set_xlabel("mother’s score on an IQ test")
ax.set_ylabel("kid's test scores")

### Model with interaction
As you can see from the plot above, the slopes of the regression of the child's test score on the mother's IQ were the same for two groups. We can include an interaction between *mom_hs* and *mom_iq* (a third predictor defined as the product of these two variables) to allow the slope to vary between subgroups. Below we create a new design matrix that includes a column for the interaction term and we will fit the linear regression again.

In [None]:
df["mom_hs_iq_inter"] = df["mom_iq"] * df["mom_hs"]
X_with_inter = df[["mom_hs", "mom_iq", "mom_hs_iq_inter"]]

print("Design matrix with interaction:")
print(X_with_inter)

In [None]:
lr.fit(X_with_inter, df["kid_score"])
print("Intercept:", lr.intercept_)
print("Slopes:", lr.coef_)
print("Score:", lr.score(X_with_inter, df["kid_score"]))

Then we visulize the plot for predictions again, but this time we have the interaction term included. 

In [None]:
education = ["no_hs", "hs"]
colors = ["r", "g"]

fig, ax = plt.subplots()
ax.scatter(df["mom_iq"], df["kid_score"])

for idx, itm in enumerate(education):
    prediciton = lr.predict(X_with_inter[X_with_inter["mom_hs"] == idx])
    ax.plot(
        X_with_inter.loc[X_with_inter["mom_hs"] == idx, "mom_iq"],
        prediciton,
        label=itm,
        color=colors[idx],
    )
ax.legend()
ax.set_xlabel("mother’s score on an IQ test")
ax.set_ylabel("kid's test scores")

Great that we have different slopes for the subgroup regression lines. 

## 4. Logistic regression

Above we have discussed a bit about categorical predictors, which is a type of input variables that represents categorical data. What if our output variable is categorical? We can perform a logistic regression for this situation. 

The logistic regression model uses the logistic (sigmoid) function to squeeze the output of a linear equation between 0 and 1. $$ P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_nX_n)}} $$

Now, let's first define a sigmoid function by ourselves. The function should take as input *x*, where *x = $\beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_nX_n$* (the linear equation we are familiar with) and return *1/(1+e<sup>-x</sup>)*. 

In [None]:
def my_sigmoid(x):
    return 1 / (1 + np.exp(-x))

We can test this function with x = -5 to 5 and take a look at the results:

In [None]:
x = np.linspace(-5, 5, 20)

fig, ax = plt.subplots()
ax.plot(x, my_sigmoid(x))
ax.set_xlabel("x", fontsize=15)
ax.set_ylabel("Sigmoid(x)", fontsize=15)

Great! All the function maps all the numbers into the range [0, 1]. 

We will simulate some data for the next steps. We decide to simulate a scenario where students study between 0 and 50 hours for the exam, and we want to predict whether the student can pass or fail the exam based on the time spent studying. However, we want the simulated data to be centred on zero again, i.e. students study -25 to +25 hours relative to the average.

In [None]:
n = 100  # number of students
x = np.linspace(-25, 25, n)
print(x)

In [None]:
fig, ax = plt.subplots()
ax.plot(x, my_sigmoid(x))
ax.set_xlabel("Relative hours studied", fontsize=15)
ax.set_ylabel("P(Success)", fontsize=15)

Emmm, looks ok-ish, but we haven't implemented the logistic regression model yet. For the next step, we will insert the linear equation into the sigmoid function so that  *p(Y) = sigmoid(b<sub>0</sub>+b<sub>1</sub>*X)*.

We define parameters for the intercept and the slope as what we did before, but now we use them with `my_sigmoid` function.

In [None]:
b0 = -0.1  # intercept
b1 = 1.5  # slope
lin_pred = b0 + b1 * x
p_y = my_sigmoid(lin_pred)

# Visulize the results
fig, ax = plt.subplots()
ax.plot(x, p_y)
ax.set_xlabel("Relative hours studied", fontsize=15)
ax.set_ylabel("P(Success)", fontsize=15)

###  Testing prediciton of  different slopes
Below, we'd like to explore how slopes changes can affect the results. 

In [None]:
b0 = 0
b1_values = np.linspace(0.1, 3, 10)

fig, ax = plt.subplots(figsize=(12, 8))
for b1 in b1_values:
    ax.plot(x, my_sigmoid(b0 + x * b1), label=np.round(b1, 1))
ax.legend()
ax.set_xlabel("hours studied")
ax.set_ylabel("p(Success)")
plt.grid("ON")

Now try it yourself. We'd like to find good values for b0 and b1 for an exam where almost everyone who studies less than -10 hours fails, but everyone who studies more than +20 hours passes. 

In [None]:
# YOUR CODE HERE

As we have set up and explore the logistic model, we are almost ready to simulate some data as our output variable. We need the numpy `random.binomial` generator for this. Test this function in the cells below, and try to understand what it is doing.

In [None]:
for i in range(10):
    print(np.random.binomial(n=1, p=0.5))

In [None]:
np.random.binomial(n=1, p=0.1)

In [None]:
np.random.binomial(n=1, p=0.5, size=10)

Ok, let's start the simulation. The first thing we need to do is simulate 50 students, whose study time is uniformly distributed between -50 and 50 hours compared to the average.

In [None]:
n_std = 50
x = np.random.uniform(-50, 50, n_std)

We can predict the probability of passing the exam for each student using the linear equation combined with the sigmoid function and visualize the results.

In [None]:
b0_exam = 1.5
b1_exam = 0.12
y_pred = my_sigmoid(b0_exam + b1_exam * x)

# Visulization
fig, ax = plt.subplots()
ax.plot(
    np.sort(x), np.sort(y_pred)
)  # sort variables to avoid the lines jumping back and forth
ax.set_xlabel("Relative hours studied", fontsize=14)
ax.set_ylabel("P(Success)", fontsize=14)

At the moment, we are ready to use the predicted probabilities to simulate whether the student passed or failed the exam using the `random.binomial` function. In the cell below, we generate data for *y*, which is either 1 (pass) or 0 (fail), based on the value in *y_pred*, and visualise the results.

In [None]:
y = np.zeros(n_std)

for i in range(n_std):
    y[i] = np.random.binomial(1, y_pred[i])

# Visulization
fig, ax = plt.subplots()
ax.scatter(x, y)
ax.plot(
    np.sort(x), np.sort(y_pred)
)  # sort variables to avoid the lines jumping back and forth
ax.set_xlabel("Relative hours studied", fontsize=14)
ax.set_ylabel("P(Success)", fontsize=14)

### Fitting logistic regression

Our next step is to actually fit a logistic regression to the simulated data. Similar with fitting a linear regression, we will call the `LogisticRegression` class from scikit-learn to fit the logistic regression.

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
log_reg = LogisticRegression()
log_reg.fit(x.reshape(-1, 1), y)

You may notice that we use `reshape(-1, 1)` for *x* to solve the dimensionality issue. When fitting the model, the design matrix should have the shape of n_samples * n_features. `reshape(-1, 1)` results in an array with a single column and multiple rows (a column vector).

In [None]:
print("Original array shape:", x.shape)
print("Reshaped array shape:", x.reshape(-1, 1).shape)

Again similar with `LinearRegression`, we can print the fitted parameters for the logistic regression, and compare them with our simulated parameters. 

In [None]:
print("Fitted intercept:", log_reg.intercept_)
print("Fitted slope:", log_reg.coef_)
print("Score: ", log_reg.score(x.reshape(-1, 1), y))

print("Simulated intercept:", b0_exam)
print("Simulated slope:", b1_exam)

Additionally, we can use the `predict_proba()` function to check the probability of the sample for each class in the model (i.e., the likelihood of each data point belongs to fail (0) or pass(1)).

In [None]:
ll = log_reg.predict_proba(x.reshape(-1, 1))
print(ll)

Below we have created a variable *lls* to record the probability that each data point belongs to its currently assigned class and visulize the results. 

In [None]:
lls = np.zeros(n_std)
lls[y == 0] = ll[y == 0, 0]  # where true y is 0, use first column of ll
lls[y == 1] = ll[y == 1, 1]  # where true y is 1, use second column of ll

# Visulization
fig, ax = plt.subplots()
cd = ax.scatter(x, y, c=lls)  # color data by model likelihood
ax.plot(
    np.sort(x),
    my_sigmoid(np.sort(x) * log_reg.coef_[0] + log_reg.intercept_),
    label="fitted",
)
# plt.plot(np.sort(x),my_sigmoid(np.sort(x)*b1gen+b0gen),label='data gen')
# plt.legend()
fig.colorbar(cd)
ax.set_xlabel("Hours studied", fontsize=14)
ax.set_ylabel("Exam passed", fontsize=14)

## Exercise

First, try to write two functions: one to generate binary outcome data based on the logistic model, and the other one to fit the logistic regression with scikit-learn. You can use the relevant parts of the code above. 

**1. Create function `make_log_dat`**

This function should take three inputs in order: intercept, slope, an array *X* used to make the predictions. The function should first use these three inputs to calculate *P(Y)* for each value of *X*, where *P(Y)* represents predicted probabilities for outcome=1 (same as above). Then for each *p(Y)*, the function simulates a binary outcome and stores all the outcomes in the array *y*.  Make sure that only the array *y* containing the simulated binary outcomes is returned. 

**2. Create function `fit_log`**

This function should take two inputs in order: an array *X* and an array *y*. The function should fit the scikit-learn logistic regression to *X* and *y* (pay attention to the shape of *X*), and output the intercept and slope of from the fitted logistic regression in order. 

**3. Compare the difference between the simulated data and the fitted model**

Once you have the above functions ready, consider again the scenario of students studying (-25 to 25 hours) to prepare for the exam and perform the following simulation:
1. change the number of students from 10 to 1010 in steps of 50
2. generate an X and y data set for each value of the number of students
3. fit the logistic regression to the generated data and observe how this affects the difference between the generated data and the fitted model.
4. visualise your results to see how different numbers of students affect the difference between the data generated and the fitted model.

Note: You can choose slope and intercepts arbitrarily.

In [None]:
# YOUR CODE HERE