# Tutorial 6 - Part 2 - Regression with categorical predictors

*Written and revised by Jozsef Arato, Mengfan Zhang, Dominik Pegler*  
Computational Cognition Course, University of Vienna  
https://github.com/univiemops/tewa1-computational-cognition

---
**This tutorial will cover:**

*   Categorical predictors
*   Interactions
*   Training and test set
*   Reguralized regression: Ridge and lasso

---

## 1. Import libraries

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import io, linalg, stats
from sklearn.linear_model import Lasso, LinearRegression, Ridge

## 2. Import data

### 2.1 Download from moodle: 

`kidiq.csv`

Source:
- Gelman, A., Hill, J., &#38; Vehtari, A. (2020). Regression and Other Stories. Cambridge: Cambridge University Press. 

In [None]:
from google.colab import files

uploaded = files.upload()

### 2.2.  Loading data into workspace

In [None]:
data = pd.read_csv("kidiq.csv")

## 3. Inspect data set

### 3.1. `print` data table, explore number of potential predictors, dataset size

In [None]:
print(np.shape(data))
data

In [None]:
data.corr()

### 3.2. Visualize some of the data

In [None]:
fig, ax = plt.subplots(ncols=2)
ax[0].scatter(data["mom_iq"], data["kid_score"], color="salmon", alpha=0.4)
ax[0].set_xlabel("mom_iq")
ax[0].set_ylabel("kid_score")

ax[1].scatter(data["mom_hs"], data["kid_score"], alpha=0.2, color="salmon")
ax[1].set_xlabel("mom high school")
ax[1].set_ylabel("kid_score")
ax[1].set_xticks([0, 1])

ax[1].set_xlim([-0.5, 1.5])
plt.tight_layout()

## 4. Fit a linear regression using two predictors: `mom_hs` and `mom_iq`

In [None]:
x = np.column_stack((data["mom_iq"], data["mom_hs"]))

In [None]:
lr = linear_regression()
lr.fit(x, data["kid_score"])

### 4.1. Check the predictor weigths and the score of the model


In [None]:
print("intercept", lr.intercept_)
print("weights/slopes", lr.coef_)
print("score: ", lr.score(x, data["kid_score"]))

### 4.2. Visualize predictions and true data

In [None]:
plt.scatter(data["mom_iq"], data["kid_score"])

ypred = lr.predict(x)
plt.plot(data["mom_iq"], ypred, color="r")

In [None]:
x

#### Solution 1: Indexing

In [None]:
pred_no_high_school = lr.predict(x[x[:, 1] == 0, :])
pred_high_school = lr.predict(x[x[:, 1] == 1, :])
plt.scatter(data["mom_iq"], data["kid_score"], c=data["mom_hs"])
plt.plot(x[x[:, 1] == 0, 0], pred_no_high_school, label="no hs", color="blue")
plt.plot(x[x[:, 1] == 1, 0], pred_high_school, label="hs", color="orange")
plt.legend()
plt.xlabel("mom_iq")
plt.ylabel("kid_score")

#### Solution 2: Set up a design matrix for prediciton

In [None]:
xs = np.array([66, 140])
x_no_hs = np.column_stack((xs, np.zeros(2)))
print(x_no_hs)
xhs = np.column_stack((xs, np.ones(2)))
print(xhs)

pred_no_high_school = lr.predict(x_no_hs)
pred_high_school = lr.predict(xhs)

plt.scatter(data["mom_iq"], data["kid_score"], c=data["mom_hs"])
plt.plot(xs, pred_no_high_school, label="no hs", color="blue")
plt.plot(xs, pred_high_school, label="hs", color="orange")
plt.legend()
plt.xlabel("mom_iq")
plt.ylabel("kid_score")

#### Solution 3: Indexing + `for` loop for groups

In [None]:
education = ["no_hs", "hs"]
colors = ["blue", "orange"]
plt.scatter(data["mom_iq"], data["kid_score"], c=data["mom_hs"])
for ce, e in enumerate(education):
    prediciton = lr.predict(x[x[:, 1] == ce, :])
    plt.plot(x[x[:, 1] == ce, 0], prediciton, label=e, color=colors[ce])
plt.legend()
plt.xlabel("mom_iq")
plt.ylabel("kid_score")

## 5. Model with interaction

Create a new design matrix that includes a column for the interaction between `mom_hs` and `mom_iq`, fit the model and visualize the model prediciton.

In [None]:
x2 = np.column_stack((data["mom_iq"], data["mom_hs"], data["mom_iq"] * data["mom_hs"]))
print("design matrix with interaction")
print(x2[:10, :])

In [None]:
lr.fit(x2, data["kid_score"])
print(lr.coef_)
print(lr.intercept_)
print(lr.score(x2, data["kid_score"]))

In [None]:
education = ["no_hs", "hs"]
colors = ["blue", "orange"]
plt.scatter(data["mom_iq"], data["kid_score"], c=data["mom_hs"])
for ce, e in enumerate(education):
    prediciton = lr.predict(x2[x2[:, 1] == ce, :])
    plt.plot(x2[x2[:, 1] == ce, 0], prediciton, label=e, color=colors[ce])
plt.legend()
plt.xlabel("mom_iq")
plt.ylabel("kid_score")

## 6. Learning from data

Traditional regression models often fit 100% of the data. While these models learn from the data, they run the risk of overfitting, capturing irrelevant patterns and noise. Furthermore, because they are fit to the entire data set, we cannot evaluate their learning performance or their ability to generalize to new data. On the other hand, more advanced models that are fit to a subset of the data allow us to evaluate the learning process. This evaluation helps us understand the generalization capabilities of the model and assess how well it can make predictions on unseen data.


### 6.1. Train-test split

The **train-test split** is a technique used to evaluate machine learning and regression models. It involves dividing the dataset into two parts:
- Training set: Used to train the model by teaching it the relationships between input and output.
- Test set: Reserved to assess the model's performance on new, unseen data.

By splitting the dataset, we can train the model on one part and use the other part to see how well it predicts output values for new data.

In order to do that we will now separate the X and Y data into an 80% training and 20% test set using indexing. For now, it is okay, just to take the first 80% of data as training and the last 20% as test data (this is clearly a wrong approach if the data is ordered)

In [None]:
n = len(data["kid_score"])
print(n)
np.shape(x2[: int(n * 0.8), :])
train_x = x2[0 : int(n * 0.8), :]
train_y = data["kid_score"][0 : int(n * 0.8)]

test_x = x2[int(n * 0.8) :, :]
test_y = data["kid_score"][int(n * 0.8) :]

In [None]:
lr.fit(train_x, train_y)

In [None]:
print(lr.score(train_x, train_y))
print(lr.score(test_x, test_y))

### 6.2. Fit the regression model (with interaction term) on the training set, and calculate the score for both the training and the test sets

(randomly selected 80% and 20% test set)

In [None]:
#

### 6.3. Mean Centering and standardization

1. Mean center your data (make a new design matrix for this), and fit and score the model again.

2. Standardize your data, and fit again the model with interaction.
see how the obtained coefficients and score change



In [None]:
#

### 6.4.. Regularized regression


Test Lasso and Ridge regression on the test and training data from above and see how the score changes, but also how the regression weights (Betas) are influenced.

*Note:* Ridge and lasso work well with standardized data.


In [None]:
#

### 6.5. Systematic exploration of Lasso and Ridge

Change the Regularization strength alpha systematically for different values and see how this influnces the obtained scores and weigths (both for Lasso and Ridge). Visualize the result of this investigation





In [None]:
#

To make more general conclusions, we would need to repeat the training - test serpation mulitple times randomly, and integrate the behavior of lasso and ridge across these mulitple divisions.