# Tutorial 7 - Introduction to scikit-learn

*Written and revised by Jozsef Arato, Dominik Pegler*  
Computational Cognition Course, University of Vienna  
https://github.com/univiemops/tewa1-computational-cognition

## This week's lab:

In this tutorial, we will introduce you to probably the most important machine learning library in Python: **scikit-learn**. We start by fitting a simple linear model to a well-known dataset (house prices). Then we move on to extensions of the linear model (regularizations) and see if this improves the performance of our model.


In this notebook, we have included many explanations as comments in the code cell. Please read them carefully instead of just pressing the run button.  

**Learning goals:** \
When you have finished this tutorial, you should be able to ...
* fit linear regression models using scikit-learn
* check the quality of the model fit
* apply feature selection
* implement train-test splits to mitigate overfitting and assess the generalization performance of your model on unseen data

**Deadline:** Day of next lab, 10:00

## Import libraries and data

In [None]:
import numpy as np
import pandas as pd
import requests
from matplotlib import pyplot as plt
from scipy import linalg, stats

# new
from sklearn.linear_model import LinearRegression

First we download the dataset `real_estate.csv` from the internet, then we read it into a pandas dataframe. As you already know `DataFrame` objects are often abbreviated as `df` (or `df_something`).

In [None]:
response = requests.get(
    "https://ucloud.univie.ac.at/index.php/s/crRApfnS4HEm2ar/download"
)
open("real_estate.csv", "wb").write(response.content)

df = pd.read_csv("real_estate.csv")

## 1. Inspect data

This dataset contains various variables for real estate properties such as transaction date, age, geographic location, etc. The market historical data set of real estate valuation is collected from Sindian Dist., New Taipei City, Taiwan. We want to see how well these variables predict the price of the house. 


Here is the list of our predictor variables:

<div class='alert alert-info'>

- X1 ... the transaction date in decimal (e.g., 2013.25=2013 March, 2013.5=2013 June, etc.)
- X2 ... the house age (unit: year)
- X3 ... the distance to the nearest MRT (metro) station (unit: meter)
- X4 ... the number of convenience stores in the living circle on foot (integer)
- X5 ... the geographic coordinate, latitude. (unit: degree)
- X6 ... the geographic coordinate, longitude. (unit: degree)
</div>

And this is our outcome variable:
<div class='alert alert-info'>

- Y ... house price of unit area (10,000 New Taiwan Dollar/Ping, where Ping is a local unit, 1 Ping = 3.3 meter squared)
</div>

Now we want to look at the different variables and calculate the mean, minimum and maximum values for each of them. This will give us a rough idea of our data. Hint: You can use the methods from the pandas tutorial, but there is also a way to `describe()` your data with only one method.

In [None]:
# YOUR CODE HERE

In the next step we want to create a list of all variable names with the name `vars` using the DataFrame. We will use this list later. Remember the last tutorial about pandas.

In [None]:
vars = # YOUR CODE HERE

print(vars)

## 2. Data exploration

### 2.1. Explorative data visualization

Visualize the data with scatter plots (for the X1-X6 predictors separately). Note that we are adding `tight_layout()` at the end to conventiently make space for all our labels.

<div class='alert alert-warning'>This plot can be improved a bit. For example, you can change the size and use the <tt>alpha</tt> argument within <tt>plt.scatter()</tt> to get a more pronounced visualization.</div>

In [None]:
plt.figure(
    # figsize=
)

for i, var in enumerate(vars[1:7]):
    plt.subplot(
        2, 3, i + 1
    )  # This means: 2 rows, 3 columns, and current index is i + 1
    plt.scatter(df[var], df[vars[7]])
    plt.xlabel(var)
    plt.ylabel(vars[7])

plt.tight_layout()

### 2.2. Correlation between predictors

Two ways of doing this:

1. Numpy: `np.correlate()` function
2. pandas: `.corr()` method

Let's choose pandas. In this case we choose only the relevant part of our DataFrame using the `iloc` attribute and append our correlation method to it. By default this gives us the Pearson correlation.

In [None]:
# YOUR CODE HERE (df.ilo...)

<div class='alert alert-warning'>You can again improve this plot a bit. Here it makes sense to use the <tt>rotation</tt> argument within <tt>plt.xticks().</tt></div>

In [None]:
plt.figure(
    # figsize=
)

offset = 0  # CHOOSE A REASONABLE VALUE

plt.pcolor(df.iloc[:, 1:].corr())
plt.xticks(np.arange(len(vars) - 1) + offset, vars[1:])
plt.yticks(np.arange(len(vars) - 1) + offset, vars[1:])
plt.colorbar()

## 3. Linear regression with a single predictor

Scikit-learn (or sklearn for short) uses an object-oriented programming style (https://www.geeksforgeeks.org/introduction-of-object-oriented-programming/), i.e. a slightly different syntax than NumPy and Matplotlib (but somewhat similar to pandas with its central DataFrame object). In our next example, we will use only one predictor. We want to know how well age predicts the price of an apartment.

### 3.1. Create model and fit to data

The first thing we do is decide which model we want to use. For simplicity, we will use scikit-learn's `LinearRegression` class (actually, we could use any other model that can be used for regression, the basic procedure is mostly the same in scikit-learn), which we imported at the beginning of this tutorial, and create an instance of it.

In [None]:
lr = LinearRegression()

Now we are ready to define our data and fit our newly created model to it. A common convention is to use uppercase `X` (because it's a matrix) for the predictor variables and `y` for the outcome variable. You are free to define this any way you like.

In [None]:
vars

In [None]:
X = df[["X2 house age"]]
y = df["Y house price of unit area"]

lr.fit(X, y)

You probably recognized, that we used two pairs of square brackets for `X`, this ensures that it is still a 2D matrix (i.e., a DataFrame). This is necessary for the regression to work, since we could theoretically have more than one predictor. For `y`, we used one pair of square brackets and the result is a pandas Series (vector). This will be clearer if you look at their underlying NumPy structure by using the `values` attribute.

### 3.2. Parameters

Parameter estimation is a common goal in computational modeling, where the aim is to find the values of model parameters that best fit observed data or match experimental results. Interpretating these parameters, analyzing how changes in each parameter affect the model's predictions or dynamics, can provide insights into the real-world phenomenon the model represents.

Let's see the fit parameters intercept and coefficients.

In [None]:
lr.intercept_

In [None]:
lr.coef_

In [None]:
plt.axhline(
    y=lr.intercept_,
    color="gray",
    linestyle="--",
    label=f"Intercept: {lr.intercept_:.2f} price units at age 0",
)

y_slope = lr.coef_[0] * np.array([0, 1]) + lr.intercept_
plt.plot(
    np.array([0, 1]),
    y_slope,
    color="red",
    linestyle="--",
    label=f"Slope: {lr.coef_[0]:.2f} price units per year",
)

plt.xlabel("Age")
plt.ylabel("Price")
plt.title("Linear Regression Parameters")
plt.grid()
plt.legend()
plt.show()

### 3.3. Model evaluation

To get a sense of how well our model fits, we can use the coefficient of determination, or R². Here it is simply called `score()`. The R² value measures the proportion of variance in the dependent variable (price) that is explained by the independent variable(s) (age) in the model. It can be interpreted as how well the independent variable(s) (age) can predict changes in the dependent variable (price). It ranges from 0 to 1 (perfect fit). In some cases, however, it can be negative, e.g., when the fit is very poor and your are evaluating on data other than that to which the model was fitted.

In [None]:
lr.score(X, y)

<div class='alert alert-warning'>Only in the case of linear regression with 1 predictor Coefficient of determination R² = (Correlation coefficient r)². Does your score align with your previously computed correlations?</div>

### 3.4. Prediction of using regression model

In addition to parameter estimation, prediction can be a goal of computational modeling (and is the predominant goal in machine learning) by using the model to predict future outcomes based on observed data and estimated parameters.

We will make use of the `predict()` method. In the case of a simple model like linear regression, we could calculate the predictions by hand because we only have two parameters (e.g., `lr.intercept_ + lr.coef_ * X_new`). But it's good to get familiar with this method as in more complex models it may become impossible to compute the predictions manually.

In [None]:
y_pred = lr.predict(X)

y_pred[:5]  # A quick sanity check with the first 5 predictions

Now it's your turn to visualize the prediction line using Matplotlib

In [None]:
# YOUR CODE HERE

Sometimes it's good to go slowly and learn more thoroughly, but sometimes it makes sense to go quickly to get results. If you understand Matplotlib well, it is extremely flexible and you can use it to visualize almost anything. But there is another library that helps you get results quickly: **seaborn** ( https://seaborn.pydata.org/). Seaborn creates a bunch of ready-made Matplotlib plots with fewer lines of code. Here is an example for our regression problem:

In [None]:
import seaborn as sns

sns.regplot(x=X, y=y, line_kws=dict(color="firebrick"), scatter_kws=dict(alpha=0.3))

With only one line of code you plot your data, labels, the regression line, and even a 95% confidence interval around it. While it is good to learn Matplotlib from the ground up, to get things done it is often helpful to use a higher level interface like seaborn that saves us from writing a lot of code. If you want to have more control over your plots we encourage you to stick with Matplotlib.

## 4. Linear regression with more predictors

Now it's time to move on and use the four variables X1 through X4 in a combined model. To do this, we create a combined predictor matrix from our original data frame, containing only the predictors we want to use. We can use our `vars` variable that we created earlier ...

In [None]:
vars

In [None]:
X = df[vars[1:5]]
y = df[vars[7]]

... create a model and fit it to the new data.

In [None]:
lr = LinearRegression()

lr.fit(X, y)
lr.coef_

Now let's compute the R² and see how including more predictors changed the model's performance:

In [None]:
lr.score(X, y)

---
<div class='alert alert-info'>You can submit this assignment as a group.</div>

##  Exercise 1: Feature selection

Let's now focus on selecting the most important features or variables from our dataset to create our model. Our goal is to improve model performance by minimizing dimensionality and complexity. To achieve this, we will

 1. Add the predictors **sequentially**, i.e., first fit a model to only X1, then another model to X1 and X2, until you have included all the features up to X6, store the resulting `score()` (R²) for each of the models.
 2. Do the same, but add predictors in a **random** order one-by-one.
 3. Do the same, but add predictors in the order of the **pearson correlation** with the outcome variable Y (starting with the largest).
 4. Plot the obtained scores from 1-3 as three lines.
 
You can use for-loops to iterate over the different models.

Please add a sentence about what you are observing.

In [None]:
# YOUR CODE HERE

## Exercise 2: Train-test split

**Overfitting** is a common problem in machine learning, where the model learns too well from the training data, even picking up noise and irrelevant details, which makes it difficult to generalize to new data. The train-test split helps by splitting the data into two parts: one for training the model and another for testing how well it generalizes to new data. There is a built-in train-test-split function in scikit-learn, but we want you to build your own. To achieve this, we will:

1. Split the `X` and `y` data into a 80% training (`X_train`, `y_train`) and a 20% test set (`X_test`, `y_test`).

- Option 1: Simply use the first 80% of rows for the training set and the rest for the test set). You can use indexing for this: e.g., `df[0:int(len(Data)*.8)]` selects the first 80% percent of a numpy array.

- Option 2: Randomly select 80% of the data as training, and the rest for test. This is the better approach, but make sure that both X and y are created from the same random selection (e.g., the rows in `X_test` correspond to the rows in `y_test`, ...).


2. Fit the model to the training set, and compute the `score()` both for the training and the test set

3. Similar to the previous exercise, try to find the best combination of predictors that best explain the test set and compute the `score()` for both training set and test set.

4. Try to visualize the last results in a similar way like in Exercise 1 (one plot with a line for training set and a line for the test set).

Please add a sentence about what you are observing.

In [None]:
# YOUR CODE HERE