# Tutorial 7 - Linear Regression using scikit-learn

*Written and revised by Jozsef Arato, Mengfan Zhang, Dominik Pegler*  
Computational Cognition Course, University of Vienna  
https://github.com/univiemops/tewa1-computational-cognition

---
**This tutorial will cover:**

- fitting linear regression models

- checking coefficients of fitted models

- checking quality of model fit  

---

## Import libraries

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from scipy import linalg, stats

## Import data

In [None]:
data = pd.read_csv("Real estate.csv")

Data Set Information:

The market historical data set of real estate valuation are collected from Sindian Dist., New Taipei City, Taiwan.

Attribute Information:

The inputs are as follows

X1=the transaction date (for example, 2013.250=2013 March, 2013.500=2013 June, etc.)

X2=the house age (unit: year)

X3=the distance to the nearest MRT station (unit: meter)

X4=the number of convenience stores in the living circle on foot (integer)

X5=the geographic coordinate, latitude. (unit: degree)

X6=the geographic coordinate, longitude. (unit: degree)

The output is as follow

Y= house price of unit area (10000 New Taiwan Dollar/Ping, where Ping is a local unit, 1 Ping = 3.3 meter squared)



In [None]:
data

data-set size

In [None]:
np.shape(data)

In [None]:
data["Y house price of unit area"]

In [None]:
data["Y house price of unit area"].to_numpy()

In [None]:
data["Y house price of unit area"][3:]

In [None]:
list(data)

In [None]:
vars = list(data)
print(vars)

In [None]:
print(data.iloc[0:3, 5:7])

different ways of accessing a column from a dataframe

In [None]:
print(data.iloc[0:3, 7])
print(data[vars[7]][0:3])
print(data["Y house price of unit area"][0:3])

## Explorative data visualization
visualize the data with scatter plots (for the X1-X6 predictors separately)

In [None]:
plt.figure(figsize=(9, 4))
for c, v in enumerate(vars[1:7]):
    plt.subplot(2, 3, c + 1)
    plt.scatter(data[v], data[vars[7]])
    plt.xlabel(v)
    plt.ylabel(vars[7])
plt.tight_layout()

## Correlation between predictors
stats.pearsonr

to calculate pearsonr correlation between each pair of predictors



In [None]:
n = len(vars) - 1
corrs = np.zeros((n, n))
for cv1, v1 in enumerate(vars[1:]):
    for cv2, v2 in enumerate(vars[1:]):
        corrs[cv1, cv2] = stats.pearsonr(data[v1], data[v2])[0]

In [None]:
plt.pcolor(corrs)
plt.xticks(np.arange(n) + 0.5, vars[1:], rotation=70)
plt.yticks(np.arange(n) + 0.5, vars[1:])
plt.colorbar()

In [None]:
data.iloc[:, 1:].corr()

In [None]:
data.corr()

In [None]:
plt.pcolor(data.iloc[:, 1:].corr())
plt.figure()
plt.pcolor(corrs)
plt.xticks(np.arange(n) + 0.5, vars[1:], rotation=70)
plt.yticks(np.arange(n) + 0.5, vars[1:])
plt.colorbar()

## Linear regression with a single predictor
using the sklearn library

now we use only one predictor, that is house age, to predict aparment price


In [None]:
from sklearn.linear_model import LinearRegression

sklearn uses an "object oriented" programming style

that is a slightly different syntax form numpy,matplotlib

(but somewhat similar to pandas, a pandas dataframe is an object)

In [None]:
lr = linear_regression()  # we create a linear regression object

In [None]:
lr.fit(x, y)

In [None]:
type(lr)

In [None]:
lr.fit(data[vars[2:4]], data[vars[7]])

In [None]:
vars[2:4]

fitting the regression model

In [None]:
x = data[vars[2]].to_numpy()
print(np.shape(x))
x = x.reshape(-1, 1)
print(np.shape(x))
y = data[vars[7]]
lr.fit(x, y)

check fitted parameters intercept and coefs

In [None]:
lr.intercept_

In [None]:
lr.coef_

In [None]:
-2.54477973e-01

score-- coefficient of determination


In [None]:
lr.score(data[vars[2:4]], data[vars[7]])

In [None]:
# lr.predict(Data[Vars[2:4]])

prediction of the regression model:
1. use the built in LinearRegression.predict() method
2. calculate the prediction, using the intercept and the slope(coef_)
3. compare the predictions achieved the two ways

In [None]:
preds = lr.predict(x)
print(preds[0:5])
preds2 = lr.intercept_ + lr.coef_ * x
print(preds2[0:5])

visualize prediction using matplotlib

In [None]:
plt.scatter(x, y)
plt.plot(x, preds2, color="r")
plt.xlabel(vars[2])
plt.ylabel(vars[7])

## Multiple linear regression

now let's use the four measurements X1-X4 in a combined model

for this we make a combined predictor matrix from our original dataframe, containig only the predictors we want to use:

In [None]:
print(vars[1:5])
x = data[vars[1:5]]
print(type(x))

y = data[vars[7]]

fit multiple linear regression

In [None]:
lr2 = linear_regression()

In [None]:
lr2.fit(x, y)

In [None]:
lr2.coef_

observe fitted parameters and goodness of fit

compare to invidual model above

In [None]:
lr2.score(x, y)

# comparison to scipy.linalg.lstsq

fit intercept parameter vs column of ones in design matrix

and use lstsq to fit a regression model

In [None]:
x = data[vars[1:5]].to_numpy()
y = data[vars[7]].to_numpy()
xx = np.column_stack((np.ones(len(y)), x))
print(xx)

In [None]:
lr_int_1 = linear_regression(fit_intercept=False)
lr_int_1.fit(xx, y)
print(lr_int_1.coef_)
print(lr_int_1.intercept_)

In [None]:
-1.15887478e04

In [None]:
lr_int_1 = linear_regression(fit_intercept=True)
lr_int_1.fit(x, y)
print(lr_int_1.coef_)
print(lr_int_1.intercept_)

In [None]:
linalg.lstsq(xx, y)[0]

##  feature selection
 1. add predictors (features) one by one, use X1 only first, and add all predictors sequentially until X6, and plot the obtained score for each model (as a function of the number of predictors)
 2. add predictors in a random order one-by-one, and plot the obtained score for each model
 3. add predictors in the order of the absolute pearson correlation with the outcome variable Y (starting with the largest), and plot the obtained score

In [None]:
n_features = 6
lr = linear_regression()
y = data[vars[7]]
for f in np.arange(1, n_features + 1):
    print(vars[1 : 1 + f])
    x = data[vars[1 : 1 + f]]
    lr.fit(x, y)
    score = lr.score(x, y)

    print(score)
    print(lr.coef_)
    plt.scatter(f, score)
plt.xlabel("num of features", fontsize=16)
plt.ylabel("score", fontsize=16)
plt.yticks(fontsize=13)

### homework
## training and test set
1. split that the X and Y data into 80% training and 20% test set

Option 1: take the first 80% of data as training, rest as test), you can use indexing for this: eg: Data[0:int(len(Data)*.8)]selects the first 80% percent of a numpy array

Option 2: randomly select 80% of the data as training, rest as test  (this is the better approach)

!!! Try not to use the built in Train-Test Split function!

2. fit the model to the training set, and calcualte the score both for the training and the test set

3. combine with the previous excercize--try to find the best combination of parameters that best explain the test data..  (try different combinations of predictors, fit on the training data, calculate the score for both training and test).

4. try to visualize, with a similar figure to Slide 7 of Lecture 4 (just with the score on the y-axis, instead of the error)

