# **Lab 1: Linear Regression**

During the first lecture, you have been introduced to Univariate and Multivariate Linear Regression. In this lab, we will see how to train such models using sklearn.


## Exercise 1: Simulate the learning process of Univariate Linear Regression

We will manually fit an univariate linear regression model to predict a given target variable (salary) as closely as possible using the provided feature (years of experience).

We will be using a dataset available here: https://raw.githubusercontent.com/aso-uts/labs_datasets/main/36106-mlaa/lab01/ex1/salary_data.csv

The steps are:
1.   Load data
2.   Explore the dataset
3.   Make predictions with Default Parameters for a Linear Regression
4.   Update parameters

---
### 0. Setup Environment

In [1]:
# Do not modify this code
!pip install -q utstd

from utstd.folders import *
from utstd.ipyrenders import *

lab = LabExFolder(
  course_code="36106",
  lab="lab01",
  exercise="ex01"
)
lab.run()

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.3/1.6 MB[0m [31m9.5 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.6/1.6 MB[0m [31m25.1 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[?25hMounted at /content/gdrive

You can now save your data files in: /content/36106/labs/lab01/ex01/data


In [2]:
import warnings
warnings.simplefilter(action='ignore')

### 1. Load Data

**[1.1]** Import the pandas, numpy and altair packages

In [3]:
import pandas as pd
import numpy as np
import altair as alt

**[1.2]** Load the dataset

In [4]:
data = pd.read_csv('https://raw.githubusercontent.com/aso-uts/labs_datasets/main/36106-mlaa/lab01/ex1/salary_data.csv')

**[1.3]** Display the dataset

In [5]:
data

Unnamed: 0,YearsExperience,Salary
0,1.1,39343.0
1,1.3,46205.0
2,1.5,37731.0
3,2.0,43525.0
4,2.2,39891.0
5,2.9,56642.0
6,3.0,60150.0
7,3.2,54445.0
8,3.2,64445.0
9,3.7,57189.0


### 2. Explore the dataset

**[2.1]** Create a scatter plot with Altair on the dataset

In [6]:
scatter_chart = alt.Chart(data).mark_circle(size=60, color='orange').encode(
    x='YearsExperience',
    y='Salary'
)

**[2.2]** Display the scatter plot

In [7]:
scatter_chart

### 3. Make Predictions with Default Parameters for a Linear Regression

**[3.1]** Extract the target and feature into 2 separate variables `x` and `y`

In [8]:
x = data['YearsExperience'].values
y = data['Salary'].values

**[3.2]** Create 2 variables called `a` and `b` that will take the value 0

In [9]:
# Placeholder for student's code

In [10]:
# Solution
a = 0
b = 0

**[3.3]** Create a variable called `y_hat` that will calculate the result of `a * x + b` (univariate linear regression)

In [11]:
# Placeholder for student's code

In [12]:
# Solution
y_hat = a * x + b
y_hat

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

**[3.4]** Create a dataframe called `preds` that will contain the values of `x` and `y_hat` with the respective columns names: `x` and `y`

In [13]:
# Placeholder for student's code

In [14]:
# Solution
preds = pd.DataFrame({'x': x, 'y': y_hat})

**[3.5]** Display the content of `preds`

In [15]:
# Placeholder for student's code

In [16]:
# Solution
preds

Unnamed: 0,x,y
0,1.1,0.0
1,1.3,0.0
2,1.5,0.0
3,2.0,0.0
4,2.2,0.0
5,2.9,0.0
6,3.0,0.0
7,3.2,0.0
8,3.2,0.0
9,3.7,0.0


**[3.6]** Create a line chart with `preds`


In [17]:
line_chart = alt.Chart(pd.DataFrame({'x': x, 'y': y_hat})).mark_line().encode(
    x='x',
    y='y'
)

**[3.7]** Display the line chart

In [18]:
scatter_chart + line_chart

**[3.8]** Calculate the difference between the predictions `y_hat` and the actual target values `y`. Save these into a variable called `errors`

In [19]:
# Placeholder for student's code

In [20]:
# Solution
errors = y_hat - y

**[3.9]** Display the content of `errors`

In [21]:
# Placeholder for student's code

In [22]:
# Solution
errors

array([ -39343.,  -46205.,  -37731.,  -43525.,  -39891.,  -56642.,
        -60150.,  -54445.,  -64445.,  -57189.,  -63218.,  -55794.,
        -56957.,  -57081.,  -61111.,  -67938.,  -66029.,  -83088.,
        -81363.,  -93940.,  -91738.,  -98273., -101302., -113812.,
       -109431., -105582., -116969., -112635., -122391., -121872.])

**[3.10]** Calculate and display the sum of the difference between the predictions `y_hat` and the actual target values `y`

In [23]:
sum_error = errors.sum()
sum_error

np.float64(-2280090.0)

### 4. Update Parameters

**[4.1]** Define a reusable function that will create a chart displaying the input data and the predictions and also calculate the sum of errors

In [24]:
def check_parameters(a, b, x, y):
  y_hat = a * x + b

  source = pd.DataFrame({'x': x, 'y': y})
  preds = pd.DataFrame({'x': x, 'y': y_hat})

  scatter_chart = alt.Chart(source).mark_circle(size=60, color='orange').encode(
    x='x',
    y='y'
  )

  line_chart = alt.Chart(preds).mark_line().encode(
    x='x',
    y='y'
  )

  errors = y_hat - y

  sum_error = errors.sum()

  print(sum_error)

  return scatter_chart + line_chart

**[4.2]** Choose 2 values for the model parameters (a and b) and use the defined function to check the predictions results

In [25]:
# Placeholder for student's code

In [26]:
# Solution
check_parameters(a=100, b=0, x=x, y=y)

-2264150.0


**[4.3]** Choose 2 new values for the model parameters (a and b) and use the defined function to check the predictions results

In [27]:
# Placeholder for student's code

In [28]:
# Solution
check_parameters(a=5000, b=0, x=x, y=y)

-1483090.0


**[4.4]** Choose 2 new values for the model parameters (`a` and `b`) and use the defined function to check the predictions results

In [29]:
# Placeholder for student's code

In [30]:
# Solution
check_parameters(a=10000, b=0, x=x, y=y)

-686090.0


**[4.5]** Choose 2 new values for the model parameters (`a` and `b`) and use the defined function to check the predictions results

In [31]:
# Placeholder for student's code

In [32]:
# Solution
check_parameters(a=10000, b=10000, x=x, y=y)

-386090.0


**[4.6]** Choose 2 new values for the model parameters (`a` and `b`) and use the defined function to check the predictions results

In [33]:
# Placeholder for student's code

In [34]:
# Solution
check_parameters(a=10000, b=20000, x=x, y=y)

-86090.0


**[4.7]** Choose 2 new values for the model parameters (`a` and `b`) and use the defined function to check the predictions results

In [35]:
# Placeholder for student's code

In [36]:
# Solution
check_parameters(a=10000, b=25000, x=x, y=y)

63910.0


**[4.8]** Choose 2 new values for the model parameters (`a` and `b`) and use the defined function to check the predictions results

In [37]:
# Placeholder for student's code

In [38]:
# Solution
check_parameters(a=9500, b=25000, x=x, y=y)

-15790.0


**[4.9]** Choose 2 new values for the model parameters (`a` and `b`) and use the defined function to check the predictions results

In [39]:
# Placeholder for student's code

In [40]:
# Solution
check_parameters(a=9450, b=25800, x=x, y=y)

240.0
