# **Lab 1: Linear Regression**

During the first lecture, you have been introduced to Univariate and Multivariate Linear Regression. In this lab, we will see how to train such models using sklearn.


## Exercise 2: Sklearn

We are going to use a dataset from The World Bank organisation which contains some keys indicators on Australia (https://data.worldbank.org/country/AU).

We will be loading the dataset from the following link:
https://raw.githubusercontent.com/aso-uts/labs_datasets/main/36106-mlaa/lab01/ex2/WorldBank_OpenData_Australia.csv

We will build an univariate linear regression model that will predict the "CO2 emissions (kt)" in Australia from the population feature.

The steps are:
1.   Load and explore dataset
2.   Visualise relationship between target and feature
3.   Data Preparation
4.   Split Datasets
5.   Assess Baseline Performance
6.   Train Univariate Linear Regression Model
7.   Analyse Model's Performance

---
### 0. Setup Environment

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [None]:
# Do not modify this code

import os
import sys
from pathlib import Path

COURSE = "36106"
LAB = "lab01"
EXE = "ex02"
DATA = "data"

lab_path = f"{COURSE}/labs/{LAB}/{EXE}"
root_path = "./"

if os.getenv("COLAB_RELEASE_TAG"):

    from google.colab import drive
    from pathlib import Path

    print("###### Install required Python packages ######")
    ! pip install -r https://raw.githubusercontent.com/aso-uts/labs_datasets/main/36106-mlaa/requirements.txt

    print("\n###### Connect to personal Google Drive ######")
    drive.mount("/content/gdrive")

    root_path = "/content/gdrive/MyDrive/"

print("\n###### Setting up folders ######")
folder_path = f"{root_path}/{lab_path}/"
root_path = Path(folder_path)
root_path.mkdir(parents=True, exist_ok=True)
print(f"\nYou can now save your data files in:")
%cd {folder_path}

###### Install required Python packages ######

###### Connect to personal Google Drive ######
Mounted at /content/gdrive

###### Setting up folders ######

You can now save your data files in:
/content/gdrive/MyDrive/36106/labs/lab01/ex02


### 1. Load and Explore Dataset

**[1.1]** Import the pandas package

In [None]:
import pandas as pd

**[1.2]** Create a variable called `file_url` containing the link to the CSV file

In [None]:
file_url = "https://raw.githubusercontent.com/aso-uts/labs_datasets/main/36106-mlaa/lab01/ex2/WorldBank_OpenData_Australia.csv"

**[1.3]** Load the data into a DataFrame called `df`

In [None]:
df = pd.read_csv(file_url)

**[1.4]** Display the data of `df`



In [None]:
df.head()

Unnamed: 0,Year,Population,Life_expectancy,Urban_population,CO2,Land_area,GDP,Gross_savings,Interest_rate
0,1960,10276477,70.817073,54.199946,88202.351,7682300.0,18577668272,31.185194,
1,1961,10483000,70.973171,54.615272,90589.568,7682300.0,19652816665,30.63597,2.20744
2,1962,10742000,70.942439,54.662232,94912.961,7682300.0,19892485161,28.926923,5.579625
3,1963,10950000,70.911707,54.998932,101029.517,7682300.0,21507447643,29.92085,3.273874
4,1964,11167000,70.880976,55.317928,108979.573,7682300.0,23764139321,31.226731,1.731211


### 2. Visualise relationship between target and feature

**[2.1]** Import the altair package

In [None]:
import altair as alt

**[2.1]** Create a scatter plot with `Population` as the x-axis and `CO2` as the y-axis

In [None]:
scatter_chart = alt.Chart(df).mark_point().encode(
    x='Population',
    y='CO2'
)


### 3. Data Preparation

**[3.1]** Delete any missing values on the CO2 and Population columns and save the results in  new dataframe called `df_cleaned`

In [None]:
df_cleaned = df.dropna(how='any', subset=['CO2', 'Population'])

### 4. Split Datasets

Note: You can dowload the content of df_cleaned here: https://raw.githubusercontent.com/aso-uts/labs_datasets/main/36106-mlaa/lab01/ex2/df_cleaned.csv

**[4.1]** Extract the feature and target respectively into 2 variables called `X` and `y`

In [None]:
X = df_cleaned['Population']
y = df_cleaned['CO2']

### 5. Assess Baseline Performance

**[5.1]** Import the DummyRegressor module from sklearn

In [None]:
from sklearn.dummy import DummyRegressor

**[5.2]** Instantiate the Dummy class into a variable called `base_reg`

In [None]:
base_reg = DummyRegressor(strategy='mean')

**[5.3]** Fit the model with X and y

In [None]:
base_reg.fit(X, y)

**[5.4]** Save the predictions into a variable called `y_preds`

In [None]:
y_preds = base_reg.predict(X)

**[5.5]** Calculate the sum of errors between the target variable and the average value

In [None]:
(y_preds - y).sum()

-8.731149137020111e-10

### 6. Train Univariate Linear Regression Model

**[6.1]** Import the linear regression module from sklearn

In [None]:
from sklearn.linear_model import LinearRegression

**[6.2]** instantiate the LinearRegression class into a variable called `reg`

In [None]:
reg = LinearRegression()

**[6.3]** Fit the model with the prepared data

In [None]:
X = df_cleaned[['Population']]
reg.fit(X, y)


### 7. Analyse Model's Performance


**[7.1]** Use your trained model to predict the outcome for all observations and save them into a variable called `y_preds`



In [None]:
y_preds = reg.predict(X)

**[7.2]** Display the content of `y_preds`

In [None]:
y_preds

array([ 94564.4791406 ,  99713.01343603, 106169.77797322, 111355.13327722,
       116764.85491649, 122274.29492699, 128830.77783541, 132520.35757095,
       137755.57206057, 144087.68863372, 150170.5092788 , 160890.23418611,
       166873.33645996, 171934.04379992, 180484.89413296, 184722.92491027,
       188213.06790335, 192176.87315977, 196315.18556585, 200204.20204385,
       204641.66956362, 210500.12387343, 216757.45166816, 221519.00389443,
       225881.68263577, 231216.61549662, 237708.28146374, 243828.49649803,
       250517.10624834, 257552.23733867, 263802.08625556, 269259.17412116,
       274519.31820359, 278807.20816651, 283493.97161436, 288903.69325363,
       294861.86593467, 299997.36205305, 304833.70305775, 310193.5655114 ,
       315852.58307875, 322334.27720875, 328277.4921341 , 334360.31277918,
       340143.97831057, 346810.15142734, 354366.31100736, 357599.67919451,
       368109.99552224, 379141.34033964, 387618.6483739 , 395303.7936671 ,
       405112.11759095, 4

**[7.3]** Calculate the sum of errors between the target variable and the predictions

In [None]:
(y_preds - y).sum()

-3.245077095925808e-09

**[7.4]** Create a dataframe called `preds` that will contain the values of `X` and `y_preds` with the respective columns names: `Population` and `y`

In [None]:
# Solution
preds = X
preds['y'] = y_preds
preds

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  preds['y'] = y_preds


Unnamed: 0,Population,y
0,10276477,94564.479141
1,10483000,99713.013436
2,10742000,106169.777973
3,10950000,111355.133277
4,11167000,116764.854916
5,11388000,122274.294927
6,11651000,128830.777835
7,11799000,132520.357571
8,12009000,137755.572061
9,12263000,144087.688634


**[7.5]** Create a line chart for the predictions:

In [None]:
line_chart = alt.Chart(preds).mark_line().encode(
    x='Population',
    y='y'
  )

**[7.6]** Display the combine scatter plot and line chart to compare predictions against actuals

In [None]:
scatter_chart + line_chart