# Regression Exercises

This exercise uses [students score on Portugese](../dataset/student-por.csv) subject. To complete this exercise, please refer to [regression](regression.ipynb) notebook.

In [1]:
# load libraries
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [2]:
studentPorDf = pd.read_csv('../dataset/student-por.csv')
studentPorDf.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,4,0,11,11
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,2,9,11,11
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,6,12,13,12
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,0,14,14,14
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,0,11,13,13


## Exercise 1
Try to get basic information of the dataset and statistical summary

<details>
  <summary>Click for answer</summary>
    
  ```python
  # display basic info
  studentPorDf.info()
  
  # statistical summary with 2 decimal places
  studentPorDf.describe().round(2)

## Exercise 2
Group dataset based on `famsize` and `Fjob`,  then calculate the following column aggregate 
- **`absences`**: `sum`, `max`
- **`studytime`**: `sum`, `mean`, `max`
- **`G1`**: `min`, `mean`, `max`, `sd`
- **`G2`**: `min`, `mean`, `max`, `sd`
- **`G3`**: `min`, `mean`, `max`, `sd`

<details>
  <summary>Click for answer</summary>
    
  ```python
  studentPorDf.groupby(['famsize', 'Fjob']).agg({'absences': ['sum', 'max'], 'studytime': ['sum', 'mean', 'max'], 
                                              'G1': ['min', 'mean', 'max', 'std'], 'G2': ['min', 'mean', 'max', 'std'],
                                              'G3': ['min', 'mean', 'max', 'std']}).round(2)

## Exercise 3
Calculate the correlation matrix of numeric variables

<details>
  <summary>Click for answer</summary>
    
  ```python
  stPor_correlation = studentPorDf.select_dtypes(include=['number']).corr().round(3)
  stPor_correlation

## Exercise 4
Make a correlation heatmap plot of numeric variables

<details>
  <summary>Click for answer</summary>
    
  ```python
  plt.figure(figsize=(12,6))
sns.heatmap(stPor_correlation, annot=True, cmap='coolwarm', fmt='.2f')

## Exercise 5
1. Select three features that you consider will contribute to the prediction performance, except **G1** and **G2**.
2. Use **G3** as a target to be predicted.
3. Then, split dataset into `train` and `test` set with the proportion of **80%** and **20%**, respectively.

<details>
  <summary>Click for answer</summary>
    
  ```python
    # do not forget to import the following library
    from sklearn.model_selection import train_test_split

    X = studentPorDf[['Medu', 'Fedu', 'studytime']] # features for predictor. you may try other features
    y = studentPorDf['G3'].to_numpy().reshape(-1, 1) # target to predict

    # split dataset
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Exercise 6
1. Make a linear regression model and fit using train set.
2. Make prediction using test set.
3. Evaluate the model performance using `MAE`, `MAPE`, `MSE`, `RMSE`, and `R²` metrics.

<details>
  <summary>Click for answer</summary>
    
  ```python
    # load metrics for evaluation
    from sklearn.metrics import mean_squared_error, root_mean_squared_error, mean_absolute_error, r2_score
    
    # fit the linear regression model with multiple features
    reg_model = LinearRegression()
    reg_model.fit(X_train, y_train)
    
    # predict using test dataset
    y_pred = reg_model.predict(X_test)
    
    # evaluate the model
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = root_mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    print(f"MAE: {mae:.4f}")
    print(f"MAPE: {mae/ y_test.mean() *100:.2f}")
    print(f"MSE: {mse:.4f}")
    print(f"RMSE: {rmse:.4f}")
    print(f"R² Score: {r2:.4f}")
    
    score = reg_model.score(X_train, y_train)
    print(f'Model determination: {score:.4f}')

## Exercise 7
Plot the comparison of `test` and `predicted` values.

<details>
  <summary>Click for answer</summary>
    
  ```python
    # plot between true value and predicted value of G3 score
    plt.scatter(y_test, y_pred)
    plt.plot([0, max(y_test[0])], [0, max(y_pred[0])], '--k')
    plt.xlabel('True G3 score')
    plt.ylabel('Predicted G3 score')

## 🧠 Does your model perform well?

If you don't think so, improve the model!