# **Lab1: Regression**
In *lab 1*, you need to finish:

1.  Basic Part: Implement the regression model to predict people's grip force from their weight.
You can use either Matrix Inversion or Gradient Descent.


> *   Step 1: Split Data
> *   Step 2: Preprocess Data
> *   Step 3: Implement Regression
> *   Step 4: Make Prediction
> *   Step 5: Train Model and Generate Result

2.  Advanced Part: Implementing a regression model to predict grip force in a different way (for example, with more variables) than the basic part




---
# 1. Basic Part (50%)
In the first part, you need to implement the regression to predict grip force

Please save the prediction result in a CSV file and submit it to Kaggle

### Import Packages

> Note: You **cannot** import any other package


In [57]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import csv
import math
import random

### Global attributes
Define the global attributes\
You can also add your own global attributes here

In [58]:
training_dataroot = 'lab1_basic_training.csv' # Training data file file named as 'lab1_basic_training.csv'
testing_dataroot = 'lab1_basic_testing.csv'   # Testing data file named as 'lab1_basic_testing.csv'
output_dataroot = 'lab1_basic.csv' # Output file will be named as 'lab1_basic.csv'

training_datalist =  [] # Training datalist, saved as numpy array
testing_datalist =  [] # Testing datalist, saved as numpy array

output_datalist =  [] # Your prediction, should be a list with 100 elements

### Load the Input File
First, load the basic input file **lab1_basic_training.csv** and **lab1_basic_testing.csv**

Input data would be stored in *training_datalist* and *testing_datalist*

In [59]:
# Read input csv to datalist
with open(training_dataroot, newline='') as csvfile:
  training_datalist = pd.read_csv(training_dataroot).to_numpy()

with open(testing_dataroot, newline='') as csvfile:
  testing_datalist = pd.read_csv(testing_dataroot).to_numpy()

### Implement the Regression Model

> Note: It is recommended to use the functions we defined, you can also define your own functions

#### Step 1: Split Data
Split data in *training_datalist* into training dataset and validation dataset


In [60]:
def SplitData(data, split_ratio):
    """
    Splits the given dataset into training and validation sets based on the specified split ratio.

    Parameters:
    - data (numpy.ndarray): The dataset to be split. It is expected to be a 2D array where each row represents a data point and each column represents a feature.
    - split_ratio (float): The ratio of the data to be used for training. For example, a value of 0.8 means 80% of the data will be used for training and the remaining 20% for validation.

    Returns:
    - training_data (numpy.ndarray): The portion of the dataset used for training.
    - validation_data (numpy.ndarray): The portion of the dataset used for validation.

    """
    data = np.array(data) # transfer data into numpy array
    np.random.shuffle(data) # shuffle data
    split_index = int(len(data) * split_ratio) # calculate data length to split
    training_data = data[:split_index] # Put data[0] to data[split_index-1] into training_data
    validation_data = data[split_index:] # Put data[split_index] to the last element to validation_data

    # TODO

    return training_data, validation_data

#### Step 2: Preprocess Data
Handle unreasonable data and missing data

> Hint 1: Outliers and missing data can be addressed by either removing them or replacing them using statistical methods (e.g., the mean of all data).

> Hint 2: Missing data are represented as `np.nan`, so functions like `np.isnan()` can be used to detect them.

> Hint 3: Methods such as the Interquartile Range (IQR) can help detect outliers

In [61]:
def PreprocessData(data):
    """
    Preprocess the given dataset and return the result.

    Parameters:
    - data (numpy.ndarray): The dataset to preprocess. It is expected to be a 2D array where each row represents a data point and each column represents a feature.

    Returns:
    - preprocessedData (numpy.ndarray): Preprocessed data.
    """
    # Change to pandas.DataFrame if it is numpy.array
    if isinstance(data, np.ndarray):
        preprocessData = pd.DataFrame(data)
    else:
        preprocessData = data.copy()

    # Replace missing values with the mean of their respective columns
    preprocessData.fillna(preprocessData.mean(), inplace=True)

    # Handle outliers
    for column in preprocessData.columns:
        # Calculate the mean and standard deviation
        mean = preprocessData[column].mean()
        std = preprocessData[column].std()
        
        # Define the threshold for outliers
        threshold = 3 * std
        
        # Identify outliers
        is_outlier = (preprocessData[column] - mean).abs() > threshold
        
        # Replace outliers with the mean
        preprocessData.loc[is_outlier, column] = mean

    return preprocessData.values


### Step 3: Implement Regression
You have to use Gradient Descent to finish this part

In [62]:
def Regression(dataset):
    """
    Performs regression on the given dataset and return the coefficients.

    Parameters:
    - dataset (numpy.ndarray): A 2D array where each row represents a data point.

    Returns:
    - w (numpy.ndarray): The coefficients of the regression model. For example, y = w[0] + w[1] * x + w[2] * x^2 + ...
    """
    # Get X and y from dataset
    X = dataset[:, : -1]
    y = dataset[:, -1]

    # TODO: Decide on the degree of the polynomial
    # degree = 2  # For example, quadratic regression

    # # Add polynomial features to X
    # X_poly = np.ones((X.shape[0], 1))  # Add intercept term (column of ones)
    # for d in range(1, degree + 1):
    #     X_poly = np.hstack((X_poly, X ** d))  # Add x^d terms to feature matrix

    # Initialize coefficients (weights) to zero
    num_dimensions = X.shape[1]  # Number of features (including intercept and polynomial terms)
    w1 = np.zeros(num_dimensions)
    w0 = np.zeros(1)

    # TODO: Set hyperparameters
    num_iteration = 100000
    learning_rate = 0.00001
    #noise = 0
    prev_loss = float('inf')
    prev_gradient_length = float('inf')
    # print(f"X.shape={X.shape},num_dimensions={num_dimensions}")
    # Gradient Descent
    m = len(y)  # Number of data points
    for iteration in range(num_iteration):
        # TODO: Prediction using current weights and compute error
        y_pred = X.dot(w1) + w0
        error = y - y_pred

        # TODO: Compute gradient
        gradient_w0 = -2/m * np.sum(error)
        gradient = -2/m * X.T.dot(error) # gradient of w1
        loss = np.sum(error ** 2) / m

        #print(f"y_pred = {y_pred}\n Error = {error}\n gradient = {gradient}")
        # TODO: Update the weights
        gradient_length = np.linalg.norm(gradient)
        if loss > prev_loss or gradient_length < 0.001 or abs(gradient_length - prev_gradient_length)<0.00001:break
        prev_loss = loss
        prev_gradient_length = gradient_length
        w1 = w1 - learning_rate * gradient
        w0 = w0 - learning_rate * gradient_w0

        # TODO: Optionally, print the cost every 100 iterations
        if iteration % 1000 == 0:
            cost = np.mean(error ** 2) / 2
            print(f"Iteration {iteration}, Cost: {cost}, w1:{w1}, loss:{loss}, gradient:{np.linalg.norm(gradient)}")
    print(f"w:{w1}, loss:{prev_loss}, noise:{w0}, gradient:{gradient}")
    return w1#,w0


### Step 4: Make Prediction
Make prediction of testing dataset and store the value in *output_datalist*

In [63]:

def MakePrediction(w, test_dataset):
    """
    Predicts the output for a given test dataset using a regression model.

    Parameters:
    - w (numpy.ndarray): The coefficients of the model, where each element corresponds to
                               a coefficient for the respective power of the independent variable.
    - test_dataset (numpy.ndarray): A 1D array containing the input values (independent variable)
                                          for which predictions are to be made.

    Returns:
    - list/numpy.ndarray: A list or 1d array of predicted values corresponding to each input value in the test dataset.
    """
    y_pred = test_dataset.dot(w)

    # TODO

    return y_pred


### Step 5: Train Model and Generate Result

Use the above functions to train your model on training dataset, and predict the answer of testing dataset.

Save your predicted values in `output_datalist`

> Notice: **Remember to inclue the coefficients of your model in the report**



In [64]:
# TODO

# (1) Split data
TraningData,ValidationData = SplitData(training_datalist,0.8)

# (2) Preprocess data
Training = PreprocessData(TraningData)
Validation = PreprocessData(ValidationData)
# (3) Train regression model

w = Regression(Training)
# (4) Predict validation dataset's answer, calculate MAPE comparing to the ground truth

# (5) Make prediction of testing dataset and store the values in output_datalist
output_datalist = MakePrediction(w,testing_datalist)
print(f"output_datalist : {output_datalist}")

Iteration 0, Cost: 1152.3912699600933, w1:[0.06141445], loss:2304.7825399201865, gradient:6141.445390318805
w:[0.67815982], loss:221.7539916253852, noise:[0.01261724], gradient:[-0.00067921]
output_datalist : [36.34936637 34.95235714 54.04933768 38.43809862 46.4539477  43.94475636
 38.92637369 43.06314859 42.92751663 41.63901297 47.40337144 62.72978338
 56.35508107 34.78959878 43.74130841 43.53786047 43.94475636 38.85855771
 37.63787003 45.70797189 56.96542491 45.02981207 57.50795277 55.81255322
 34.85741477 61.98380758 52.69301804 38.24821387 44.96199609 35.80683852
 56.21944911 46.72521162 43.53786047 43.47004449 51.60796233 39.14338483
 30.66638708 61.44127973 29.90684808 41.29993306 34.11143896 46.79302761
 54.11715367 40.41832529 34.92523075 53.37117786 38.24821387 36.34936637
 35.33212664 45.02981207 50.99761849 37.97694994 40.68958922 46.86084359
 42.04590886 49.8447468  40.55395726 53.64244179 35.8746545  57.64358473
 56.01600116 33.09419923 57.3723208  54.52404956 49.8447468  

### *Write the Output File*

Write the prediction to output csv and upload the file to Kaggle
> Format: 'Id', 'gripForce'


In [65]:
# Assume that output_datalist is a list (or 1d array) with length = 100

with open(output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  writer.writerow(['Id', 'gripForce'])
  for i in range(len(output_datalist)):
    writer.writerow([i,output_datalist[i]])


# 2. Advanced Part (45%)
In the second part, you need to implement regression differently from the basic part to improve your grip force predictions. You must use more than two features.

You can choose either matrix inversion or gradient descent for this part

We have provided `lab1_advanced_training.csv` for your training

> Notice: Be cautious of the "gender" attribute, as it is represented by "F"/"M" rather than a numerical value.

Please save the prediction result in a CSV file and submit it to Kaggle

In [66]:
training_dataroot = 'lab1_advanced_training.csv' # Training data file file named as 'lab1_advanced_training.csv'
testing_dataroot = 'lab1_advanced_testing.csv'   # Testing data file named as 'lab1_advanced_testing.csv'
output_dataroot = 'lab1_advanced.csv' # Output file will be named as 'lab1_advanced.csv'

training_datalist =  [] # Training datalist, saved as numpy array
testing_datalist =  [] # Testing datalist, saved as numpy array

output_datalist =  [] # Your prediction, should be a list with 3000 elements

In [67]:
# Read input csv to datalist
with open(training_dataroot, newline='') as csvfile:
  training_datalist = pd.read_csv(training_dataroot).to_numpy()

with open(testing_dataroot, newline='') as csvfile:
  testing_datalist = pd.read_csv(testing_dataroot).to_numpy()

In [68]:
# TODO
training_datalist[:, 1] = np.where(training_datalist[:, 1] == 'F', 0.0, 1.0)
testing_datalist[:, 1] = np.where(testing_datalist[:, 1] == 'F', 0.0, 1.0)
TraningData,ValidationData = SplitData(training_datalist,0.8)

# (2) Preprocess data
Training = PreprocessData(TraningData)
Validation = PreprocessData(ValidationData)

w = Regression(Training)
# (4) Predict validation dataset's answer, calculate MAPE comparing to the ground truth

# (5) Make prediction of testing dataset and store the values in output_datalist
output_datalist = MakePrediction(w,testing_datalist)
print(f"output_datalist : {output_datalist}")

Iteration 0, Cost: 1157.7337787295444, w1:[0.03289787 0.00067201 0.15513614 0.06164369 0.02014554 0.07233999
 0.11986839], loss:2315.4675574590888, gradient:22126.281932586637
Iteration 1000, Cost: 77.45720173100827, w1:[-0.10672214  0.05029661  0.20822151  0.24749481 -0.50730779 -0.00572193
  0.08028704], loss:154.91440346201654, gradient:28.251497533091662
Iteration 2000, Cost: 75.7224216079323, w1:[-7.44286196e-02  8.49393959e-02  2.15176877e-01  2.46860612e-01
 -6.79376936e-01 -3.68377407e-04  8.94397180e-02], loss:151.4448432158646, gradient:10.653095853634827


  preprocessData.fillna(preprocessData.mean(), inplace=True)
  preprocessData.fillna(preprocessData.mean(), inplace=True)


Iteration 3000, Cost: 75.45299636518531, w1:[-0.06134661  0.11527976  0.21841669  0.24584133 -0.74168672  0.00663145
  0.08869305], loss:150.90599273037063, gradient:4.7470557268763045
Iteration 4000, Cost: 75.38140073943204, w1:[-0.0564171   0.14402314  0.21972082  0.24527494 -0.76413043  0.01128831
  0.08691545], loss:150.76280147886408, gradient:3.1455302239767957
Iteration 5000, Cost: 75.33778724841984, w1:[-0.05457618  0.17213978  0.22022031  0.24488231 -0.77188311  0.01382575
  0.08565532], loss:150.6755744968397, gradient:2.835009049941702
Iteration 6000, Cost: 75.29842576128345, w1:[-0.05390213  0.19998535  0.22040142  0.24455193 -0.77416447  0.01507035
  0.08492251], loss:150.5968515225669, gradient:2.780119216484651
w:[-0.05370253  0.22160505  0.22045059  0.24431068 -0.77443216  0.01554619
  0.08458673], loss:150.53662359264874, noise:[-0.01335584], gradient:[-0.01566332 -2.76766794 -0.00343833  0.03044537 -0.01131661 -0.04302964
  0.03395052]
output_datalist : [46.5926722149

# Save the Code File
Please save your code and submit it as an ipynb file! (**Lab1.ipynb**)

In [69]:
# Assume that output_datalist is a list (or 1d array) with length = 100

with open(output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  writer.writerow(['Id', 'gripForce'])
  for i in range(len(output_datalist)):
    writer.writerow([i,output_datalist[i]])
