# Supervised Learning Techniques
In this notebook, we will work on supervised learning and visualization. Particularly, we will perform supervise learning techniques by building regression models to predict changes in heart rate based on features extracted from a dataset of user workout records.

## Table of Contents
1. [Data Preperation and Feature Engineering](#)
1. [Model Training and Application](#)
1. [Model Evaluation and Visualization](#)
1. [Model Improvement with More Features (Multivariate Regression)](#)
1. [Model Comparison](#)

**View this notebook in:**

<a href="https://colab.research.google.com/github/tky1026/nus_it5100f/blob/main/notebook/assignment2.ipynb" target="_blank">
    <img src="https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab&style=for-the-badge" alt="Open in Colab"/>
</a>

<a href="https://github.com/tky1026/nus_it5100f" target="_blank">
    <img src="https://img.shields.io/badge/GitHub-Pages-blue?logo=github&style=for-the-badge" alt="GitHub Pages"/>
</a>

<a href="https://github.com/tky1026/nus_it5100f/blob/main/notebook/assignment2.ipynb" target="_blank">
    <img src="https://img.shields.io/badge/View-Notebook-orange?logo=jupyter&style=for-the-badge" alt="View Notebook"/>
</a>

---

**Note:**

> This notebook is a continuation of the previous data preprocessing and exploratory data analysis steps. Before proceeding, it's important to have completed the previous steps, where we cleaned and transformed the Endomondo dataset. As a result of those steps, we saved the preprocessed data into a CSV file.
>
>The pre-requisite for this notebook is the CSV file (`endomondo_proper_cleaned_expanded.csv`), which was generated during the earlier part of the project. This file contains the cleaned dataset, which we will now use to apply various supervised learning techniques.

## Setup

Before we dive into the supervised learning techniques, we need to prepare our environment. Here, we import the necessary libraries, mounting Google Drive to access the dataset, and loading the preprocessed data from the previous assignment.


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# Mount google drive
from google.colab import drive
drive.mount('/content/drive')

# Define the path to the csv file
file_path_to_csv = "/content/drive/My Drive/IT5100F Project/data/endomondo_proper_cleaned_expanded.csv"

# Load the data into DataFrame
endomondo_df = pd.read_csv(file_path_to_csv)

# Display the first few rows of the DataFrame
endomondo_df.head()

Mounted at /content/drive


Unnamed: 0,altitude,heart_rate,id,latitude,longitude,speed,sport,timestamp
0,41.6,100,396826535.0,60.173349,24.64977,6.8652,bike,2014-08-24 16:45:46
1,40.6,111,396826535.0,60.17324,24.650143,16.4736,bike,2014-08-24 16:45:54
2,40.6,120,396826535.0,60.17298,24.650911,19.1988,bike,2014-08-24 16:46:05
3,38.4,119,396826535.0,60.172478,24.650669,20.4804,bike,2014-08-24 16:46:18
4,37.0,120,396826535.0,60.171861,24.649145,31.3956,bike,2014-08-24 16:46:34


## 1 Data Preperation and Feature Engineering

In this section, we will perform data engineering tasks to prepare new features for supervised learning. Specifically, we will calculate differences for heart rate, speed, and altitude between consecutive rows for each user. Additionally, we will compute the elapsed time for each workout and handle any resulting NaN values.

- Note: Each User only has one workout session

### 1.1 Compute Heart Rate Difference

In [4]:
# Add a new column to the dataframe called `heart_rate_diff` that calculates the difference between the current heart rate and the previous heart rate for each user.
endomondo_df['heart_rate_diff'] = endomondo_df.groupby('id')['heart_rate'].diff()

### 1.2 Compute Speed Difference

In [5]:
# Add a new column to the dataframe called `heart_rate_diff` that calculates the difference between the current heart rate and the previous heart rate for each user.
endomondo_df['speed_diff'] = endomondo_df.groupby('id')['speed'].diff()

### 1.3 Compute Altitude Difference

In [6]:
# Add a new column to the dataframe called `altitude_diff` that calculates the difference in altitude between the current and previous row for each user.
endomondo_df['altitude_diff'] = endomondo_df.groupby('id')['altitude'].diff()

### 1.4 Compute Time Elapsed

In [8]:
# Add a new column to the dataframe called `time_elapsed` which is the difference between the start and current time in seconds
endomondo_df['timestamp'] = pd.to_datetime(endomondo_df['timestamp'])
endomondo_df['time_elapsed'] = endomondo_df.groupby('id')['timestamp'].diff().dt.total_seconds()

### 1.5: Remove NaN values
After computing differences, the first record for each user will have NaN values for the `heart_rate_diff`, `speed_diff`, and `altitude_diff` columns. These `NaN` values occur because there is no previous record to compare against for the first row.

We will handle these `NaN` values by replacing them with `0`, as the difference for the first record can be safely assumed to be zero.

In [13]:
# Replace NaN values in the computed difference columns with 0
for col in ['heart_rate_diff', 'speed_diff', 'altitude_diff', 'time_elapsed']:
    endomondo_df.fillna({col: 0}, inplace=True) # OR endomondo_df[col] = endomondo_df[col].fillna(0)

# Verify there are no more NaN values
for col in ['heart_rate_diff', 'speed_diff', 'altitude_diff', 'time_elapsed']:
    print(f"Number of NaN values in {col}: {endomondo_df[col].isna().sum()}")

Number of NaN values in heart_rate_diff: 0
Number of NaN values in speed_diff: 0
Number of NaN values in altitude_diff: 0
Number of NaN values in time_elapsed: 0


## 2 Model Training and Application

## 3 Model Evaluation and Visualization

## 4 Model Improvement with More Features (Multivariate Regression)

## 5 Model Comparison
