# Supervised Learning Model: Human Activity Recognition

**Author:** Phong Trang Tran Thanh (Python version)
**Date:** August 7, 2025

## Overview

### Data Description
This project examines the Weight Lifting Exercise Data provided by Velloso et al. (2013). They used wearable devices to measure the acceleration of 6 participants to see if the participants were doing the exercises correctly. The accelerometers on the belt, forearm, arm, and dumbbell of 6 participants provide information on the exercise movements. There are 5 classes of the labels: A, B, C, D, and E; only class A is the correct movement while other classes correspond to common mistakes of the training.

* **Training Data:** [pml-training.csv](https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv)
* **Testing Data:** [pml-testing.csv](https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv)

### The Project Goal
This project aims to use data provided by the accelerators to build a supervised learning model that predicts the outcome of correct/incorrect training movements. Two models will be built and evaluated:
1.  Random Forest model
2.  Support Vector Machine (SVM)

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.feature_selection import VarianceThreshold
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

# Set plot style
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = (10, 7.25)

print("Libraries imported successfully!")

## 1. Data Loading and Preprocessing

In [None]:
# URLs for the datasets
url_train = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
url_test = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

# Load the datasets using pandas
# The na_values parameter handles strings that should be treated as NaN
training_df = pd.read_csv(url_train, na_values=["", "NA", "#DIV/0!"])
testing_df = pd.read_csv(url_test, na_values=["", "NA", "#DIV/0!"])

print("Dimension of the training dataset:", training_df.shape)
print("Dimension of the testing dataset:", testing_df.shape)

### Initial Data Exploration
Let's check the distribution of the target variable `classe`.

In [None]:
plt.figure(figsize=(8, 6))
sns.countplot(x='classe', data=training_df, palette="viridis", order=sorted(training_df['classe'].unique()))
plt.title('Number of Observations in Each Class')
plt.xlabel('Exercise Class')
plt.ylabel('Count')
plt.show()

The dataset is reasonably balanced, which is good for model training.

### Split Data into Training and Validation Sets
We will split the original training data into a new training set (80%) and a validation set (20%) to evaluate our models. We use `stratify` to ensure the class distribution is the same in both sets.

In [None]:
# Separate features (X) and target (y)
X = training_df.drop('classe', axis=1)
y = training_df['classe']

# Split the data
X_train, X_validation, y_train, y_validation = train_test_split(
    X, y, test_size=0.2, random_state=123, stratify=y
)

print("Training set dimensions:", X_train.shape)
print("Validation set dimensions:", X_validation.shape)

## 2. Feature Selection for Modeling

We apply a multi-step feature selection process to the training data. The same transformations will be applied to the validation and test sets.

### Step 1: Remove columns with a high percentage of NA values
First, we remove any column that contains more than 50% NA values.

In [None]:
# Calculate the percentage of missing values for each column
missing_percentage = X_train.isnull().sum() / len(X_train)
cols_to_drop_na = missing_percentage[missing_percentage > 0.5].index

# Drop these columns from all datasets
X_train = X_train.drop(columns=cols_to_drop_na)
X_validation = X_validation.drop(columns=cols_to_drop_na)
testing_df = testing_df.drop(columns=cols_to_drop_na)

print(f"Dropped {len(cols_to_drop_na)} columns with >50% missing values.")
print("New training set dimensions:", X_train.shape)

### Step 2: Remove descriptive and near-zero variance columns
Next, we remove metadata columns and predictors with little to no variance.

In [None]:
# Exclude descriptive columns
descriptive_cols = ["Unnamed: 0", "user_name", "raw_timestamp_part_1", "raw_timestamp_part_2", "cvtd_timestamp", "new_window", "num_window"]
# Ensure these columns exist before trying to drop them
cols_to_drop_desc = [col for col in descriptive_cols if col in X_train.columns]
X_train = X_train.drop(columns=cols_to_drop_desc)
X_validation = X_validation.drop(columns=cols_to_drop_desc)
testing_df = testing_df.drop(columns=cols_to_drop_desc)

# Remove near-zero variance predictors
selector = VarianceThreshold()
selector.fit(X_train)
cols_to_keep = X_train.columns[selector.get_support()]

X_train = X_train[cols_to_keep]
X_validation = X_validation[cols_to_keep]
testing_df = testing_df[cols_to_keep]

print("Dropped descriptive columns and zero-variance features.")
print("Final training set dimensions after feature selection:", X_train.shape)

After feature selection, we have reduced the number of predictors to a more manageable and relevant set.

## 3. Model Training and Validation

### 3.1. Training with Random Forest

In [None]:
# Initialize and train the Random Forest model
rf_model = RandomForestClassifier(n_estimators=200, random_state=123, n_jobs=-1)
rf_model.fit(X_train, y_train)

# Make predictions on the validation set
rf_predictions = rf_model.predict(X_validation)

# Evaluate the model
rf_accuracy = accuracy_score(y_validation, rf_predictions)
print(f"Random Forest Model Accuracy: {rf_accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_validation, rf_predictions))

In [None]:
# Plot the confusion matrix
cm_rf = confusion_matrix(y_validation, rf_predictions)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='Blues',
            xticklabels=sorted(y_validation.unique()),
            yticklabels=sorted(y_validation.unique()))
plt.title('Random Forest Confusion Matrix', fontsize=16)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

### 3.2. Training with Support Vector Machine (SVM)

In [None]:
# Initialize and train the SVM model
print("Training SVM model... (This may take a moment)")
svm_model = SVC(random_state=123)
svm_model.fit(X_train, y_train)

# Make predictions
svm_predictions = svm_model.predict(X_validation)

# Evaluate the SVM model
svm_accuracy = accuracy_score(y_validation, svm_predictions)
print(f"SVM Model Accuracy: {svm_accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_validation, svm_predictions))

In [None]:
# Plot the confusion matrix for SVM
cm_svm = confusion_matrix(y_validation, svm_predictions)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_svm, annot=True, fmt='d', cmap='Greens',
            xticklabels=sorted(y_validation.unique()),
            yticklabels=sorted(y_validation.unique()))
plt.title('SVM Confusion Matrix', fontsize=16)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

## 4. Final Prediction on the Test Data

The Random Forest model performed exceptionally well. We will use this model to predict the outcomes for the final `testing` dataset.

In [None]:
# The testing_df has already been preprocessed to have the same columns as X_train
# The 'problem_id' column is needed for the final submission but not for prediction
final_test_features = testing_df.drop(columns=['problem_id'])
final_predictions = rf_model.predict(final_test_features)

# Create a DataFrame for the results
prediction_results = pd.DataFrame({
    'problem_id': testing_df['problem_id'],
    'predicted_classe': final_predictions
})

print("Predictions on the final 20 test samples:")
print(prediction_results.to_string(index=False))