## 1. Algorithm Selection & Justification

### 1.1 Support Vector Machine (SVM)

### 1.2 Random Forest 

## 2. Algorithm Implementation

### 2.1 Data Loading and Preparation
The preprocessed dataset student_sleep_patterns_preprocessed.csv was loaded for model training. It contains 13 columns representing demographic, academic, and lifestyle attributes of university students: Age, Gender, University_Year, Sleep_Duration, Study_Hours, Screen_Time, Caffeine_Intake, Physical_Activity, Weekday_Sleep_Start, Weekend_Sleep_Start, Weekday_Sleep_End, Weekend_Sleep_End, and the target variable Sleep_Quality.


### 2.2 Feature and Target Selection

All columns except Sleep_Quality were treated as features (X). Sleep_Quality was defined as the target variable (y) with three possible classes: Poor, Average, and Good. These classes represent the overall quality of students’ sleep based on their lifestyle and academic factors.

### 2.3 Building a Random Forest Model

The Random Forest algorithm was implemented to classify students’ sleep quality based on multiple lifestyle and demographic factors. It was selected because it performs well on medium-sized datasets, handles both numeric and categorical variables, and captures complex, non-linear interactions between features such as study hours, caffeine intake, and sleep duration. By combining multiple decision trees, Random Forest reduces overfitting and provides a stable, reliable prediction of sleep quality.

#### 2.3.1 Installing Required Libraries

In [32]:
!pip install scikit-learn
!pip install imbalanced-learn




[notice] A new release of pip is available: 24.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


#### 2.3.2 Random Forest Model Development and Execution

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE

# -------------------------------------------
# 1. Load the preprocessed dataset
# -------------------------------------------
df = pd.read_csv("../Dataset/student_sleep_patterns_preprocessed.csv")

# -------------------------------------------
# 2. Encode the target variable only
# -------------------------------------------
label_encoder = LabelEncoder()
df["Sleep_Quality"] = label_encoder.fit_transform(df["Sleep_Quality"])

# -------------------------------------------
# 3. Identify categorical and numerical columns
# -------------------------------------------
categorical_cols = ["Weekday_Sleep_Start", "Weekend_Sleep_Start",
                    "Weekday_Sleep_End", "Weekend_Sleep_End"]
numerical_cols = [col for col in df.columns if col not in categorical_cols + ["Sleep_Quality"]]

# -------------------------------------------
# 4. Preprocessing for model training (OneHot + Scaling)
# -------------------------------------------
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols),
        ('num', StandardScaler(), numerical_cols)
    ]
)

# -------------------------------------------
# 5. Split features and target
# -------------------------------------------
X = df.drop("Sleep_Quality", axis=1)
y = df["Sleep_Quality"]

# Apply preprocessing transformations
X_processed = preprocessor.fit_transform(X)

# -------------------------------------------
# 6. Handle imbalance using SMOTE
# -------------------------------------------
smote = SMOTE(random_state=42, sampling_strategy='auto', k_neighbors=3)
X_balanced, y_balanced = smote.fit_resample(X_processed, y)

# -------------------------------------------
# 7. Split data into 80% train / 20% test
# -------------------------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X_balanced, y_balanced, test_size=0.2, random_state=42, stratify=y_balanced
)

# -------------------------------------------
# 8. Train the Random Forest model (tuned)
# -------------------------------------------
rf = RandomForestClassifier(
    n_estimators=800,
    max_depth=30,
    min_samples_split=2,
    min_samples_leaf=1,
    class_weight='balanced_subsample',
    bootstrap=True,
    random_state=42
)
rf.fit(X_train, y_train)

# -------------------------------------------
# 9. Make predictions
# -------------------------------------------
y_pred = rf.predict(X_test)

# -------------------------------------------
# 10. Decode labels back to original classes
# -------------------------------------------
decoded_true = label_encoder.inverse_transform(y_test)
decoded_pred = label_encoder.inverse_transform(y_pred)

# -------------------------------------------
# 11. Create a readable DataFrame using the original (non-encoded) features
# -------------------------------------------
X_display = X.sample(n=len(X_test), random_state=42).reset_index(drop=True)

results_df = pd.DataFrame({
    "Actual Sleep Quality": decoded_true,
    "Predicted Sleep Quality": decoded_pred
})

results_df = pd.concat([results_df, X_display], axis=1)

# -------------------------------------------
# 12. Display the first 25 rows clearly
# -------------------------------------------
print("Final Model Predictions with Original Feature Values (first 25 rows):")
display(results_df.head(25))


Final Model Predictions with Original Feature Values (first 25 rows):


Unnamed: 0,Actual Sleep Quality,Predicted Sleep Quality,Age,Gender,University_Year,Sleep_Duration,Study_Hours,Screen_Time,Caffeine_Intake,Physical_Activity,Weekday_Sleep_Start,Weekend_Sleep_Start,Weekday_Sleep_End,Weekend_Sleep_End
0,Good,Poor,18,0,3,7.5,9.2,3.0,3,16,Late,Late,Early,Early
1,Poor,Average,19,2,3,4.1,2.0,1.0,2,83,Medium,Early,Early,Early
2,Average,Poor,18,1,2,4.1,3.7,2.3,4,57,Late,Medium,Early,Early
3,Poor,Poor,19,2,2,4.4,0.1,2.4,3,27,Early,Early,Early,Medium
4,Good,Poor,19,0,3,8.9,5.9,2.7,2,56,Medium,Early,Early,Early
5,Good,Poor,20,1,1,6.2,3.6,1.3,3,67,Late,Medium,Early,Early
6,Average,Poor,18,0,1,7.9,1.3,3.8,2,19,Late,Early,Early,Medium
7,Poor,Poor,22,1,2,4.9,10.4,2.8,3,71,Late,Late,Early,Early
8,Average,Poor,21,2,1,5.5,5.2,2.8,4,99,Late,Early,Early,Early
9,Poor,Average,23,0,2,4.7,3.1,2.9,0,92,Late,Late,Early,Early


### 2.4 Building a Support Vector Machine (SVM) Model

## 3. Algorithm Evaluation & Comparison

## 4. Results Interpretation