<a href="https://colab.research.google.com/github/up2113232/up2113232_coursework/blob/dev/Q1_folder/ML_approach.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


Question 1: Traditional Machine Learning Approach
Predicting GAD_T, SWL_T, and SPIN_T from Online Gaming Anxiety Data

 **Dataset:** Online Gaming Anxiety Data from Kaggle


Introduction
 This notebook explores traditional (non-neural network) machine learning approaches for predicting three psychological metrics from online gaming data:
 1. **GAD_T**: Generalized Anxiety Disorder score
 2. **SWL_T**: Satisfaction With Life score  
 3. **SPIN_T**: Social Phobia Inventory score

We'll compare multiple traditional ML algorithms to establish a performance baseline before moving to neural networks in Q2.

Objectives
 - Load and explore the gaming anxiety dataset
 - Preprocess data for machine learning
 - Implement and compare traditional ML models
 - Evaluate model performance using appropriate metrics
 - Interpret results and draw conclusions


In [1]:
# Import necessary libraries
import sys
import os

# Add parent directory to path to import our functions
sys.path.append('..')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import warnings
warnings.filterwarnings('ignore')

# Set style for better visualisations
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")


Now we are going to import the helper functions from our functions.py file. These will be explained in the file if you are wondering what each of them do.

In [3]:
from functions import clean_data, encode_features, split_data, scale_features
from functions import evaluate_regression_model

Now we are going to load our data set

In [4]:
df = pd.read_csv('gaming_anxiety_data.csv', encoding='ISO-8859-1')


Now we will select the feature columns we want our code to use and our targets columns we want to predict

In [5]:
feature_columns = ['GADE', 'Game', 'Hours', 'earnings', 'whyplay',
                   'streams', 'Narcissism', 'Gender',
                   'Age', 'Work', 'Playstyle']
target_columns = ['GAD_T', 'SWL_T', 'SPIN_T']

df = df[feature_columns + target_columns].copy()

Now we are going to clean our data, this is so there are no missing values in our data

In [6]:
# Clean the data
print("Cleaning dataset...")
df_cleaned_initial = clean_data(df)

Cleaning dataset...
Missing values per column:
GADE          649
Game            0
Hours          30
earnings        0
whyplay         0
streams       100
Narcissism     23
Gender          0
Age             0
Work           38
Playstyle       0
GAD_T           0
SWL_T           0
SPIN_T        650
dtype: int64
Removed 0 duplicate rows
Removed 51 duplicate rows


Now we will encode any non-numerical data within our target columnns into corresponding numbers, so that our Machine Learning will be effective as it can only learn from numbers. We will then clean the data again to remove all other missing values so we can begin our ML.

In [19]:
#We will encode the string values into corresponding numbers
df_encoded = encode_features(df_cleaned_initial)

#We will then clean up any values that were not encoded or missing
df_clean = df_encoded.dropna()

print(f"\n Original Missing values: {df.isnull().sum().sum()}")
print(f" Missing values after cleaning and encoding: {df_clean.isnull().sum().sum()}")
if df_clean.isnull().sum().sum() > 0:
    print("Columns with missing values in df_clean:")
    print(df_clean.isnull().sum()[df_clean.isnull().sum() > 0])


 Original Missing values: 1490
 Missing values after cleaning and encoding: 0


Now we are going to define which ML models to use, we'll use 2 to compare them

In [21]:
# Define our ML models
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42)}

Now we are going to use several functions inside a training function to train our models. To see how this important function works please see the functions file. In short, it splits the data into training data and testing data and then scales the data into uniform distrubutions so that each feature is treated equally by our model no matter how high or low the original data is.

In [22]:
all_results = {}
X = df_clean[feature_columns]

def run_experiment_for_target(X, y, target_name, test_size=0.2, random_state=42):

    # Run complete ML process for a specific target variable

    print(f"\n EXPERIMENT FOR TARGET: {target_name}")

    # Splits the data into 80% training, 20% testing
    X_train, X_test, y_train, y_test = split_data(X, y, test_size=test_size, random_state=random_state)

    # Scales the features using the standard scaler function
    X_train_scaled, X_test_scaled, scaler = scale_features(X_train, X_test)

    results = {}

    # Trains and evaluates each model
    for model_name, model in models.items():
        print(f"\nTraining {model_name}...")

        # Train model
        model.fit(X_train_scaled, y_train)

        # Evaluate model
        metrics = evaluate_regression_model(model, X_train_scaled, X_test_scaled,
                                           y_train, y_test, f"{model_name} - {target_name}")

        # Store results
        results[model_name] = {
            'model': model,
            'metrics': metrics,
            'predictions': model.predict(X_test_scaled)
        }



    return X_test_scaled, y_test, results

for target_name in target_columns:
    y = df_clean[target_name]
    X_test_scaled, y_test, results = run_experiment_for_target(X, y, target_name)
    all_results[target_name] = {
        'X_test': X_test_scaled,
        'y_test': y_test,
        'results': results
    }


 EXPERIMENT FOR TARGET: GAD_T

Training Linear Regression...

Evaluation for Linear Regression - GAD_T
Train R²: 0.1542
Test R²:  0.1769
Train MSE: 18.9640
Test MSE:  17.6324
Train MAE: 3.2574
Test MAE:  3.1538

Training Random Forest...

Evaluation for Random Forest - GAD_T
Train R²: 0.9063
Test R²:  0.3546
Train MSE: 2.0998
Test MSE:  13.8260
Train MAE: 1.0949
Test MAE:  2.8555

 EXPERIMENT FOR TARGET: SWL_T

Training Linear Regression...

Evaluation for Linear Regression - SWL_T
Train R²: 0.0907
Test R²:  0.1008
Train MSE: 47.9460
Test MSE:  45.2426
Train MAE: 5.7451
Test MAE:  5.5995

Training Random Forest...

Evaluation for Random Forest - SWL_T
Train R²: 0.8695
Test R²:  0.0689
Train MSE: 6.8819
Test MSE:  46.8472
Train MAE: 2.1043
Test MAE:  5.5870

 EXPERIMENT FOR TARGET: SPIN_T

Training Linear Regression...

Evaluation for Linear Regression - SPIN_T
Train R²: 0.0824
Test R²:  0.0913
Train MSE: 158.4433
Test MSE:  159.0071
Train MAE: 9.8954
Test MAE:  9.8860

Training Random