# Problematic Internet Use | Part D : Severity Prediction using Machine Learning

In this part, we will be using Machine Learning techniques such as regression, decision trees, correlation features, etc. to answer the following question:

**Can we use HBN instruments to predict a child’s score on the problematic internet use scale, and if so, which features are most informative to the prediction?**

## Importing Dependencies

We will import all dependencies for this part of the project in the cell below.

In [47]:
import pandas as pd
import numpy as np
from statsmodels.miscmodels.ordinal_model import OrderedModel
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn import preprocessing
import statsmodels.api as sm 
from sklearn.ensemble import RandomForestClassifier

## Data Preparation

Before we can go about training a model and feature importance, we need to prepare the data first. Let's review all of the features that are present within the dataset:

| Instrument                         | Field                          | Description                   |
|------------------------------------|--------------------------------|-------------------------------|
| Identifier | `id` | Participant's ID |
| Demographics | `Basic_Demos-Enroll_Season` | Season of enrollment |
| Demographics | `Basic_Demos-Age` | Age of participant |
| Demographics | `Basic_Demos-Sex` | Sex of participant |
| Children's Global Assessment Scale | `CGAS-Season` | Season of participation |
| Children's Global Assessment Scale | `CGAS-CGAS_Score` | Children's Global Assessment Scale Score |
| Physical Measures | `Physical-Season` | Season of participation |
| Physical Measures | `Physical-BMI` | Body Mass Index (kg/m^2) |
| Physical Measures | `Physical-Height` | Height (in) |
| Physical Measures | `Physical-Weight` | Weight (lbs) |
| Physical Measures | `Physical-Waist_Circumference` | Waist circumference (in) |
| Physical Measures | `Physical-Diastolic_BP` | Diastolic BP (mmHg) |
| Physical Measures | `Physical-HeartRate` | Heart rate (beats/min) |
| Physical Measures | `Physical-Systolic_BP` | Systolic BP (mmHg) |
| FitnessGram Vitals and Treadmill | `Fitness_Endurance-Season` | Season of participation |
| FitnessGram Vitals and Treadmill | `Fitness_Endurance-Max_Stage` | Maximum stage reached |
| FitnessGram Vitals and Treadmill | `Fitness_Endurance-Time_Mins` | Exact time completed: Minutes |
| FitnessGram Vitals and Treadmill | `Fitness_Endurance-Time_Sec` | Exact time completed: Seconds |
| FitnessGram Child | `FGC-Season` | Season of participation |
| FitnessGram Child | `FGC-FGC_CU` | Curl up total |
| FitnessGram Child | `FGC-FGC_CU_Zone` | Curl up fitness zone |
| FitnessGram Child | `FGC-FGC_GSND` | Grip Strength total (non-dominant) |
| FitnessGram Child | `FGC-FGC_GSND_Zone` | Grip Strength fitness zone (non-dominant) |
| FitnessGram Child | `FGC-FGC_GSD` | Grip Strength total (dominant) |
| FitnessGram Child | `FGC-FGC_GSD_Zone` | Grip Strength fitness zone (dominant) |
| FitnessGram Child | `FGC-FGC_PU` | Push-up total |
| FitnessGram Child | `FGC-FGC_PU_Zone` | Push-up fitness zone |
| FitnessGram Child | `FGC-FGC_SRL` | Sit & Reach total (left side) |
| FitnessGram Child | `FGC-FGC_SRL_Zone` | Sit & Reach fitness zone (left side) |
| FitnessGram Child | `FGC-FGC_SRR` | Sit & Reach total (right side) |
| FitnessGram Child | `FGC-FGC_SRR_Zone` | Sit & Reach fitness zone (right side) |
| FitnessGram Child | `FGC-FGC_TL` | Trunk lift total |
| FitnessGram Child | `FGC-FGC_TL_Zone` | Trunk lift fitness zone |
| Bio-electric Impedance Analysis | `BIA-Season` | Season of participation |
| Bio-electric Impedance Analysis | `BIA-BIA_Activity_Level_num` | Activity Level |
| Bio-electric Impedance Analysis | `BIA-BIA_BMC` | Bone Mineral Content |
| Bio-electric Impedance Analysis | `BIA-BIA_BMI` | Body Mass Index |
| Bio-electric Impedance Analysis | `BIA-BIA_BMR` | Basal Metabolic Rate |
| Bio-electric Impedance Analysis | `BIA-BIA_DEE` | Daily Energy Expenditure |
| Bio-electric Impedance Analysis | `BIA-BIA_ECW` | Extracellular Water |
| Bio-electric Impedance Analysis | `BIA-BIA_FFM` | Fat Free Mass |
| Bio-electric Impedance Analysis | `BIA-BIA_FFMI` | Fat Free Mass Index |
| Bio-electric Impedance Analysis | `BIA-BIA_FMI` | Fat Mass Index |
| Bio-electric Impedance Analysis | `BIA-BIA_Fat` | Body Fat Percentage |
| Bio-electric Impedance Analysis | `BIA-BIA_Frame_num` | Body Frame |
| Bio-electric Impedance Analysis | `BIA-BIA_ICW` | Intracellular Water |
| Bio-electric Impedance Analysis | `BIA-BIA_LDM` | Lean Dry Mass |
| Bio-electric Impedance Analysis | `BIA-BIA_LST` | Lean Soft Tissue |
| Bio-electric Impedance Analysis | `BIA-BIA_SMM` | Skeletal Muscle Mass |
| Bio-electric Impedance Analysis | `BIA-BIA_TBW` | Total Body Water |
| Physical Activity Questionnaire (Adolescents) | `PAQ_A-Season` | Season of participation |
| Physical Activity Questionnaire (Adolescents) | `PAQ_A-PAQ_A_Total` | Activity Summary Score (Adolescents) |
| Physical Activity Questionnaire (Children) | `PAQ_C-Season` | Season of participation |
| Physical Activity Questionnaire (Children) | `PAQ_C-PAQ_C_Total` | Activity Summary Score (Children) |
| Parent-Child Internet Addiction Test | `PCIAT-Season` | Season of participation |
| Parent-Child Internet Addiction Test | `PCIAT-PCIAT_01` | How often does your child disobey time limits you set for online use? |
| Parent-Child Internet Addiction Test | `PCIAT-PCIAT_02` | How often does your child neglect household chores to spend more time online? |
| Parent-Child Internet Addiction Test | `PCIAT-PCIAT_03` | How often does your child prefer to spend time online rather than with the rest of your family? |
| Parent-Child Internet Addiction Test | `PCIAT-PCIAT_04` | How often does your child form new relationships with fellow online users? |
| Parent-Child Internet Addiction Test | `PCIAT-PCIAT_05` | How often do you complain about the amount of time your child spends online? |
| Parent-Child Internet Addiction Test | `PCIAT-PCIAT_06` | How often do your child's grades suffer because of the amount of time he or she spends online? |
| Parent-Child Internet Addiction Test | `PCIAT-PCIAT_07` | How often does your child check his or her e-mail before doing something else? |
| Parent-Child Internet Addiction Test | `PCIAT-PCIAT_08` | How often does your child seem withdrawn from others since discovering the Internet? |
| Parent-Child Internet Addiction Test | `PCIAT-PCIAT_09` | How often does your child become defensive or secretive when asked what he or she does online? |
| Parent-Child Internet Addiction Test | `PCIAT-PCIAT_10` | How often have you caught your child sneaking online against your wishes? |
| Parent-Child Internet Addiction Test | `PCIAT-PCIAT_11` | How often does your child spend time alone in his or her room playing on the computer? |
| Parent-Child Internet Addiction Test | `PCIAT-PCIAT_12` | How often does your child receive strange phone calls from new "online" friends? |
| Parent-Child Internet Addiction Test | `PCIAT-PCIAT_13` | How often does your child snap, yell, or act annoyed if bothered while online? |
| Parent-Child Internet Addiction Test | `PCIAT-PCIAT_14` | How often does your child seem more tired and fatigued than he or she did before the Internet came along? |
| Parent-Child Internet Addiction Test | `PCIAT-PCIAT_15` | How often does your child seem preoccupied with being back online when off-line? |
| Parent-Child Internet Addiction Test | `PCIAT-PCIAT_16` | How often does your child throw tantrums with your interference about how long he or she spends online? |
| Parent-Child Internet Addiction Test | `PCIAT-PCIAT_17` | How often does your child choose to spend time online rather than doing once enjoyed hobbies and/or outside interests? |
| Parent-Child Internet Addiction Test | `PCIAT-PCIAT_18` | How often does your child become angry or belligerent when your place time limits on how much time he or shes is allowed to spend online? |
| Parent-Child Internet Addiction Test | `PCIAT-PCIAT_19` | How often does your child choose to spend more time online than going out with friends? |
| Parent-Child Internet Addiction Test | `PCIAT-PCIAT_20` | How often does your child feel depressed, moody, or nervous when off-line which seems to go away once back online? |
| Parent-Child Internet Addiction Test | `PCIAT-PCIAT_Total` | Total Score |
| Sleep Disturbance Scale | `SDS-Season` | Season of participation |
| Sleep Disturbance Scale | `SDS-SDS_Total_Raw` | Total Raw Score |
| Sleep Disturbance Scale | `SDS-SDS_Total_T` | Total T-Score |
| Internet Use | `PreInt_EduHx-Season` | Season of participation |
| Internet Use | `PreInt_EduHx-computerinternet_hoursday` | Hours of using computer/internet |

Currently, there are all the `PCIAT` columns that record the fine-grained breakdown of the Parent-Child-Internet-Addiction-Test. Our dependent variable, the Severity Impairment Index (`sii`), is direclty determined by the these features, thus the `PCIAT` columns are dependent variables too and should be removed before proceeding.

Furthermore, there are metadata columns such as `id` and a lot of redundant columns as well that provide the same information such as all the season ones (`CGAS-Season`, `FGC-Season`, etc.), so let's remove them too save for the first one, that is `Basic_Demos-Enroll_Season`.

In [2]:
# Reading in the data
df = pd.read_csv('cleaned_data.csv', index_col=False)

# Dropping the necessary columns
columns_to_drop = [
    'id', 'BIA-Season', 'CGAS-Season', 'SDS-Season', 'PCIAT-Season', 
    'Physical-Season', 'FGC-Season', 'PreInt_EduHx-Season', 
    'Fitness_Endurance-Season','PAQ_C-Season', 
    'PCIAT-PCIAT_17', 'PCIAT-PCIAT_16', 'PCIAT-PCIAT_18', 
    'PCIAT-PCIAT_07', 'PCIAT-PCIAT_13', 'PCIAT-PCIAT_05', 
    'PCIAT-PCIAT_08', 'PCIAT-PCIAT_15', 'PCIAT-PCIAT_09', 
    'PCIAT-PCIAT_19', 'PCIAT-PCIAT_12', 'PCIAT-PCIAT_04', 
    'PCIAT-PCIAT_03', 'PCIAT-PCIAT_06', 'PCIAT-PCIAT_14', 
    'PCIAT-PCIAT_20', 'PCIAT-PCIAT_01', 'PCIAT-PCIAT_10', 
    'PCIAT-PCIAT_11', 'PCIAT-PCIAT_02', 'PCIAT-PCIAT_Total'
]
df = df.drop(columns=columns_to_drop)

Let's see how many categorical features we are left with now.

In [25]:
print(df.select_dtypes(include=['object']).columns)

Index(['Basic_Demos-Enroll_Season'], dtype='object')


We need to numerically represent the categorical seasonal feature before we can provide it as input to any machine learning model. For this purpose, we can use One-Hot Encoding, which is a widely used method where each category is converted into a binary (0/1) variable, creating a new column for each category. For example, for `Basic_Demos-Enroll_Season`:

| `Basic_Demos-Enroll_Season`   | spring | summer | autumn | winter |
|----------|--------|--------|--------|--------|
| spring   | 1      | 0      | 0      | 0      |
| summer   | 0      | 1      | 0      | 0      |
| autumn   | 0      | 0      | 1      | 0      |
| winter   | 0      | 0      | 0      | 1      |

In [3]:
# One-Hot Encoding
df = pd.get_dummies(df, columns=['Basic_Demos-Enroll_Season'], drop_first=False, dtype=int)  # Set drop_first=True to avoid multicollinearity

# Preview of the dataset
display(df.head())

Unnamed: 0,Basic_Demos-Age,Basic_Demos-Sex,CGAS-CGAS_Score,Physical-BMI,Physical-Height,Physical-Weight,Physical-Diastolic_BP,Physical-HeartRate,Physical-Systolic_BP,Fitness_Endurance-Max_Stage,...,BIA-BIA_TBW,PAQ_C-PAQ_C_Total,SDS-SDS_Total_Raw,SDS-SDS_Total_T,PreInt_EduHx-computerinternet_hoursday,sii,Basic_Demos-Enroll_Season_Fall,Basic_Demos-Enroll_Season_Spring,Basic_Demos-Enroll_Season_Summer,Basic_Demos-Enroll_Season_Winter
0,5,0,51.0,16.877316,46.0,50.8,68.0,81.0,114.0,5.0,...,32.6909,2.55,39.0,55.0,3.0,2,1,0,0,0
1,9,0,65.0,14.03559,48.0,46.0,75.0,70.0,122.0,5.0,...,27.0552,2.34,46.0,64.0,0.0,0,0,0,1,0
2,10,1,71.0,16.648696,56.5,75.6,65.0,94.0,117.0,5.0,...,44.7213,2.17,38.0,54.0,2.0,0,0,0,1,0
3,9,0,71.0,18.292347,56.0,81.6,60.0,97.0,117.0,6.0,...,45.9966,2.451,31.0,45.0,0.0,1,0,0,0,1
4,13,1,50.0,22.279952,59.5,112.2,60.0,73.0,102.0,5.0,...,63.1265,4.11,40.0,56.0,0.0,1,0,1,0,0


## Ordinal Logistic Regression

Let's review `sii` again. The target Variable `sii` is defined as:
- 0: None (`PCIAT-PCIAT_Total` from 0 to 30)
- 1: Mild (`PCIAT-PCIAT_Total` from 31 to 49)
- 2: Moderate (`PCIAT-PCIAT_Total` from 50 to 79)
- 3: Severe (`PCIAT-PCIAT_Total` 80 and more)

The variable `sii` is categorical, meaning it represents discrete categories rather than continuous numerical values. However, unlike general categorical variables, the values of `sii` — 0, 1, 2, and 3 — have an inherent order. For instance:

- A higher value of `sii` (e.g., 3) indicates a greater degree of problematic behaviours compared to a lower value (e.g., 0).
- While the ordering exists, the spacing between categories is not necessarily uniform (e.g., the difference between 0 and 1 is not the same as the difference between 1 and 2 in terms of `PCIAT-PCIAT_Total`).

This ordinal nature of `sii` makes it distinct from purely nominal variables (like "red", "green", "blue") or continuous variables (like height or temperature), requiring a modeling approach that respects this structure.

### Why Not Treat This as a Linear Regression Problem?

Linear regression models the relationship between independent variables (features) and a continuous dependent variable by fitting a straight line (or hyperplane in higher dimensions). However,

- Linear regression assumes the dependent variable has an interval scale, meaning the "distance" between categories is uniform. For `sii`, this assumption is not valid because the differences between consecutive categories are not guaranteed to be equal.
- Linear regression does not inherently consider the ordering of categories, treating them as arbitrary numerical values.
- It also predicts continous values, such as 2.5 instead of 2 or 3, resulting in higher loss values and making it difficult for the model to generalize.

### Why Not Use Standard Logistic Regression?

Logistic regression is a classification technique for binary or multiclass problems. It models the probability of each class as a function of the features using the logistic (sigmoid) function. However,

- Multiclass logistic regression treats all categories as distinct and unordered. For example, it sees the relationship between 0 → 1 and 0 → 3 as equivalent, ignoring the natural progression in the ordering.

While logistic regression is better suited than linear regression for categorical data, it is not optimal for ordered categorical variables like `sii`.

### Why Ordinal Logistic Regression is a Good Choice.

Ordinal logistic regression is a statistical technique specifically designed for ordered categorical dependent variables, making it a natural fit for `sii`. It bridges the gap between logistic regression and linear regression by accounting for both the ordinal nature of the target and the categorical nature of the outcomes.

Instead of predicting probabilities for each individual category, ordinal logistic regression models the cumulative probability of `sii` being less than or equal to a given category $j$:

$P(sii <= j) = \Large\frac{1}{1 \text{ } + \text{ } exp(-(\alpha_j \text{ } - \text{ } X.\beta))}$

Where,
- $\alpha_j$​ is the threshold for category jj,
- $X$ represents the input features,
- $\beta$ are the model coefficients.

The model learns thresholds $\alpha_1,\alpha_2,\alpha_3​,…$ that partition the cumulative probabilities into the respective categories. These thresholds reflect the boundaries between consecutive categories, capturing the ordinal structure.

The coefficients $\beta$ represent the effect of each feature on the latent variable that underlies the ordinal categories of `sii`. The model respects the ordering of the categories while recognizing that the exact distances between them are unknown, while the coefficients can be interpreted in terms of how each feature shifts the likelihood of `sii` falling into higher or lower categories.

Overall, Ordinal Logistic Regressions allows us to avoid the pitfalls of linear regression (invalid predictions) and standard logistic regression (loss of ordering information).

## Model Fitting, Predictions and Evalaution

We shall now fit an ordinal regression model and evalaute its accuracy among other metrics on the test dataset. Given the large size of our dataset, we shall use an 70-30 train-test split.

In [53]:
# Getting our input and output dataframes/series
X = df.drop(columns=['sii'])
y = df['sii']

# Feature scaling via normalization
scaler = preprocessing.StandardScaler()
X_scaled = scaler.fit_transform(X)
X = pd.DataFrame(X_scaled, columns=X.columns, index=X.index)

In [57]:
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fitting the model
model = OrderedModel(y_train, X_train, distr='probit')
result = model.fit(method='bfgs')
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.861171
         Iterations: 124
         Function evaluations: 129
         Gradient evaluations: 129
                             OrderedModel Results                             
Dep. Variable:                    sii   Log-Likelihood:                -1648.3
Model:                   OrderedModel   AIC:                             3399.
Method:            Maximum Likelihood   BIC:                             3682.
Date:                Sun, 01 Dec 2024                                         
Time:                        06:13:23                                         
No. Observations:                1914                                         
Df Residuals:                    1863                                         
Df Model:                          48                                         
                                             coef    std err          z      P>|z|      [0.025      0.975]
-------



In [58]:
# Making predictions
predictions = result.predict(X_test)
predicted_classes = predictions.idxmax(axis=1)  # The class with the highest predicted probability

# Evaluate the model
print("\nClassification Report:")
print(classification_report(y_test, predicted_classes))
print(f"Accuracy: {accuracy_score(y_test, predicted_classes):.2f}")


Classification Report:
              precision    recall  f1-score   support

           0       0.69      0.91      0.79       491
           1       0.37      0.18      0.24       209
           2       0.31      0.22      0.26       111
           3       0.00      0.00      0.00        10

    accuracy                           0.62       821
   macro avg       0.34      0.33      0.32       821
weighted avg       0.55      0.62      0.57       821

Accuracy: 0.62


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
# Showing that random forest performs worse

X = df.drop(columns=['sii'])
y = df['sii']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = preprocessing.StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 42)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

In [50]:
accuracy_score(y_test, y_pred)

0.5886654478976234