# **Child Mind Institute - Relating Physical Activity to Problematic Internet Use**

‣ GitHub page for this project 👉  [here]()

‣ An article from the institute on Summer Screen Time Use 👉  [read here](https://childmind.org/article/screen-time-and-summer/)



### The Problem at Hand 🧑‍💻 *(taken from the comptetition homepage 👉  [read here](https://www.kaggle.com/competitions/child-mind-institute-problematic-internet-use/overview))*

"In today’s digital age, problematic internet use among children and adolescents is a growing concern. Better understanding this issue is crucial for addressing mental health problems such as depression and anxiety.

Current methods for measuring problematic internet use in children and adolescents are often complex and require professional assessments. This creates access, cultural, and linguistic barriers for many families. Due to these limitations, problematic internet use is often not measured directly, but is instead associated with issues such as depression and anxiety in youth.

Conversely, physical & fitness measures are extremely accessible and widely available with minimal intervention or clinical expertise. Changes in physical habits, such as poorer posture, irregular diet, and reduced physical activity, are common in excessive technology users. We propose using these easily obtainable physical fitness indicators as proxies for identifying problematic internet use, especially in contexts lacking clinical expertise or suitable assessment tools."

**What does this mean?** The Child Mind Institute has tasked the public with building predictive machine learning models that will determine a participant's Severity Impairment Index (SII), which is a metric measuring the level of problematic internet use among children and adolescents, based on physical activity, health, and lifestyle factors. The aim is to identify signs of problematic internet use early so that preventative measures can be taken by the parent/ caretaker.

### The Data at Hand 📊

We will be working with the Child Mind Institute's *Healthy Brain Network (HBN)* dataset, a clinical sample of roughly 5,000 youth and adolescents (aged 5-22) that have undergone various clinical and research screenings for the institute. The institute has conveniently separated for us the relevant data into two distinct categories.

The first is tabular data comprising measurements from various instruments, assessments, and questionairres - in particular, it includes an assessment called the Parent-Child Internet Addiction Test (PCIAT), which is used to calculate the SII of each participant - we'll refer to this data as **feature data**. The second is time-series data collected with a wrist accelerometer given to roughly 1,000 participants to wear for up to 30 days continually while at home and going about their daily lives. The data collected from this device includes physical activity and other metrics - we'll refer to this data as **actigraphy data**.

For each we have been provided with a **train set**, on which we will train our models, and a **test set**, on which we will evaluate their performance. The train set is a full dataset that includes the SII, which is our **target variable**, and the PCIAT results used to calculate it - the test set is a much smaller collection of data that is missing this information. Our objective then is to train the models to accurately predict SII values *for each entry in the test set*.

Because the natures of the feature data and the actigraphy data are vastly different, we will use an **ensemble approach**, analyzing, feature engineering and training models separately, then merging results for the final submission. The actigraphy data is dense time-series data, which means we will get a lot of value from training a neural network on it. For the feature data, simpler baseline ML models should be appropriate. 

### Competition Evaluation 📝

The result will be evaluated based on the **quadratic weighted kappa**, which measures the agreement between two outcomes. This metric typically varies from 0 (random agreement) to 1 (complete agreement). The submission file will consist of two rows, one for id and one for SII, with an entry for each participant in the test set. An example submission has been given to us on the Kaggle page.

In [20]:
# load pandas
import pandas as pd

# load sample submission
sample = pd.read_csv("/Users/tomragus/Library/CloudStorage/OneDrive-UCSanDiego/CMI-PIU-Model/data/sample_submission.csv")

# display sample submission
print("Sample submission")
print(f"Submission shape: {sample.shape}")
sample

Sample submission
Submission shape: (20, 2)


Unnamed: 0,id,sii
0,00008ff9,0
1,000fd460,1
2,00105258,2
3,00115b9f,3
4,0016bb22,0
5,001f3379,1
6,0038ba98,2
7,0068a485,3
8,0069fbed,0
9,0083e397,1


### Credit 📚

Parts of this notebook, particularly in the EDA phase, were adapted from [Antonina Dolgorukova](https://datadelic.dev/)'s brilliant EDA notebooks for this competition. I highly encourage checking out her work - they are extremely in-depth and very well written.

‣ *Feature EDA 👉  [read here](https://www.kaggle.com/code/antoninadolgorukova/cmi-piu-features-eda/notebook)*

‣ *Actigraphy EDA 👉  [read here](https://www.kaggle.com/code/antoninadolgorukova/cmi-piu-actigraphy-data-eda)*


## ***Feature data***

### Let's start by taking a peek into our feature data:

In [None]:
# load train set
train = pd.read_csv("/Users/tomragus/Library/CloudStorage/OneDrive-UCSanDiego/CMI-PIU-Model/data/train.csv")

# display first 5 rows of train set
print("""Train set: where the 'features' live""")
print(f"Train shape: {train.shape}")
display(train.head())

Train set: where the 'features' live
Train shape: (3960, 82)


Unnamed: 0,id,Basic_Demos-Enroll_Season,Basic_Demos-Age,Basic_Demos-Sex,CGAS-Season,CGAS-CGAS_Score,Physical-Season,Physical-BMI,Physical-Height,Physical-Weight,...,PCIAT-PCIAT_18,PCIAT-PCIAT_19,PCIAT-PCIAT_20,PCIAT-PCIAT_Total,SDS-Season,SDS-SDS_Total_Raw,SDS-SDS_Total_T,PreInt_EduHx-Season,PreInt_EduHx-computerinternet_hoursday,sii
0,00008ff9,Fall,5,0,Winter,51.0,Fall,16.877316,46.0,50.8,...,4.0,2.0,4.0,55.0,,,,Fall,3.0,2.0
1,000fd460,Summer,9,0,,,Fall,14.03559,48.0,46.0,...,0.0,0.0,0.0,0.0,Fall,46.0,64.0,Summer,0.0,0.0
2,00105258,Summer,10,1,Fall,71.0,Fall,16.648696,56.5,75.6,...,2.0,1.0,1.0,28.0,Fall,38.0,54.0,Summer,2.0,0.0
3,00115b9f,Winter,9,0,Fall,71.0,Summer,18.292347,56.0,81.6,...,3.0,4.0,1.0,44.0,Summer,31.0,45.0,Winter,0.0,1.0
4,0016bb22,Spring,18,1,Summer,,,,,,...,,,,,,,,,,


In [9]:
# load test set
test = pd.read_csv("/Users/tomragus/Library/CloudStorage/OneDrive-UCSanDiego/CMI-PIU-Model/data/test.csv")

# display first 5 rows of test set
print("""Test set: what we will evaluate our models on""")
print(f"Test shape: {test.shape}")
display(test.head())

Test set: what we will evaluate our models on
Test shape: (20, 59)


Unnamed: 0,id,Basic_Demos-Enroll_Season,Basic_Demos-Age,Basic_Demos-Sex,CGAS-Season,CGAS-CGAS_Score,Physical-Season,Physical-BMI,Physical-Height,Physical-Weight,...,BIA-BIA_TBW,PAQ_A-Season,PAQ_A-PAQ_A_Total,PAQ_C-Season,PAQ_C-PAQ_C_Total,SDS-Season,SDS-SDS_Total_Raw,SDS-SDS_Total_T,PreInt_EduHx-Season,PreInt_EduHx-computerinternet_hoursday
0,00008ff9,Fall,5,0,Winter,51.0,Fall,16.877316,46.0,50.8,...,32.6909,,,,,,,,Fall,3.0
1,000fd460,Summer,9,0,,,Fall,14.03559,48.0,46.0,...,27.0552,,,Fall,2.34,Fall,46.0,64.0,Summer,0.0
2,00105258,Summer,10,1,Fall,71.0,Fall,16.648696,56.5,75.6,...,,,,Summer,2.17,Fall,38.0,54.0,Summer,2.0
3,00115b9f,Winter,9,0,Fall,71.0,Summer,18.292347,56.0,81.6,...,45.9966,,,Winter,2.451,Summer,31.0,45.0,Winter,0.0
4,0016bb22,Spring,18,1,Summer,,,,,,...,,Summer,1.04,,,,,,,


It can be tricky to figure out what all of these abbreviations mean - thankfully, the Child Mind Institute was kind enough to include a **data dictionary** for this competition, which gives some extra information for each variable. Here is a little preview - [you can view the full file on the Kaggle page](https://www.kaggle.com/competitions/child-mind-institute-problematic-internet-use/data?select=data_dictionary.csv).

In [10]:
# load data dictionary
data_dict = pd.read_csv("/Users/tomragus/Library/CloudStorage/OneDrive-UCSanDiego/CMI-PIU-Model/data/data_dictionary.csv")

# display first 5 rows of data dictionary
print("""Data Dictionary: what each feature means""")
print(f"Data Dictionary shape: {data_dict.shape}")
display(data_dict.head())

Data Dictionary: what each feature means
Data Dictionary shape: (81, 6)


Unnamed: 0,Instrument,Field,Description,Type,Values,Value Labels
0,Identifier,id,Participant's ID,str,,
1,Demographics,Basic_Demos-Enroll_Season,Season of enrollment,str,"Spring, Summer, Fall, Winter",
2,Demographics,Basic_Demos-Age,Age of participant,float,,
3,Demographics,Basic_Demos-Sex,Sex of participant,categorical int,01,"0=Male, 1=Female"
4,Children's Global Assessment Scale,CGAS-Season,Season of participation,str,"Spring, Summer, Fall, Winter",


While this only gives us a snippet of the data at hand, we can see the SII and the PCIAT scores on the right side of the train set. The SII scores range from 0 to 3, with 0 representing no impairment, and 3 representing severe impairment. So, we can think of the problem as training our models to **classify** each id in the test set into one of the 4 SII classes (0, 1, 2 or 3). Classification calls for **supervised learning**.

Looking at the shape of the train set, we can see that the train set has almost 4,000 entries 🤯 this is good - the more data we have, the more finely we can tune our models.

### Loading the rest of our libraries...

In [None]:
# load all libraries
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import os
from scipy import stats
from scipy.stats import pearsonr, spearmanr
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler, RobustScaler, LabelEncoder
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, make_scorer
from sklearn.feature_selection import SelectKBest, f_regression, mutual_info_regression
from datetime import datetime
import xgboost as xgb
import lightgbm as lgb
import warnings
from matplotlib.colors import ListedColormap

warnings.filterwarnings("ignore", category=RuntimeWarning)

### **Exploratory Data Anlaysis - Feature Data**

Let us first analyze observe the features that are related to the SII and are not present in the test set.

In [None]:
# isolating train-only features
train_cols = set(train.columns)
test_cols = set(test.columns)
columns_not_in_test = sorted(list(train_cols - test_cols))

# addind additional information using data dictionary
data_dict[data_dict['Field'].isin(columns_not_in_test)]

Unnamed: 0,Instrument,Field,Description,Type,Values,Value Labels
54,Parent-Child Internet Addiction Test,PCIAT-Season,Season of participation,str,"Spring, Summer, Fall, Winter",
55,Parent-Child Internet Addiction Test,PCIAT-PCIAT_01,How often does your child disobey time limits ...,categorical int,012345,"0=Does Not Apply, 1=Rarely, 2=Occasionally, 3=..."
56,Parent-Child Internet Addiction Test,PCIAT-PCIAT_02,How often does your child neglect household ch...,categorical int,012345,"0=Does Not Apply, 1=Rarely, 2=Occasionally, 3=..."
57,Parent-Child Internet Addiction Test,PCIAT-PCIAT_03,How often does your child prefer to spend time...,categorical int,012345,"0=Does Not Apply, 1=Rarely, 2=Occasionally, 3=..."
58,Parent-Child Internet Addiction Test,PCIAT-PCIAT_04,How often does your child form new relationshi...,categorical int,012345,"0=Does Not Apply, 1=Rarely, 2=Occasionally, 3=..."
59,Parent-Child Internet Addiction Test,PCIAT-PCIAT_05,How often do you complain about the amount of ...,categorical int,012345,"0=Does Not Apply, 1=Rarely, 2=Occasionally, 3=..."
60,Parent-Child Internet Addiction Test,PCIAT-PCIAT_06,How often do your child's grades suffer becaus...,categorical int,012345,"0=Does Not Apply, 1=Rarely, 2=Occasionally, 3=..."
61,Parent-Child Internet Addiction Test,PCIAT-PCIAT_07,How often does your child check his or her e-m...,categorical int,012345,"0=Does Not Apply, 1=Rarely, 2=Occasionally, 3=..."
62,Parent-Child Internet Addiction Test,PCIAT-PCIAT_08,How often does your child seem withdrawn from ...,categorical int,012345,"0=Does Not Apply, 1=Rarely, 2=Occasionally, 3=..."
63,Parent-Child Internet Addiction Test,PCIAT-PCIAT_09,How often does your child become defensive or ...,categorical int,012345,"0=Does Not Apply, 1=Rarely, 2=Occasionally, 3=..."


Here we see each item in the Parent-Child Internet Addiction Test (PCIAT). Each question (third column) assesses a different aspect of a child's behavior related to internet use, and responses are given on a scale from 0 to 5 with the total score providing an indication of the severity of internet addiction.

We also have the season of participation in PCIAT-Season and total score in PCIAT-PCIAT_Total; so there are a total of 22 PCIAT test-related columns.

Here we will verify that the PCIAT-PCIAT_Total align with the corresponding SII categories by calculating the minimum and maximum scores for each SII category:

In [None]:
# calculate max and min
pciat_min_max = train.groupby('sii')['PCIAT-PCIAT_Total'].agg(['min', 'max'])
pciat_min_max = pciat_min_max.rename(columns={'min': 'Minimum PCIAT total Score', 'max': 'Maximum total PCIAT Score'})
pciat_min_max

Unnamed: 0_level_0,Minimum PCIAT total Score,Maximum total PCIAT Score
sii,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,0.0,30.0
1.0,31.0,49.0
2.0,50.0,79.0
3.0,80.0,93.0


In [21]:
# display range for each level of severity
data_dict[data_dict['Field'] == 'PCIAT-PCIAT_Total']['Value Labels'].iloc[0]

'Severity Impairment Index: 0-30=None; 31-49=Mild; 50-79=Moderate; 80-100=Severe'