The Effects of the Covid-19 Pandemic on Various Life Factors of Students

Here is a summary of what we have covered in our exploration of the effects of the Covid-19 pandemic on students:

Data Wrangling:

Cleaned and prepared a dataset containing various measures of student performance and potential interventions.
Addressed missing values, standardized numerical data, and ensured categorical data was properly encoded.

Exploratory Data Analysis (EDA):

Identified key trends and patterns in the data, such as the distribution of performance measures and the prevalence of certain interventions.
Visualized relationships between different variables to understand how various factors influence student performance.

Preprocessing and Training:

Focused on preparing the data for modeling by handling rare classes in the target variable and encoding categorical variables.
Used techniques like SMOTE to address class imbalance in the dataset.
Trained and evaluated various machine learning models, including Decision Trees, Random Forests, Gradient Boosting, Logistic Regression, XGBoost, and LightGBM.

We now move into the modeling process to examine the effectiveness of multiple intervention strategies, which we can use to drive our decisions regarding what interventions to recommend to schools to help their students move past the effects of the pandemic. As always, we begin by import the necessary packages, loading the dataset, and ensuring that it loads properly. 

In [27]:
#import packages
import pandas as pd
import numpy as np
import warnings
import os
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
import xgboost as xgb
import lightgbm as lgb
warnings.filterwarnings("ignore")

#Load the data
# Get the path to the "documents" folder
documents_folder = os.path.expanduser('~/Documents')

# Specify the file name
file_name = 'covid_interventions.csv'

# Construct the full file path
file_path = os.path.join(documents_folder, file_name)

# Load the CSV file into a DataFrame
interventions = pd.read_csv(file_path)

interventions.head()

Unnamed: 0,Author,Group,T0,Scale,Measure,SD,N,T1,Scale.1,Measure.1,SD.1,N.1,T2,Scale.2,Measure.2,SD.2,N.2
0,Cataldi,Control-Y,,BMI,22.48,2.2,15.0,Immediately,BMI,22.45,2.12,15.0,,,,,
1,,,,Waist circumference (cm),74.87,7.59,,,Waist circumference (cm),74.9,7.53,,,,,,
2,,,,Squat test (rep),28.89,2.4,,,Squat test (rep),29.4,2.56,,,,,,
3,,,,Push-up test (rep),9.13,4.2,,,Push-up test (rep),9.53,4.34,,,,,,
4,,,,Lunge test (rep),31.13,5.4,,,Lunge test (rep),31.4,6.07,,,,,,


Next, let's ensure that our data is nice and clean by replacing NaN values with column means for numerical columns and converting categorical columns to categorical data type.

In [46]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Assuming your data is loaded into a pandas DataFrame called 'interventions'
# Define X and y
X = interventions.drop(columns=['N.2'])  # Drop the target variable column
y = interventions['N.2']  # Target variable

# Define preprocessing steps for numeric and categorical features
numeric_features = ['T0', 'N', 'T1', 'N.1', 'T2', 'Measure.2', 'SD.2', 'N.2']
categorical_features = [col for col in X.columns if col not in numeric_features]

# Define transformers for numeric and categorical features
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Impute missing values with most frequent category
    ('onehot', OneHotEncoder(handle_unknown='ignore'))  # One-hot encode categorical features
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# Create a pipeline with preprocessing and the model
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(random_state=42))
])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
pipeline.fit(X_train, y_train)


ValueError: A given column is not a column of the dataframe

Next, we define preprocessing steps for numeric features, including imputation and scaling.


In [13]:
from sklearn.model_selection import KFold

# Define the number of folds for cross-validation
cv = KFold(n_splits=2, shuffle=True, random_state=42)

# Perform grid search with adjusted cross-validation
grid_search = GridSearchCV(pipeline, clf_info['params'], cv=cv, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)


Fitting 2 folds for each of 27 candidates, totalling 54 fits


Now, let's define classifiers with their respective hyperparameter grids. Then, we will iterate through each classifier, apply SMOTE for oversampling, and perform grid search cross-validation to find the best parameters for each model.

In [41]:
# Remove rows with missing or non-numeric values
cleaned_data = interventions.dropna()

# Check if there are any remaining missing or non-numeric values
missing_values = cleaned_data.isnull().any()
print("Missing values in cleaned data:")
print(missing_values)

# Now you can proceed with fitting your pipeline using the cleaned data


Missing values in cleaned data:
Author       False
Group        False
T0           False
Scale        False
Measure      False
SD           False
N            False
T1           False
Scale.1      False
Measure.1    False
SD.1         False
N.1          False
T2           False
Scale.2      False
Measure.2    False
SD.2         False
N.2          False
dtype: bool


In [45]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestRegressor

# Assuming your cleaned dataset is stored in a DataFrame called 'cleaned_data'
X_cleaned = cleaned_data.drop(columns=['Measure'])  # Drop the target column 'Measure' to get the feature matrix
y_cleaned = cleaned_data['Measure']  # Select the column 'Measure' as the target vector

print("Shape of X_cleaned:", X_cleaned.shape)
print("Shape of y_cleaned:", y_cleaned.shape)


Shape of X_cleaned: (0, 16)
Shape of y_cleaned: (0,)


In [None]:
 # Step 5: Feature Scaling

from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler
scaler = StandardScaler()

# Scale the features
interventions_scaled = scaler.fit_transform(interventions)

# Convert the scaled features back to a DataFrame
interventions_scaled_df = pd.DataFrame(interventions_scaled, columns=interventions.columns)

# Check the scaled dataset
print(interventions_scaled_df.head())
