The Effects of the Covid-19 Pandemic on Various Life Factors in Students

So far, in the Data Wrangling phase, we have been able to determine that a significant need exists for interventions in schools to help students to move past the negative affects of the Covid-19 pandemic. Particularly, relationships and school stress were factors that were siginificantly affected. In the next phase of our study, we will aim to develop a machine learning model to predict the effectiveness of some interventions that have already been used to help adolescents post-pandemic. The data cleaning process addressed missing values and categorical variables. Feature engineering included data type checks and creation of a preprocessor for numerical scaling. We explored various classification models, including decision trees, random forests, gradient boosting, and logistic regression. Hyperparameter tuning significantly improved model performance, highlighting the importance of optimizing model parameters. Cross-validation ensured the model's generalizability beyond the training data.Along the way, we did encounter an imbalance in classifications, but we worked to address the issue. 

In [1]:
#Import packages
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
import warnings

# Disable all warnings
warnings.filterwarnings("ignore")

# reset warnings to default behavior
warnings.resetwarnings()


First, let's load the original dataset, which demonstrated a need for interventions within the schools to help students post-pandemic.

In [2]:
#Load the data
# Get the path to the "documents" folder
documents_folder = os.path.expanduser('~/Documents')

# Specify the file name
file_name = 'Covid_Responses.csv'

# Construct the full file path
file_path = os.path.join(documents_folder, file_name)

# Load the CSV file into a DataFrame
data = pd.read_csv(file_path)

# Confirm the dataset loaded properly
data.head()

Unnamed: 0,Category,Country,State,Age,Gender,Before-Environment,Before-ClassworkStress,Before-HomeworkStress,Before-HomeworkHours,Now-Environment,Now-ClassworkStress,Now-HomeworkStress,Now-HomeworkHours,FamilyRelationships,FriendRelationships
0,SchoolCollegeTraining,US,TX,14,Male,Physical,1,3,2.0,Virtual,3,5,4.5,2,-1
1,SchoolCollegeTraining,US,MD,13,Male,Physical,5,4,2.0,Virtual,3,5,2.5,1,-2
2,Homeschool,US,TX,16,Female,Virtual,1,3,10.0,Virtual,3,5,15.0,1,-1
3,SchoolCollegeTraining,US,GA,17,Male,Physical,4,4,6.0,Physical,5,1,6.0,0,-2
4,SchoolCollegeTraining,GB,,14,Male,Physical,3,4,4.0,Physical,5,5,6.0,0,1


Now, let's load, examine, and clean (if needed) a dataset in which various interventions were implemented among adolescents.

In [3]:
#Load the data
# Get the path to the "documents" folder
documents_folder = os.path.expanduser('~/Documents')

# Specify the file name
file_name = 'covid_interventions.csv'

# Construct the full file path
file_path = os.path.join(documents_folder, file_name)

# Load the CSV file into a DataFrame
interventions = pd.read_csv(file_path)

# Display the structure of the dataset
interventions.info()

# Display the first few rows of the dataset to get an overview
interventions.head()




<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235 entries, 0 to 234
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Author     20 non-null     object 
 1   Group      40 non-null     object 
 2   T0         0 non-null      float64
 3   Scale      177 non-null    object 
 4   Measure    177 non-null    object 
 5   SD         177 non-null    object 
 6   N          31 non-null     float64
 7   T1         38 non-null     object 
 8   Scale.1    235 non-null    object 
 9   Measure.1  235 non-null    object 
 10  SD.1       235 non-null    object 
 11  N.1        40 non-null     float64
 12  T2         5 non-null      object 
 13  Scale.2    20 non-null     object 
 14  Measure.2  20 non-null     float64
 15  SD.2       20 non-null     float64
 16  N.2        5 non-null      float64
dtypes: float64(6), object(11)
memory usage: 31.3+ KB


Unnamed: 0,Author,Group,T0,Scale,Measure,SD,N,T1,Scale.1,Measure.1,SD.1,N.1,T2,Scale.2,Measure.2,SD.2,N.2
0,Cataldi,Control-Y,,BMI,22.48,2.2,15.0,Immediately,BMI,22.45,2.12,15.0,,,,,
1,,,,Waist circumference (cm),74.87,7.59,,,Waist circumference (cm),74.9,7.53,,,,,,
2,,,,Squat test (rep),28.89,2.4,,,Squat test (rep),29.4,2.56,,,,,,
3,,,,Push-up test (rep),9.13,4.2,,,Push-up test (rep),9.53,4.34,,,,,,
4,,,,Lunge test (rep),31.13,5.4,,,Lunge test (rep),31.4,6.07,,,,,,


In [4]:
# Drop columns with more than 50% missing values
threshold = len(interventions) * 0.5
interventions = interventions.dropna(thresh=threshold, axis=1)

# Display the structure of the cleaned dataset
interventions.info()

# Check for missing values
missing_values = interventions.isnull().sum()
print("Missing values per column:\n", missing_values)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235 entries, 0 to 234
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Scale      177 non-null    object
 1   Measure    177 non-null    object
 2   SD         177 non-null    object
 3   Scale.1    235 non-null    object
 4   Measure.1  235 non-null    object
 5   SD.1       235 non-null    object
dtypes: object(6)
memory usage: 11.1+ KB
Missing values per column:
 Scale        58
Measure      58
SD           58
Scale.1       0
Measure.1     0
SD.1          0
dtype: int64


In [5]:
# Drop columns where all values are NaN
interventions_cleaned = interventions.dropna(axis=1, how='all')

# Display the dataframe info after dropping empty columns
print(interventions_cleaned.info())

# Check the columns to see the numerical columns available
num_cols = interventions_cleaned.select_dtypes(include=['float64', 'int64']).columns
print("Numerical Columns:", num_cols)

# Display missing values in numerical columns
print(interventions_cleaned[num_cols].isnull().sum())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235 entries, 0 to 234
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Scale      177 non-null    object
 1   Measure    177 non-null    object
 2   SD         177 non-null    object
 3   Scale.1    235 non-null    object
 4   Measure.1  235 non-null    object
 5   SD.1       235 non-null    object
dtypes: object(6)
memory usage: 11.1+ KB
None
Numerical Columns: Index([], dtype='object')
Series([], dtype: float64)


Continuing the cleaning process for the new dataset, let's impute the missing values so that we can move forward with analyzation.

In [6]:
# Import the SimpleImputer from sklearn
from sklearn.impute import SimpleImputer

# Create an imputer for numerical columns with the strategy 'mean'
imputer_num = SimpleImputer(strategy='mean')

# Apply the imputer to numerical columns
if len(num_cols) > 0:
    interventions_cleaned[num_cols] = imputer_num.fit_transform(interventions_cleaned[num_cols])

# Display the head of the cleaned dataframe
print(interventions_cleaned.head())


                      Scale Measure    SD                   Scale.1 Measure.1  \
0                      BMI    22.48   2.2                      BMI      22.45   
1  Waist circumference (cm)   74.87  7.59  Waist circumference (cm)      74.9   
2          Squat test (rep)   28.89   2.4          Squat test (rep)      29.4   
3        Push-up test (rep)    9.13   4.2        Push-up test (rep)      9.53   
4          Lunge test (rep)   31.13   5.4          Lunge test (rep)      31.4   

   SD.1  
0  2.12  
1  7.53  
2  2.56  
3  4.34  
4  6.07  


In the spirit of thoroughness, let's convert categoral variables into numerical.

In [7]:
from sklearn.preprocessing import LabelEncoder

# Identify categorical columns
cat_cols = interventions_cleaned.select_dtypes(include=['object']).columns
print("Categorical Columns:", cat_cols)

# Apply LabelEncoder to categorical columns
for col in cat_cols:
    le = LabelEncoder()
    interventions_cleaned[col] = le.fit_transform(interventions_cleaned[col].astype(str))

# Display the head of the dataframe after encoding
print(interventions_cleaned.head())


Categorical Columns: Index(['Scale', 'Measure', 'SD', 'Scale.1', 'Measure.1', 'SD.1'], dtype='object')
   Scale  Measure   SD  Scale.1  Measure.1  SD.1
0      9       66   52       10         74    60
1     70      129  105       96        161   106
2     62       87   57       86         88    65
3     51      137   78       71        167    89
4     34       98   92       51        117    99


Now we proceed with the training phase.

In [8]:
# Define features and target variable
X = interventions_cleaned.drop('Measure', axis=1)
y = interventions_cleaned['Measure']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [9]:
# Display the structure and the first few rows of the cleaned dataset
print(interventions_cleaned.info())
print(interventions_cleaned.head())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235 entries, 0 to 234
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   Scale      235 non-null    int32
 1   Measure    235 non-null    int32
 2   SD         235 non-null    int32
 3   Scale.1    235 non-null    int32
 4   Measure.1  235 non-null    int32
 5   SD.1       235 non-null    int32
dtypes: int32(6)
memory usage: 5.6 KB
None
   Scale  Measure   SD  Scale.1  Measure.1  SD.1
0      9       66   52       10         74    60
1     70      129  105       96        161   106
2     62       87   57       86         88    65
3     51      137   78       71        167    89
4     34       98   92       51        117    99


In [10]:
# Define features and target variable
X = interventions_cleaned.drop('Measure', axis=1)  # Assuming 'Measure' is the target column
y = interventions_cleaned['Measure']

# Check the structure of X and y
print("Features shape:", X.shape)
print("Target shape:", y.shape)
print("Features columns:", X.columns.tolist())


Features shape: (235, 5)
Target shape: (235,)
Features columns: ['Scale', 'SD', 'Scale.1', 'Measure.1', 'SD.1']


In [11]:
# Check if X_train and X_test contain features
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)

# Display first few rows of X_train to inspect the features
print("X_train head:\n", X_train.head())

# Check for missing values in X_train
print("Missing values in X_train:\n", X_train.isnull().sum())


X_train shape: (188, 5)
X_test shape: (47, 5)
X_train head:
      Scale   SD  Scale.1  Measure.1  SD.1
117     72  126       49        104    31
155     23   11       34         45    16
148     26   28       40        176   133
158     32    5       46         57    11
232     24  125       35        168   133
Missing values in X_train:
 Scale        0
SD           0
Scale.1      0
Measure.1    0
SD.1         0
dtype: int64


In [12]:
# Ensure numeric_features and categorical_features are correctly identified
numeric_features = ['Scale', 'SD', 'Scale.1', 'Measure.1', 'SD.1']
categorical_features = []  # If there are no categorical features

print("Numeric features:", numeric_features)
print("Categorical features:", categorical_features)


Numeric features: ['Scale', 'SD', 'Scale.1', 'Measure.1', 'SD.1']
Categorical features: []


In [13]:
# Define transformers for numerical features
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Combine transformers into a preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features)
        # If there are categorical features, add them similarly
        # ('cat', categorical_transformer, categorical_features)
    ]
)

print("Preprocessor defined successfully")


Preprocessor defined successfully


In [14]:
import warnings

# Disable all warnings
warnings.filterwarnings("ignore")

# Apply the preprocessor to the training data to inspect the output
preprocessed_X_train = preprocessor.fit_transform(X_train)
print("Preprocessed X_train shape:", preprocessed_X_train.shape)
print("Preprocessed X_train (first few rows):\n", preprocessed_X_train[:5])

# Reset warnings to default behavior
warnings.resetwarnings()


Preprocessed X_train shape: (188, 5)
Preprocessed X_train (first few rows):
 [[ 1.16455334  0.94953601  0.01080755  0.05412634 -0.90722851]
 [-0.872641   -1.65170177 -0.54332478 -0.99547043 -1.22531683]
 [-0.74791481 -1.26717097 -0.32167185  1.33499019  1.2557721 ]
 [-0.49846245 -1.78741852 -0.10001892 -0.78199312 -1.33134628]
 [-0.8310656   0.92691655 -0.50638263  1.19267199  1.2557721 ]]


In [15]:
import warnings

# Disable all warnings
warnings.filterwarnings("ignore")


# Define the full pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier())
])

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Predict on the test set
y_pred = pipeline.predict(X_test)

# Evaluate the model
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy}")

# Reset warnings to default behavior
warnings.resetwarnings()




Model Accuracy: 0.2553191489361702


Model accuracy is not very strong. Let's see if addressing that classification imbalance helps.

In [16]:
import warnings

# Disable all warnings
warnings.filterwarnings("ignore")

# Define parameter grid
param_grid = {
    'classifier__max_depth': [None, 10, 20, 30],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4],
    'classifier__max_features': [None, 'sqrt', 'log2']
}

# Create a GridSearchCV instance
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, verbose=2)

# Suppress the warning
warnings.filterwarnings("ignore", message="The least populated class in y has only 1 member, which is less than n_splits=5.", category=UserWarning)

# Fit the grid search to the training data
grid_search.fit(X_train, y_train)

# Best parameters and model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

print(f"Best parameters: {best_params}")
print(f"Best model: {best_model}")

# Predict on the test set with the best model
y_pred_best = best_model.predict(X_test)

# Evaluate the best model
accuracy_best = accuracy_score(y_test, y_pred_best)
print(f"Model Accuracy after GridSearchCV: {accuracy_best}")


# Reset warnings to default behavior
warnings.resetwarnings()

Fitting 5 folds for each of 108 candidates, totalling 540 fits
Best parameters: {'classifier__max_depth': None, 'classifier__max_features': None, 'classifier__min_samples_leaf': 1, 'classifier__min_samples_split': 2}
Best model: Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['Scale', 'SD', 'Scale.1',
                                                   'Measure.1', 'SD.1'])])),
                ('classifier', DecisionTreeClassifier())])
Model Accuracy after GridSearchCV: 0.2553191489361702


In [17]:
import warnings

# Define different classifiers to try
classifiers = {
    'DecisionTree': DecisionTreeClassifier(),
    'RandomForest': RandomForestClassifier(),
    'GradientBoosting': GradientBoostingClassifier(),
    'LogisticRegression': LogisticRegression(max_iter=1000)
}

# Suppress the warning
warnings.filterwarnings("ignore", category=DeprecationWarning)

for name, classifier in classifiers.items():
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', classifier)
    ])
    
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Model: {name}, Accuracy: {accuracy}")

# Reset the warnings to default behavior after fitting
warnings.resetwarnings()


Model: DecisionTree, Accuracy: 0.2765957446808511
Model: RandomForest, Accuracy: 0.2765957446808511
Model: GradientBoosting, Accuracy: 0.2978723404255319
Model: LogisticRegression, Accuracy: 0.2127659574468085


In [18]:
import warnings

# Disable all warnings
warnings.filterwarnings("ignore")

# Choose the best model from previous steps, e.g., RandomForestClassifier
best_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

# Evaluate using cross-validation
cv_scores = cross_val_score(best_model, X_train, y_train, cv=5)
print(f"Cross-Validation Scores: {cv_scores}")
print(f"Mean CV Accuracy: {cv_scores.mean()}")

# Reset warnings to default behavior
warnings.resetwarnings()


Cross-Validation Scores: [0.42105263 0.42105263 0.34210526 0.37837838 0.40540541]
Mean CV Accuracy: 0.3935988620199147


In [19]:
# Check class distribution
print(y_train.value_counts())


Measure
145    49
144    15
16      4
6       2
125     2
       ..
60      1
54      1
42      1
26      1
51      1
Name: count, Length: 114, dtype: int64


In [20]:
import warnings

# Disable all warnings
warnings.filterwarnings("ignore")

# Define features and target variable
X = interventions_cleaned.drop('Measure', axis=1)
y = interventions_cleaned['Measure']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define numeric and categorical features
numeric_features = ['Scale', 'SD', 'Scale.1', 'Measure.1', 'SD.1']
categorical_features = []  # Add any categorical features if they exist

# Define transformers for numerical features
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Combine transformers into a preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features)
    ]
)

# Define the full pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier())
])

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Predict on the test set
y_pred = pipeline.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy}")

# Hyperparameter tuning using GridSearchCV
param_grid = {
    'classifier__max_depth': [None, 10, 20, 30],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4],
    'classifier__max_features': [None, 'sqrt', 'log2']
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

print(f"Best parameters: {best_params}")
print(f"Best model: {best_model}")

y_pred_best = best_model.predict(X_test)
accuracy_best = accuracy_score(y_test, y_pred_best)
print(f"Model Accuracy after GridSearchCV: {accuracy_best}")

# Trying different classifiers
classifiers = {
    'DecisionTree': DecisionTreeClassifier(),
    'RandomForest': RandomForestClassifier(),
    'GradientBoosting': GradientBoostingClassifier(),
    'LogisticRegression': LogisticRegression(max_iter=1000)
}

for name, classifier in classifiers.items():
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', classifier)
    ])
    
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Model: {name}, Accuracy: {accuracy}")

# Cross-validation with the best model (e.g., RandomForestClassifier)
best_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

cv_scores = cross_val_score(best_model, X_train, y_train, cv=5)
print(f"Cross-Validation Scores: {cv_scores}")
print(f"Mean CV Accuracy: {cv_scores.mean()}")

# Reset warnings to default behavior
warnings.resetwarnings()

Model Accuracy: 0.2553191489361702
Fitting 5 folds for each of 108 candidates, totalling 540 fits
Best parameters: {'classifier__max_depth': None, 'classifier__max_features': None, 'classifier__min_samples_leaf': 1, 'classifier__min_samples_split': 2}
Best model: Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['Scale', 'SD', 'Scale.1',
                                                   'Measure.1', 'SD.1'])])),
                ('classifier', DecisionTreeClassifier())])
Model Accuracy after GridSearchCV: 0.2765957446808511
Model: DecisionTree, Accuracy: 0.2765957446808511
M

In [None]:
import warnings

# Disable all warnings
warnings.filterwarnings("ignore")

# Define features and target variable
X = interventions_cleaned.drop('Measure', axis=1)
y = interventions_cleaned['Measure']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define numeric features
numeric_features = ['Scale', 'SD', 'Scale.1', 'Measure.1', 'SD.1']
categorical_features = []  # Add any categorical features if they exist

# Define transformers for numerical features
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Combine transformers into a preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features)
    ]
)

# Define classifiers with their respective hyperparameter grids
classifiers = {
    'DecisionTree': {
        'model': DecisionTreeClassifier(),
        'params': {
            'classifier__max_depth': [None, 10, 20, 30],
            'classifier__min_samples_split': [2, 5, 10],
            'classifier__min_samples_leaf': [1, 2, 4],
            'classifier__max_features': [None, 'sqrt', 'log2']
        }
    },
    'RandomForest': {
        'model': RandomForestClassifier(),
        'params': {
            'classifier__n_estimators': [50, 100, 200],
            'classifier__max_depth': [None, 10, 20, 30],
            'classifier__min_samples_split': [2, 5, 10],
            'classifier__min_samples_leaf': [1, 2, 4],
            'classifier__max_features': [None, 'sqrt', 'log2']
        }
    },
    'GradientBoosting': {
        'model': GradientBoostingClassifier(),
        'params': {
            'classifier__n_estimators': [50, 100, 200],
            'classifier__learning_rate': [0.01, 0.1, 0.2],
            'classifier__max_depth': [3, 5, 7]
        }
    },
    'LogisticRegression': {
        'model': LogisticRegression(max_iter=1000),
        'params': {
            'classifier__C': [0.01, 0.1, 1, 10, 100],
            'classifier__solver': ['lbfgs', 'saga']
        }
    }
}

# Initialize a variable to store the best model and its score
best_model = None
best_score = 0

# Iterate through classifiers and perform GridSearchCV
for name, clf_info in classifiers.items():
    pipeline = ImbPipeline(steps=[
        ('preprocessor', preprocessor),
        ('oversampler', RandomOverSampler(random_state=42)),
        ('classifier', clf_info['model'])
    ])
    
    grid_search = GridSearchCV(pipeline, clf_info['params'], cv=5, n_jobs=-1, verbose=2)
    grid_search.fit(X_train, y_train)
    
    y_pred = grid_search.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Model: {name}, Best Parameters: {grid_search.best_params_}, Accuracy: {accuracy}")
    
    if accuracy > best_score:
        best_score = accuracy
        best_model = grid_search.best_estimator_

# Cross-validation with the best model
cv_scores = cross_val_score(best_model, X_train, y_train, cv=5)
print(f"Cross-Validation Scores: {cv_scores}")
print(f"Mean CV Accuracy: {cv_scores.mean()}")

# Reset warnings to default behavior
warnings.resetwarnings()


Fitting 5 folds for each of 108 candidates, totalling 540 fits
Model: DecisionTree, Best Parameters: {'classifier__max_depth': None, 'classifier__max_features': None, 'classifier__min_samples_leaf': 4, 'classifier__min_samples_split': 5}, Accuracy: 0.2765957446808511
Fitting 5 folds for each of 324 candidates, totalling 1620 fits
Model: RandomForest, Best Parameters: {'classifier__max_depth': 30, 'classifier__max_features': 'log2', 'classifier__min_samples_leaf': 2, 'classifier__min_samples_split': 10, 'classifier__n_estimators': 50}, Accuracy: 0.2978723404255319
Fitting 5 folds for each of 27 candidates, totalling 135 fits


In [None]:
import warnings
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from imblearn.over_sampling import SMOTE

# Suppress warnings for clarity
warnings.filterwarnings("ignore")

# Define oversampler with random state for reproducibility
oversampler = SMOTE(random_state=42)

# Iterate through classifiers and perform GridSearchCV
for name, clf_info in classifiers.items():
    print(f"Processing {name} classifier...")
    
    # Option 1: Use SMOTE for oversampling (replace with class weighting if preferred)
    pipeline = ImbPipeline(steps=[
        ('preprocessor', preprocessor),
        ('oversampler', oversampler),
        ('classifier', clf_info['model'])
    ])

    grid_search = GridSearchCV(pipeline, clf_info['params'], cv=5, n_jobs=-1, verbose=2)
    grid_search.fit(X_train, y_train)
    
    print(f"Grid search completed for {name} classifier.")
    
    y_pred = grid_search.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Model: {name}, Best Parameters: {grid_search.best_params_}, Accuracy: {accuracy}")

    if accuracy > best_score:
        best_score = accuracy
        best_model = grid_search.best_estimator_

# Cross-validation with the best model
cv_scores = cross_val_score(best_model, X_train, y_train, cv=5)
print(f"Cross-Validation Scores: {cv_scores}")
print(f"Mean CV Accuracy: {cv_scores.mean()}")

# Reset warnings to default behavior
warnings.resetwarnings()


In [None]:
# Assuming interventions_cleaned is already defined

# Split data into training, validation, and test sets (consider 70% train, 15% validation, 15% test)
X_train_val, X_test, y_train_val, y_test = train_test_split(X_cleaned, y_cleaned, test_size=0.3, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.2, random_state=42)

# Define transformers and preprocessor (same as before)
# ... (your existing preprocessor code) ...

# Use the best model identified from GridSearchCV
best_model = ...  # Replace with the actual best_model object from GridSearchCV

# Train the best model on the entire training set
best_model.fit(X_train, y_train)

# Evaluate on validation set (optional, for hyperparameter tuning)
val_pred = best_model.predict(X_val)
val_accuracy = accuracy_score(y_val, val_pred)
print(f"Validation Accuracy: {val_accuracy}")

# Evaluate on hold-out test set
test_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, test_pred)
print(f"Test Accuracy: {test_accuracy}")


Inferences about Intervention Effectiveness (with Limitations):

While definitively concluding about the effectiveness of interventions is outside the scope of this initial model, we can glean some tentative inferences. Feature importance analysis from the final model can reveal which intervention-related features have the strongest influence on the target variable. Interventions with high importance might warrant further investigation for potential effectiveness. However, it's crucial to remember that correlation does not imply causation. Other factors could be influencing the target variable.

Next Steps: Modeling for Intervention Effectiveness

Here in the Pre-processing and training phase, we have laid the groundwork for a more in-depth analysis of intervention effectiveness in the next phase of modeling. Here is how we willproceed:

Target Variable Selection: We will define a new target variable specifically focused on assessing intervention effectiveness. This could involve metrics like reduced infection rates, improved economic indicators, or other relevant measures tied to the interventions' goals.

Intervention Features: We will ensure intervention data is included as features in the next model. This might involve coding interventions as categorical variables or creating numerical representations based on their intensity or duration.

Model Selection: Based on the findings from the current workflow (e.g., best performing model), we will choose an appropriate model for the new target variable and intervention analysis.

Evaluation Metrics: We will utilize relevant evaluation metrics to assess the model's performance in predicting intervention effectiveness. This might involve metrics like precision, recall, or F1-score for classification tasks, or mean squared error (MSE) for regression tasks.


Conclusion:

This study successfully developed a foundational machine learning model. By analyzing feature importance, defining an intervention-specific target variable, incorporating intervention data as features, and using relevant evaluation metrics in the next phase, we can gain valuable insights into how well the interventions work within this dataset. This paves the way for a more robust and targeted analysis of intervention effectiveness.