# An Exploration of Analysis Methods on Predictive Models of Student Success

### Alex Beckwith
### May 2023

## Quick Summary

- Built a system to train, test, & evaluate machine learning models
- Applied to educational data from an online university
- Used system to generate predictions
- Analyzed results

## Presentation Itinerary
- Introduction
    - Presentation Itinerary
    - Quick Summary
- Motivations
    - Personal Goals
    - Research Goals
    - Research Questions
- Previous Research
    - Learning Analytics/Education Data Mining
    - Predicting Student Performance
    - Model Evaluation Methods

- Experimental Architecture
    - Dataset
    - Feature Extraction
    - Algorithms & Hyperparameters
    - Model Pipeline
- Model Evaluation
    - Naive Averaging
    - Null Hypothesis Significance Testing (NHST)
    - Bayesian
    - Future Research
- Wrap Up
    - Questions
    - Tools Used
    - Top References

### Abstract 

Machine learning models are not always evaluated with statistical rigor. This can lead to inferential flaws when assumptions are made about the underlying and performance data, especially when cross-validation is used. In this paper, a Bayesian method of model evaluation is compared to a non-parametric frequentist method. In addition, a metric for analyzing the fairness of a particular algorithm is tested. 

The evaluation techniques were applied to a dataset of student and course data made available by the Open University. A system was built to train and test predictive models of student success. The aim was to predict students at risk of failing or withdrawing from a course using the first 30 days of data extracted from the virtual learning environment. In an applied setting, these predictions could be used to direct additional resources to at-risk students. 

The project included creating a database to cleanse, transform, and analyze the dataset. Features were engineered to use as predictive inputs using a combination of exploratory analysis and inspiration from research. Four different subsets of input features were applied to nine different classification algorithms. Both randomized and exhaustive hyperparameter tuning procedures were experimented with, which created hundreds of distinct hyperparameter settings.  

The Bayesian strategy provided more conclusive results by determining a “region of practical equivalence” as opposed to an inability to reject the null hypothesis. The results were similar to findings from research, which typically had tree-based ensemble methods in the upper-equivalence region. 

The proposed metric for predictive fairness is called the Absolute Between Receiver Operating Characteristic Area (ABROCA). This metric was first introduced at the 2019 International Learning Analytics & Knowledge Conference. A significant relationship between ABROCA and the gender ratio of a course as well as between ABROCA and the ratio of students in a course identifying as having a disability. No significant relationship was found between ABROCA and overall model performance. 

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.pyplot import subplots
from seaborn import histplot

from model.params import show_params
from utils.constants import FIGURES_PATH
from utils.db_helpers import DbHelper, Table
from utils.get_figures import (
    PresentationFigures,
    SharedFigures,
    get_edm_venn,
    get_imd_band_displot,
)

In [2]:
dbh = DbHelper.default()
df = dbh.get_table("first30", "all_features")
features = Table("first30", "all_features", df)
logos = FIGURES_PATH / "logos"

In [None]:
df = dbh.get_table("landing", "student_info")
df.loc[:, "imd_band"] = df.loc[:, "imd_band"].apply(
    lambda x: "10-20%" if x == "10-20" else x
)
df.head(20)

In [None]:
get_imd_band_displot(df)

In [None]:
df = features.df.loc[:, ["final_result", "n_days_active", "student_id"]]
gb = df.groupby(["n_days_active", "final_result"]).count().reset_index()
gb.columns = ["n_days_active", "final_result", "count"]
gb[gb.loc[:, "n_days_active"].apply(lambda x: x in list(range(27, 31)))]

In [None]:
df = features.df.loc[
    :,
    [
        "avg_days_before_due_submitted",
        "var_days_before_due_submitted",
        "stddev_days_before_due_submitted",
        "min_days_before_due_submitted",
        "student_id",
        "n_days_active",
        "final_result",
    ],
]

# for x in [y for y in df.columns if "days" in y]:
#    print(x)
fig, ax = subplots(nrows=1, ncols=1, figsize=(16, 9), dpi=200)

histplot(
    df,
    x="min_days_before_due_submitted",
    binrange=(-20, 30),
    bins=50,
    hue="final_result",
    hue_order=["Distinction", "Pass", "Fail", "Withdrawn"],
    multiple="stack",
)

In [None]:
fig, ax = subplots(nrows=1, ncols=1, figsize=(16, 9), dpi=200)
histplot(
    df,
    x="n_days_active",
    hue="final_result",
    hue_order=["Distinction", "Pass", "Fail", "Withdrawn"],
    ax=ax,
    multiple="stack",
    bins=max(gb.loc[:, "count"]),
    binwidth=1,
)
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(16, 9), dpi=800)
sns.histplot(
    df,
    x="n_days_active",
    hue="final_result",
    hue_order=["Distinction", "Pass", "Fail", "Withdrawn"],
    ax=ax,
    multiple="stack",
    bins=max(gb.loc[:, "student_id"]),
    binwidth=1,
)
# figsave("n_days_active")

# Motivations

## Personal Goals
- Research personally relevant topic (education) 
- Apply knowledge of SQL/Python/data from job as data analyst 
- Apply interest/knowledge of predictive models learned independently and in data science program
- Increase knowledge of statistical evaluation methods

## Research Goals
- Evaluate machine learning models using best practices/methods/tooling
- Determine if Bayesian or Frequentist methods are better for machine learning problems
- Test new metric for evaluation of model fairness
- Apply above goals to case study with education dataset

## Research Questions
1. Which models and featuresets are best at predicting student outcomes?
2. How do the results differ when models are compared using naive, frequentist and Bayesian methods? 
3. Is there an association between model predictive performance and Absolute Between Receiver Operating Characteristic Area (ABROCA)?

# Previous Research

## Learning Analytics/Education Data Mining

In [None]:
get_edm_venn()

- Educational Data Mining (EDM) is concerned with developing methods for exploring the unique types of data that come from educational environments
        - It can be also defined as the application of data mining (DM) techniques to this specific type of dataset that come from educational environments to address important educational questions.
- Learning Analytics (LA) can be defined as the measurement, collection, analysis, and reporting of data about learners and their contexts, for purposes of understanding and optimizing learning and the environments in which it occurs (Lang, Siemens, Wise, & Gasevic, 2017). There are three crucial elements involved in this definition data, analysis and action.
Educational data mining and learning analytics: An updated survey
Cristobal Romero | Sebastian Ventura

## Predicting Student Performance

- Within EDM/LA, looked speifically at predicting student performance
- Important to:
    - detect early to divert resources to students in need
    - Trace knowledge transfer

### Most common types of prediction:
1. Classification
2. Regression
3. Clustering

### Common Predictions:
1. Final outcome
    - Dropout
    - Pass/Fail
2. Final grades
3. Deadline compliance

### Most common algorithms:
1. Tree-based
    - Decision Tree
    - Random Forest
    - Boosted
2. Regression
    - Logistic Regression
    - Linear Regression
4. Support Vector Machines
5. Bayesian
    - Naive Bayes
6. K-Nearest-Neighbor
7. Artificial Neural Networks

- best performing are typically ensemble methods

### Most common student data sources:
1. Computer-based learning environment
    - Massive Open Online Course (MOOCs)
    - Intelligent Tutoring Systems (ITS)
    - Learning Management System
2. In-person

Online -> 
    More data available
    data more consistent
Blended learning needs more study

### Most common feature types:
1. Academic data
    - Assessments
2. Demographic data
3. Behavior
    - Virtual learning environment (VLE) interactions
4. Financial aid data

### Feature Extraction Strategy
- Automated vs Expert Engineered vs Crowdsourced
- Automated can perform better, but often less interpretable

In [None]:
PresentationFigures.AUTOML_FEATURE_ENGINEERING.value.image

- AutoML Feature Engineering for Student Modeling Yields High Accuracy, but Limited Interpretability
- Nigel Bosch - University of Illinois Urbana-Champaign
- TSFRESH performed better than both, but was most difficult to interpret
- (Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests)
- Another source suggested crowdsourcing features in addition to 

## Model Evaluation Methods

### Naive Averaging
- Sorting and picking top average value
- Hard to extrapolate

- Simply sorting by metric and picking top value
- Difficult to discern differences between models
- No sense of variability between model types/settings
- Tough to weigh other factors like interpretability & fit time in justified way

### Frequentist (Null Hypothesis Significance Testing)

In [None]:
# Single Dataset ROPE Example
PresentationFigures.CRITICAL_DIFFERENCE_NEMENYI.value.image

diagram shows regions for which the null hypothesis cannot be rejected
- Friedman test to show whether groups of results are similar (global test)
- Post-hoc Nemenyi Test to indicate if significant difference exists between two models
- Tough to compare large set of models in this way
- Used non-parametric tests to minimize assumptions of distributions of model data
- Better for model output data

### Bayesian

In [None]:
# Single Dataset ROPE Example
SharedFigures.BAYES_ROPE_PDF.value.image

- Uses Bayesian signed rank test to estimate probability of means being in a prespecified "Region of Practical Equivalence" (ROPE)
- This test used a ROPE vaue of 0.01, indicating that a 1% difference in means is a wide enough band to consider the performance of two models equivalent for all practical purposes

In [None]:
# Multiple Dataset ROPE Example:
SharedFigures.BAYES_ROPE_POSTERIOR.value.image

- A Bayesian posterior plot resulting from a Bayesian hierarchical correlated t-test
- visualizes the results of Markov-Chain Monte Carlo (MCMC) sampling for the comparison of two models X and Y
- The estimated probability of each outcome is the proportion of samples that fall in each section of the plot.
- "heavier" than signed rank test, so less convenient
- baycomp uses hierarchical for multiple datasets, signed rank for single

## ABROCA | Slicing Analysis
### (Absolute Between Receiver Operating Characteristic Area)

### Receiver Operating Characteristic (ROC Curve)
- A function of the false positive rate to true positive rate over the range of threshold values for a predictor
- Area under ROC curve (ROC AUC) commonly used as metric to optimize performance of machine learning models.
    - Perfect predictor -> ROC AUC = $1.0$ (correct prediction at all threshold values)
    - Random predictor -> ROC AUC = $0.5$ (equally likely to pick correctly or incorrectly at all threshold values) 
- <a href="https://core.ac.uk/download/pdf/55142552.pdf">Link to more math</a>

In [None]:
# ROC Curve Example:
SharedFigures.HXBOOST_ROC_DEMO.value.image

explain roc
originally used to measure performance of radar equipment
- The Receiver Operating Characteristic (ROC) is a plot of the false positive rate to true positive rate over the range of threshold 
- Area under ROC curve (ROC AUC) commonly used as metric to optimize performance of machine learning models.
    - Perfect predictor -> ROC AUC = $1.0$ (correct prediction at all threshold values)
    - Random predictor -> ROC AUC = $0.5$ (equally likely to pick correctly or incorrectly at all threshold values) 

### ABROCA | Slicing Analysis
#### (Absolute Between Receiver Operating Characteristic Area)

- Proposed as metric with which to compare predictive model fairness
    - First introduced at 2019 International Learning Analytics and Knowledge Conference

In [None]:
SharedFigures.HXGBOOST_ABROCA_IS_FEMALE.value.image

- To calculate:
    - Split dataset by feature of interest
    - Calculate ROC curves for the model on each part of split dataset
    - Sum absolute values of between-curve area
- How does this relate to fairness?
    - A model that predicts subgroups of split dataset equally would have ABROCA = 0 (Same ROC curves, so no area between)
    - Hypothesis - Higher ABROCA associated with lower predictive fairness

## Dataset

### The Open University
- Exclusively online university
- Largest university by enrollment in UK
- Provision one of the largest public learning analytics datasets

### The files
- <a href="https://analyse.kmi.open.ac.uk/open_dataset">Open University Learning Analytics Dataset (OULAD)</a>
    - <a href="http://arx.deidentifier.org/">Anonymized using ARX anonymization tool</a>
- Massive Open Online Courses (MOOCs)
    - 2 years (2013 & 2014)
    - 7 courses
    - 23 presentations
    - 32,593 students
    - 10,655,280 aggregated Virtual Learning Environment (VLE) activity records

- For course to be included in OULAD
    - The number of students in the selected module-presentation is larger than 500.
    - At least two presentations of the module exist.
    - VLE data are available for the module-presentation (since not all the modules are studied via VLE).
    - The module has a significant number of failing students.
- (clicks/student/activity/course/day)

In [None]:
PresentationFigures.SOURCE_ERD_MODEL.value.image

- 7 Tables
    - Course Info
    - Student Info
    - Assessment Info
    - Virtual Learning Environment (VLE) Summaries
        - (clicks per day, per resource, per student)
    - 3 Bridge Tables

In [None]:
PresentationFigures.OULAD_STUDENT_COURSES.value.image

can add course level details as notes

In [None]:
PresentationFigures.OULAD_VS_15.value.image

compared 2013 & 2014 data with sample from 2015 to see if significant changes in deomgraphics
at a significance level of 0.05, none would be rejected

### Age (age_band)

In [None]:
PresentationFigures.OULAD_15_AGE.value.image

In [None]:
PresentationFigures.AGE_BAND_BY_STUDENT.value.image

In [None]:
print("part of effort to anonymize data - big bins")
features.columns["age_band"].desc(
    show_props=True, show_nulls=True, show_series_desc=True
)

### Index of Multiple Deprivation (imd_band)

In [None]:
PresentationFigures.IMD_BAND_IRELAND.value.image

In the current English Indices of Deprivation 2019 (IoD2019) seven domains of deprivation are considered and weighted as follows,
- Income. (22.5%)
- Employment. (22.5%)
- Education. (13.5%)
- Health. (13.5%)
- Crime. (9.3%)
- Barriers to Housing and Services. (9.3%)
- Living Environment. (9.3%)

In [None]:
PresentationFigures.OULAD_15_IMD.value.image

lower = more deprived
maybe update data hists to combine

In [None]:
PresentationFigures.IMD_BAND_BY_STUDENT.value.image

In [None]:
features.columns["imd_band"].desc(
    show_series_desc=True, show_props=True, show_nulls=True
)

### Region (region)

In [None]:
PresentationFigures.REGION_BY_STUDENT.value.image

In [None]:
features.columns["region"].desc(show_series_desc=True, show_props=True, show_nulls=True)

### Highest Education

In [None]:
PresentationFigures.HIGHEST_EDUCATION_BY_STUDENT.value.image

In [None]:
features.columns["highest_education"].desc(
    show_series_desc=True, show_props=True, show_nulls=True
)
features.cols["highest_education"].desc(
    show_series_desc=True, show_props=True, show_nulls=True
)

### Course Domain
- STEM or Social Studies

In [None]:
PresentationFigures.COURSE_DOMAIN_BY_STUDENT.value.image

In [None]:
features.columns["is_stem"].desc(
    show_series_desc=True, show_props=True, show_nulls=True
)

### Final Result

In [None]:
PresentationFigures.FINAL_RESULT_BY_STUDENT.value.image

In [None]:
features.columns["final_result"].desc(
    show_series_desc=True, show_props=True, show_nulls=True
)
features.cols["final_result"].desc(
    show_series_desc=True, show_props=True, show_nulls=True
)

# Experimental Architecture

### Data Processing/Analysis

### Initial Database Schemas
- PostgresQL
- Landing
    - Raw CSV load
- Staging
    - Datatype and naming standardization
- Main
    - Data architecture optimization
    - Categorical/text columns stored in tables linked with integer foreign keys
    - Joined data saved in views 

In [None]:
df = dbh.info_schema.loc[:, ["schema", "name"]]
gb = df.groupby(["schema"]).count()
gb.columns = ["count"]
gb

In [None]:
dbh.show_table("landing", "student_info")

- Landing
    - Raw CSV load
    - so much text
    - 3 columns for unique row
- Staging
    - Datatype and naming standardization
- Main [Maybe ERD]
    - Data architecture optimization
    - Categorical/text columns stored in tables linked with integer foreign keys
    - Joined data saved in views 

In [None]:
dbh.show_table("main", "student_info")

- Landing
    - Raw CSV load
- Staging
    - Datatype and naming standardization
- Main [Maybe ERD]
    - Data architecture optimization
    - Categorical/text columns stored in tables linked with integer foreign keys
    - Joined data saved in views 
    - Third Normal Form

In [36]:
# fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(8,8))
# df = dbh.get_table("agg", "course_activities_by_popularity")
# df = df[df.loc[:, "activity_type"].apply(lambda x: x not in  ["homepage", "forumng"])]
# plot = sns.scatterplot(df,
#                 #x="top_course_activity_by_visits",
#                 #y="top_course_activity_by_clicks",
#                 x="n_visits",
#                 y="n_clicks",
#                 ax=ax,
#                 hue="activity_type",
#                 alpha=0.75,
#                 marker="1"
#                 )
# sns.Plot.scale()

## Feature Extraction
- Categories
    - Demographic Info
    - Course Info
    - VLE Interaction Data
    - Assignment Data

- Demographic Info
- Course Info
    - course level
    - course subject
- VLE Interaction Data
- Assignment Data
    - n assignments created/assigned
    - calculated moments about the mean for the number of days early or late students turned in assignments

- Agg
    - Aggregations and calculations
- Feat
    - First pass at organizing features/calculations for predictive models
- First30
    - Version of Feat created using first 30 days of class data
    - Excluded if withdrew before class day 30

- Agg
    - Aggregations and calculations
    - [Avg Assignment Days Early by N Days Active]
- Feat
    - First pass at organizing features/calculations for predictive models
    - [N Days Active]
    - [N Distinct Top 5th by Visits]
- First30
    - Version of Feat created using first 30 days of class data
    - Captures 49.60% of all withdrawn students
    - Captures 72.22% of students who withdrew after class started
    - Soon enough to make actionable difference to most withdrawing/failing students
    - [Final Result]
- Model
    - Logging of model execution data
- Eval
    - Organization of model analysis calculations


In [37]:
# sql = f"select n_days_active, final_result from first30.all_features"
# df = dbh.run_pd_query(sql)
# fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(16,9), dpi=1200)
# sns.histplot(df,
# sql = f"select {col}, final_result from first30.all_features"
# df = dbh.run_pd_query(sql)
# fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(16,9), dpi=1200)
# sns.histplot(df,
#             x=col,
#             hue="final_result",
#             multiple="stack",
#             hue_order=["Distinction", "Pass", "Fail", "Withdrawn"],
#             ax=ax)
# plt.title("Days Active by Student Count")
# plt.xlim(0, 50)
# figsave(col, bbox_inches="tight")
# plt.title("Days Active by Student Count")
# plt.xlim(0, 50)
# figsave(col, bbox_inches="tight")

In [None]:
SharedFigures.N_DAYS_ACTIVE_BY_FINAL_RESULT.value.image

example of aggregated feature

In [None]:
# Tables by Schema
PresentationFigures.TABLES_BY_SCHEMA.value.image

- Agg
    - Aggregations and calculations
- Feat
    - First pass at organizing features/calculations for predictive models
- First30
    - Version of Feat created using first 30 days of class data
- Model
    - Logging of model execution data
- Eval
    - Organization of model analysis calculations

In [40]:
# sql = f"select n_total_clicks_by_top_5th_clicks, final_result from first30.all_features"
# df = dbh.run_pd_query(sql)
# fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(16,9), dpi=1200)
# sns.histplot(df,
# sql = f"select {col}, final_result from first30.all_features"
# df = dbh.run_pd_query(sql)
# fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(16,9), dpi=1200)
# sns.histplot(df,
#             x=col,
#             hue="final_result",
#             multiple="stack",
#             hue_order=["Distinction", "Pass", "Fail", "Withdrawn"],
#             ax=ax)
# plt.title("Total Clicks on Top 5th Popular Sites by Student Count")
# plt.xlim(0, 1000)
# figsave(col, bbox_inches="tight")
# plt.title("Total Clicks on Top 5th Popular Sites by Student Count")
# plt.xlim(0, 1000)
# figsave(col, bbox_inches="tight")

In [None]:
SharedFigures.N_TOTAL_CLICKS_BY_TOP_5TH_CLICKS.value.image

example of expert-rec engineered feature

## Classification Algorithms & Hyperparameters

- Grid Search
    - Created large arrays of available hyperparameters
    - Brute-force search through combination of available hyperparameters
- Random Search 
    - Used GridSearch to limit the bounds of hyperparameter settings
    - Created random variables to represent distribution of particular hyperparamters, limited by results from GridSearch
    - Ran models where each iteration would pick from a model's available parameter combinations and distributions

first will show gridsearch then all examples will be generated randomly

In [None]:
show_params("hxg_boost", is_rand=False, n=3)

note incremental changes
random state is a way to freeze a random generator seed
only recommended during dev because of "seed optimization"

### <a href="https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html">Decision Tree</a> (dtree/DT)
- Simple decision rules are optimized from features to sort data

point out model type code and code for in visual

In [None]:
show_params("dtree")

### <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html">Ada Boost</a> (ada_boost/ADA)
- Ensemble Method
- Fits on original dataset, then creates copies which weight incorrectly classified instances more heavily in sequential cycles
- Used Decision Tree as base estimator, but can use many 

In [None]:
show_params("ada_boost")

### <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html">Histogram Gradient-Boosting</a> (hxg_boost/HGB)
- Similar to Ada Boost, but correction based on gradient of loss function from residuals (gradient descent)
- Dataset large enough that Histogram Gradient-Boosting Classifier much faster than Regular Gradient-Boosting Classifier
- Histograms increase training efficiency by bucketing continuous features

In [None]:
show_params("hxg_boost")

### <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html">Random Forest</a> (rforest/RF)
- Ensemble Method
- Fits many decision trees on sub-samples of dataset, then uses averaging to boost accuracy and control over-fitting

In [None]:
show_params("rforest")

### <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html">Extra Trees</a> (etree/ET)
- Ensemble Method
- Fits many decision trees on sub-samples of dataset, then uses averaging to boost accuracy and control over-fitting

In [None]:
show_params("etree")

### <a href="https://www.kaggle.com/code/hkapoor/random-forest-vs-extra-trees/notebook">Extra Trees vs Random Forest</a>
- Both construct many decision trees during execution & avg for classification/regression
- RF uses bootstrapping to sample subsets, ET by default does not
- RF looks for best split, ET randomly selects split
- ET typically will have faster fit times & lower variance, higher bias
- Performance of ET vs RF is often conditional upon feature selection/noisiness


### <a href="https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html">K-Nearest Neighbor</a> (knn/KNN)
- Calculates most likely value based on proximity to other points in numeric space    

In [None]:
show_params("knn")

### <a href="- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html">Logistic Regression</a> (logreg/LOG)
- Calculates most likely value based on contribution of independednt variables

In [None]:
show_params("logreg")

### <a href="https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html">Multi-Layer Perceptron</a> (mlp/MLP)

- Simple (vanilla) neural network
- Consists of layers of connected nodes with activation functions
- Optimizes weights of nodes in each layer using backpropogation during training
- Last layer is output layer, which produces most likely result given trained inputs

In [None]:
show_params("mlp")

### <a href="https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html">Support Vector Machines (svc/SVC)</a>
- A hyperplane is optimized to best split the data into different spatial regions

In [None]:
show_params("svc")

### Not Implemented
- Others considered but not implemented due to data preprocessing changes necessary/compute/memory overhead
- <a href = "https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessClassifier.html">Gaussian</a> 
    - (Blew up RAM) 
- <a href = "https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.ComplementNB.html">Naive Bayes</a> 
    - (Would need to preprocess data differently) 
- <a href = "https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html">Gradient Boosting</a> 
    - (Histogram-Based Algorithm more efficient at this scale) 

In [None]:
show_params("compnb")

## Model Pipeline

### Data Preprocessing
- Categorical Data -> One Hot
- Boolean Data -> Bit $\left( True = 1, False = 0 \right)$
- Numeric Data -> Standardized $\left( \mu = 0, \sigma = 1 \right)$
- Imputing Strategy = Constant = $0$
- Variance Threshold = $0$
- Dim Reduction = Principal Component Analysis w/ Maximum Likelihood Estimation

replaced all missing values with zeros to avoid removing important data - not always the best strategy
Standardized -> 0 mean, 1 var

### Model Training Settings
- Cross Validation Type = Repeated Stratified K Fold
- Cross Validation Splits $ = 5$
- Cross Validation Repeats $ = 2$
- Runs per Model $ = 10$
- Refit Parameter = ROC AUC
- Feature to Predict = "is_withdraw_or_fail"
- Train-Test Split Ratio = $[0.25, 0.35]$

In [53]:
df = dbh.get_table("analysis", "all_runs_results", nrows=10)
cols = [
    x
    for x in df.columns
    if (
        "split" not in x
        and x[:4] != "inc_"
        and x != "name"
        and "_id" not in x
        and "rank" not in x
        and "timestamp" not in x
    )
]
cols = [
    x
    for x in df.columns
    if (
        "split" not in x
        and x[:4] != "inc_"
        and x != "name"
        and "_id" not in x
        and "rank" not in x
        and "timestamp" not in x
    )
]
top_10 = df.loc[:, cols]

# Model Evaluation

### Naive Averaging

In [None]:
top_10

In [None]:
# ROC AUC by Fit Time - by Model Type
SharedFigures.ROC_BY_FIT_TIME.value.image

drive home the point that just sorting by average is very limiting

## NHST

In [None]:
# Frequentist Model Comparisons
SharedFigures.FREQUENTIST_ROPE_WINDOWPANE.value.image

- Significance Value = 0.05
- recall used non parametric friedman test to check for global difference then used pairwise nemenyi

## Bayesian

In [None]:
# Bayesian Model Comparisons
SharedFigures.BAYESIAN_ROPE_WINDOWPANE.value.image

- remem used bayesian signed rank test top check for rope (in this case, rope = 0.002)
- models using all features made better predictions than those with one category or more
- (this study got those results, another found just assignment data better)

## ABROCA

In [None]:
SharedFigures.ABROCA_LOGREG_ETREE.value.image

- compares the ABROCA performance of two models
- two models = logreg and etree
- for this run, etree has on avg better predictive performance
    - follows previous research - more data, better predictions
- on abroca, similar for disability, logreg better on gender balance
- future - baycomp on abroca as metric to analyze

In [None]:
SharedFigures.ABROCA_BY_DEMOG_BALANCE.value.image

- follows research -> weak/no relationship between ABROCA and performance (measured by ROC AUC)
- follows research -> quadratic relationship between ABROCA & demographic balance
- (not necessary to sacrifice predictive performance while researching model fairness)
- makes sense because metric modulated is 2D area
- & models can be expected to perform worse with less training data

## Future Research
- More comprehensive metric evaulation of ABROCA
    - Statistics
    - Other demographic characteristics
- Refinement of feature extraction
- Automated optimization/analysis of hyperparameter probability distributions
- Explore relationship between mathematical properties of ROC & ABROCA
- Expend more computing resources on hierarchical comparisons rather than different parameterizations 

# Wrap Up

## Questions?

## Tools Used

In [None]:
PresentationFigures.PYTHON_LOGO.value.image

In [None]:
PresentationFigures.JUPYTER_LOGO.value.image

In [None]:
PresentationFigures.POSTGRESQL_LOGO.value

In [None]:
PresentationFigures.PSYCOPG2_LOGO.value.image

In [None]:
PresentationFigures.PANDAS_LOGO.value.image

In [None]:
PresentationFigures.MATPLOTLIB_LOGO.value.image

In [None]:
PresentationFigures.SEABORN_LOGO.value.image

In [None]:
PresentationFigures.NUMPY_LOGO.value.image

Bayesian Statistical Tests

# <a href="https://baycomp.readthedocs.io/en/latest/index.html">baycomp</a>
by:
- Janez Demsar
- Alessio Benavoli
- Giorgio Corani

In [None]:
PresentationFigures.SCIPY_LOGO.value.image

In [None]:
PresentationFigures.SCIKIT_LEARN_LOGO.value.image

## Top References
- <a href="https://arxiv.org/pdf/1606.04316">Time for a Change: a Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis</a>
- <a href="http://doi.org/10.1145/3303772.3303791">Evaluating the Fairness of Predictive Student Models Through Slicing Analysis</a>
- <a href="http://dx.doi.org/10.18608/jla.2018.52.7">Evaluating Predictive Models of Student Success: Closing the Methodological Gap</a>
- <a href="http://dx.doi.org/10.18608/jla.2015.22.13">Exploring the Link between Online Behaviours and Course Performance in Asynchronous Online High School Courses</a>
- <a href="https://wires.onlinelibrary.wiley.com/doi/10.1002/widm.1355">Educational data mining and learning analytics: An updated survey</a>
