# Analysis of student's perfomance in math using Bayesian models

## Intro

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

import matplotlib.pyplot as plt

import pymc3 as pm
import arviz as az

az.style.use('arviz-darkgrid')

from sklearn.metrics import mean_squared_error as mse
from sklearn.linear_model import Ridge

Let's just read data and have a first look on it

In [None]:
df = pd.read_csv('/kaggle/input/students-performance-in-exams/StudentsPerformance.csv')
df.head(5)

There are three possible targets, but we'll be working with only one of them: math score.

First of all we need to specify the task to be solved. Task specification for such kind of target is a well-known interview question, and the preffered answer is that this one is a *regression* problem. The reason is that classification algorithms cannot compare two mistakes, i.e. in case when correct answer is 65 the prediction of 64 is the same bad as of 25 from the classifications metrics point of view, but it's pretty obvious, that the prediction of 64 is not bad at all when the other one is just terrible.

Okay, so we specify a target column and features columns. It should be noted, that all features are categorical.

In [None]:
df = df.drop(columns=['reading score', 'writing score'])
y = df['math score']
X = df[['gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course']]

Let's OHE them. ``drop='first'`` argument means that the first category will be encoded as zeros, so that the number of columns for the feature with $n$ categories will be $n - 1$.

In [None]:
X_ohe = OneHotEncoder(drop='first', sparse=False).fit_transform(X)

Just common train-test split.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X_ohe, y, test_size=0.1, random_state=123
)

## Bayesian model

Okay, let's create Bayesian model:
$$
bias \sim \mathcal{N}(0, 20),
$$
$$
w_i \sim \mathcal{N}(0, 5),
$$
$$
\sigma \sim |\mathcal{N}(0, 2)|,
$$
$$
\nu \sim Exp(1),
$$
$$
y \sim t(\nu = \nu, \mu = bias + w \cdot X, sd = \sigma).
$$

<p style="text-align: center;"><img src="https://i.ibb.co/vvr7t5m/bayes-model-diag-page-0001.jpg" alt="Model" border="0"></p>

In [None]:
DIM = 12

with pm.Model() as robust_linreg_model:
    X_data = np.array([pm.Data(f"X_data_{i+1}", X_train[:, i]) for i in range(DIM)])
    w0 = pm.Normal('w0', mu=0, sd=np.array(20))
    w = np.array([pm.Normal(f'w{i+1}', mu=0, sd=np.array(10)) for i in range(DIM)])
    sigma = pm.HalfNormal('sigma', sd=2)
    nu = pm.Exponential('nu', 1)
    outputs = pm.StudentT('y', mu=w0 + np.sum(w*X_data), sd=sigma, nu=nu, observed=y_train)
    
pm.model_to_graphviz(robust_linreg_model)

In [None]:
with robust_linreg_model:
    inf_data_robust = pm.sample(draws=2000, tune=2000, chains=2, cores=2, 
                         return_inferencedata=True)

In [None]:
az.summary(inf_data_robust, round_to=2)

`r_hat` is perfect

In [None]:
az.plot_trace(inf_data_robust, compact=False)

... and sampling is also well-looking.

In [None]:
az.plot_forest(inf_data_robust,
               model_names = ['Robust Linreg'],
               hdi_prob=0.95, figsize=(6, 4));

Confidence intervals are small enough

## Predictions

In [None]:
with robust_linreg_model:
    pm.set_data({f'X_data_{i+1}': X_test[:, i] for i in range(12)})
    samples_train = pm.sample_posterior_predictive(inf_data_robust)

Let's evaluate our prediction using mean squared error. But before doing this we need to round our predictions and somehow pick the final prediction. Let's say it would be the most common one.

In [None]:
from scipy.stats import mode
mse(y_test, mode(np.rint(samples_train['y']))[0][0])

Comparing this to common Ridge Regression we see, that our model does it's job not much worse. That's cool :)

In [None]:
mse(y_test, Ridge().fit(X_train, y_train).predict(X_test))