# Ordinal Regression

In [None]:
import numpy as np
import pandas as pd
import scipy.stats as stats

from statsmodels.miscmodels.ordinal_model import OrderedModel

Loading a stata data file from the UCLA website.This notebook is inspired by https://stats.idre.ucla.edu/r/dae/ordinal-logistic-regression/ which is a R notebook from UCLA.

In [None]:
url = "https://stats.idre.ucla.edu/stat/data/ologit.dta"
data_student = pd.read_stata(url)

In [None]:
data_student.head(5)

In [None]:
data_student.dtypes

In [None]:
data_student['apply'].dtype

This dataset is about the probability for undergraduate students to apply to graduate school given three exogenous variables:
- their grade point average(`gpa`), a float between 0 and 4.
- `pared`, a binary that indicates if at least one parent went to graduate school.
- and `public`, a binary that indicates if the current undergraduate institution of the student is public or private.

`apply`, the target variable is categorical with ordered categories: `unlikely` < `somewhat likely` < `very likely`. It is a `pd.Serie` of categorical type, this is preferred over NumPy arrays.

The model is based on a numerical latent variable $y_{latent}$ that we cannot observe but that we can compute thanks to exogenous variables.
Moreover we can use this $y_{latent}$ to define $y$ that we can observe.

For more details see the the Documentation of OrderedModel,  [the UCLA webpage](https://stats.idre.ucla.edu/r/dae/ordinal-logistic-regression/) or this [book](https://onlinelibrary.wiley.com/doi/book/10.1002/9780470594001).


### Probit ordinal regression:

In [None]:
mod_prob = OrderedModel(data_student['apply'],
                        data_student[['pared', 'public', 'gpa']],
                        distr='logit')

res_prob = mod_prob.fit(method='bfgs')
res_prob.summary()

In our model, we have 3 exogenous variables(the $\beta$s if we keep the documentation's notations) so we have 3 coefficients that need to be estimated.

Those 3 estimations and their standard errors can be retrieved in the summary table.

Since there are 3 categories in the target variable(`unlikely`, `somewhat likely`, `very likely`), we have two thresholds to estimate. 
As explained in the doc of the method `OrderedModel.transform_threshold_params`, the first estimated threshold is the actual value and all the other thresholds are in terms of cumulative exponentiated increments. Actual thresholds values can be computed as follows:

In [None]:
num_of_thresholds = 2
mod_prob.transform_threshold_params(res_prob.params[-num_of_thresholds:])

### Logit ordinal regression:

In [None]:
mod_log = OrderedModel(data_student['apply'],
                        data_student[['pared', 'public', 'gpa']],
                        distr='logit')

res_log = mod_log.fit(method='bfgs', disp=False)
res_log.summary()

In [None]:
predicted = res_log.model.predict(res_log.params, exog=data_student[['pared', 'public', 'gpa']])
predicted

In [None]:
pred_choice = predicted.argmax(1)
print('Fraction of correct choice predictions')
print((np.asarray(data_student['apply'].values.codes) == pred_choice).mean())

### Ordinal regression with a custom cumulative cLogLog distribution:

In addition to `logit` and `probit` regression, any continuous distribution from `SciPy.stats` package can be used for the `distr` argument. Alternatively, one can define its own distribution simply creating a subclass from `rv_continuous` and implementing a few methods.

In [None]:
# using a SciPy distribution
res_exp = OrderedModel(data_student['apply'],
                           data_student[['pared', 'public', 'gpa']],
                           distr=stats.expon).fit(method='bfgs', disp=False)
res_exp.summary()

In [None]:
# minimal definition of a custom scipy distribution.
class CLogLog(stats.rv_continuous):
    def _ppf(self, q):
        return np.log(-np.log(1 - q))

    def _cdf(self, x):
        return 1 - np.exp(-np.exp(x))


cloglog = CLogLog()

# definition of the model and fitting
res_cloglog = OrderedModel(data_student['apply'],
                           data_student[['pared', 'public', 'gpa']],
                           distr=cloglog).fit(method='bfgs', disp=False)
res_cloglog.summary()