Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
8 changed files
with
1,712 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,247 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Ordinal Regression" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import numpy as np\n", | ||
"import pandas as pd\n", | ||
"import scipy.stats as stats\n", | ||
"\n", | ||
"from statsmodels.miscmodels.ordinal_model import OrderedModel" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Loading a stata data file from the UCLA website.This notebook is inspired by https://stats.idre.ucla.edu/r/dae/ordinal-logistic-regression/ which is a R notebook from UCLA." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"url = \"https://stats.idre.ucla.edu/stat/data/ologit.dta\"\n", | ||
"data_student = pd.read_stata(url)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"data_student.head(5)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"data_student.dtypes" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"data_student['apply'].dtype" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"This dataset is about the probability for undergraduate students to apply to graduate school given three exogenous variables:\n", | ||
"- their grade point average(`gpa`), a float between 0 and 4.\n", | ||
"- `pared`, a binary that indicates if at least one parent went to graduate school.\n", | ||
"- and `public`, a binary that indicates if the current undergraduate institution of the student is public or private.\n", | ||
"\n", | ||
"`apply`, the target variable is categorical with ordered categories: `unlikely` < `somewhat likely` < `very likely`. It is a `pd.Serie` of categorical type, this is preferred over NumPy arrays." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The model is based on a numerical latent variable $y_{latent}$ that we cannot observe but that we can compute thanks to exogenous variables.\n", | ||
"Moreover we can use this $y_{latent}$ to define $y$ that we can observe.\n", | ||
"\n", | ||
"For more details see the the Documentation of OrderedModel, [the UCLA webpage](https://stats.idre.ucla.edu/r/dae/ordinal-logistic-regression/) or this [book](https://onlinelibrary.wiley.com/doi/book/10.1002/9780470594001).\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Probit ordinal regression:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"mod_prob = OrderedModel(data_student['apply'],\n", | ||
" data_student[['pared', 'public', 'gpa']],\n", | ||
" distr='logit')\n", | ||
"\n", | ||
"res_prob = mod_prob.fit(method='bfgs')\n", | ||
"res_prob.summary()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"In our model, we have 3 exogenous variables(the $\\beta$s if we keep the documentation's notations) so we have 3 coefficients that need to be estimated.\n", | ||
"\n", | ||
"Those 3 estimations and their standard errors can be retrieved in the summary table.\n", | ||
"\n", | ||
"Since there are 3 categories in the target variable(`unlikely`, `somewhat likely`, `very likely`), we have two thresholds to estimate. \n", | ||
"As explained in the doc of the method `OrderedModel.transform_threshold_params`, the first estimated threshold is the actual value and all the other thresholds are in terms of cumulative exponentiated increments. Actual thresholds values can be computed as follows:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"num_of_thresholds = 2\n", | ||
"mod_prob.transform_threshold_params(res_prob.params[-num_of_thresholds:])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Logit ordinal regression:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"mod_log = OrderedModel(data_student['apply'],\n", | ||
" data_student[['pared', 'public', 'gpa']],\n", | ||
" distr='logit')\n", | ||
"\n", | ||
"res_log = mod_log.fit(method='bfgs', disp=False)\n", | ||
"res_log.summary()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"predicted = res_log.model.predict(res_log.params, exog=data_student[['pared', 'public', 'gpa']])\n", | ||
"predicted" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"pred_choice = predicted.argmax(1)\n", | ||
"print('Fraction of correct choice predictions')\n", | ||
"print((np.asarray(data_student['apply'].values.codes) == pred_choice).mean())" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Ordinal regression with a custom cumulative cLogLog distribution:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"In addition to `logit` and `probit` regression, any continuous distribution from `SciPy.stats` package can be used for the `distr` argument. Alternatively, one can define its own distribution simply creating a subclass from `rv_continuous` and implementing a few methods." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# using a SciPy distribution\n", | ||
"res_exp = OrderedModel(data_student['apply'],\n", | ||
" data_student[['pared', 'public', 'gpa']],\n", | ||
" distr=stats.expon).fit(method='bfgs', disp=False)\n", | ||
"res_exp.summary()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# minimal definition of a custom scipy distribution.\n", | ||
"class CLogLog(stats.rv_continuous):\n", | ||
" def _ppf(self, q):\n", | ||
" return np.log(-np.log(1 - q))\n", | ||
"\n", | ||
" def _cdf(self, x):\n", | ||
" return 1 - np.exp(-np.exp(x))\n", | ||
"\n", | ||
"\n", | ||
"cloglog = CLogLog()\n", | ||
"\n", | ||
"# definition of the model and fitting\n", | ||
"res_cloglog = OrderedModel(data_student['apply'],\n", | ||
" data_student[['pared', 'public', 'gpa']],\n", | ||
" distr=cloglog).fit(method='bfgs', disp=False)\n", | ||
"res_cloglog.summary()" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.7.7" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
# -*- coding: utf-8 -*- | ||
""" | ||
Created on Mon Aug 24 11:17:06 2015 | ||
Author: Josef Perktold | ||
License: BSD-3 | ||
""" | ||
|
||
import numpy as np | ||
from scipy import stats | ||
import pandas | ||
|
||
from statsmodels.miscmodels.ordinal_model import OrderedModel | ||
|
||
nobs, k_vars = 1000, 3 | ||
x = np.random.randn(nobs, k_vars) | ||
# x = np.column_stack((np.ones(nobs), x)) | ||
# #constant will be in integration limits | ||
xb = x.dot(np.ones(k_vars)) | ||
y_latent = xb + np.random.randn(nobs) | ||
y = np.round(np.clip(y_latent, -2.4, 2.4)).astype(int) + 2 | ||
|
||
print(np.unique(y)) | ||
print(np.bincount(y)) | ||
|
||
mod = OrderedModel(y, x) | ||
# start_params = np.ones(k_vars + 4) | ||
# start_params = np.concatenate((np.ones(k_vars), np.arange(4))) | ||
start_ppf = stats.norm.ppf((np.bincount(y) / len(y)).cumsum()) | ||
start_threshold = np.concatenate((start_ppf[:1], | ||
np.log(np.diff(start_ppf[:-1])))) | ||
start_params = np.concatenate((np.zeros(k_vars), start_threshold)) | ||
res = mod.fit(start_params=start_params, maxiter=5000, maxfun=5000) | ||
print(res.params) | ||
# res = mod.fit(start_params=res.params, method='bfgs') | ||
res = mod.fit(start_params=start_params, method='bfgs') | ||
|
||
print(res.params) | ||
print(np.exp(res.params[-(mod.k_levels - 1):]).cumsum()) | ||
# print(res.summary()) | ||
|
||
predicted = res.model.predict(res.params) | ||
pred_choice = predicted.argmax(1) | ||
print('Fraction of correct choice predictions') | ||
print((y == pred_choice).mean()) | ||
|
||
print('\ncomparing bincount') | ||
print(np.bincount(res.model.predict(res.params).argmax(1))) | ||
print(np.bincount(res.model.endog)) | ||
|
||
res_log = OrderedModel(y, x, distr='logit').fit(method='bfgs') | ||
pred_choice_log = res_log.predict().argmax(1) | ||
print((y == pred_choice_log).mean()) | ||
print(res_log.summary()) | ||
|
||
# example form UCLA Stats pages | ||
# http://www.ats.ucla.edu/stat/stata/dae/ologit.htm | ||
# requires downloaded dataset ologit.dta | ||
|
||
dataf = pandas.read_stata(r"M:\josef_new\scripts\ologit_ucla.dta") | ||
|
||
# this works but sorts category levels alphabetically | ||
res_log2 = OrderedModel(np.asarray(dataf['apply']), | ||
np.asarray(dataf[['pared', 'public', 'gpa']], float), | ||
distr='logit').fit(method='bfgs') | ||
|
||
# this replicates the UCLA example except | ||
# for different parameterization of par2 | ||
res_log3 = OrderedModel(dataf['apply'].values.codes, | ||
np.asarray(dataf[['pared', 'public', 'gpa']], float), | ||
distr='logit').fit(method='bfgs') | ||
|
||
print(res_log3.summary()) | ||
|
||
# with ordered probit - not on UCLA page | ||
print( | ||
OrderedModel(dataf['apply'].values.codes, | ||
np.asarray(dataf[['pared', 'public', 'gpa']], float), | ||
distr='probit').fit(method='bfgs').summary()) | ||
|
||
|
||
# example with a custom distribution - not on UCLA page | ||
# definition of the SciPy dist | ||
class CLogLog(stats.rv_continuous): | ||
def _ppf(self, q): | ||
return np.log(-np.log(1 - q)) | ||
|
||
def _cdf(self, x): | ||
return 1 - np.exp(-np.exp(x)) | ||
|
||
|
||
cloglog = CLogLog() | ||
|
||
res_cloglog = OrderedModel(dataf['apply'], | ||
dataf[['pared', 'public', 'gpa']], | ||
distr=cloglog).fit(method='bfgs', disp=False) | ||
print(res_cloglog.summary()) |
Oops, something went wrong.