Skip to content

Commit

Permalink
ENH: small fixes + example notebook
Browse files Browse the repository at this point in the history
  • Loading branch information
Bolzano-Weierstrass authored and josef-pkt committed Sep 3, 2020
1 parent b975648 commit cc353cc
Show file tree
Hide file tree
Showing 2 changed files with 250 additions and 4 deletions.
247 changes: 247 additions & 0 deletions examples/notebooks/ordinal_regression.ipynb
@@ -0,0 +1,247 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Ordinal Regression"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import scipy.stats as stats\n",
"\n",
"from statsmodels.miscmodels.ordinal_model import OrderedModel"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Loading a stata data file from the UCLA website.This notebook is inspired by https://stats.idre.ucla.edu/r/dae/ordinal-logistic-regression/ which is a R notebook from UCLA."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"url = \"https://stats.idre.ucla.edu/stat/data/ologit.dta\"\n",
"data_student = pd.read_stata(url)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data_student.head(5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data_student.dtypes"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data_student['apply'].dtype"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This dataset is about the probability for undergraduate students to apply to graduate school given three exogenous variables:\n",
"- their grade point average(`gpa`), a float between 0 and 4.\n",
"- `pared`, a binary that indicates if at least one parent went to graduate school.\n",
"- and `public`, a binary that indicates if the current undergraduate institution of the student is public or private.\n",
"\n",
"`apply`, the target variable is categorical with ordered categories: `unlikely` < `somewhat likely` < `very likely`. It is a `pd.Serie` of categorical type, this is preferred over NumPy arrays."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The model is based on a numerical latent variable $y_{latent}$ that we cannot observe but that we can compute thanks to exogenous variables.\n",
"Moreover we can use this $y_{latent}$ to define $y$ that we can observe.\n",
"\n",
"For more details see the the Documentation of OrderedModel, [the UCLA webpage](https://stats.idre.ucla.edu/r/dae/ordinal-logistic-regression/) or this [book](https://onlinelibrary.wiley.com/doi/book/10.1002/9780470594001).\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Probit ordinal regression:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mod_prob = OrderedModel(data_student['apply'],\n",
" data_student[['pared', 'public', 'gpa']],\n",
" distr='logit')\n",
"\n",
"res_prob = mod_prob.fit(method='bfgs')\n",
"res_prob.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In our model, we have 3 exogenous variables(the $\\beta$s if we keep the documentation's notations) so we have 3 coefficients that need to be estimated.\n",
"\n",
"Those 3 estimations and their standard errors can be retrieved in the summary table.\n",
"\n",
"Since there are 3 categories in the target variable(`unlikely`, `somewhat likely`, `very likely`), we have two thresholds to estimate. \n",
"As explained in the doc of the method `OrderedModel.transform_threshold_params`, the first estimated threshold is the actual value and all the other thresholds are in terms of cumulative exponentiated increments. Actual thresholds values can be computed as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"num_of_thresholds = 2\n",
"mod_prob.transform_threshold_params(res_prob.params[-num_of_thresholds:])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Logit ordinal regression:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mod_log = OrderedModel(data_student['apply'],\n",
" data_student[['pared', 'public', 'gpa']],\n",
" distr='logit')\n",
"\n",
"res_log = mod_log.fit(method='bfgs', disp=False)\n",
"res_log.summary()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"predicted = res_log.model.predict(res_log.params, exog=data_student[['pared', 'public', 'gpa']])\n",
"predicted"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pred_choice = predicted.argmax(1)\n",
"print('Fraction of correct choice predictions')\n",
"print((np.asarray(data_student['apply'].values.codes) == pred_choice).mean())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Ordinal regression with a custom cumulative cLogLog distribution:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In addition to `logit` and `probit` regression, any continuous distribution from `SciPy.stats` package can be used for the `distr` argument. Alternatively, one can define its own distribution simply creating a subclass from `rv_continuous` and implementing a few methods."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# using a SciPy distribution\n",
"res_exp = OrderedModel(data_student['apply'],\n",
" data_student[['pared', 'public', 'gpa']],\n",
" distr=stats.expon).fit(method='bfgs', disp=False)\n",
"res_exp.summary()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# minimal definition of a custom scipy distribution.\n",
"class CLogLog(stats.rv_continuous):\n",
" def _ppf(self, q):\n",
" return np.log(-np.log(1 - q))\n",
"\n",
" def _cdf(self, x):\n",
" return 1 - np.exp(-np.exp(x))\n",
"\n",
"\n",
"cloglog = CLogLog()\n",
"\n",
"# definition of the model and fitting\n",
"res_cloglog = OrderedModel(data_student['apply'],\n",
" data_student[['pared', 'public', 'gpa']],\n",
" distr=cloglog).fit(method='bfgs', disp=False)\n",
"res_cloglog.summary()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
7 changes: 3 additions & 4 deletions statsmodels/miscmodels/ordinal_model.py
Expand Up @@ -106,7 +106,7 @@ def _check_inputs(self, endog, exog):

# Pandas' support
if (isinstance(exog, pd.DataFrame)) or (isinstance(exog, pd.Series)):
exog_name = (exog.name if isinstance(exog, pd.Series)
exog_name = ([exog.name] if isinstance(exog, pd.Series)
else exog.columns.tolist())
names['xname'] = exog_name
exog = np.asarray(exog)
Expand Down Expand Up @@ -243,7 +243,6 @@ def fit(self, start_params=None, method='nm', maxiter=500, full_output=1,

class OrderedResults(GenericLikelihoodModelResults):
@Appender(GenericLikelihoodModelResults.summary.__doc__)
def summary(self, yname=None, xname=None, title=None, alpha=.05, **kwargs):
def summary(self, yname=None, xname=None, title=None, alpha=.05):
names = self.model.names
print(names)
return super(OrderedResults, self).summary(**names, **kwargs)
return super(OrderedResults, self).summary(**names)

0 comments on commit cc353cc

Please sign in to comment.