Skip to content

Commit

Permalink
Merge ce0899b into df1c1a1
Browse files Browse the repository at this point in the history
  • Loading branch information
josef-pkt committed Sep 21, 2020
2 parents df1c1a1 + ce0899b commit 81c0c3b
Show file tree
Hide file tree
Showing 8 changed files with 1,712 additions and 2 deletions.
247 changes: 247 additions & 0 deletions examples/notebooks/ordinal_regression.ipynb
@@ -0,0 +1,247 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Ordinal Regression"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import scipy.stats as stats\n",
"\n",
"from statsmodels.miscmodels.ordinal_model import OrderedModel"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Loading a stata data file from the UCLA website.This notebook is inspired by https://stats.idre.ucla.edu/r/dae/ordinal-logistic-regression/ which is a R notebook from UCLA."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"url = \"https://stats.idre.ucla.edu/stat/data/ologit.dta\"\n",
"data_student = pd.read_stata(url)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data_student.head(5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data_student.dtypes"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data_student['apply'].dtype"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This dataset is about the probability for undergraduate students to apply to graduate school given three exogenous variables:\n",
"- their grade point average(`gpa`), a float between 0 and 4.\n",
"- `pared`, a binary that indicates if at least one parent went to graduate school.\n",
"- and `public`, a binary that indicates if the current undergraduate institution of the student is public or private.\n",
"\n",
"`apply`, the target variable is categorical with ordered categories: `unlikely` < `somewhat likely` < `very likely`. It is a `pd.Serie` of categorical type, this is preferred over NumPy arrays."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The model is based on a numerical latent variable $y_{latent}$ that we cannot observe but that we can compute thanks to exogenous variables.\n",
"Moreover we can use this $y_{latent}$ to define $y$ that we can observe.\n",
"\n",
"For more details see the the Documentation of OrderedModel, [the UCLA webpage](https://stats.idre.ucla.edu/r/dae/ordinal-logistic-regression/) or this [book](https://onlinelibrary.wiley.com/doi/book/10.1002/9780470594001).\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Probit ordinal regression:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mod_prob = OrderedModel(data_student['apply'],\n",
" data_student[['pared', 'public', 'gpa']],\n",
" distr='logit')\n",
"\n",
"res_prob = mod_prob.fit(method='bfgs')\n",
"res_prob.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In our model, we have 3 exogenous variables(the $\\beta$s if we keep the documentation's notations) so we have 3 coefficients that need to be estimated.\n",
"\n",
"Those 3 estimations and their standard errors can be retrieved in the summary table.\n",
"\n",
"Since there are 3 categories in the target variable(`unlikely`, `somewhat likely`, `very likely`), we have two thresholds to estimate. \n",
"As explained in the doc of the method `OrderedModel.transform_threshold_params`, the first estimated threshold is the actual value and all the other thresholds are in terms of cumulative exponentiated increments. Actual thresholds values can be computed as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"num_of_thresholds = 2\n",
"mod_prob.transform_threshold_params(res_prob.params[-num_of_thresholds:])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Logit ordinal regression:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mod_log = OrderedModel(data_student['apply'],\n",
" data_student[['pared', 'public', 'gpa']],\n",
" distr='logit')\n",
"\n",
"res_log = mod_log.fit(method='bfgs', disp=False)\n",
"res_log.summary()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"predicted = res_log.model.predict(res_log.params, exog=data_student[['pared', 'public', 'gpa']])\n",
"predicted"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pred_choice = predicted.argmax(1)\n",
"print('Fraction of correct choice predictions')\n",
"print((np.asarray(data_student['apply'].values.codes) == pred_choice).mean())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Ordinal regression with a custom cumulative cLogLog distribution:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In addition to `logit` and `probit` regression, any continuous distribution from `SciPy.stats` package can be used for the `distr` argument. Alternatively, one can define its own distribution simply creating a subclass from `rv_continuous` and implementing a few methods."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# using a SciPy distribution\n",
"res_exp = OrderedModel(data_student['apply'],\n",
" data_student[['pared', 'public', 'gpa']],\n",
" distr=stats.expon).fit(method='bfgs', disp=False)\n",
"res_exp.summary()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# minimal definition of a custom scipy distribution.\n",
"class CLogLog(stats.rv_continuous):\n",
" def _ppf(self, q):\n",
" return np.log(-np.log(1 - q))\n",
"\n",
" def _cdf(self, x):\n",
" return 1 - np.exp(-np.exp(x))\n",
"\n",
"\n",
"cloglog = CLogLog()\n",
"\n",
"# definition of the model and fitting\n",
"res_cloglog = OrderedModel(data_student['apply'],\n",
" data_student[['pared', 'public', 'gpa']],\n",
" distr=cloglog).fit(method='bfgs', disp=False)\n",
"res_cloglog.summary()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
15 changes: 13 additions & 2 deletions statsmodels/base/model.py
Expand Up @@ -789,6 +789,7 @@ def __init__(self, endog, exog=None, loglike=None, score=None,
if hessian is not None:
self.hessian = hessian

hasconst = kwds.pop("hasconst", None)
self.__dict__.update(kwds)

# TODO: data structures?
Expand All @@ -797,7 +798,8 @@ def __init__(self, endog, exog=None, loglike=None, score=None,
# self.df_model = 9999
# somewhere: CacheWriteWarning: 'df_model' cannot be overwritten
super(GenericLikelihoodModel, self).__init__(endog, exog,
missing=missing)
missing=missing,
hasconst=hasconst)

# this will not work for ru2nmnl, maybe np.ndim of a dict?
if exog is not None:
Expand Down Expand Up @@ -1060,7 +1062,9 @@ def predict(self, exog=None, transform=True, *args, **kwargs):
exog_index = exog.index if is_pandas else None

if transform and hasattr(self.model, 'formula') and (exog is not None):
design_info = self.model.data.design_info
# allow both location of design_info, see #7043
design_info = (getattr(self.model, "design_info", None) or
self.model.data.design_info)
from patsy import dmatrix
if isinstance(exog, pd.Series):
# we are guessing whether it should be column or row
Expand Down Expand Up @@ -2488,6 +2492,13 @@ def __init__(self, model, mlefit):
self._cache = {}
self.__dict__.update(mlefit.__dict__)

k_params = len(mlefit.params)
# checks mainly for adding new models or subclassing
if self.df_model + self.model.k_constant != k_params:
warnings.warn("df_model + k_constant differs from nparams")
if self.df_resid != self.nobs - k_params:
warnings.warn("df_resid differs from nobs - nparams")

def summary(self, yname=None, xname=None, title=None, alpha=.05):
"""Summarize the Regression Results
Expand Down
97 changes: 97 additions & 0 deletions statsmodels/examples/ex_ordered_model.py
@@ -0,0 +1,97 @@
# -*- coding: utf-8 -*-
"""
Created on Mon Aug 24 11:17:06 2015
Author: Josef Perktold
License: BSD-3
"""

import numpy as np
from scipy import stats
import pandas

from statsmodels.miscmodels.ordinal_model import OrderedModel

nobs, k_vars = 1000, 3
x = np.random.randn(nobs, k_vars)
# x = np.column_stack((np.ones(nobs), x))
# #constant will be in integration limits
xb = x.dot(np.ones(k_vars))
y_latent = xb + np.random.randn(nobs)
y = np.round(np.clip(y_latent, -2.4, 2.4)).astype(int) + 2

print(np.unique(y))
print(np.bincount(y))

mod = OrderedModel(y, x)
# start_params = np.ones(k_vars + 4)
# start_params = np.concatenate((np.ones(k_vars), np.arange(4)))
start_ppf = stats.norm.ppf((np.bincount(y) / len(y)).cumsum())
start_threshold = np.concatenate((start_ppf[:1],
np.log(np.diff(start_ppf[:-1]))))
start_params = np.concatenate((np.zeros(k_vars), start_threshold))
res = mod.fit(start_params=start_params, maxiter=5000, maxfun=5000)
print(res.params)
# res = mod.fit(start_params=res.params, method='bfgs')
res = mod.fit(start_params=start_params, method='bfgs')

print(res.params)
print(np.exp(res.params[-(mod.k_levels - 1):]).cumsum())
# print(res.summary())

predicted = res.model.predict(res.params)
pred_choice = predicted.argmax(1)
print('Fraction of correct choice predictions')
print((y == pred_choice).mean())

print('\ncomparing bincount')
print(np.bincount(res.model.predict(res.params).argmax(1)))
print(np.bincount(res.model.endog))

res_log = OrderedModel(y, x, distr='logit').fit(method='bfgs')
pred_choice_log = res_log.predict().argmax(1)
print((y == pred_choice_log).mean())
print(res_log.summary())

# example form UCLA Stats pages
# http://www.ats.ucla.edu/stat/stata/dae/ologit.htm
# requires downloaded dataset ologit.dta

dataf = pandas.read_stata(r"M:\josef_new\scripts\ologit_ucla.dta")

# this works but sorts category levels alphabetically
res_log2 = OrderedModel(np.asarray(dataf['apply']),
np.asarray(dataf[['pared', 'public', 'gpa']], float),
distr='logit').fit(method='bfgs')

# this replicates the UCLA example except
# for different parameterization of par2
res_log3 = OrderedModel(dataf['apply'].values.codes,
np.asarray(dataf[['pared', 'public', 'gpa']], float),
distr='logit').fit(method='bfgs')

print(res_log3.summary())

# with ordered probit - not on UCLA page
print(
OrderedModel(dataf['apply'].values.codes,
np.asarray(dataf[['pared', 'public', 'gpa']], float),
distr='probit').fit(method='bfgs').summary())


# example with a custom distribution - not on UCLA page
# definition of the SciPy dist
class CLogLog(stats.rv_continuous):
def _ppf(self, q):
return np.log(-np.log(1 - q))

def _cdf(self, x):
return 1 - np.exp(-np.exp(x))


cloglog = CLogLog()

res_cloglog = OrderedModel(dataf['apply'],
dataf[['pared', 'public', 'gpa']],
distr=cloglog).fit(method='bfgs', disp=False)
print(res_cloglog.summary())

0 comments on commit 81c0c3b

Please sign in to comment.