Merge ce0899b into df1c1a1

statsmodels · Sep 21, 2020 · 81c0c3b · 81c0c3b
2 parents df1c1a1 + ce0899b
commit 81c0c3b
Show file tree

Hide file tree

Showing 8 changed files with 1,712 additions and 2 deletions.
diff --git a/examples/notebooks/ordinal_regression.ipynb b/examples/notebooks/ordinal_regression.ipynb
@@ -0,0 +1,247 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Ordinal Regression"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import scipy.stats as stats\n",
+    "\n",
+    "from statsmodels.miscmodels.ordinal_model import OrderedModel"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Loading a stata data file from the UCLA website.This notebook is inspired by https://stats.idre.ucla.edu/r/dae/ordinal-logistic-regression/ which is a R notebook from UCLA."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "url = \"https://stats.idre.ucla.edu/stat/data/ologit.dta\"\n",
+    "data_student = pd.read_stata(url)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_student.head(5)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_student.dtypes"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_student['apply'].dtype"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This dataset is about the probability for undergraduate students to apply to graduate school given three exogenous variables:\n",
+    "- their grade point average(`gpa`), a float between 0 and 4.\n",
+    "- `pared`, a binary that indicates if at least one parent went to graduate school.\n",
+    "- and `public`, a binary that indicates if the current undergraduate institution of the student is public or private.\n",
+    "\n",
+    "`apply`, the target variable is categorical with ordered categories: `unlikely` < `somewhat likely` < `very likely`. It is a `pd.Serie` of categorical type, this is preferred over NumPy arrays."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The model is based on a numerical latent variable $y_{latent}$ that we cannot observe but that we can compute thanks to exogenous variables.\n",
+    "Moreover we can use this $y_{latent}$ to define $y$ that we can observe.\n",
+    "\n",
+    "For more details see the the Documentation of OrderedModel,  [the UCLA webpage](https://stats.idre.ucla.edu/r/dae/ordinal-logistic-regression/) or this [book](https://onlinelibrary.wiley.com/doi/book/10.1002/9780470594001).\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Probit ordinal regression:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "mod_prob = OrderedModel(data_student['apply'],\n",
+    "                        data_student[['pared', 'public', 'gpa']],\n",
+    "                        distr='logit')\n",
+    "\n",
+    "res_prob = mod_prob.fit(method='bfgs')\n",
+    "res_prob.summary()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In our model, we have 3 exogenous variables(the $\\beta$s if we keep the documentation's notations) so we have 3 coefficients that need to be estimated.\n",
+    "\n",
+    "Those 3 estimations and their standard errors can be retrieved in the summary table.\n",
+    "\n",
+    "Since there are 3 categories in the target variable(`unlikely`, `somewhat likely`, `very likely`), we have two thresholds to estimate. \n",
+    "As explained in the doc of the method `OrderedModel.transform_threshold_params`, the first estimated threshold is the actual value and all the other thresholds are in terms of cumulative exponentiated increments. Actual thresholds values can be computed as follows:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "num_of_thresholds = 2\n",
+    "mod_prob.transform_threshold_params(res_prob.params[-num_of_thresholds:])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Logit ordinal regression:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "mod_log = OrderedModel(data_student['apply'],\n",
+    "                        data_student[['pared', 'public', 'gpa']],\n",
+    "                        distr='logit')\n",
+    "\n",
+    "res_log = mod_log.fit(method='bfgs', disp=False)\n",
+    "res_log.summary()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "predicted = res_log.model.predict(res_log.params, exog=data_student[['pared', 'public', 'gpa']])\n",
+    "predicted"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pred_choice = predicted.argmax(1)\n",
+    "print('Fraction of correct choice predictions')\n",
+    "print((np.asarray(data_student['apply'].values.codes) == pred_choice).mean())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Ordinal regression with a custom cumulative cLogLog distribution:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In addition to `logit` and `probit` regression, any continuous distribution from `SciPy.stats` package can be used for the `distr` argument. Alternatively, one can define its own distribution simply creating a subclass from `rv_continuous` and implementing a few methods."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# using a SciPy distribution\n",
+    "res_exp = OrderedModel(data_student['apply'],\n",
+    "                           data_student[['pared', 'public', 'gpa']],\n",
+    "                           distr=stats.expon).fit(method='bfgs', disp=False)\n",
+    "res_exp.summary()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# minimal definition of a custom scipy distribution.\n",
+    "class CLogLog(stats.rv_continuous):\n",
+    "    def _ppf(self, q):\n",
+    "        return np.log(-np.log(1 - q))\n",
+    "\n",
+    "    def _cdf(self, x):\n",
+    "        return 1 - np.exp(-np.exp(x))\n",
+    "\n",
+    "\n",
+    "cloglog = CLogLog()\n",
+    "\n",
+    "# definition of the model and fitting\n",
+    "res_cloglog = OrderedModel(data_student['apply'],\n",
+    "                           data_student[['pared', 'public', 'gpa']],\n",
+    "                           distr=cloglog).fit(method='bfgs', disp=False)\n",
+    "res_cloglog.summary()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/statsmodels/base/model.py b/statsmodels/base/model.py
@@ -789,6 +789,7 @@ def __init__(self, endog, exog=None, loglike=None, score=None,
         if hessian is not None:
             self.hessian = hessian
 
+        hasconst = kwds.pop("hasconst", None)
         self.__dict__.update(kwds)
 
         # TODO: data structures?
@@ -797,7 +798,8 @@ def __init__(self, endog, exog=None, loglike=None, score=None,
         # self.df_model = 9999
         # somewhere: CacheWriteWarning: 'df_model' cannot be overwritten
         super(GenericLikelihoodModel, self).__init__(endog, exog,
-                                                     missing=missing)
+                                                     missing=missing,
+                                                     hasconst=hasconst)
 
         # this will not work for ru2nmnl, maybe np.ndim of a dict?
         if exog is not None:
@@ -1060,7 +1062,9 @@ def predict(self, exog=None, transform=True, *args, **kwargs):
         exog_index = exog.index if is_pandas else None
 
         if transform and hasattr(self.model, 'formula') and (exog is not None):
-            design_info = self.model.data.design_info
+            # allow both location of design_info, see #7043
+            design_info = (getattr(self.model, "design_info", None) or
+                           self.model.data.design_info)
             from patsy import dmatrix
             if isinstance(exog, pd.Series):
                 # we are guessing whether it should be column or row
@@ -2488,6 +2492,13 @@ def __init__(self, model, mlefit):
         self._cache = {}
         self.__dict__.update(mlefit.__dict__)
 
+        k_params = len(mlefit.params)
+        # checks mainly for adding new models or subclassing
+        if self.df_model + self.model.k_constant != k_params:
+            warnings.warn("df_model + k_constant differs from nparams")
+        if self.df_resid != self.nobs - k_params:
+            warnings.warn("df_resid differs from nobs - nparams")
+
     def summary(self, yname=None, xname=None, title=None, alpha=.05):
         """Summarize the Regression Results
 

diff --git a/statsmodels/examples/ex_ordered_model.py b/statsmodels/examples/ex_ordered_model.py
@@ -0,0 +1,97 @@
+# -*- coding: utf-8 -*-
+"""
+Created on Mon Aug 24 11:17:06 2015
+
+Author: Josef Perktold
+License: BSD-3
+"""
+
+import numpy as np
+from scipy import stats
+import pandas
+
+from statsmodels.miscmodels.ordinal_model import OrderedModel
+
+nobs, k_vars = 1000, 3
+x = np.random.randn(nobs, k_vars)
+# x = np.column_stack((np.ones(nobs), x))
+# #constant will be in integration limits
+xb = x.dot(np.ones(k_vars))
+y_latent = xb + np.random.randn(nobs)
+y = np.round(np.clip(y_latent, -2.4, 2.4)).astype(int) + 2
+
+print(np.unique(y))
+print(np.bincount(y))
+
+mod = OrderedModel(y, x)
+# start_params = np.ones(k_vars + 4)
+# start_params = np.concatenate((np.ones(k_vars), np.arange(4)))
+start_ppf = stats.norm.ppf((np.bincount(y) / len(y)).cumsum())
+start_threshold = np.concatenate((start_ppf[:1],
+                                  np.log(np.diff(start_ppf[:-1]))))
+start_params = np.concatenate((np.zeros(k_vars), start_threshold))
+res = mod.fit(start_params=start_params, maxiter=5000, maxfun=5000)
+print(res.params)
+# res = mod.fit(start_params=res.params, method='bfgs')
+res = mod.fit(start_params=start_params, method='bfgs')
+
+print(res.params)
+print(np.exp(res.params[-(mod.k_levels - 1):]).cumsum())
+# print(res.summary())
+
+predicted = res.model.predict(res.params)
+pred_choice = predicted.argmax(1)
+print('Fraction of correct choice predictions')
+print((y == pred_choice).mean())
+
+print('\ncomparing bincount')
+print(np.bincount(res.model.predict(res.params).argmax(1)))
+print(np.bincount(res.model.endog))
+
+res_log = OrderedModel(y, x, distr='logit').fit(method='bfgs')
+pred_choice_log = res_log.predict().argmax(1)
+print((y == pred_choice_log).mean())
+print(res_log.summary())
+
+# example form UCLA Stats pages
+# http://www.ats.ucla.edu/stat/stata/dae/ologit.htm
+# requires downloaded dataset ologit.dta
+
+dataf = pandas.read_stata(r"M:\josef_new\scripts\ologit_ucla.dta")
+
+# this works but sorts category levels alphabetically
+res_log2 = OrderedModel(np.asarray(dataf['apply']),
+                        np.asarray(dataf[['pared', 'public', 'gpa']], float),
+                        distr='logit').fit(method='bfgs')
+
+# this replicates the UCLA example except
+# for different parameterization of par2
+res_log3 = OrderedModel(dataf['apply'].values.codes,
+                        np.asarray(dataf[['pared', 'public', 'gpa']], float),
+                        distr='logit').fit(method='bfgs')
+
+print(res_log3.summary())
+
+# with ordered probit - not on UCLA page
+print(
+    OrderedModel(dataf['apply'].values.codes,
+                 np.asarray(dataf[['pared', 'public', 'gpa']], float),
+                 distr='probit').fit(method='bfgs').summary())
+
+
+# example with a custom distribution - not on UCLA page
+# definition of the SciPy dist
+class CLogLog(stats.rv_continuous):
+    def _ppf(self, q):
+        return np.log(-np.log(1 - q))
+
+    def _cdf(self, x):
+        return 1 - np.exp(-np.exp(x))
+
+
+cloglog = CLogLog()
+
+res_cloglog = OrderedModel(dataf['apply'],
+                           dataf[['pared', 'public', 'gpa']],
+                           distr=cloglog).fit(method='bfgs', disp=False)
+print(res_cloglog.summary())