ENH: small fixes + example notebook

statsmodels · Sep 3, 2020 · cc353cc · cc353cc
1 parent b975648
commit cc353cc
Show file tree

Hide file tree

Showing 2 changed files with 250 additions and 4 deletions.
diff --git a/examples/notebooks/ordinal_regression.ipynb b/examples/notebooks/ordinal_regression.ipynb
@@ -0,0 +1,247 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Ordinal Regression"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import scipy.stats as stats\n",
+    "\n",
+    "from statsmodels.miscmodels.ordinal_model import OrderedModel"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Loading a stata data file from the UCLA website.This notebook is inspired by https://stats.idre.ucla.edu/r/dae/ordinal-logistic-regression/ which is a R notebook from UCLA."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "url = \"https://stats.idre.ucla.edu/stat/data/ologit.dta\"\n",
+    "data_student = pd.read_stata(url)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_student.head(5)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_student.dtypes"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_student['apply'].dtype"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This dataset is about the probability for undergraduate students to apply to graduate school given three exogenous variables:\n",
+    "- their grade point average(`gpa`), a float between 0 and 4.\n",
+    "- `pared`, a binary that indicates if at least one parent went to graduate school.\n",
+    "- and `public`, a binary that indicates if the current undergraduate institution of the student is public or private.\n",
+    "\n",
+    "`apply`, the target variable is categorical with ordered categories: `unlikely` < `somewhat likely` < `very likely`. It is a `pd.Serie` of categorical type, this is preferred over NumPy arrays."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The model is based on a numerical latent variable $y_{latent}$ that we cannot observe but that we can compute thanks to exogenous variables.\n",
+    "Moreover we can use this $y_{latent}$ to define $y$ that we can observe.\n",
+    "\n",
+    "For more details see the the Documentation of OrderedModel,  [the UCLA webpage](https://stats.idre.ucla.edu/r/dae/ordinal-logistic-regression/) or this [book](https://onlinelibrary.wiley.com/doi/book/10.1002/9780470594001).\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Probit ordinal regression:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "mod_prob = OrderedModel(data_student['apply'],\n",
+    "                        data_student[['pared', 'public', 'gpa']],\n",
+    "                        distr='logit')\n",
+    "\n",
+    "res_prob = mod_prob.fit(method='bfgs')\n",
+    "res_prob.summary()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In our model, we have 3 exogenous variables(the $\\beta$s if we keep the documentation's notations) so we have 3 coefficients that need to be estimated.\n",
+    "\n",
+    "Those 3 estimations and their standard errors can be retrieved in the summary table.\n",
+    "\n",
+    "Since there are 3 categories in the target variable(`unlikely`, `somewhat likely`, `very likely`), we have two thresholds to estimate. \n",
+    "As explained in the doc of the method `OrderedModel.transform_threshold_params`, the first estimated threshold is the actual value and all the other thresholds are in terms of cumulative exponentiated increments. Actual thresholds values can be computed as follows:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "num_of_thresholds = 2\n",
+    "mod_prob.transform_threshold_params(res_prob.params[-num_of_thresholds:])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Logit ordinal regression:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "mod_log = OrderedModel(data_student['apply'],\n",
+    "                        data_student[['pared', 'public', 'gpa']],\n",
+    "                        distr='logit')\n",
+    "\n",
+    "res_log = mod_log.fit(method='bfgs', disp=False)\n",
+    "res_log.summary()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "predicted = res_log.model.predict(res_log.params, exog=data_student[['pared', 'public', 'gpa']])\n",
+    "predicted"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pred_choice = predicted.argmax(1)\n",
+    "print('Fraction of correct choice predictions')\n",
+    "print((np.asarray(data_student['apply'].values.codes) == pred_choice).mean())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Ordinal regression with a custom cumulative cLogLog distribution:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In addition to `logit` and `probit` regression, any continuous distribution from `SciPy.stats` package can be used for the `distr` argument. Alternatively, one can define its own distribution simply creating a subclass from `rv_continuous` and implementing a few methods."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# using a SciPy distribution\n",
+    "res_exp = OrderedModel(data_student['apply'],\n",
+    "                           data_student[['pared', 'public', 'gpa']],\n",
+    "                           distr=stats.expon).fit(method='bfgs', disp=False)\n",
+    "res_exp.summary()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# minimal definition of a custom scipy distribution.\n",
+    "class CLogLog(stats.rv_continuous):\n",
+    "    def _ppf(self, q):\n",
+    "        return np.log(-np.log(1 - q))\n",
+    "\n",
+    "    def _cdf(self, x):\n",
+    "        return 1 - np.exp(-np.exp(x))\n",
+    "\n",
+    "\n",
+    "cloglog = CLogLog()\n",
+    "\n",
+    "# definition of the model and fitting\n",
+    "res_cloglog = OrderedModel(data_student['apply'],\n",
+    "                           data_student[['pared', 'public', 'gpa']],\n",
+    "                           distr=cloglog).fit(method='bfgs', disp=False)\n",
+    "res_cloglog.summary()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/statsmodels/miscmodels/ordinal_model.py b/statsmodels/miscmodels/ordinal_model.py
@@ -106,7 +106,7 @@ def _check_inputs(self, endog, exog):
 
         # Pandas' support
         if (isinstance(exog, pd.DataFrame)) or (isinstance(exog, pd.Series)):
-            exog_name = (exog.name if isinstance(exog, pd.Series)
+            exog_name = ([exog.name] if isinstance(exog, pd.Series)
                          else exog.columns.tolist())
             names['xname'] = exog_name
             exog = np.asarray(exog)
@@ -243,7 +243,6 @@ def fit(self, start_params=None, method='nm', maxiter=500, full_output=1,
 
 class OrderedResults(GenericLikelihoodModelResults):
     @Appender(GenericLikelihoodModelResults.summary.__doc__)
-    def summary(self, yname=None, xname=None, title=None, alpha=.05, **kwargs):
+    def summary(self, yname=None, xname=None, title=None, alpha=.05):
         names = self.model.names
-        print(names)
-        return super(OrderedResults, self).summary(**names, **kwargs)
+        return super(OrderedResults, self).summary(**names)