rasbt · rasbt · Dec 22, 2017 · Dec 22, 2017
diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml
@@ -55,6 +55,7 @@ pages:
     - user_guide/evaluate/bootstrap.md
     - user_guide/evaluate/bootstrap_point632_score.md
     - user_guide/evaluate/BootstrapOutOfBag.md
+    - user_guide/evaluate/cochrans_q.md    
     - user_guide/evaluate/confusion_matrix.md
     - user_guide/evaluate/lift_score.md
     - user_guide/evaluate/mcnemar_table.md

diff --git a/docs/sources/CHANGELOG.md b/docs/sources/CHANGELOG.md
@@ -25,6 +25,7 @@ The CHANGELOG for the current development version is available at
     via [takashioya](https://github.com/takashioya))
 - The new `store_train_meta_features` attribute and `pred_meta_features` method for the `StackingCVRegressor` were also added to the `StackingRegressor`, `StackingClassifier`, and `StackingCVClassifier` ([#299](https://github.com/rasbt/mlxtend/pull/299) & [#300](https://github.com/rasbt/mlxtend/pull/300)) 
 - New function (`evaluate.mcnemar_tables`) for creating multiple 2x2 contigency from model predictions arrays that can be used in multiple McNemar (post-hoc) tests or Cochran's Q or F tests, etc. ([#307](https://github.com/rasbt/mlxtend/issues/307))
+- New function (`evaluate.cochrans_q`) for performing Cochran's Q test to compare the accuracy of multiple classifiers. ([#310](https://github.com/rasbt/mlxtend/issues/310))
 
 ##### Changes
 

diff --git a/docs/sources/USER_GUIDE_INDEX.md b/docs/sources/USER_GUIDE_INDEX.md
@@ -27,6 +27,7 @@
 - [bootstrap](user_guide/evaluate/bootstrap.md)
 - [bootstrap_point632_score](user_guide/evaluate/bootstrap_point632_score.md)
 - [BootstrapOutOfBag](user_guide/evaluate/BootstrapOutOfBag.md)
+- [cochrans_q](user_guide/evaluate/cochrans_q.md)
 - [confusion_matrix](user_guide/evaluate/confusion_matrix.md)
 - [lift_score](user_guide/evaluate/lift_score.md)
 - [mcnemar_table](user_guide/evaluate/mcnemar_table.md)

diff --git a/docs/sources/user_guide/evaluate/bootstrap.ipynb b/docs/sources/user_guide/evaluate/bootstrap.ipynb
@@ -269,7 +269,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.1"
+   "version": "3.6.3"
   }
  },
  "nbformat": 4,

diff --git a/docs/sources/user_guide/evaluate/cochrans_q.ipynb b/docs/sources/user_guide/evaluate/cochrans_q.ipynb
@@ -0,0 +1,347 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%matplotlib inline"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Cochran's Q Test"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Cochran's Q test for comparing the performance of multiple classifiers."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "> `from mlxtend.evaluate import cochrans_q`    "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Overview"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Cochran's Q test can be regarded as a generalized version of McNemar's test that can be applied to evaluate multiple classifiers. In a sense, Cochran's Q test is analogous to ANOVA for binary outcomes. \n",
+    "\n",
+    "To compare more than two classifiers, we can use Cochran's Q test, which has a test statistic $Q$ that is approximately, (similar to McNemar's test), distributed as chi-squared with $L-1$ degrees of freedom, where L is the number of models we evaluate (since $L=2$ for McNemar's test, McNemars test statistic approximates a chi-squared distribution with one degree of freedom). \n",
+    "\n",
+    "More formally, Cochran's Q test tests the hypothesis that there is no difference between the classification accuracies [1]: \n",
+    "\n",
+    "$$p_i: H_0 = p_1 = p_2 = \\cdots = p_L.$$\n",
+    "\n",
+    "Let $\\{D_1, \\dots , D_L\\}$ be a set of classifiers who have all been tested on the same dataset. If the L classifiers don't perform differently, then the following Q statistic is distributed approximately as\n",
+    "\"chi-squared\" with $L-1$ degrees of freedom:\n",
+    "\n",
+    "$$Q_C = (L-1) \\frac{L \\sum^{L}_{i=1}G_{i}^{2} - T^2}{LT - \\sum^{N_{ts}}_{j=1} (L_j)^2}.$$\n",
+    "\n",
+    "Here, $G_i$ is the number of objects out of $N_{ts}$ correctly classified by $D_i= 1, \\dots  L$; $L_j$ is the number of classifiers out of $L$ that correctly classified object $\\mathbf{z}_j \\in \\mathbf{Z}_{ts}$, where $\\mathbf{Z}_{ts} = \\{\\mathbf{z}_1, ... \\mathbf{z}_{N_{ts}}\\}$ is the test dataset on which the classifers are tested on; and $T$ is the total number of correct number of votes among the $L$ classifiers [2]:\n",
+    "\n",
+    "$$ T = \\sum_{i=1}^{L} G_i = \\sum^{N_{ts}}_{j=1} L_j.$$\n",
+    "\n",
+    "\n",
+    "To perform Cochran's Q test, we typically organize the classificier predictions in a binary $N_{ts} \\times L$ matrix. The $ij\\text{th}$ entry of such matrix is 0 if a classifier $D_j$ has misclassified a data example (vector) $\\mathbf{z}_i$ and 1 otherwise (if the classifier predicted the class label $l(\\mathbf{z}_i)$ correctly) [2].\n",
+    "\n",
+    "The following example taken from [2] illustrates how the classification results may be organized. For instance, assume we have the ground truth labels of the test dataset `y_true` and the following predictions by 3 classifiers (`y_model_1`, `y_model_2`, and `y_model_3`):\n",
+    "\n",
+    "    y_true = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                       0, 0, 0, 0, 0])\n",
+    "\n",
+    "    y_model_1 = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,\n",
+    "                          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                          0, 0])\n",
+    "\n",
+    "    y_model_2 = np.array([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                          1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                          0, 0])\n",
+    "\n",
+    "    y_model_3 = np.array([1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                          1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                          1, 1])\n",
+    "\n",
+    "The table of correct (1) and incorrect (0) classifications may then look as follows:\n",
+    "\n",
+    "|          | $D_1$ (model 1)   | $D_2$ (model 2)   | $D_3$ (model 3)   | Occurrences |\n",
+    "|----------|-------------------|-------------------|-------------------|-------------|\n",
+    "|          | 1                 | 1                 | 1                 | 80          |\n",
+    "|          | 1                 | 1                 | 0                 | 2           |\n",
+    "|          | 1                 | 0                 | 1                 | 0           |\n",
+    "|          | 1                 | 0                 | 0                 | 2           |\n",
+    "|          | 0                 | 1                 | 1                 | 9           |\n",
+    "|          | 0                 | 1                 | 0                 | 1           |\n",
+    "|          | 0                 | 0                 | 1                 | 3           |\n",
+    "|          | 0                 | 0                 | 0                 | 3           |\n",
+    "| Accuracy | 84/100*100% = 84% | 92/100*100% = 92% | 92/100*100% = 92% |             |\n",
+    "\n",
+    "By plugging in the respective value into the previous equation, we obtain the following $Q$ value [2]:\n",
+    "\n",
+    "$$Q_c = 2 \\times \\frac{3 \\times (84^2 + 92^2 + 92^2) - 268^2}{3\\times 268-(80 \\times 9 + 11 \\times 4 + 6 \\times 1)} \\approx 7.5294.$$\n",
+    "\n",
+    "(Note that the $Q$ value in [2] is listed as 3.7647 due to a typo as discussed with the author, the value 7.5294 is the correct one.)\n",
+    "\n",
+    "Now, the Q value (approximating $\\chi^2$) corresponds to a p-value of approx. 0.023 assuming a $\\chi^2$ distribution with $L-1 = 2$ degrees of freedom. Assuming that we chose a significance level of $\\alpha=0.05$, we would reject the null hypothesis that all classifiers perform equally well, since $0.023 < \\alpha$.\n",
+    "\n",
+    "In practice, if we successfully rejected the null hypothesis, we could perform multiple post hoc pair-wise tests -- for example, McNemar tests with a Bonferroni correction -- to determine which pairs have different population proportions.\n",
+    "\n",
+    "\n",
+    "### References\n",
+    "\n",
+    "- [1] Fleiss, Joseph L., Bruce Levin, and Myunghee Cho Paik. Statistical methods for rates and proportions. John Wiley & Sons, 2013.\n",
+    "- [2] Kuncheva, Ludmila I. Combining pattern classifiers: methods and algorithms. John Wiley & Sons, 2004.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Example 1 - Cochran's Q test"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "from mlxtend.evaluate import cochrans_q\n",
+    "from mlxtend.evaluate import mcnemar_table\n",
+    "from mlxtend.evaluate import mcnemar\n",
+    "\n",
+    "## Dataset:\n",
+    "\n",
+    "# ground truth labels of the test dataset:\n",
+    "\n",
+    "y_true = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                   0, 0, 0, 0, 0])\n",
+    "\n",
+    "\n",
+    "# predictions by 3 classifiers (`y_model_1`, `y_model_2`, and `y_model_3`):\n",
+    "\n",
+    "y_model_1 = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,\n",
+    "                      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                      0, 0])\n",
+    "\n",
+    "y_model_2 = np.array([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                      1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                      0, 0])\n",
+    "\n",
+    "y_model_3 = np.array([1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                      1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+    "                      1, 1])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Assuming a significance level $\\alpha=0.05$, we can conduct Cochran's Q test as follows, to test the null hypothesis there is no difference between the classification accuracies, $p_i: H_0 = p_1 = p_2 = \\cdots = p_L$:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Q: 7.529\n",
+      "p-value: 0.023\n"
+     ]
+    }
+   ],
+   "source": [
+    "q, p_value = cochrans_q(y_true, \n",
+    "                        y_model_1, \n",
+    "                        y_model_2, \n",
+    "                        y_model_3)\n",
+    "\n",
+    "print('Q: %.3f' % q)\n",
+    "print('p-value: %.3f' % p_value)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Since the p-value is smaller than $\\alpha$, we can reject the null hypothesis and conclude that there is a difference between the classification accuracies. As mentioned in the introduction earlier, we could now perform multiple post hoc pair-wise tests -- for example, McNemar tests with a Bonferroni correction -- to determine which pairs have different population proportions."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Lastly, let's illustrate that Cochran's Q test is indeed just a generalized version of McNemar's test:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Cochran's Q Chi^2: 5.333\n",
+      "Cochran's Q p-value: 0.021\n"
+     ]
+    }
+   ],
+   "source": [
+    "chi2, p_value = cochrans_q(y_true, \n",
+    "                           y_model_1, \n",
+    "                           y_model_2)\n",
+    "\n",
+    "print('Cochran\\'s Q Chi^2: %.3f' % chi2)\n",
+    "print('Cochran\\'s Q p-value: %.3f' % p_value)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "McNemar's Chi^2: 5.333\n",
+      "McNemar's p-value: 0.021\n"
+     ]
+    }
+   ],
+   "source": [
+    "chi2, p_value = mcnemar(mcnemar_table(y_true, \n",
+    "                                      y_model_1, \n",
+    "                                      y_model_2),\n",
+    "                        corrected=False)\n",
+    "\n",
+    "print('McNemar\\'s Chi^2: %.3f' % chi2)\n",
+    "print('McNemar\\'s p-value: %.3f' % p_value)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## API"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "## cochrans_q\n",
+      "\n",
+      "*cochrans_q(y_target, *y_model_predictions)*\n",
+      "\n",
+      "Cochran's Q test to compare 2 or more models.\n",
+      "\n",
+      "**Parameters**\n",
+      "\n",
+      "- `y_target` : array-like, shape=[n_samples]\n",
+      "\n",
+      "    True class labels as 1D NumPy array.\n",
+      "\n",
+      "\n",
+      "- `*y_model_predictions` : array-likes, shape=[n_samples]\n",
+      "\n",
+      "    Variable number of 2 or more arrays that\n",
+      "    contain the predicted class labels\n",
+      "    from models as 1D NumPy array.\n",
+      "\n",
+      "**Returns**\n",
+      "\n",
+      "\n",
+      "- `q, p` : float or None, float\n",
+      "\n",
+      "    Returns the Q (chi-squared) value and the p-value\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "with open('../../api_modules/mlxtend.evaluate/cochrans_q.md', 'r') as f:\n",
+    "    s = f.read() \n",
+    "print(s)"
+   ]
+  }
+ ],
+ "metadata": {
+  "anaconda-cloud": {},
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
diff --git a/mlxtend/evaluate/__init__.py b/mlxtend/evaluate/__init__.py
@@ -14,10 +14,12 @@
 from .bootstrap_outofbag import BootstrapOutOfBag
 from .bootstrap_point632 import bootstrap_point632_score
 from .permutation import permutation_test
+from .cochrans_q import cochrans_q
 
 
 __all__ = ["scoring", "confusion_matrix",
            "mcnemar_table", "mcnemar_tables",
            "mcnemar", "lift_score",
            "bootstrap", "permutation_test",
-           "BootstrapOutOfBag", "bootstrap_point632_score"]
+           "BootstrapOutOfBag", "bootstrap_point632_score",
+           "cochrans_q"]