Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cochrans q test #310

Merged
merged 1 commit into from
Dec 22, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ pages:
- user_guide/evaluate/bootstrap.md
- user_guide/evaluate/bootstrap_point632_score.md
- user_guide/evaluate/BootstrapOutOfBag.md
- user_guide/evaluate/cochrans_q.md
- user_guide/evaluate/confusion_matrix.md
- user_guide/evaluate/lift_score.md
- user_guide/evaluate/mcnemar_table.md
Expand Down
1 change: 1 addition & 0 deletions docs/sources/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ The CHANGELOG for the current development version is available at
via [takashioya](https://github.com/takashioya))
- The new `store_train_meta_features` attribute and `pred_meta_features` method for the `StackingCVRegressor` were also added to the `StackingRegressor`, `StackingClassifier`, and `StackingCVClassifier` ([#299](https://github.com/rasbt/mlxtend/pull/299) & [#300](https://github.com/rasbt/mlxtend/pull/300))
- New function (`evaluate.mcnemar_tables`) for creating multiple 2x2 contigency from model predictions arrays that can be used in multiple McNemar (post-hoc) tests or Cochran's Q or F tests, etc. ([#307](https://github.com/rasbt/mlxtend/issues/307))
- New function (`evaluate.cochrans_q`) for performing Cochran's Q test to compare the accuracy of multiple classifiers. ([#310](https://github.com/rasbt/mlxtend/issues/310))

##### Changes

Expand Down
1 change: 1 addition & 0 deletions docs/sources/USER_GUIDE_INDEX.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
- [bootstrap](user_guide/evaluate/bootstrap.md)
- [bootstrap_point632_score](user_guide/evaluate/bootstrap_point632_score.md)
- [BootstrapOutOfBag](user_guide/evaluate/BootstrapOutOfBag.md)
- [cochrans_q](user_guide/evaluate/cochrans_q.md)
- [confusion_matrix](user_guide/evaluate/confusion_matrix.md)
- [lift_score](user_guide/evaluate/lift_score.md)
- [mcnemar_table](user_guide/evaluate/mcnemar_table.md)
Expand Down
2 changes: 1 addition & 1 deletion docs/sources/user_guide/evaluate/bootstrap.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -269,7 +269,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
"version": "3.6.3"
}
},
"nbformat": 4,
Expand Down
347 changes: 347 additions & 0 deletions docs/sources/user_guide/evaluate/cochrans_q.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,347 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Cochran's Q Test"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cochran's Q test for comparing the performance of multiple classifiers."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> `from mlxtend.evaluate import cochrans_q` "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Overview"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cochran's Q test can be regarded as a generalized version of McNemar's test that can be applied to evaluate multiple classifiers. In a sense, Cochran's Q test is analogous to ANOVA for binary outcomes. \n",
"\n",
"To compare more than two classifiers, we can use Cochran's Q test, which has a test statistic $Q$ that is approximately, (similar to McNemar's test), distributed as chi-squared with $L-1$ degrees of freedom, where L is the number of models we evaluate (since $L=2$ for McNemar's test, McNemars test statistic approximates a chi-squared distribution with one degree of freedom). \n",
"\n",
"More formally, Cochran's Q test tests the hypothesis that there is no difference between the classification accuracies [1]: \n",
"\n",
"$$p_i: H_0 = p_1 = p_2 = \\cdots = p_L.$$\n",
"\n",
"Let $\\{D_1, \\dots , D_L\\}$ be a set of classifiers who have all been tested on the same dataset. If the L classifiers don't perform differently, then the following Q statistic is distributed approximately as\n",
"\"chi-squared\" with $L-1$ degrees of freedom:\n",
"\n",
"$$Q_C = (L-1) \\frac{L \\sum^{L}_{i=1}G_{i}^{2} - T^2}{LT - \\sum^{N_{ts}}_{j=1} (L_j)^2}.$$\n",
"\n",
"Here, $G_i$ is the number of objects out of $N_{ts}$ correctly classified by $D_i= 1, \\dots L$; $L_j$ is the number of classifiers out of $L$ that correctly classified object $\\mathbf{z}_j \\in \\mathbf{Z}_{ts}$, where $\\mathbf{Z}_{ts} = \\{\\mathbf{z}_1, ... \\mathbf{z}_{N_{ts}}\\}$ is the test dataset on which the classifers are tested on; and $T$ is the total number of correct number of votes among the $L$ classifiers [2]:\n",
"\n",
"$$ T = \\sum_{i=1}^{L} G_i = \\sum^{N_{ts}}_{j=1} L_j.$$\n",
"\n",
"\n",
"To perform Cochran's Q test, we typically organize the classificier predictions in a binary $N_{ts} \\times L$ matrix. The $ij\\text{th}$ entry of such matrix is 0 if a classifier $D_j$ has misclassified a data example (vector) $\\mathbf{z}_i$ and 1 otherwise (if the classifier predicted the class label $l(\\mathbf{z}_i)$ correctly) [2].\n",
"\n",
"The following example taken from [2] illustrates how the classification results may be organized. For instance, assume we have the ground truth labels of the test dataset `y_true` and the following predictions by 3 classifiers (`y_model_1`, `y_model_2`, and `y_model_3`):\n",
"\n",
" y_true = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0])\n",
"\n",
" y_model_1 = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0])\n",
"\n",
" y_model_2 = np.array([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0])\n",
"\n",
" y_model_3 = np.array([1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 1, 1])\n",
"\n",
"The table of correct (1) and incorrect (0) classifications may then look as follows:\n",
"\n",
"| | $D_1$ (model 1) | $D_2$ (model 2) | $D_3$ (model 3) | Occurrences |\n",
"|----------|-------------------|-------------------|-------------------|-------------|\n",
"| | 1 | 1 | 1 | 80 |\n",
"| | 1 | 1 | 0 | 2 |\n",
"| | 1 | 0 | 1 | 0 |\n",
"| | 1 | 0 | 0 | 2 |\n",
"| | 0 | 1 | 1 | 9 |\n",
"| | 0 | 1 | 0 | 1 |\n",
"| | 0 | 0 | 1 | 3 |\n",
"| | 0 | 0 | 0 | 3 |\n",
"| Accuracy | 84/100*100% = 84% | 92/100*100% = 92% | 92/100*100% = 92% | |\n",
"\n",
"By plugging in the respective value into the previous equation, we obtain the following $Q$ value [2]:\n",
"\n",
"$$Q_c = 2 \\times \\frac{3 \\times (84^2 + 92^2 + 92^2) - 268^2}{3\\times 268-(80 \\times 9 + 11 \\times 4 + 6 \\times 1)} \\approx 7.5294.$$\n",
"\n",
"(Note that the $Q$ value in [2] is listed as 3.7647 due to a typo as discussed with the author, the value 7.5294 is the correct one.)\n",
"\n",
"Now, the Q value (approximating $\\chi^2$) corresponds to a p-value of approx. 0.023 assuming a $\\chi^2$ distribution with $L-1 = 2$ degrees of freedom. Assuming that we chose a significance level of $\\alpha=0.05$, we would reject the null hypothesis that all classifiers perform equally well, since $0.023 < \\alpha$.\n",
"\n",
"In practice, if we successfully rejected the null hypothesis, we could perform multiple post hoc pair-wise tests -- for example, McNemar tests with a Bonferroni correction -- to determine which pairs have different population proportions.\n",
"\n",
"\n",
"### References\n",
"\n",
"- [1] Fleiss, Joseph L., Bruce Levin, and Myunghee Cho Paik. Statistical methods for rates and proportions. John Wiley & Sons, 2013.\n",
"- [2] Kuncheva, Ludmila I. Combining pattern classifiers: methods and algorithms. John Wiley & Sons, 2004.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example 1 - Cochran's Q test"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"from mlxtend.evaluate import cochrans_q\n",
"from mlxtend.evaluate import mcnemar_table\n",
"from mlxtend.evaluate import mcnemar\n",
"\n",
"## Dataset:\n",
"\n",
"# ground truth labels of the test dataset:\n",
"\n",
"y_true = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0])\n",
"\n",
"\n",
"# predictions by 3 classifiers (`y_model_1`, `y_model_2`, and `y_model_3`):\n",
"\n",
"y_model_1 = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0])\n",
"\n",
"y_model_2 = np.array([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0])\n",
"\n",
"y_model_3 = np.array([1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 1, 1])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Assuming a significance level $\\alpha=0.05$, we can conduct Cochran's Q test as follows, to test the null hypothesis there is no difference between the classification accuracies, $p_i: H_0 = p_1 = p_2 = \\cdots = p_L$:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Q: 7.529\n",
"p-value: 0.023\n"
]
}
],
"source": [
"q, p_value = cochrans_q(y_true, \n",
" y_model_1, \n",
" y_model_2, \n",
" y_model_3)\n",
"\n",
"print('Q: %.3f' % q)\n",
"print('p-value: %.3f' % p_value)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since the p-value is smaller than $\\alpha$, we can reject the null hypothesis and conclude that there is a difference between the classification accuracies. As mentioned in the introduction earlier, we could now perform multiple post hoc pair-wise tests -- for example, McNemar tests with a Bonferroni correction -- to determine which pairs have different population proportions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lastly, let's illustrate that Cochran's Q test is indeed just a generalized version of McNemar's test:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Cochran's Q Chi^2: 5.333\n",
"Cochran's Q p-value: 0.021\n"
]
}
],
"source": [
"chi2, p_value = cochrans_q(y_true, \n",
" y_model_1, \n",
" y_model_2)\n",
"\n",
"print('Cochran\\'s Q Chi^2: %.3f' % chi2)\n",
"print('Cochran\\'s Q p-value: %.3f' % p_value)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"McNemar's Chi^2: 5.333\n",
"McNemar's p-value: 0.021\n"
]
}
],
"source": [
"chi2, p_value = mcnemar(mcnemar_table(y_true, \n",
" y_model_1, \n",
" y_model_2),\n",
" corrected=False)\n",
"\n",
"print('McNemar\\'s Chi^2: %.3f' % chi2)\n",
"print('McNemar\\'s p-value: %.3f' % p_value)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## API"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"## cochrans_q\n",
"\n",
"*cochrans_q(y_target, *y_model_predictions)*\n",
"\n",
"Cochran's Q test to compare 2 or more models.\n",
"\n",
"**Parameters**\n",
"\n",
"- `y_target` : array-like, shape=[n_samples]\n",
"\n",
" True class labels as 1D NumPy array.\n",
"\n",
"\n",
"- `*y_model_predictions` : array-likes, shape=[n_samples]\n",
"\n",
" Variable number of 2 or more arrays that\n",
" contain the predicted class labels\n",
" from models as 1D NumPy array.\n",
"\n",
"**Returns**\n",
"\n",
"\n",
"- `q, p` : float or None, float\n",
"\n",
" Returns the Q (chi-squared) value and the p-value\n",
"\n",
"\n"
]
}
],
"source": [
"with open('../../api_modules/mlxtend.evaluate/cochrans_q.md', 'r') as f:\n",
" s = f.read() \n",
"print(s)"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
4 changes: 3 additions & 1 deletion mlxtend/evaluate/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,12 @@
from .bootstrap_outofbag import BootstrapOutOfBag
from .bootstrap_point632 import bootstrap_point632_score
from .permutation import permutation_test
from .cochrans_q import cochrans_q


__all__ = ["scoring", "confusion_matrix",
"mcnemar_table", "mcnemar_tables",
"mcnemar", "lift_score",
"bootstrap", "permutation_test",
"BootstrapOutOfBag", "bootstrap_point632_score"]
"BootstrapOutOfBag", "bootstrap_point632_score",
"cochrans_q"]
Loading