Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implementation of f-test #460

Merged
merged 1 commit into from
Nov 8, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/sources/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ The CHANGELOG for the current development version is available at
- Added a new convenience function `extract_face_landmarks` based on `dlib` to `mlxtend.image`. ([#458](https://github.com/rasbt/mlxtend/pull/458))
- Added a `method='oob'` option to the `mlxtend.evaluate.bootstrap_point632_score` method to compute the classic out-of-bag bootstrap estimate ([#459](https://github.com/rasbt/mlxtend/pull/459))
- Added a `method='.632+'` option to the `mlxtend.evaluate.bootstrap_point632_score` method to compute the .632+ bootstrap estimate that addresses the optimism bias of the .632 bootstrap ([#459](https://github.com/rasbt/mlxtend/pull/459))
- Added a new `mlxtend.evaluate.ftest` function to perform an F-test for comparing the accuracies of two or more classification models. ([#460](https://github.com/rasbt/mlxtend/pull/460))

##### Changes

Expand Down
1 change: 1 addition & 0 deletions docs/sources/USER_GUIDE_INDEX.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
- [cochrans_q](user_guide/evaluate/cochrans_q.md)
- [confusion_matrix](user_guide/evaluate/confusion_matrix.md)
- [feature_importance_permutation](user_guide/evaluate/feature_importance_permutation.md)
- [ftest](user_guide/evaluate/ftest.md)
- [lift_score](user_guide/evaluate/lift_score.md)
- [mcnemar_table](user_guide/evaluate/mcnemar_table.md)
- [mcnemar_tables](user_guide/evaluate/mcnemar_tables.md)
Expand Down
286 changes: 286 additions & 0 deletions docs/sources/user_guide/evaluate/ftest.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,286 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# F-Test"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"F-test for comparing the performance of multiple classifiers."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> `from mlxtend.evaluate import ftest` "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Overview"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the context of evaluating machine learning models, the F-test by George W. Snedecor [1] can be regarded as analogous to Cochran's Q test that can be applied to evaluate multiple classifiers (i.e., whether their accuracies estimated on a test set differ) as described by Looney [2][3]. \n",
"\n",
"More formally, assume the task to test the null hypothesis that there is no difference between the classification accuracies [1]: \n",
"\n",
"$$p_i: H_0 = p_1 = p_2 = \\cdots = p_L.$$\n",
"\n",
"Let $\\{D_1, \\dots , D_L\\}$ be a set of classifiers who have all been tested on the same dataset. If the L classifiers don't perform differently, then the F statistic is distributed according to an F distribution with $(L-1$) and $(L-1)\\times(N)$ degrees of freedom, where $N$ is the number of examples in the test set.\n",
"\n",
"The calculation of the F statistic consists of several components, which are listed below (adopted from [3]).\n",
"\n",
"Sum of squares of the classifiers:\n",
"\n",
"$$\n",
"SSA = N \\sum_{i=1}^{N} (L_j)^2,\n",
"$$\n",
"\n",
"\n",
"where $L_j$ is the number of classifiers out of $L$ that correctly classified object $\\mathbf{z}_j \\in \\mathbf{Z}_{N}$, where $\\mathbf{Z}_{N} = \\{\\mathbf{z}_1, ... \\mathbf{z}_{N}\\}$ is the test dataset on which the classifers are tested on.\n",
"\n",
"The sum of squares for the objects:\n",
"\n",
"$$\n",
"SSB= \\frac{1}{L} \\sum_{j=1}^N (L_j)^2 - L\\cdot N \\cdot ACC_{avg}^2,\n",
"$$\n",
"\n",
"where $ACC_{avg}$ is the average of the accuracies of the different models $ACC_{avg} = \\sum_{i=1}^L ACC_i$.\n",
"\n",
"The total sum of squares:\n",
"\n",
"$$\n",
"SST = L\\cdot N \\cdot ACC_{avg}^2 (1 - ACC_{avg}^2).\n",
"$$\n",
"\n",
"The sum of squares for the classification--object interaction:\n",
"\n",
"$$\n",
"SSAB = SST - SSA - SSB.\n",
"$$\n",
"\n",
"The mean SSA and mean SSAB values:\n",
"\n",
"$$\n",
"MSA = \\frac{SSA}{L-1},\n",
"$$\n",
"\n",
"and\n",
"\n",
"$$\n",
"MSAB = \\frac{SSAB}{(L-1) (N-1)}.\n",
"$$\n",
"\n",
"From the MSA and MSAB, we can then calculate the F-value as\n",
"\n",
"$$\n",
"F = \\frac{MSA}{MSAB}.\n",
"$$\n",
"\n",
"\n",
"After computing the F-value, we can then look up the p-value from a F-distribution table for the corresponding degrees of freedom or obtain it computationally from a cumulative F-distribution function. In practice, if we successfully rejected the null hypothesis at a previously chosen significance threshold, we could perform multiple post hoc pair-wise tests -- for example, McNemar tests with a Bonferroni correction -- to determine which pairs have different population proportions.\n",
"\n",
"\n",
"### References\n",
"\n",
"- [1] Snedecor, George W. and Cochran, William G. (1989), Statistical Methods, Eighth Edition, Iowa State University Press.\n",
"- [2] Looney, Stephen W. \"A statistical technique for comparing the accuracies of several classifiers.\" Pattern Recognition Letters 8, no. 1 (1988): 5-9.\n",
"- [3] Kuncheva, Ludmila I. Combining pattern classifiers: methods and algorithms. John Wiley & Sons, 2004.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example 1 - F-test"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"from mlxtend.evaluate import ftest\n",
"\n",
"## Dataset:\n",
"\n",
"# ground truth labels of the test dataset:\n",
"\n",
"y_true = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0])\n",
"\n",
"\n",
"# predictions by 3 classifiers (`y_model_1`, `y_model_2`, and `y_model_3`):\n",
"\n",
"y_model_1 = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0])\n",
"\n",
"y_model_2 = np.array([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0])\n",
"\n",
"y_model_3 = np.array([1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 1, 1])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Assuming a significance level $\\alpha=0.05$, we can conduct Cochran's Q test as follows, to test the null hypothesis there is no difference between the classification accuracies, $p_i: H_0 = p_1 = p_2 = \\cdots = p_L$:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"F: 3.873\n",
"p-value: 0.022\n"
]
}
],
"source": [
"f, p_value = ftest(y_true, \n",
" y_model_1, \n",
" y_model_2, \n",
" y_model_3)\n",
"\n",
"print('F: %.3f' % f)\n",
"print('p-value: %.3f' % p_value)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since the p-value is smaller than $\\alpha$, we can reject the null hypothesis and conclude that there is a difference between the classification accuracies. As mentioned in the introduction earlier, we could now perform multiple post hoc pair-wise tests -- for example, McNemar tests with a Bonferroni correction -- to determine which pairs have different population proportions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## API"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"## ftest\n",
"\n",
"*ftest(y_target, *y_model_predictions)*\n",
"\n",
"F-Test test to compare 2 or more models.\n",
"\n",
"**Parameters**\n",
"\n",
"- `y_target` : array-like, shape=[n_samples]\n",
"\n",
" True class labels as 1D NumPy array.\n",
"\n",
"\n",
"- `*y_model_predictions` : array-likes, shape=[n_samples]\n",
"\n",
" Variable number of 2 or more arrays that\n",
" contain the predicted class labels\n",
" from models as 1D NumPy array.\n",
"\n",
"**Returns**\n",
"\n",
"\n",
"- `f, p` : float or None, float\n",
"\n",
" Returns the F-value and the p-value\n",
"\n",
"**Examples**\n",
"\n",
"For usage examples, please see\n",
" [http://rasbt.github.io/mlxtend/user_guide/evaluate/ftest/](http://rasbt.github.io/mlxtend/user_guide/evaluate/ftest/)\n",
"\n",
"\n"
]
}
],
"source": [
"with open('../../api_modules/mlxtend.evaluate/ftest.md', 'r') as f:\n",
" s = f.read() \n",
"print(s)"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
},
"toc": {
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 1
}
4 changes: 3 additions & 1 deletion mlxtend/evaluate/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
from .ttest import paired_ttest_5x2cv
from .holdout import RandomHoldoutSplit
from .holdout import PredefinedHoldoutSplit
from .f_test import ftest


__all__ = ["scoring", "confusion_matrix",
Expand All @@ -32,4 +33,5 @@
"cochrans_q", "paired_ttest_resampled",
"paired_ttest_kfold_cv", "paired_ttest_5x2cv",
"feature_importance_permutation",
"RandomHoldoutSplit"]
"RandomHoldoutSplit", "PredefinedHoldoutSplit",
"ftest"]
Loading