Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add 5x2cv paired t test #325

Merged
merged 1 commit into from
Jan 20, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ pages:
- user_guide/evaluate/mcnemar_table.md
- user_guide/evaluate/mcnemar_tables.md
- user_guide/evaluate/mcnemar.md
- user_guide/evaluate/paired_ttest_5x2cv.md
- user_guide/evaluate/paired_ttest_kfold_cv.md
- user_guide/evaluate/paired_ttest_resampled.md
- user_guide/evaluate/permutation_test.md
Expand Down
2 changes: 2 additions & 0 deletions docs/sources/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ The CHANGELOG for the current development version is available at
- New function implementing the k-fold paired t-test procedure (`paired_ttest_kfold_cv`)
to compare the performance of two models
(also called k-hold-out paired t-test). ([#324](https://github.com/rasbt/mlxtend/issues/324))
- New function implementing the 5x2cv paired t-test procedure (`paired_ttest_5x2cv`) proposed by Dieterrich (1998)
to compare the performance of two models. ([#325](https://github.com/rasbt/mlxtend/issues/325))

##### Changes

Expand Down
1 change: 1 addition & 0 deletions docs/sources/USER_GUIDE_INDEX.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
- [mcnemar_table](user_guide/evaluate/mcnemar_table.md)
- [mcnemar_tables](user_guide/evaluate/mcnemar_tables.md)
- [mcnemar](user_guide/evaluate/mcnemar.md)
- [paired_ttest_5x2cv](user_guide/evaluate/paired_ttest_5x2cv.md)
- [paired_ttest_kfold_cv](user_guide/evaluate/paired_ttest_kfold_cv.md)
- [paired_ttest_resampled](user_guide/evaluate/paired_ttest_resampled.md)
- [permutation_test](user_guide/evaluate/permutation_test.md)
Expand Down
326 changes: 326 additions & 0 deletions docs/sources/user_guide/evaluate/paired_ttest_5x2cv.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,326 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 5x2cv paired *t* test"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"5x2cv paired *t* test procedure to compare the performance of two models"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> `from mlxtend.evaluate import paired_ttest_5x2cv` "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Overview"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The 5x2cv paired *t* test is a procedure for comparing the performance of two models (classifiers or regressors)\n",
"that was proposed by Dietterich [1] to address shortcomings in other methods such as the resampled paired *t* test (see [`paired_ttest_resampled`](paired_ttest_resampled.md)) and the k-fold cross-validated paired *t* test (see [`paired_ttest_kfold_cv`](paired_ttest_kfold_cv.md)).\n",
"\n",
"To explain how this method works, let's consider to estimator (e.g., classifiers) A and B. Further, we have a labeled dataset *D*. In the common hold-out method, we typically split the dataset into 2 parts: a training and a test set. In the 5x2cv paired *t* test, we repeat the splitting (50% training and 50% test data) 5 times. \n",
"\n",
"In each of the 5 iterations, we fit A and B to the training split and evaluate their performance ($p_A$ and $p_B$) on the test split. Then, we rotate the training and test sets (the training set becomes the test set and vice versa) compute the performance again, which results in 2 performance difference measures:\n",
"\n",
"$$p^{(1)} = p^{(1)}_A - p^{(1)}_B$$\n",
"\n",
"and\n",
"\n",
"$$p^{(2)} = p^{(2)}_A - p^{(2)}_B.$$\n",
"\n",
"Then, we estimate the estimate mean and variance of the differences:\n",
"\n",
"$\\overline{p} = \\frac{p^{(1)} + p^{(2)}}{2}$\n",
"\n",
"and\n",
"\n",
"$s^2 = (p^{(1)} - \\overline{p})^2 + (p^{(2)} - \\overline{p})^2.$\n",
"\n",
"The variance of the difference is computed for the 5 iterations and then used to compute the *t* statistic as follows:\n",
"\n",
"$$t = \\frac{p_1^{(1)}}{\\sqrt{(1/5) \\sum_{i=1}^{5}s_i^2}},$$\n",
"\n",
"where $p_1^{(1)}$ is the $p_1$ from the very first iteration. The *t* statistic, assuming that it approximately follows as *t* distribution with 5 degrees of freedom, under the null hypothesis that the models A and B have equal performance. Using the *t* statistic, the p value can be computed and compared with a previously chosen significance level, e.g., $\\alpha=0.05$. If the p value is smaller than $\\alpha$, we reject the null hypothesis and accept that there is a significant difference in the two models.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### References\n",
"\n",
"- [1] Dietterich TG (1998) Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. *Neural Comput* 10:1895–1923."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example 1 - 5x2cv paired *t* test"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Assume we want to compare two classification algorithms, logistic regression and a decision tree algorithm:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Logistic regression accuracy: 97.37%\n",
"Decision tree accuracy: 94.74%\n"
]
}
],
"source": [
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"from mlxtend.data import iris_data\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"\n",
"X, y = iris_data()\n",
"clf1 = LogisticRegression(random_state=1)\n",
"clf2 = DecisionTreeClassifier(random_state=1)\n",
"\n",
"X_train, X_test, y_train, y_test = \\\n",
" train_test_split(X, y, test_size=0.25,\n",
" random_state=123)\n",
"\n",
"score1 = clf1.fit(X_train, y_train).score(X_test, y_test)\n",
"score2 = clf2.fit(X_train, y_train).score(X_test, y_test)\n",
"\n",
"print('Logistic regression accuracy: %.2f%%' % (score1*100))\n",
"print('Decision tree accuracy: %.2f%%' % (score2*100))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that these accuracy values are not used in the paired *t* test procedure as new test/train splits are generated during the resampling procedure, the values above are just serving the purpose of intuition.\n",
"\n",
"Now, let's assume a significance threshold of $\\alpha=0.05$ for rejecting the null hypothesis that both algorithms perform equally well on the dataset and conduct the 5x2cv *t* test:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"t statistic: -1.539\n",
"p value: 0.184\n"
]
}
],
"source": [
"from mlxtend.evaluate import paired_ttest_5x2cv\n",
"\n",
"\n",
"t, p = paired_ttest_5x2cv(estimator1=clf1,\n",
" estimator2=clf2,\n",
" X=X, y=y,\n",
" random_seed=1)\n",
"\n",
"print('t statistic: %.3f' % t)\n",
"print('p value: %.3f' % p)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since $p > t$, we cannot reject the null hypothesis and may conclude that the performance of the two algorithms is not significantly different. \n",
"\n",
"While it is generally not recommended to apply statistical tests multiple times without correction for multiple hypothesis testing, let us take a look at an example where the decision tree algorithm is limited to producing a very simple decision boundary that would result in a relatively bad performance:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Decision tree accuracy: 63.16%\n",
"t statistic: 5.386\n",
"p value: 0.003\n"
]
}
],
"source": [
"clf2 = DecisionTreeClassifier(random_state=1, max_depth=1)\n",
"\n",
"score2 = clf2.fit(X_train, y_train).score(X_test, y_test)\n",
"print('Decision tree accuracy: %.2f%%' % (score2*100))\n",
"\n",
"\n",
"t, p = paired_ttest_5x2cv(estimator1=clf1,\n",
" estimator2=clf2,\n",
" X=X, y=y,\n",
" random_seed=1)\n",
"\n",
"print('t statistic: %.3f' % t)\n",
"print('p value: %.3f' % p)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Assuming that we conducted this test also with a significance level of $\\alpha=0.05$, we can reject the null-hypothesis that both models perform equally well on this dataset, since the p-value ($p < 0.001$) is smaller than $\\alpha$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## API"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"## paired_ttest_5x2cv\n",
"\n",
"*paired_ttest_5x2cv(estimator1, estimator2, X, y, scoring=None, random_seed=None)*\n",
"\n",
"Implements the 5x2cv paired t test proposed\n",
"by Dieterrich (1998)\n",
"to compare the performance of two models.\n",
"\n",
"**Parameters**\n",
"\n",
"- `estimator1` : scikit-learn classifier or regressor\n",
"\n",
"\n",
"\n",
"- `estimator2` : scikit-learn classifier or regressor\n",
"\n",
"\n",
"\n",
"- `X` : {array-like, sparse matrix}, shape = [n_samples, n_features]\n",
"\n",
" Training vectors, where n_samples is the number of samples and\n",
" n_features is the number of features.\n",
"\n",
"\n",
"- `y` : array-like, shape = [n_samples]\n",
"\n",
" Target values.\n",
"\n",
"\n",
"- `scoring` : str, callable, or None (default: None)\n",
"\n",
" If None (default), uses 'accuracy' for sklearn classifiers\n",
" and 'r2' for sklearn regressors.\n",
" If str, uses a sklearn scoring metric string identifier, for example\n",
" {accuracy, f1, precision, recall, roc_auc} for classifiers,\n",
" {'mean_absolute_error', 'mean_squared_error'/'neg_mean_squared_error',\n",
" 'median_absolute_error', 'r2'} for regressors.\n",
" If a callable object or function is provided, it has to be conform with\n",
" sklearn's signature ``scorer(estimator, X, y)``; see\n",
" http://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html\n",
" for more information.\n",
"\n",
"\n",
"- `random_seed` : int or None (default: None)\n",
"\n",
" Random seed for creating the test/train splits.\n",
"\n",
"**Returns**\n",
"\n",
"- `t` : float\n",
"\n",
" The t-statistic\n",
"\n",
"\n",
"- `pvalue` : float\n",
"\n",
" Two-tailed p-value.\n",
" If the chosen significance level is larger\n",
" than the p-value, we reject the null hypothesis\n",
" and accept that there are significant differences\n",
" in the two compared models.\n",
"\n",
"\n"
]
}
],
"source": [
"with open('../../api_modules/mlxtend.evaluate/paired_ttest_5x2cv.md', 'r') as f:\n",
" s = f.read() \n",
"print(s)"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
6 changes: 3 additions & 3 deletions docs/sources/user_guide/evaluate/paired_ttest_kfold_cv.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# K-fold cross-validated paired t-test"
"# K-fold cross-validated paired *t* test"
]
},
{
Expand Down Expand Up @@ -41,7 +41,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"K-fold cross-validated paired t-test procedure is a common method for comparing the performance of two models (classifiers or regressors) and addresses some of the drawbacks of the [resampled t-test procedure](paired_ttest_resampled.md); however, this method has still the problem that the training sets overlap and is not recommended to be used in practice [1], and techniques such as the [`paired_ttest_5times2_cv`](paired_ttest_5times2_cv.md) should be used instead.\n",
"K-fold cross-validated paired t-test procedure is a common method for comparing the performance of two models (classifiers or regressors) and addresses some of the drawbacks of the [resampled t-test procedure](paired_ttest_resampled.md); however, this method has still the problem that the training sets overlap and is not recommended to be used in practice [1], and techniques such as the [`paired_ttest_5times2cv`](paired_ttest_5times2cv.md) should be used instead.\n",
"\n",
"To explain how this method works, let's consider to estimator (e.g., classifiers) A and B. Further, we have a labeled dataset *D*. In the common hold-out method, we typically split the dataset into 2 parts: a training and a test set. In the k-fold cross-validated paired t-test procedure, we split the test set into *k* parts of equal size, and each of these parts is then used for testing while the remaining *k-1* parts (joined together) are used for training a classifier or regressor (i.e., the standard k-fold cross-validation procedure).\n",
"\n",
Expand Down Expand Up @@ -73,7 +73,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example 1 - Paired resampled t test"
"## Example 1 - K-fold cross-validated paired *t* test"
]
},
{
Expand Down