rasbt · rasbt · Jan 20, 2018 · Jan 19, 2018 · Jan 20, 2018
diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml
@@ -61,6 +61,7 @@ pages:
     - user_guide/evaluate/mcnemar_table.md
     - user_guide/evaluate/mcnemar_tables.md
     - user_guide/evaluate/mcnemar.md
+    - user_guide/evaluate/paired_ttest_kfold_cv.md
     - user_guide/evaluate/paired_ttest_resampled.md
     - user_guide/evaluate/permutation_test.md
     - user_guide/evaluate/scoring.md

diff --git a/docs/sources/CHANGELOG.md b/docs/sources/CHANGELOG.md
@@ -16,9 +16,12 @@ The CHANGELOG for the current development version is available at
 
 ##### New Features
 
--   New function implementing the resampled paired t-test procedure
+-   New function implementing the resampled paired t-test procedure (`paired_ttest_resampled`)
     to compare the performance of two models
     (also called k-hold-out paired t-test). ([#323](https://github.com/rasbt/mlxtend/issues/323))
+-   New function implementing the k-fold paired t-test procedure (`paired_ttest_kfold_cv`)
+    to compare the performance of two models
+    (also called k-hold-out paired t-test). ([#324](https://github.com/rasbt/mlxtend/issues/324))
 
 ##### Changes
 

diff --git a/docs/sources/USER_GUIDE_INDEX.md b/docs/sources/USER_GUIDE_INDEX.md
@@ -33,6 +33,7 @@
 - [mcnemar_table](user_guide/evaluate/mcnemar_table.md)
 - [mcnemar_tables](user_guide/evaluate/mcnemar_tables.md)
 - [mcnemar](user_guide/evaluate/mcnemar.md)
+- [paired_ttest_kfold_cv](user_guide/evaluate/paired_ttest_kfold_cv.md)
 - [paired_ttest_resampled](user_guide/evaluate/paired_ttest_resampled.md)
 - [permutation_test](user_guide/evaluate/permutation_test.md)
 - [scoring](user_guide/evaluate/scoring.md)

diff --git a/docs/sources/user_guide/evaluate/paired_ttest_kfold_cv.ipynb b/docs/sources/user_guide/evaluate/paired_ttest_kfold_cv.ipynb
@@ -0,0 +1,329 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%matplotlib inline"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# K-fold cross-validated paired t-test"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "K-fold paired *t* test procedure to compare the performance of two models"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "> `from mlxtend.evaluate import paired_ttest_kfold_cv`    "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Overview"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "K-fold cross-validated paired t-test procedure is a common method for comparing the performance of two models (classifiers or regressors) and addresses some of the drawbacks of the [resampled t-test procedure](paired_ttest_resampled.md); however, this method has still the problem that the training sets overlap and is not recommended to be used in practice [1], and techniques such as the [`paired_ttest_5times2_cv`](paired_ttest_5times2_cv.md) should be used instead.\n",
+    "\n",
+    "To explain how this method works, let's consider to estimator (e.g., classifiers) A and B. Further, we have a labeled dataset *D*. In the common hold-out method, we typically split the dataset into 2 parts: a training and a test set. In the k-fold cross-validated paired t-test procedure, we split the test set into *k* parts of equal size, and each of these parts is then used for testing while the remaining *k-1* parts (joined together) are used for training a classifier or regressor (i.e., the standard k-fold cross-validation procedure).\n",
+    "\n",
+    "In each k-fold cross-validation iteration, we then compute the difference in performance between A and B in each so that we obtain *k* difference measures. Now, by making the assumption that these *k* differences were independently drawn and follow an approximately normal distribution, we can compute the following *t* statistic with *k-1* degrees of freedom according to Student's *t* test, under the null hypothesis that the models A and B have equal performance:\n",
+    "\n",
+    "$$t = \\frac{\\overline{p} \\sqrt{k}}{\\sqrt{\\sum_{i=1}^{k}(p^{(i) - \\overline{p}})^2 / (k-1)}}.$$\n",
+    "\n",
+    "Here, $p^{(i)}$ computes the difference between the model performances in the $i$th iteration, $p^{(i)} = p^{(i)}_A - p^{(i)}_B$, and $\\overline{p}$ represents the average difference between the classifier performances, $\\overline{p} = \\frac{1}{k} \\sum^k_{i=1} p^{(i)}$.\n",
+    "\n",
+    "Once we computed the *t* statistic we can compute the p value and compare it to our chosen significance level, e.g., $\\alpha=0.05$. If the p value is smaller than $\\alpha$, we reject the null hypothesis and accept that there is a significant difference in the two models.\n",
+    "\n",
+    "\n",
+    "The problem with this method, and the reason why it is not recommended to be used in practice, is that it violates an assumption of Student's *t* test [1]:\n",
+    "\n",
+    "- the difference between the model performances ($p^{(i)} = p^{(i)}_A - p^{(i)}_B$) are not normal distributed because $p^{(i)}_A$ and $p^{(i)}_B$ are not independent\n",
+    "- the $p^{(i)}$'s themselves are not independent because training sets overlap"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### References\n",
+    "\n",
+    "- [1] Dietterich TG (1998) Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. *Neural Comput* 10:1895–1923."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Example 1 - Paired resampled t test"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Assume we want to compare two classification algorithms, logistic regression and a decision tree algorithm:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Logistic regression accuracy: 97.37%\n",
+      "Decision tree accuracy: 94.74%\n"
+     ]
+    }
+   ],
+   "source": [
+    "from sklearn.linear_model import LogisticRegression\n",
+    "from sklearn.tree import DecisionTreeClassifier\n",
+    "from mlxtend.data import iris_data\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "\n",
+    "\n",
+    "X, y = iris_data()\n",
+    "clf1 = LogisticRegression(random_state=1)\n",
+    "clf2 = DecisionTreeClassifier(random_state=1)\n",
+    "\n",
+    "X_train, X_test, y_train, y_test = \\\n",
+    "    train_test_split(X, y, test_size=0.25,\n",
+    "                     random_state=123)\n",
+    "\n",
+    "score1 = clf1.fit(X_train, y_train).score(X_test, y_test)\n",
+    "score2 = clf2.fit(X_train, y_train).score(X_test, y_test)\n",
+    "\n",
+    "print('Logistic regression accuracy: %.2f%%' % (score1*100))\n",
+    "print('Decision tree accuracy: %.2f%%' % (score2*100))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Note that these accuracy values are not used in the paired t-test procedure as new test/train splits are generated during the resampling procedure, the values above are just serving the purpose of intuition.\n",
+    "\n",
+    "Now, let's assume a significance threshold of $\\alpha=0.05$ for rejecting the null hypothesis that both algorithms perform equally well on the dataset and conduct the k-fold cross-validated t-test:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "t statistic: -1.861\n",
+      "p value: 0.096\n"
+     ]
+    }
+   ],
+   "source": [
+    "from mlxtend.evaluate import paired_ttest_kfold_cv\n",
+    "\n",
+    "\n",
+    "t, p = paired_ttest_kfold_cv(estimator1=clf1,\n",
+    "                              estimator2=clf2,\n",
+    "                              X=X, y=y,\n",
+    "                              random_seed=1)\n",
+    "\n",
+    "print('t statistic: %.3f' % t)\n",
+    "print('p value: %.3f' % p)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Since $p > t$, we cannot reject the null hypothesis and may conclude that the performance of the two algorithms is not significantly different. \n",
+    "\n",
+    "While it is generally not recommended to apply statistical tests multiple times without correction for multiple hypothesis testing, let us take a look at an example where the decision tree algorithm is limited to producing a very simple decision boundary that would result in a relatively bad performance:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Decision tree accuracy: 63.16%\n",
+      "t statistic: 13.491\n",
+      "p value: 0.000\n"
+     ]
+    }
+   ],
+   "source": [
+    "clf2 = DecisionTreeClassifier(random_state=1, max_depth=1)\n",
+    "\n",
+    "score2 = clf2.fit(X_train, y_train).score(X_test, y_test)\n",
+    "print('Decision tree accuracy: %.2f%%' % (score2*100))\n",
+    "\n",
+    "\n",
+    "t, p = paired_ttest_kfold_cv(estimator1=clf1,\n",
+    "                             estimator2=clf2,\n",
+    "                             X=X, y=y,\n",
+    "                             random_seed=1)\n",
+    "\n",
+    "print('t statistic: %.3f' % t)\n",
+    "print('p value: %.3f' % p)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Assuming that we conducted this test also with a significance level of $\\alpha=0.05$, we can reject the null-hypothesis that both models perform equally well on this dataset, since the p-value ($p < 0.001$) is smaller than $\\alpha$."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## API"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "## paired_ttest_kfold_cv\n",
+      "\n",
+      "*paired_ttest_kfold_cv(estimator1, estimator2, X, y, cv=10, scoring=None, shuffle=False, random_seed=None)*\n",
+      "\n",
+      "Implements the k-fold paired t test procedure\n",
+      "to compare the performance of two models.\n",
+      "\n",
+      "**Parameters**\n",
+      "\n",
+      "- `estimator1` : scikit-learn classifier or regressor\n",
+      "\n",
+      "\n",
+      "\n",
+      "- `estimator2` : scikit-learn classifier or regressor\n",
+      "\n",
+      "\n",
+      "\n",
+      "- `X` : {array-like, sparse matrix}, shape = [n_samples, n_features]\n",
+      "\n",
+      "    Training vectors, where n_samples is the number of samples and\n",
+      "    n_features is the number of features.\n",
+      "\n",
+      "\n",
+      "- `y` : array-like, shape = [n_samples]\n",
+      "\n",
+      "    Target values.\n",
+      "\n",
+      "\n",
+      "- `cv` : int (default: 10)\n",
+      "\n",
+      "    Number of splits and iteration for the\n",
+      "    cross-validation procedure\n",
+      "\n",
+      "\n",
+      "- `scoring` : str, callable, or None (default: None)\n",
+      "\n",
+      "    If None (default), uses 'accuracy' for sklearn classifiers\n",
+      "    and 'r2' for sklearn regressors.\n",
+      "    If str, uses a sklearn scoring metric string identifier, for example\n",
+      "    {accuracy, f1, precision, recall, roc_auc} for classifiers,\n",
+      "    {'mean_absolute_error', 'mean_squared_error'/'neg_mean_squared_error',\n",
+      "    'median_absolute_error', 'r2'} for regressors.\n",
+      "    If a callable object or function is provided, it has to be conform with\n",
+      "    sklearn's signature ``scorer(estimator, X, y)``; see\n",
+      "    http://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html\n",
+      "    for more information.\n",
+      "\n",
+      "\n",
+      "- `shuffle` : bool (default: True)\n",
+      "\n",
+      "    Whether to shuffle the dataset for generating\n",
+      "    the k-fold splits.\n",
+      "\n",
+      "\n",
+      "- `random_seed` : int or None (default: None)\n",
+      "\n",
+      "    Random seed for shuffling the dataset\n",
+      "    for generating the k-fold splits.\n",
+      "    Ignored if shuffle=False.\n",
+      "\n",
+      "**Returns**\n",
+      "\n",
+      "- `t` : float\n",
+      "\n",
+      "    The t-statistic\n",
+      "\n",
+      "\n",
+      "- `pvalue` : float\n",
+      "\n",
+      "    Two-tailed p-value.\n",
+      "    If the chosen significance level is larger\n",
+      "    than the p-value, we reject the null hypothesis\n",
+      "    and accept that there are significant differences\n",
+      "    in the two compared models.\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "with open('../../api_modules/mlxtend.evaluate/paired_ttest_kfold_cv.md', 'r') as f:\n",
+    "    s = f.read() \n",
+    "print(s)"
+   ]
+  }
+ ],
+ "metadata": {
+  "anaconda-cloud": {},
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}