rasbt · rasbt · Jan 19, 2018 · Jan 19, 2018 · Jan 19, 2018 · Jan 19, 2018
diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml
@@ -61,6 +61,7 @@ pages:
     - user_guide/evaluate/mcnemar_table.md
     - user_guide/evaluate/mcnemar_tables.md
     - user_guide/evaluate/mcnemar.md
+    - user_guide/evaluate/paired_ttest_resampled.md
     - user_guide/evaluate/permutation_test.md
     - user_guide/evaluate/scoring.md
   - feature_extraction:

diff --git a/docs/sources/CHANGELOG.md b/docs/sources/CHANGELOG.md
@@ -7,7 +7,7 @@ The CHANGELOG for the current development version is available at
 
 ---
 
-### Version 0.10.1dev
+### Version 0.11.0dev
 
 ##### Downloads
 
@@ -16,7 +16,9 @@ The CHANGELOG for the current development version is available at
 
 ##### New Features
 
-- -
+-   New function implementing the resampled paired t-test procedure
+    to compare the performance of two models
+    (also called k-hold-out paired t-test). ([#323](https://github.com/rasbt/mlxtend/issues/323))
 
 ##### Changes
 

diff --git a/docs/sources/USER_GUIDE_INDEX.md b/docs/sources/USER_GUIDE_INDEX.md
@@ -33,6 +33,7 @@
 - [mcnemar_table](user_guide/evaluate/mcnemar_table.md)
 - [mcnemar_tables](user_guide/evaluate/mcnemar_tables.md)
 - [mcnemar](user_guide/evaluate/mcnemar.md)
+- [paired_ttest_resampled](user_guide/evaluate/paired_ttest_resampled.md)
 - [permutation_test](user_guide/evaluate/permutation_test.md)
 - [scoring](user_guide/evaluate/scoring.md)
 

diff --git a/docs/sources/user_guide/evaluate/paired_ttest_resample.ipynb b/docs/sources/user_guide/evaluate/paired_ttest_resample.ipynb
@@ -0,0 +1,342 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%matplotlib inline"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Resampled paired t-test"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Resampled paired t-test procedure to compare the performance of two models"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "> `from mlxtend.evaluate import paired_ttest_resample`    "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Overview"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Resampled paired t-test procedure (also called k-hold-out paired t-test) is a popular method for comparing the performance of two models (classifiers or regressors); however, this method has many drawbacks and is not recommended to be used in practice [1], and techniques such as the [`paired_ttest_5times2_cv`](paired_ttest_5times2_cv.md) should be used instead.\n",
+    "\n",
+    "To explain how this method works, let's consider to estimator (e.g., classifiers) A and B. Further, we have a labeled dataset *D*. In the common hold-out method, we typically split the dataset into 2 parts: a training and a test set. In the resampled paired t-test procedure, we repeat this splitting procedure (with typically 2/3 training data and 1/3 test data) *k* times (usually 30). In each iteration, we train A and B on the training set and evaluate it on the test set. Then, we compute the difference in performance between A and B in each iteration so that we obtain *k* difference measures. Now, by making the assumption that these *k* differences were independently drawn and follow an approximately normal distribution, we can compute the following *t* statistic with *k-1* degrees of freedom according to Student's *t* test, under the null hypothesis that the models A and B have equal performance:\n",
+    "\n",
+    "$$t = \\frac{\\overline{p} \\sqrt{k}}{\\sqrt{\\sum_{i=1}^{k}(p^{(i) - \\overline{p}})^2 / (k-1)}}.$$\n",
+    "\n",
+    "Here, $p^{(i)}$ computes the difference between the model performances in the $i$th iteration, $p^{(i)} = p^{(i)}_A - p^{(i)}_B$, and $\\overline{p}$ represents the average difference between the classifier performances, $\\overline{p} = \\frac{1}{k} \\sum^k_{i=1} p^{(i)}$.\n",
+    "\n",
+    "Once we computed the *t* statistic we can compute the p value and compare it to our chosen significance level, e.g., $\\alpha=0.05$. If the p value is smaller than $\\alpha$, we reject the null hypothesis and accept that there is a significant difference in the two models.\n",
+    "\n",
+    "To summarize the procedure:\n",
+    "\n",
+    "0. i := 0\n",
+    "1. while i < k:\n",
+    "  2. split dataset into training and test subsets\n",
+    "  3. fit models A and B to the training set\n",
+    "  4. compute the performances of A and B on the test set\n",
+    "  5. record the performance difference between A and B\n",
+    "  6. i := i + 1\n",
+    "3. compute t-statistic\n",
+    "4. compute p value from t-statistic with k-1 degrees of freedom\n",
+    "5. compare p value to chosen significance threshold\n",
+    "\n",
+    "\n",
+    "\n",
+    "The problem with this method, and the reason why it is not recommended to be used in practice, is that it violates the assumptions of Student's *t* test [1]:\n",
+    "\n",
+    "- the difference between the model performances ($p^{(i)} = p^{(i)}_A - p^{(i)}_B$) are not normal distributed because $p^{(i)}_A$ and $p^{(i)}_B$ are not independent\n",
+    "- the $p^{(i)}$'s themselves are not independent because of the overlapping test sets; also, test and training sets overlap as well"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### References\n",
+    "\n",
+    "- [1] Dietterich TG (1998) Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. *Neural Comput* 10:1895–1923."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Example 1 - Paired resampled t test"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Assume we want to compare two classification algorithms, logistic regression and a decision tree algorithm:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Logistic regression accuracy: 97.37%\n",
+      "Decision tree accuracy: 94.74%\n"
+     ]
+    }
+   ],
+   "source": [
+    "from sklearn.linear_model import LogisticRegression\n",
+    "from sklearn.tree import DecisionTreeClassifier\n",
+    "from mlxtend.data import iris_data\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "\n",
+    "\n",
+    "X, y = iris_data()\n",
+    "clf1 = LogisticRegression(random_state=1)\n",
+    "clf2 = DecisionTreeClassifier(random_state=1)\n",
+    "\n",
+    "X_train, X_test, y_train, y_test = \\\n",
+    "    train_test_split(X, y, test_size=0.25,\n",
+    "                     random_state=123)\n",
+    "\n",
+    "score1 = clf1.fit(X_train, y_train).score(X_test, y_test)\n",
+    "score2 = clf2.fit(X_train, y_train).score(X_test, y_test)\n",
+    "\n",
+    "print('Logistic regression accuracy: %.2f%%' % (score1*100))\n",
+    "print('Decision tree accuracy: %.2f%%' % (score2*100))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Note that these accuracy values are not used in the paired t-test procedure as new test/train splits are generated during the resampling procedure, the values above are just serving the purpose of intuition.\n",
+    "\n",
+    "Now, let's assume a significance threshold of $\\alpha=0.05$ for rejecting the null hypothesis that both algorithms perform equally well on the dataset and conduct the paired sample t-test:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "t statistic: -1.809\n",
+      "p value: 0.081\n"
+     ]
+    }
+   ],
+   "source": [
+    "from mlxtend.evaluate import paired_ttest_resampled\n",
+    "\n",
+    "\n",
+    "t, p = paired_ttest_resampled(estimator1=clf1,\n",
+    "                              estimator2=clf2,\n",
+    "                              X=X, y=y,\n",
+    "                              random_seed=1)\n",
+    "\n",
+    "print('t statistic: %.3f' % t)\n",
+    "print('p value: %.3f' % p)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Since $p > t$, we cannot reject the null hypothesis and may conclude that the performance of the two algorithms is not significantly different. \n",
+    "\n",
+    "While it is generally not recommended to apply statistical tests multiple times without correction for multiple hypothesis testing, let us take a look at an example where the decision tree algorithm is limited to producing a very simple decision boundary that would result in a relatively bad performance:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Decision tree accuracy: 63.16%\n",
+      "t statistic: 39.214\n",
+      "p value: 0.000\n"
+     ]
+    }
+   ],
+   "source": [
+    "clf2 = DecisionTreeClassifier(random_state=1, max_depth=1)\n",
+    "\n",
+    "score2 = clf2.fit(X_train, y_train).score(X_test, y_test)\n",
+    "print('Decision tree accuracy: %.2f%%' % (score2*100))\n",
+    "\n",
+    "\n",
+    "t, p = paired_ttest_resampled(estimator1=clf1,\n",
+    "                              estimator2=clf2,\n",
+    "                              X=X, y=y,\n",
+    "                              random_seed=1)\n",
+    "\n",
+    "print('t statistic: %.3f' % t)\n",
+    "print('p value: %.3f' % p)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Assuming that we conducted this test also with a significance level of $\\alpha=0.05$, we can reject the null-hypothesis that both models perform equally well on this dataset, since the p-value ($p < 0.001$) is smaller than $\\alpha$."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## API"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "## paired_ttest_resampled\n",
+      "\n",
+      "*paired_ttest_resampled(estimator1, estimator2, X, y, num_rounds=30, test_size=0.3, scoring=None, random_seed=None)*\n",
+      "\n",
+      "Implements the resampled paired t-test procedure\n",
+      "to compare the performance of two models\n",
+      "(also called k-hold-out paired t-test).\n",
+      "\n",
+      "**Parameters**\n",
+      "\n",
+      "- `estimator1` : scikit-learn classifier or regressor\n",
+      "\n",
+      "\n",
+      "\n",
+      "- `estimator2` : scikit-learn classifier or regressor\n",
+      "\n",
+      "\n",
+      "\n",
+      "- `X` : {array-like, sparse matrix}, shape = [n_samples, n_features]\n",
+      "\n",
+      "    Training vectors, where n_samples is the number of samples and\n",
+      "    n_features is the number of features.\n",
+      "\n",
+      "\n",
+      "- `y` : array-like, shape = [n_samples]\n",
+      "\n",
+      "    Target values.\n",
+      "\n",
+      "\n",
+      "- `num_rounds` : int (default: 30)\n",
+      "\n",
+      "    Number of resampling iterations\n",
+      "    (i.e., train/test splits)\n",
+      "\n",
+      "\n",
+      "- `test_size` : float or int (default: 0.3)\n",
+      "\n",
+      "    If float, should be between 0.0 and 1.0 and\n",
+      "    represent the proportion of the dataset to use\n",
+      "    as a test set.\n",
+      "    If int, represents the absolute number of test exsamples.\n",
+      "\n",
+      "\n",
+      "- `scoring` : str, callable, or None (default: None)\n",
+      "\n",
+      "    If None (default), uses 'accuracy' for sklearn classifiers\n",
+      "    and 'r2' for sklearn regressors.\n",
+      "    If str, uses a sklearn scoring metric string identifier, for example\n",
+      "    {accuracy, f1, precision, recall, roc_auc} for classifiers,\n",
+      "    {'mean_absolute_error', 'mean_squared_error'/'neg_mean_squared_error',\n",
+      "    'median_absolute_error', 'r2'} for regressors.\n",
+      "    If a callable object or function is provided, it has to be conform with\n",
+      "    sklearn's signature ``scorer(estimator, X, y)``; see\n",
+      "    http://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html\n",
+      "    for more information.\n",
+      "\n",
+      "\n",
+      "- `random_seed` : int or None (default: None)\n",
+      "\n",
+      "    Random seed for creating the test/train splits.\n",
+      "\n",
+      "**Returns**\n",
+      "\n",
+      "- `t` : float\n",
+      "\n",
+      "    The t-statistic\n",
+      "\n",
+      "\n",
+      "- `pvalue` : float\n",
+      "\n",
+      "    Two-tailed p-value.\n",
+      "    If the chosen significance level is larger\n",
+      "    than the p-value, we reject the null hypothesis\n",
+      "    and accept that there are significant differences\n",
+      "    in the two compared models.\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "with open('../../api_modules/mlxtend.evaluate/paired_ttest_resampled.md', 'r') as f:\n",
+    "    s = f.read() \n",
+    "print(s)"
+   ]
+  }
+ ],
+ "metadata": {
+  "anaconda-cloud": {},
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
diff --git a/mlxtend/__init__.py b/mlxtend/__init__.py
@@ -4,4 +4,4 @@
 #
 # License: BSD 3 clause
 
-__version__ = '0.10.1dev'
+__version__ = '0.11.0dev'
diff --git a/mlxtend/evaluate/__init__.py b/mlxtend/evaluate/__init__.py
@@ -15,11 +15,12 @@
 from .bootstrap_point632 import bootstrap_point632_score
 from .permutation import permutation_test
 from .cochrans_q import cochrans_q
+from .ttest import paired_ttest_resampled
 
 
 __all__ = ["scoring", "confusion_matrix",
            "mcnemar_table", "mcnemar_tables",
            "mcnemar", "lift_score",
            "bootstrap", "permutation_test",
            "BootstrapOutOfBag", "bootstrap_point632_score",
-           "cochrans_q"]
+           "cochrans_q", "paired_ttest_resampled"]