diff --git a/ipython/ResamplingAndMonteCarloMethods/resampling_tutorial_1.ipynb b/ipython/ResamplingAndMonteCarloMethods/resampling_tutorial_1.ipynb new file mode 100644 index 0000000..65bb3f3 --- /dev/null +++ b/ipython/ResamplingAndMonteCarloMethods/resampling_tutorial_1.ipynb @@ -0,0 +1,668 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "4304eb40-80fd-46b5-a6ef-6e7a84d33dbe", + "metadata": {}, + "source": [ + "## Monte Carlo Tests\n", + "To motivate resampling methods, suppose we wish to infer from measurements whether the weights of adult human males in a medical study are normally distributed [[1](https://www.jstor.org/stable/2333709https://www.jstor.org/stable/2333709)]. This may be prerequisite to performing other statistical tests, many of which are developed under the assumption that samples are drawn from a normally distributed population (although some tests are quite robust to deviations from this assumption, as we will see later). The weights are recorded in the array `x` below." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "02824778-802d-4d9d-8cd5-0a69cc028d13", + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "x = np.array([148, 154, 158, 160, 161, 162, 166, 170, 182, 195, 236]) # weights (lbs)" + ] + }, + { + "cell_type": "markdown", + "id": "fe14279d-496f-45fe-9ce0-937ff5844a3b", + "metadata": {}, + "source": [ + "One way of testing for departures from normality, chosen based on its simplicity rather than its sensitivity, is the [Jarque-Bera test [2]](https://www.sciencedirect.com/science/article/abs/pii/0165176580900245) implemented in SciPy as `scipy.stats.jarque_bera`. The test, like many other hypothesis tests, computes a *statistic* based on the sample and compares its value to the distribution of the statistic derived under the *null hypothesis* that the sample is normally distributed. If the value of the statistic is extreme compared to this *null distribution* - that is, if there is a low probability of sampling such data from a normally distributed population - then we have evidence to reject the null hypothesis.\n", + "\n", + "The statistic is calculated based on the skewness and kurtosis of the sample as follows." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "8475312e-7810-4f61-8fed-7fe8297afe28", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "6.982848237344646" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from scipy import stats\n", + "\n", + "def statistic(x):\n", + " # Calculates the Jarque-Bera Statistic\n", + " # Compare against `scipy.stats.jarque_bera`:\n", + " # https://github.com/scipy/scipy/blob/4cf21e753cf937d1c6c2d2a0e372fbc1dbbeea81/scipy/stats/_stats_py.py#L1583-L1637\n", + " n = len(x)\n", + " mu = np.mean(x, axis=0)\n", + " x = x - mu # remove the mean from the data\n", + " s = stats.skew(x) # calculate the sample skewness\n", + " k = stats.kurtosis(x) # calculate the sample kurtosis\n", + " statistic = n/6 * (s**2 + k**2/4)\n", + " return statistic\n", + "\n", + "stat1 = statistic(x)\n", + "stat2, _ = stats.jarque_bera(x)\n", + "np.testing.assert_allclose(stat1, stat2, rtol=1e-14)\n", + "stat1" + ] + }, + { + "cell_type": "markdown", + "id": "7a38c293-1414-4d5c-8c6f-84d9d2e77e93", + "metadata": {}, + "source": [ + "Note that the value of the statistic is unaffected by the scale and location of the data." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "138b5da5-d953-4630-97c4-ebd8ad6420c7", + "metadata": {}, + "outputs": [], + "source": [ + "old_location = np.mean(x)\n", + "old_scale = np.std(x)\n", + "x_new = (x - old_location) / old_scale # make location 0 and scale 1\n", + "stat3 = statistic(x_new)\n", + "np.testing.assert_allclose(stat1, stat3, rtol=1e-14)" + ] + }, + { + "cell_type": "markdown", + "id": "69fad794-1d58-49c9-9c75-09f642cdfbf0", + "metadata": {}, + "source": [ + "Consequently, it can be shown that large samples drawn from a normal distribution with any mean and variance will produce statistic values that are distributed according to the chi-squared distribution with two degrees of freedom. We can check this numerically by drawing 1000 samples of size 500 from a standard normal distribution and computing the value of the statistic for each sample." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "9e2c2e75-1f3b-477c-9b6a-20d6bbfd6efd", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "np.random.seed(0)\n", + "n_observations = 500 # number of observations\n", + "n_samples = 1000 # number of samples\n", + "# standard normal distribution can be used, since the\n", + "# statistic is unaffected by location and scale\n", + "norm_dist = stats.norm()\n", + "# Draw 1000 samples, each with 500 observations\n", + "y = norm_dist.rvs(size=(n_observations, n_samples))\n", + "\n", + "# calculate the value of the statistic for each sample\n", + "# we'll call this the \"Monte Carlo null distribution\"\n", + "null_dist_mc = statistic(y)\n", + "\n", + "# the asymptotic null distribution is chi-squared with df=2\n", + "null_dist = stats.chi2(df=2)\n", + "y_grid = np.linspace(0, null_dist.isf(0.001))\n", + "pdf = null_dist.pdf(y_grid)\n", + "\n", + "# compare the two\n", + "import matplotlib.pyplot as plt\n", + "plt.plot(y_grid, pdf)\n", + "plt.hist(null_dist_mc, density=True, bins=100)\n", + "plt.xlim(0, np.max(y_grid))\n", + "plt.xlabel(\"Value of statistic\")\n", + "plt.ylabel(\"Probability Density\")\n", + "plt.legend(['Asymptotic Null Distribution', 'Monte Carlo Null Distribution (500 observations/sample)'])\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "87d87faf-f1e1-462f-a94c-cf09d9445767", + "metadata": {}, + "source": [ + "As we can see, the *Monte Carlo null distribution* [[2]](#cite_note-2) of the test statistic when samples are drawn according to the null hypothesis (from a normal distribution) appears to follow the *asymptotic null distribution* (chi-squared with two degrees of freedom). \n", + "\n", + "[[2]](#cite_note-2) Named after the Monte Carlo Casino in Monaco, apparently [[3]](https://en.wikipedia.org/wiki/Monte_Carlo_method#Historyhttps://en.wikipedia.org/wiki/Monte_Carlo_method#History).\n" + ] + }, + { + "cell_type": "markdown", + "id": "4195c3fd-12fe-46b4-b681-83e0f57550ae", + "metadata": {}, + "source": [ + "Note that the originally observed value of the statistic, 6.98, is located in the right tail of the null distribution. Random samples from a normal distribution usually produce statistic values less than 6.98, and only occasionally produce higher values. Therefore, there is rather low probability of observing such an extreme value of the statistic under the null hypothesis that the sample is drawn from a normal population. This probability is quantified by the *inverse survival function* of the null distribution:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "0f2fad67-7dc3-4bf4-af5f-12bad97321e2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Under the null hypothesis, the chance of drawing a sample that produces a statistic value greater than \n", + "6.982848237344646\n", + "is\n", + "0.03045746622458189\n" + ] + } + ], + "source": [ + "pvalue = null_dist.sf(stat1)\n", + "message = (\"Under the null hypothesis, the chance of drawing a sample \"\n", + " f\"that produces a statistic value greater than \\n{stat1}\\n\"\n", + " f\"is\\n{pvalue}\")\n", + "print(message)" + ] + }, + { + "cell_type": "markdown", + "id": "54c6b2aa-98ce-4463-be49-78732a50083d", + "metadata": {}, + "source": [ + "This is the `pvalue` returned by `stats.jarque_bera`." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "bead807c-e722-4269-bfbe-3be8fccb2da8", + "metadata": {}, + "outputs": [], + "source": [ + "np.testing.assert_allclose(pvalue, stats.jarque_bera(x).pvalue, rtol=1e-14)" + ] + }, + { + "cell_type": "markdown", + "id": "78720622-ff97-4248-85a1-6e13b786ea06", + "metadata": {}, + "source": [ + "When the $p$-value is small, we take this as evidence against the null hypothesis, since samples drawn under the null hypothesis have a low probability of producing such an extreme value of the statistic. For better or for worse, a common \"confidence level\" used for statistical tests is 0.99, meaning that the threshold for rejection of the null hypothesis is $p \\leq 0.01$. If we adopt this criterion, then the Jarque-Bera test was inconclusive; it gives no evidence that the null hypothesis is false. Although this should *not* be taken as evidence that the null hypothesis is *true*, the lack of evidence against the hypothesis of normality is often considered sufficient to proceed with tests that assume the data is drawn from a normal population." + ] + }, + { + "cell_type": "markdown", + "id": "8feb92d9-99b7-4863-885c-8876a64905f6", + "metadata": {}, + "source": [ + "There are a few shortcomings with the test procedure outlined above. The Monte Carlo null distribution agreed with the asymptotic null distribution when the number of observations per sample was 500, but our original sample `x` had only 11 observations. If we generate a Monte Carlo null distribution of the statistic for sample size of only 11 observations, there is marked disagreement between the Monte Carlo null distribution and the asymptotic null distribution." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "1fe77f87-b4cc-4b0f-baee-cf79ca2088ea", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "# Draw 10000 samples, each with 11 observations\n", + "n_observations = 11\n", + "n_samples = 10000\n", + "y = norm_dist.rvs(size=(n_observations, n_samples))\n", + "\n", + "# calculate the value of the statistic for each sample\n", + "null_dist_mc = statistic(y)\n", + "\n", + "# compare the MC and asymptotic distributions\n", + "plt.plot(y_grid, pdf)\n", + "plt.hist(null_dist_mc, density=True, bins=200)\n", + "plt.xlim(0, np.max(y_grid))\n", + "plt.xlabel(\"Value of test statistic\")\n", + "plt.ylabel(\"Probability Density / Observed Frequency\")\n", + "plt.legend(['Asymptotic Null Distribution', 'Monte Carlo Null Distribution (11 observations/sample)'])\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "3c15ba7a-53c4-42b4-833c-39d86e2ba969", + "metadata": {}, + "source": [ + "This is because the asymptotic null distribution was derived under the assumption that the number of observations approaches infinity (hence the name \"asymptotic\" null distribution). Apparently, it is quite different from the null distribution of the test statistic when the number of observations is 11.\n", + "\n", + "The true theoretical null distribution when the number of observations is 11 may not be possible to calculate analytically (in a way that can be expressed in terms of a finite number of common functions). So rather than comparing a test statistic against a theoretical null distribution to determine the $p$-value, the approach of the *Monte Carlo test* is to compare the test statistic against the Monte Carlo null distribution." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "17e06351-479d-49b6-989f-7a30bfde43ab", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Asymptotic p-value 0.03045746622458189\n", + "Monte Carlo p-value: 0.0085\n" + ] + } + ], + "source": [ + "res = stats.jarque_bera(x)\n", + "stat, p_asymptotic = res\n", + "count = np.sum(null_dist_mc >= stat)\n", + "p_mc = count / n_samples\n", + "print(f\"Asymptotic p-value {p_asymptotic}\")\n", + "print(f\"Monte Carlo p-value: {p_mc}\")" + ] + }, + { + "cell_type": "markdown", + "id": "e27fd4f9-64eb-42ca-9131-b6f1cefa9698", + "metadata": {}, + "source": [ + "These $p$-values are substantially different, so we might draw different conclusions about the validity of the null hypothesis depending on which test we perform. Under the 1% threshold used above, the Monte Carlo test would suggest that there is evidence for rejection of the null hypothesis whereas the asymptotic test performed by `stats.jarque_bera` would not. In other cases, the opposite may be true. In any case, it seems that the Monte Carlo test should be preferred when the number of observations is small.\n", + "\n", + "`stats.monte_carlo_test` simplifies the process of performing a Monte Carlo test. All we need to provide is the obverved data, a function that generates data sampled under under the null hypothesis, and a function that computes the test statistic. `monte_carlo_test` returns an object with the observed statistic value, an empirical null distribution of the statistic, and the corresponding $p$-value." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "4869f8dd-4d59-432e-80be-b6ec321aef26", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Observed values of the statistic: 6.982848237344646\n", + "Monte Carlo p-value: 0.0076\n", + "Empirical null distribution shape: (10000,)\n" + ] + } + ], + "source": [ + "res = stats.monte_carlo_test(sample=x, rvs=norm_dist.rvs, statistic=statistic, \n", + " n_resamples=10000, alternative='greater')\n", + "print(f\"Observed values of the statistic: {res.statistic}\")\n", + "print(f\"Monte Carlo p-value: {res.pvalue}\")\n", + "print(f\"Empirical null distribution shape: {res.null_distribution.shape}\")" + ] + }, + { + "cell_type": "markdown", + "id": "46763f1a-a696-41b9-ace3-9124760b7e65", + "metadata": {}, + "source": [ + "Note that the $p$-value here is slightly different than `p_mc` above because the algorithm is stochastic. Nevertheless, `res.pvalue` is much closer to `p_mc` than to `p_asyptotic`: for small samples, the Monte Carlo null distribution generated from a sufficiently large number of random samples is often more accurate than the asymptotic null distribution, despite the error inherent in random sampling.\n", + "\n", + "Here, we also passed optional arguments to control the behavior of `monte_carlo_test`. As one might expect, the parameter `n_resamples` controls the number of samples to draw from the provided *random variate sample* function. Perhaps less obvious is the meaning of `alternative`, which controls which *alternative hypothesis* we are testing, that is, which tail of the null distribution the observed statistic value should be compared against. In this case, we are assessing the null hypothesis that the sample was drawn from a normal population against the alternative that the sample is drawn from a population which tends to produce a *greater* value of the test statistic. Perhaps more intuitively, the argument `'greater'` corresponds with the sign of the comparison against the null distribution (i.e., `null_dist_emperical >= statistic(x)` from above). Another options for `alternative` is `'less'` (i.e., `null_dist_emperical <= statistic(x)`). In some cases, a `'two-sided'` alternative is desired, which is twice the minimum of the \"one-sided\" $p$-values. (More on the choice of this convention below.)\n", + "\n", + "We can improve performance of `monte_carlo_test` by ensuring that our test statistic is \"vectorized\". That is, instead of requiring a one-dimensional sample array as input, the statistic should accept an $n$-dimensional array of samples in which each *axis-slice* (e.g. row, column) is a distinct sample. Our `statistic` function is already vectorized in some sense. Above, we wrote `null_dist_emperical = statistic(y)`, and `statistic` computed the statistic for each column of the two-dimensional `y`." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "6d0f8f55-5497-428c-a54e-8664c1cfd6d0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(11, 10000)\n", + "(10000,)\n" + ] + } + ], + "source": [ + "print(y.shape) # 10000 samples, each with 11 observations\n", + "print(statistic(y).shape) # statistic for each column of y" + ] + }, + { + "cell_type": "markdown", + "id": "0c4da9ef-f7b2-4540-a195-b322dd089692", + "metadata": {}, + "source": [ + "However, `monte_carlo_test` requires that the statistic function accept an `axis` argument to compute the statistic along *any* axis. Only minor modifications are required:" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "b7d6797e-7dab-46ac-a097-ec41716cd8b8", + "metadata": {}, + "outputs": [], + "source": [ + "def statistic_vectorized(x, axis=0):\n", + " # Calculates the Jarque-Bera Statistic\n", + " # Compare against https://github.com/scipy/scipy/blob/4cf21e753cf937d1c6c2d2a0e372fbc1dbbeea81/scipy/stats/_stats_py.py#L1583-L1637\n", + " n = x.shape[axis]\n", + " mu = np.mean(x, axis=axis, keepdims=True)\n", + " x = x - mu # remove the mean from the data\n", + " s = stats.skew(x, axis=axis) # calculate the sample skewness\n", + " k = stats.kurtosis(x, axis=axis) # calculate the sample kurtosis\n", + " statistic = n/6 * (s**2 + k**2/4)\n", + " return statistic\n", + "\n", + "np.testing.assert_allclose(statistic_vectorized(y, axis=0), statistic_vectorized(y.T, axis=1))" + ] + }, + { + "cell_type": "markdown", + "id": "9f5001f8-c828-42f2-848c-8d79157bd159", + "metadata": {}, + "source": [ + "But `monte_carlo_test` becomes much faster:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "198b8734-c516-4394-af01-614712138c10", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "7.15 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n" + ] + } + ], + "source": [ + "# Before\n", + "%timeit -r1 -n1 stats.monte_carlo_test(sample=x, rvs=norm_dist.rvs, statistic=statistic, n_resamples=10000, alternative='greater')" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "20212a36-30ab-4486-81d8-fbe8683ecaa6", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10.8 ms ± 583 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" + ] + } + ], + "source": [ + "# After\n", + "%timeit stats.monte_carlo_test(sample=x, rvs=norm_dist.rvs, statistic=statistic_vectorized, n_resamples=10000, alternative='greater', vectorized=True)" + ] + }, + { + "cell_type": "markdown", + "id": "f376518b-23b7-45bf-9921-4f717378f0aa", + "metadata": {}, + "source": [ + "When a statistical test is already implemented in SciPy (like `stats.jarque_bera`), it becomes even easier to perform a Monte Carlo version of the test. We simply need to \"wrap\" it in another function which only returns the test statistic rather than the full result object." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "4fc52618-5476-4042-9bea-8b463c2767f3", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.008" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def statistic_scipy(x):\n", + " # `jarque_bera` returns a result obeject\n", + " res = stats.jarque_bera(x) \n", + " # Our wrapper returns only the statistic, as required by `monte_carlo_test`\n", + " return res.statistic\n", + "\n", + "res = stats.monte_carlo_test(sample=x, rvs=norm_dist.rvs, statistic=statistic_scipy, n_resamples=10000, alternative='greater')\n", + "res.pvalue" + ] + }, + { + "cell_type": "markdown", + "id": "0ac75775-40e2-44c7-882b-009b2adef689", + "metadata": {}, + "source": [ + "Of course, besides enabling more accurate tests for small sample sizes, `monte_carlo_test` makes it easy to perform hypothesis tests *not* implemented in SciPy. For instance, suppose we want to assess whether our data is distributed according to a [`rayleigh` distribution](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rayleigh). Just as the \"normal distribution\" is really a *family* of distributions parameterized by mean and standard deviations, so `rayleigh` is a family of distributions rather than one specific distribution. In contrast with the normal distribution, however, there are no tests in SciPy specifically designed to determine whether a sample is drawn from a Rayleigh distribution. \n", + "\n", + "The closest options are tests like [`ks_1samp`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_1samp.html) and [`cramervonmises`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.cramervonmises.html), which can assess whether a sample is drawn from a *specific* distribution. If we want to use these tests to assess whether the sample is distributed according to *any* Rayleigh distribution, one approach is to fit a Rayleigh distribution to the data, and then apply `ks_1samp` and `cramervonmises` to test whether the data were drawn from the fitted Rayleigh distribution." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "7ecf4ac3-3430-4508-b42e-625e3d7f08cc", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "dist_family = stats.rayleigh\n", + "params = dist_family.fit(x)\n", + "dist_specific = dist_family(*params)\n", + "\n", + "z = np.linspace(dist_specific.ppf(0), dist_specific.isf(0.001))\n", + "plt.plot(z, dist_specific.pdf(z))\n", + "plt.plot(x, np.zeros_like(x), 'x')\n", + "plt.legend(('Candidate PDF', 'Observed Data'))\n", + "plt.xlabel('Weight (lb)')\n", + "plt.ylabel('Probability Density')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "bf116179-6679-4ca4-b938-549886b92b0d", + "metadata": {}, + "source": [ + "To the eyes of the author, this does not look like a terrific fit. The mode of the Rayleigh distribution is too far to the right compared to cluster of observations around 160 lb. Also, according to this Rayleigh distribution, there is zero probability that any weights could be less than ~135 lb, which does not seem realistic. However, the `ks_1samp` and `cramervonmises` tests are both inconclusive, with relatively large $p$-values." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "09860ff5-3a29-4d41-bf9f-d69770ab9aa3", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "KstestResult(statistic=0.26884627441317677, pvalue=0.3412228239139401)" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "stats.ks_1samp(x, dist_specific.cdf)" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "09867af4-cd95-4286-8eb7-3e79e5b97b7a", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "CramerVonMisesResult(statistic=0.17536330558707267, pvalue=0.32395064743536117)" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "stats.cramervonmises(x, dist_specific.cdf)" + ] + }, + { + "cell_type": "markdown", + "id": "7666ae1d-8c0b-48df-b40d-06050da64edc", + "metadata": {}, + "source": [ + "A much more powerful test of the null hypothesis that the data is distributed according to *any* Rayleigh distribution is the [Anderson-Darling Test](https://en.wikipedia.org/wiki/Anderson%E2%80%93Darling_test). [`scipy.stats.anderson`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.anderson.html) implements the test for some families of distributions, but not for the Rayleigh distribution. A simple implementation is included below." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "e9eb2df7-4560-4b17-98f5-e9611a9223a5", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.0425042504250425" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def statistic(x):\n", + " \"\"\"Compute the Anderson-Darling statistic A^2\"\"\"\n", + " # fit a distribution to the data\n", + " params = dist_family.fit(x)\n", + " dist = dist_family(*params)\n", + " \n", + " # compute A^2\n", + " x = np.sort(x)\n", + " n = len(x)\n", + " i = np.arange(1, n+1)\n", + " Si = (2*i - 1)/n * (dist.logcdf(x) + dist.logsf(x[::-1]))\n", + " S = np.sum(Si)\n", + " return -n - S\n", + "\n", + "params = dist_family.fit(x)\n", + "dist = dist_family(*params)\n", + "res = stats.monte_carlo_test(x, rvs=dist.rvs, statistic=statistic, alternative='greater')\n", + "res.pvalue" + ] + }, + { + "cell_type": "markdown", + "id": "db20efd8-6b9f-41d2-a44b-3b26e847ab22", + "metadata": {}, + "source": [ + "Although this does not meet the threshold for significance used above (1%), it does begin to cast doubt on the null hypothesis." + ] + }, + { + "cell_type": "markdown", + "id": "eb114e39-ea07-451d-b6d1-1c965fca311e", + "metadata": {}, + "source": [ + "As we can see, `monte_carlo_test` is a versatile tool for comparing a sample against a distribution by means of an arbitrary statistic. Provided a statistic and null distribution, it can replicate the $p$-value of any such tests in SciPy, and it may be more accurate than these existing implementations, especially for small samples:\n", + "\n", + "- [`skewtest`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.skewtest.html)\n", + "- [`kurtosistest`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kurtosistest.html)\n", + "- [`normaltest`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html)\n", + "- [`shapiro`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html)\n", + "- [`anderson`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.anderson.html)\n", + "- [`chisquare`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html)\n", + "- [`power_divergence`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.power_divergence.html)\n", + "- [`cramervonmises`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.cramervonmises.html)\n", + "- [`ks_1samp`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_1samp.html)\n", + "- [`binomtest`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binomtest.html)\n", + "\n", + "In addition, `monte_carlo_test` can be used to perform tests not yet implemented in SciPy, such as [the Lilliefors Test](https://www.tandfonline.com/doi/abs/10.1080/01621459.1967.10482916) for normality.\n", + "\n", + "However, there are other types of statistical tests that do not test whether a sample is drawn from a particular distribution or family of distributions, but instead test whether multiple samples are drawn from the same distribution. For these situations, we turn our attention to [Permutation Tests](https://nbviewer.org/github/mdhaber/scipy/blob/resampling_tutorial/doc/source/tutorial/stats/notebooks/resampling_tutorial_2.ipynb)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.5" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/ipython/ResamplingAndMonteCarloMethods/resampling_tutorial_2.ipynb b/ipython/ResamplingAndMonteCarloMethods/resampling_tutorial_2.ipynb new file mode 100644 index 0000000..c56bb99 --- /dev/null +++ b/ipython/ResamplingAndMonteCarloMethods/resampling_tutorial_2.ipynb @@ -0,0 +1,386 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "c2bf7028-42c0-4310-8bf8-e7e5a9bf27fd", + "metadata": {}, + "source": [ + "## Permutation Tests\n", + "### Exact Tests\n", + "Consider the following experiment from [An Introduction to the Bootstrap](https://books.google.com/books?id=MWC1DwAAQBAJ&printsec=frontcoverhttps://books.google.com/books?id=MWC1DwAAQBAJ&printsec=frontcover). A new medical treatment is intended to prolong life after a form of surgery. Sixteen mice are randomly assigned to either a treatment group or control group under the constraint that only seven treatments are available. All mice receive the surgery, but only the treatment group will receive the treatment being studied. The survival time of each mouse after surgery is recorded below." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "823b24e2-d26a-4f79-94cb-cc1eb27c8e5b", + "metadata": {}, + "outputs": [], + "source": [ + "# survival times measured in days\n", + "import numpy as np\n", + "x = np.array([94, 197, 16, 38, 99, 141, 23]) # treatment group\n", + "y = np.array([52, 104, 146, 10, 51, 30, 40, 27, 46]) # control group" + ] + }, + { + "cell_type": "markdown", + "id": "e19f78a9-e1c0-4d8a-aeb6-3fa944752948", + "metadata": {}, + "source": [ + "The difference in the mean life after treatment between the two groups suggests that the treatment has a prolonging effect, as hypothesized." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "5b3b1ead-55f7-456f-b015-9e36d0380c9d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "30.63492063492064\n" + ] + } + ], + "source": [ + "def statistic(x, y):\n", + " return np.mean(x) - np.mean(y)\n", + "print(statistic(x, y))" + ] + }, + { + "cell_type": "markdown", + "id": "9d016372-3ebe-49e5-a70c-923d8e9a1650", + "metadata": {}, + "source": [ + "It is possible that the treatment has no effect in reality; perhaps the apparent prolonging effect of the treatment is due to the inherent variability in survival times and chance alone. This possibility is typically assessed using Student's t-test. A common formulation of the test begins with the null hypothesis that the survival times `x` and `y` are sampled at random from normal distributions $X$ and $Y$ with means $\\mu_x$ and $\\mu_y$ and a common standard deviation, $\\sigma$. To test the null hypothesis that $\\mu_x = \\mu_y$ against the alternative that $\\mu_x > \\mu_y$, we perform the independent sample t-test with `stats.ttest_ind`." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "b993796d-05c7-4b75-bc70-7cd974da76b0", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Ttest_indResult(statistic=1.1208453991208167, pvalue=0.14060629239765005)" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from scipy import stats\n", + "stats.ttest_ind(x, y, equal_var=True, alternative='greater')" + ] + }, + { + "cell_type": "markdown", + "id": "33e2913b-ac57-493a-a6c8-20f11ab9b933", + "metadata": {}, + "source": [ + "The probability of observing such an extreme test statistic under the null hypothesis (due to chance alone) is greater than 14%, so these data do not seem inconsistent with the null hypothesis. The *point estimate* of the statistic (~30 days) suggested a life-prolonging effect, but such a value of the statistic could quite easily have been observed due to chance alone.\n", + "\n", + "Although the t-test tends to be rather robust to violations of its underlying assumptions (e.g., $X$ and $Y$ do not need to be strictly normally distributed for the test to be reasonably accurate), it is possible to perform a hypothesis test which requires no such assumptions at all. \n", + "\n", + "Instead, let the null hypothesis be that the samples `x` and `y` are drawn a single distribution ($X = Y = Z$), and test this against the alternative that the two sample are drawn from distributions which would tend to produce greater values of `statistic`. \n", + "\n", + "The complete population of mice survival times in the study is really:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "8888960d-bb51-4549-a01c-d1e70f880744", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[ 94 197 16 38 99 141 23 52 104 146 10 51 30 40 27 46]\n" + ] + } + ], + "source": [ + "z = np.concatenate([x, y])\n", + "print(z)" + ] + }, + { + "cell_type": "markdown", + "id": "6204af8f-b858-4d5a-8bed-b4ef1bc00c30", + "metadata": {}, + "source": [ + "Since the mice were randomly divided into the two groups under the constraint that there were only seven treatments available, any selection of seven mice from `z` to form the treatment group `x` was equally likely; the remaining mice would form the control group `y`. Furthermore, if the null hypothesis is true, the mice survival times would be *unaffected by the grouping*. Therefore, each value of the statistic obtained from the possible groupings is equaly likely.\n", + "\n", + "We begin our hypothesis test by calculating the value of `statistic` for all possible *permutations*[[2]](#cite_note-2) of mice into the the two groups, forming an exact null distribution.\n", + "\n", + "[[2]](#cite_note-2) Here and below, we will refer to the the ways of rearranging samples as \"permutations\" even when the word is not stricly appropriate in the technical sense. " + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "c7fed236-d06e-478b-b580-cbdd468730eb", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Text(0, 0.5, 'Observed Frequency')" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "from itertools import combinations\n", + "import matplotlib.pyplot as plt\n", + "\n", + "def null_distribution(z, nx):\n", + " # z is the population of mice survival times in the study\n", + " # nx is the number of mice in the treatment group\n", + " z = set(z)\n", + " null_distribution = []\n", + " for x in combinations(z, nx):\n", + " y = z - set(x)\n", + " stat = statistic(list(x), list(y))\n", + " null_distribution.append(stat)\n", + " return null_distribution\n", + "\n", + "null_dist = null_distribution(z, len(x))\n", + "plt.hist(null_dist, density=True, bins=50)\n", + "plt.xlabel(\"Value of test statistic\")\n", + "plt.ylabel(\"Observed Frequency\")" + ] + }, + { + "cell_type": "markdown", + "id": "13914661-4e02-47db-82d0-a2302cc1e695", + "metadata": {}, + "source": [ + "We complete the hypothesis test by comparing the observed value of the test statistic to the rest of the null distribution." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "d0b5ba29-952e-4652-8f35-58928e76e6ad", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.1409965034965035\n" + ] + } + ], + "source": [ + "pvalue = np.sum(null_dist >= statistic(x, y) ) / len(null_dist)\n", + "print(pvalue)" + ] + }, + { + "cell_type": "markdown", + "id": "90fd3def-2996-4366-a661-76c3216cd634", + "metadata": {}, + "source": [ + "Approximately 14% of the values of the null distribution are greater than the observed value of the statistic, so there is a 14% probability of observing such an extreme value of the statistic even if the the treatment had no effect at all. \n", + "\n", + "Given data and a statistic function, `stats.permutation_test` performs the same test automatically." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "8ff08473-77e4-4bc9-864e-2f06f19f2948", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "PermutationTestResult(statistic=30.63492063492064, pvalue=0.1409965034965035, null_distribution=array([ 30.63492063, 38. , 51.20634921, ..., -11.01587302,\n", + " -45.55555556, -34.88888889]))\n" + ] + } + ], + "source": [ + "# `alternative` is 'greater' because we are interested in the percentage of values in the\n", + "# null distribution that are greater than the observed value of the test statistic.\n", + "# `n_resamples` is `np.inf` to ensure that all possible permutations are used\n", + "# Note that `(x, y)`, a tuple, is a single argument.\n", + "res = stats.permutation_test((x, y), statistic, alternative='greater', n_resamples = np.inf)\n", + "assert res.pvalue == pvalue\n", + "print(res)" + ] + }, + { + "cell_type": "markdown", + "id": "8b5bb238-7db9-40eb-8221-d703c3d26b70", + "metadata": {}, + "source": [ + "It returns the observed value of the test statistic, the null distribution, and the $p$-value. They are related as above:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "f303201f-cf56-46c1-8ee3-428afca93eb3", + "metadata": {}, + "outputs": [], + "source": [ + "assert np.sum(res.null_distribution >= res.statistic ) / len(res.null_distribution) == res.pvalue" + ] + }, + { + "cell_type": "markdown", + "id": "ae116861-ded9-4728-a1b1-4c9c34c50fdd", + "metadata": {}, + "source": [ + "Note that the exact $p$-value from the permutation test matches the $p$-value from the t-test quite closely. (As we shall see, Ronald Fisher introduced permutation tests primarily to support the use of the t-test in applications where the underlying normality assumptions were not strictly true [[4](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2458144/https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2458144/)].)" + ] + }, + { + "cell_type": "markdown", + "id": "ae96fea0-1255-48ac-9d7f-e56f1db10ae2", + "metadata": {}, + "source": [ + "### Randomized Tests\n", + "The number of possible permutations grows rather quickly as the number of observations increases. Specifically, if $n_x$ and $n_y$ are the number of observations in `x` and `y`, respectively, than the number of possible permutations is $\\frac{(n_x + n_y)!}{n_x! n_y!}$. " + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "d864ec56-9438-45f2-a41a-cb30e3c1ac35", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "11440\n" + ] + } + ], + "source": [ + "from math import factorial as f\n", + "n_x = len(x)\n", + "n_y = len(y)\n", + "assert len(res.null_distribution) == f(n_x + n_y) / (f(n_x) * f(n_y))\n", + "print(len(res.null_distribution))" + ] + }, + { + "cell_type": "markdown", + "id": "432cff2d-f08e-4be3-99f2-c97a8f751e83", + "metadata": {}, + "source": [ + "When the number of possible permutations is too large, it is common to use a randomly-sampled subset of the possible permutations instead. As with `monte_carlo_test` the maximum number of resamples used by `permutation_test` is controlled using the `n_resamples` parameter." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "e9625f6e-d527-4e78-89fc-e24a603a2811", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.1494" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# use only 4999 randomly-sampled permutations\n", + "res = stats.permutation_test((x, y), statistic, alternative='greater', n_resamples = 4999)\n", + "res.pvalue" + ] + }, + { + "cell_type": "markdown", + "id": "e1937a92-5ad8-4164-a6c0-5ee01e6702b0", + "metadata": {}, + "source": [ + "If the number of distinct permutations of the data is less than or equal to `n_resamples`, `permutation_test` performs an exact test, computing the value of the test statistic for each distinct permutation exactly once. If the number of distinct permutations exceeds `n_resamples`, `permutation_test` computes the value of the statistic for `n_resamples` random permutations, and the $p$-value is computed as:" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "8e7e5525-5cbf-43df-86b2-7af081ad96e4", + "metadata": {}, + "outputs": [], + "source": [ + "pvalue = (np.sum(res.null_distribution >= res.statistic ) + 1) / (len(res.null_distribution) + 1)\n", + "assert pvalue == res.pvalue" + ] + }, + { + "cell_type": "markdown", + "id": "459d7ca3-f719-4b8e-a855-8e47846c3ec0", + "metadata": {}, + "source": [ + "Note that `1` is added to both the numerator and denominator when performing the randomized test [[3]](https://www.degruyter.com/document/doi/10.2202/1544-6115.1585/html). This can be thought of as including the observed value of the test statistic in the null distribution, and it ensures that the $p$-value of a randomized test is never zero." + ] + }, + { + "cell_type": "markdown", + "id": "cb30657f-b7db-4762-95c2-1eebb2a05a64", + "metadata": {}, + "source": [ + "A wide variety of common hypothesis tests can be performed as permutation tests. We continue with several other examples to explore the flexibility of `permutation_test`, beginning with [Independent-Sample Tests](https://nbviewer.org/github/mdhaber/scipy/blob/resampling_tutorial/doc/source/tutorial/stats/notebooks/resampling_tutorial_2a.ipynb)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.5" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/ipython/ResamplingAndMonteCarloMethods/resampling_tutorial_2a.ipynb b/ipython/ResamplingAndMonteCarloMethods/resampling_tutorial_2a.ipynb new file mode 100644 index 0000000..76ca6f0 --- /dev/null +++ b/ipython/ResamplingAndMonteCarloMethods/resampling_tutorial_2a.ipynb @@ -0,0 +1,578 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "57c13277-a239-452b-a202-e499e0a068b0", + "metadata": {}, + "source": [ + "### Independent-Sample Tests\n", + "#### Two-sample Test\n", + "In [Individual Comparisons by Ranking Methods](https://www.jstor.org/stable/3001968#metadata_info_tab_contents), Wilcoxon considers two sprays designed to kill flying insects. A subset of the data, the percentage of flies killed in repeated trials of two treatments, is recorded below." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "fe264441-ebf5-47f0-83fb-81b824751b2d", + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "x = np.array([61, 62, 67, 63, 56, 58])\n", + "y = np.array([60, 68, 59, 72, 64])" + ] + }, + { + "cell_type": "markdown", + "id": "8525c260-6699-4f90-b20c-85e0d948bb3a", + "metadata": {}, + "source": [ + "In the paper, Wilcoxon describes a test to assess whether the two samples are drawn from the same population that is now commonly described as a *nonparametric* version of the independent sample t-test - that is, a version of the t-test that does not make the normality (or any particular distributional) assumption. \n", + "\n", + "Suppose we want to test that null hypothesis that the samples are drawn from the same distribution against the alternative that they are drawn from different distributions which tend to produce samples with a lower values of the statistic. Under certain assumptions, this can be argued as evidence that the location of the distribution underlying `x` is less than the location of the distribution underlying `y`. We pass the data into [`scipy.stats.mannwhitneyu`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html) with `alternative='less'`." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "9fc1c550-67a2-4483-b760-e119b16e6b2d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.16450216450216448" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from scipy import stats\n", + "_, pvalue = stats.mannwhitneyu(x, y, alternative='less')\n", + "pvalue # p-value is greater than our threshold; test is inconclusive" + ] + }, + { + "cell_type": "markdown", + "id": "4b709400-3601-4ce5-9a36-538498abab45", + "metadata": {}, + "source": [ + "Like the mean comparison test in Efron's example, this is an example of an \"independent sample\" test of the null hypothesis that group labels (`x`, `y`) are entirely random. In fact, because `mannwhitneyu` claims to produce an exact value of the statistic, we would expect `permutation_test` to return precisely the same $p$-value (using `mannwhitneyu` only to compute the statistic)." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "f526ac5a-b0a8-4c60-a47d-324c22215431", + "metadata": {}, + "outputs": [], + "source": [ + "def statistic(x, y):\n", + " # return just the Mann-Whitney U statistic\n", + " return stats.mannwhitneyu(x, y, alternative='less').statistic\n", + "\n", + "# \"independent\" is the default `permutation type`, so we are not required to pass it here\n", + "# We pass `alternative='less'` because lesser values of the statistic are more extreme\n", + "res = stats.permutation_test((x, y), statistic, permutation_type='independent', alternative='less')\n", + "np.testing.assert_allclose(res.pvalue, pvalue, atol=1e-15)" + ] + }, + { + "cell_type": "markdown", + "id": "5c346fa8-1b34-4443-8220-c018b6d0f5a8", + "metadata": {}, + "source": [ + "Just as with `monte_carlo_test`, vectorizing the `statistic` function can greatly improve the speed of the test." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "4f2dae73-2bf1-4c66-89b3-ac1ff0487e4c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "272 ms ± 4.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" + ] + } + ], + "source": [ + "# Before\n", + "%timeit stats.permutation_test((x, y), statistic, alternative='less')" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "b5c8a5c2-e958-4164-bd19-0b9fffc342b5", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "56.5 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" + ] + } + ], + "source": [ + "# After \n", + "def statistic_vectorized(x, y, axis=0):\n", + " # return just the Mann-Whitney U statistic\n", + " return stats.mannwhitneyu(x, y, axis=axis, alternative='less').statistic\n", + "\n", + "%timeit stats.permutation_test((x, y), statistic_vectorized, alternative='less', vectorized=True)" + ] + }, + { + "cell_type": "markdown", + "id": "a751fc04-e796-4d8e-bf5a-3c47da9e4323", + "metadata": {}, + "source": [ + "Although `mannwhitneyu` provides an exact $p$-value for the data above, `permutation_test` comes in handy when there are ties in the samples. As the [`mannwhitneyu` documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html) states,\n", + "> `'exact'`: computes the exact p-value by comparing the observed statistic against the exact distribution of the statistic under the null hypothesis. **No correction is made for ties.**\n", + "\n", + "The complete data set in Wilcoxon's original paper had ties." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "ccadcdf2-1341-48c6-bcc3-92be7b80b725", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.014763014763014764 0.01351981351981352\n" + ] + } + ], + "source": [ + "x = [60, 67, 61, 62, 67, 63, 56, 58]\n", + "y = [68, 68, 59, 72, 64, 67, 70, 74]\n", + "res1 = stats.mannwhitneyu(x, y, method='exact', alternative='two-sided')\n", + "# By default, only 9,999 random permutations are used. \n", + "# We pass n_resamples=np.inf to ensure that all 12,870 possible permutations are used\n", + "res2 = stats.permutation_test((x, y), statistic_vectorized, alternative='two-sided', vectorized=True, n_resamples=np.inf)\n", + "print(res1.pvalue, res2.pvalue)" + ] + }, + { + "cell_type": "markdown", + "id": "7c85b266-314c-490f-9a22-a5c1e1400496", + "metadata": {}, + "source": [ + "The two $p$-values are similar despite the ties, but only `permutation_test` is truly \"exact\" in this case. Either way, our 1% threshold for statistical significance is not met, and the test is inconclusive." + ] + }, + { + "cell_type": "markdown", + "id": "6cfb7ac5-6ec6-43a4-ae3b-f5ef32050f17", + "metadata": {}, + "source": [ + "#### Multi-sample Test\n", + "`scipy.stats.kruskal` is a many-sample extension of the Mann-Whitney U test, but SciPy provides only an approximate (asymptotic) $p$-value. It is possible to perform an exact version of the test using `permutation_test` very small samples, and a randomized test using a subset of the possible permutations may yield more accurate results than the approximation implemented by `kruskal`, especially if there are ties or the sample size is small. Using the (artificial) data for milk cap production from [Kruskal and Wallis' original paper](https://www.tandfonline.com/doi/abs/10.1080/01621459.1952.10483441), we have:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "d8879fa7-9576-4d84-acfb-85962ebb0b02", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "KruskalResult(statistic=5.656410256410254, pvalue=0.059118869289796136)" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x = [340, 345, 330, 342, 338]\n", + "y = [339, 333, 344]\n", + "z = [347, 343, 349, 355]\n", + "stats.kruskal(x, y, z)" + ] + }, + { + "cell_type": "markdown", + "id": "246840a9-f7fe-4af6-942e-ad97540e5730", + "metadata": {}, + "source": [ + "At the expense of some time, the exact p-value for this data is given by `permutation_test`." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "5b274945-72d2-4597-a2dd-e9d7efbe293e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.048629148629148626" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def statistic(x, y, z, axis=0):\n", + " return stats.kruskal(x, y, z, axis=axis).statistic\n", + "\n", + "res = stats.permutation_test((x, y, z), statistic, vectorized=True, alternative='greater', n_resamples=np.inf)\n", + "res.pvalue" + ] + }, + { + "cell_type": "markdown", + "id": "fc877591-9f75-4990-a035-a6a48b4e4597", + "metadata": {}, + "source": [ + "Note that we passed `alternative='greater'` into `permutation_test` but not into `kruskal`. This is because the `kruskal` statistic is inherently one-sided test: data generated under the null hypothesis tends to generate small positive values, and data with greater values are more exceptional. This raises the point that setting up a permutation test requires some study of the underlying statistic and SciPy's implementation. Another example of this is shown in the next section." + ] + }, + { + "cell_type": "markdown", + "id": "2eb247ee-5dc9-4951-9124-11bd3a88c510", + "metadata": {}, + "source": [ + "#### Gotchas\n", + "Suppose that we wish to perform the two-sample Kolmogorov-Smirnov test to test the null hypothesis that two samples were drawn from the same distribution against the alternative that the distribution $X$ underlying sample `x` is [stochastically greater](https://en.wikipedia.org/wiki/Stochastic_ordering) than the distribution $Y$ underlying sample `y`. Roughly speaking, this is the alternative that $X$ \"tends to be\" greater than $Y$.\n", + "\n", + "Here, we'll use randomly-generated data that best illustrates some confusing (but important) points. We choose shapes of the samples to generate the `RuntimeWarning` reported in [gh-14019](https://github.com/scipy/scipy/issues/14019)." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "08329d40-cd24-482b-9a40-24e2ed90b23e", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "from scipy import stats\n", + "import matplotlib.pyplot as plt\n", + "\n", + "# Indeed, the distribution $X$ is stochastically greater the distribution $y$\n", + "X = stats.norm(loc=+0.2)\n", + "Y = stats.norm(loc=0)\n", + "x = X.rvs(size=801)\n", + "y = Y.rvs(size=399)\n", + "\n", + "grid = np.linspace(-4, 4, 100)\n", + "plt.plot(grid, X.pdf(grid), 'C0')\n", + "plt.plot(grid, Y.pdf(grid), 'C1')\n", + "plt.hist(x, density=True, color='C0', bins=30, alpha=0.5)\n", + "plt.hist(y, density=True, color='C1', bins=30, alpha=0.5)\n", + "plt.title('Distribution PDFs and Sample Histograms')\n", + "plt.legend(['x', 'y'])\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "d10f5fe3-43f8-46e9-8347-01b926170eaa", + "metadata": {}, + "source": [ + "Our first difficulty is determining the correct value of `alternative` to pass into `ks_2samp`. From its [documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html), we see that the alternatives are expressed not in terms of the values of the samples or the location of the underlying distributions, but in terms of _cumulative density functions_ of the underlying distributions. \n", + "\n", + "> - `two-sided`: The null hypothesis is that the two distributions are identical, $F(x)=G(x)$ for all $x$; the alternative is that they are not identical.\n", + "> - `less`: The null hypothesis is that $F(x) >= G(x)$ for all $x$; the alternative is that $F(x) < G(x)$ for at least one $x$.\n", + "> - `greater`: The null hypothesis is that $F(x) <= G(x)$ for all $x$; the alternative is that $F(x) > G(x)$ for at least one $x$.\n", + "\n", + "Note that if a distribution $X$ tends to be greater than $Y$, we find that the cumulative distribution function of $X$ lies _below_ the cumulative distribution function of $Y$." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "5ebe1d2f-f067-46f1-a415-986b19b933be", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "plt.plot(grid, X.cdf(grid), 'C0')\n", + "plt.plot(grid, Y.cdf(grid), 'C1')\n", + "plt.title('Distribution CDFs')\n", + "plt.legend(['x', 'y'])\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "c44e1bfe-8ebe-41d1-b96d-90f970cb8311", + "metadata": {}, + "source": [ + "Therefore, to test the alternative that $X$ is stochastically greater than $Y$, we pass `alternative='less'` into `ks_2samp`." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "97815b21-eb3e-401e-9cff-1da838141a70", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "KstestResult(statistic=0.12484394506866417, pvalue=0.00022195733373729093)\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "C:\\Users\\matth\\AppData\\Local\\Temp\\ipykernel_35644\\3756467023.py:1: RuntimeWarning: ks_2samp: Exact calculation unsuccessful. Switching to method=asymp.\n", + " res1 = stats.ks_2samp(x, y, alternative='less', method='exact')\n" + ] + } + ], + "source": [ + "res1 = stats.ks_2samp(x, y, alternative='less', method='exact')\n", + "print(res1)" + ] + }, + { + "cell_type": "markdown", + "id": "33065824-d0c3-4ad6-9c08-296a160636fd", + "metadata": {}, + "source": [ + "The $p$-value is tiny, confirming what we already know: the data are inconsistent with the null hypothesis, and we have evidence to reject it in favor of the alternative.\n", + "\n", + "The warning states that `ks_2samp` was unable to compute an exact $p$-value, and an asymptotic $p$-value is being returned instead. To determine whether the asymptotic $p$-value is accurate for these sample sizes, we can perform a permutation test." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "649978b2-02ff-42a6-90d1-e04b3051c31c", + "metadata": {}, + "outputs": [], + "source": [ + "def statistic(x, y):\n", + " return stats.ks_2samp(x, y, alternative='less').statistic\n", + "\n", + "# This would be extremely slow!\n", + "# res2 = stats.permutation_test((x, y), statistic, alternative='greater')" + ] + }, + { + "cell_type": "markdown", + "id": "69c52b5c-f7aa-4eaf-b2ad-9e53723aa9c7", + "metadata": {}, + "source": [ + "The calculation above would be extremely slow to run. Unfortunately, `ks_2samp` does not accept an `axis` argument, so we can't speed it up using vectorization without truly implementing the statistic ourselves. However, lack of vectorization is not be the bottleneck here. Note that the call to `ks_2samp` is quite slow with the default parameters *even for 1D inputs*." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "4825ddb9-5bf9-445f-bb82-031d3b330fc8", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "374 ms ± 2.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" + ] + } + ], + "source": [ + "# No need for the warning; we know the exact calculation is unsuccessful\n", + "import warnings\n", + "warnings.filterwarnings('ignore')\n", + "%timeit stats.ks_2samp(x, y, alternative='less', method='exact')" + ] + }, + { + "cell_type": "markdown", + "id": "f728c880-bbde-4003-87a6-c4090925088d", + "metadata": {}, + "source": [ + "By default, `permutation_test` needs to call `ks_2samp` 9999 times, which would take about an hour. We can speed this up dramatically by noting `permutation_test` only uses `ks_2samp` to compute the test statistic, so the `pvalue` attribute of the `ks_2samp` result object is not used at all. We can use `ks_2samp` to compute essentially the same value of the test statistic, but much faster, by specifying `method='asymp'`." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "abfe9b45-4e7b-4284-bd1b-2620eced89ff", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "167 µs ± 947 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n" + ] + } + ], + "source": [ + "# method='asymp' and method='exact' result in the same statistic value\n", + "res1 = stats.ks_2samp(x, y, alternative='less', method='asymp')\n", + "res2 = stats.ks_2samp(x, y, alternative='less', method='exact')\n", + "np.testing.assert_allclose(res1.statistic, res2.statistic, atol=1e-15)\n", + "\n", + "# but method='asymp' is much faster\n", + "%timeit stats.ks_2samp(x, y, alternative='less', method='asymp')" + ] + }, + { + "cell_type": "markdown", + "id": "bafdc39b-6132-4121-b7b8-6853603b129a", + "metadata": {}, + "source": [ + "Now we can run a randomized `permutation_test` in reasonable time. " + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "c0c64b0b-31c5-4586-b696-36f03642040b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.0002219573337372917 0.9999\n" + ] + } + ], + "source": [ + "def statistic(x, y):\n", + " return stats.ks_2samp(x, y, alternative='less', method='asymp').statistic\n", + "\n", + "res3 = stats.permutation_test((x, y), statistic, alternative='less')\n", + "print(res1.pvalue, res3.pvalue)" + ] + }, + { + "cell_type": "markdown", + "id": "c9cb3241-fea3-4b3d-90af-a60a7852ffbd", + "metadata": {}, + "source": [ + "This was much faster, but something is still wrong. Either the approximate $p$-value is wildly inaccurate, or we have set up our test incorrectly. The latter turns out to be the case: the value of `alternative` passed into `ks_2samp` changes *the definition of the test statistic*, but a *greater* p-value is always considered more extreme. Therefore, even if we wish to perform a test equivalent to `ks_2samp` with `alternative='less'`, we actually need to pass `alternative='greater'` into `permutation_test`!" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "3fc268c6-db67-4bee-9c3e-b50ea2896e98", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.0002219573337372917 0.0005\n" + ] + } + ], + "source": [ + "# greater values of the statistic returned by `ks_2samp` are more extreme\n", + "res4= stats.permutation_test((x, y), statistic, alternative='greater')\n", + "print(res1.pvalue, res4.pvalue)" + ] + }, + { + "cell_type": "markdown", + "id": "9b06fd03-ca61-498c-9542-5dabc43bcddd", + "metadata": {}, + "source": [ + "At last, `permutation_test` is invoked correctly. Indeed, the asymptotic $p$-value produced by `ks_2samp` appears to be reliable for these sample sizes." + ] + }, + { + "cell_type": "markdown", + "id": "a8013eef-9aba-475d-b83a-ffbbbe311e8d", + "metadata": {}, + "source": [ + "### Other Tests\n", + "As we can see, `permutation_test` with `permutation_type='independent'` is a versatile tool for comparing independent samples. Provided only data and a statistic, it can produce the null distribution and replicate the $p$-value of many such tests in SciPy, and it may be more accurate than these existing implementations, especially for small samples and when there are ties:\n", + "\n", + "- [`ttest_ind`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html)\n", + "- [`cramervonmises_2samp`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.cramervonmises_2samp.html)\n", + "- [`ks_2samp`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html)\n", + "- [`epps_singleton_2samp`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.epps_singleton_2samp.html)\n", + "- [`mannwhitneyu`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html)\n", + "- [`kruskal`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kruskal.html)\n", + "- [`friedmanchisquare`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.friedmanchisquare.html)\n", + "- [`brunnermunzel`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.brunnermunzel.html)\n", + "- [`ansari`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ansari.html)\n", + "- [`bartlett`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bartlett.html)\n", + "- [`levene`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.levene.html)\n", + "- [`anderson_ksamp`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.anderson_ksamp.html)\n", + "- [`fligner`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fligner.html)\n", + "- [`median_test`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.median_test.html)\n", + "- [`mood`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mood.html)\n", + "\n", + "In addition, `permutation_test` with `permutation_type='independent'` can be used to perform tests not yet implemented in SciPy.\n", + "\n", + "However, there are other types of permutation tests that do not assume that the samples are entirely independent. We continue the study of `permutation_test` with [Paired-Sample Tests](https://nbviewer.org/github/mdhaber/scipy/blob/resampling_tutorial/doc/source/tutorial/stats/notebooks/resampling_tutorial_2b.ipynb)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.5" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/ipython/ResamplingAndMonteCarloMethods/resampling_tutorial_2b.ipynb b/ipython/ResamplingAndMonteCarloMethods/resampling_tutorial_2b.ipynb new file mode 100644 index 0000000..9cd2c07 --- /dev/null +++ b/ipython/ResamplingAndMonteCarloMethods/resampling_tutorial_2b.ipynb @@ -0,0 +1,544 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "0da5ab55-df7a-48d9-a9e8-a84473eb2a72", + "metadata": { + "tags": [] + }, + "source": [ + "### Paired-Sample Tests\n", + "\n", + "In [The Design of Experiments](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2458144/https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2458144/), Fisher describes an experiment conducted by Charles Darwin to measure the effect of cross-fertilization on plant growth. In Darwin's experiment, pairs of plants were grown from the same batch of seed in the same pot under the same conditions, except that one was self-fertilized and the other was cross-fertilized. \n", + "\n", + "> The evident object of these precautions is to increase the sensitiveness of the experiment, by making such differences in growth rate as were to be observed as little as possible dependent from environmental circumstancces, and as much as possible, therefore, from intrinsic differences due to their mode of origin.\n", + "\n", + "The `x` and `y` arrays below record the height of the cross-fertilized and self-fertilized plants, respectively." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "77e2b91f-6d6c-43c0-b6f1-0be309885c9d", + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "x = np.array([23.5, 12, 21, 22, 19.125, 21.5, 22.125, 20.375, 18.25, 21.625, 23.25, 21, 22.125, 23, 12]) # (in) cross-fertilized\n", + "y = np.array([17.375, 20.375, 20, 20, 18.375, 18.625, 18.625, 15.25, 16.5, 18, 16.25, 18, 12.75, 15.5, 18]) # (in) self-fertilized\n", + "assert len(x) == len(y) # elements at corresponding positions form a pair" + ] + }, + { + "cell_type": "markdown", + "id": "24b37c06-fb34-493e-ab4b-c49a14185c69", + "metadata": {}, + "source": [ + "The null hypothesis was that the method of fertilization would have no effect, and Darwin's alternative hypothesis was that cross-fertilized plants would be taller. Fisher recommends the t-test, which is implemented for paired (\"related\") samples by `stats.ttest_rel`." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "005bb7e7-8751-4298-a13e-c796e149df9e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Ttest_relResult(statistic=2.1479874613311205, pvalue=0.024851472010900447)" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from scipy import stats\n", + "# alternative hypothesis 'greater': the mean of the distribution underlying \n", + "# `x` is greater than the mean of the distribution underlying `y` \n", + "res_t = stats.ttest_rel(x, y, alternative='greater')\n", + "res_t" + ] + }, + { + "cell_type": "markdown", + "id": "809a24f6-4345-47f5-b7b8-9c0b64b7384f", + "metadata": {}, + "source": [ + "However, the t-test is derived under the assumption that samples were drawn from a normal population. He reports that \n", + "\n", + "> There has, however, in recent years, been a tendency for theoretical statisticians, not closely in touch with the requirements of experimental data, to stress the element of normality, in the hypothesis tested, as though it were a serious limitation to the test applied.\n", + "\n", + "(The same is true today!) And so he proposes a different test that does not rely on the assumption of normality:\n", + "\n", + "> On the hypothesis that the two series of seeds are random sampled from identical populations, and that their sites have been assigned to members of each pair independently at random, the 15 differences [in height between pairs] would each have occured with equal frequency with a positive or with a negative sign... \n", + "Since *ex hypothesi* each of these $2^{15}$ combinations will occur by chance with equal frequency, a knowledge of how many of them are equal to or greater than the the value actually observed affords a direct arithmetical test of the significance of this value.\n", + "\n", + "He continues to perform the following test." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "5c328b74-1774-4008-a8fc-d922fa0d94b9", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Text(0, 0.5, 'Observed Frequency')" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "from itertools import product\n", + "import matplotlib.pyplot as plt\n", + "\n", + "def statistic(d):\n", + " # d is the differences in height between paired plants\n", + " # the statistic is the sum of these differences\n", + " return np.sum(d)\n", + "\n", + "# compute the statistic for all possible combinations of signs on the elements of `d`\n", + "def null_distribution(d):\n", + " signs = (-1, 1) \n", + " n = len(d) # number of observations per sample\n", + " null_distribution = []\n", + " for dsigns in product(*[signs]*n):\n", + " stat = statistic(d * dsigns)\n", + " null_distribution.append(stat)\n", + " return null_distribution\n", + "\n", + "d = x - y\n", + "null_dist = null_distribution(d)\n", + "assert len(null_dist) == 2**15\n", + "bins = np.unique(null_dist).tolist()\n", + "bins.append(np.max(null_dist)+1)\n", + "plt.hist(null_dist, density=True, bins=bins)\n", + "plt.xlabel(\"Value of test statistic\")\n", + "plt.ylabel(\"Observed Frequency\")" + ] + }, + { + "cell_type": "markdown", + "id": "fa460986-ab98-448f-878d-35abe9e7c82a", + "metadata": {}, + "source": [ + "> In just 863 cases out of 32,768 the total deviation will have a positive value as great as or greater than that observed.\n", + "\n", + "The $p$-value is simply the ratio of the two." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "ee2ae5b0-8ed4-4f20-be2d-904365f7ff6a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "t-test p-value: 0.024851472010900447\n", + "permutation test p-value: 0.026336669921875\n" + ] + } + ], + "source": [ + "assert np.sum(null_dist >= statistic(d)) == 863\n", + "pvalue = np.sum(null_dist >= statistic(d)) / len(null_dist)\n", + "print(f\"t-test p-value: {res_t.pvalue}\")\n", + "print(f\"permutation test p-value: {pvalue}\")" + ] + }, + { + "cell_type": "markdown", + "id": "e105edca-1c7e-4223-ab29-ed5db55fcf4f", + "metadata": {}, + "source": [ + "The two tests agree remarkably well in this case, suggesting the applicability of the t-test even when the original data are not strictly normally distributed." + ] + }, + { + "cell_type": "markdown", + "id": "bbcb4fdb-7191-4bc6-82c0-0fdfac445b4d", + "metadata": {}, + "source": [ + "`permutation_test` minimizes the code required to perform the same procedure. The most essential difference here compared to the independent \n", + "sample tests above is that we pass `permutation_type='samples'` instead of `permutation_type='independent'`. Note also that we can simply pass `np.sum` as the statistic because it satisfies the required interface, even with `vectorized=True`." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "dd6d90a0-e443-4ec2-a6c5-04eddccea877", + "metadata": {}, + "outputs": [], + "source": [ + "# Note that the first argument of `permutation_test` is a sequence containing all the samples.\n", + "# We only have one sample here, but it still needs to be in a sequence.\n", + "res = stats.permutation_test((d,), np.sum, alternative='greater', permutation_type='samples', vectorized=True, n_resamples=np.inf)\n", + "assert res.pvalue == pvalue" + ] + }, + { + "cell_type": "markdown", + "id": "3ef5fc8c-00a0-4f70-b385-76f189b1f66b", + "metadata": {}, + "source": [ + "The value of the `permutation_type`, \"samples\", comes from the fact that permuting the signs of the differences is equivalent to permuting the *sample* to which each paired observation is assigned. That is, we can obtain the same $p$-value by performing a two-sample test in which observations are exchanged between the two samples in all possible ways *without* breaking up pairs. This is what `permutation_test` does with `permutation_type='samples'` when there are two or more samples." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "c27182a9-ba9d-4b7f-8fb4-66b49eee11bb", + "metadata": {}, + "outputs": [], + "source": [ + "def statistic(x, y, axis=0):\n", + " # observations will be permuted between samples `x` and `y`,\n", + " # changing the sign of the corresponding element of `d` \n", + " d = x - y\n", + " return np.sum(d, axis=axis)\n", + "\n", + "res = stats.permutation_test((x, y), statistic, alternative='greater', permutation_type='samples', n_resamples=np.inf)\n", + "assert res.pvalue == pvalue" + ] + }, + { + "cell_type": "markdown", + "id": "fe585e67-6955-43a5-a509-0b3e1743d22b", + "metadata": {}, + "source": [ + "Other common hypothesis tests can be performed as paired-sample permutation tests. We continue with several comparisons against `scipy.stats.wilcoxon` to further illustrate the usage of `permutation_test` with `permutation_type='samples'`.\n", + "\n", + "#### Two-sample Test" + ] + }, + { + "cell_type": "markdown", + "id": "d6b915f5-6182-4060-806c-60aa242071ae", + "metadata": {}, + "source": [ + "[`wilcoxon`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html) is to `ttest_rel` as `mannwhitneyu` is to `ttest_ind`; it is a nonparametric permutation test of the null hypothesis that each observation in a pair is drawn from the same distribution. Because it is a paired-sample test, it computes the $p$-value by reassigning observations between the samples in all possible ways while maintaining the pairs." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "4b4c4b12-0e1d-4a41-a202-3682cae871b1", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "WilcoxonResult(statistic=10.0, pvalue=0.765625)" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "rng = np.random.default_rng()\n", + "x = rng.random(7)\n", + "y = rng.random(7)\n", + "# Each element in `x` is paired with the corresponding element of `y` \n", + "stats.wilcoxon(x, y, alternative='greater')" + ] + }, + { + "cell_type": "markdown", + "id": "d542cee8-1f20-440f-be69-9812f593099a", + "metadata": {}, + "source": [ + "Using `wilcoxon` to compute only the statistic, we can use `permutation_test` to calculate precisely the same $p$-value." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "65a8170d-26be-4c64-9be3-d89e49699cc8", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "C:\\Users\\matth\\Desktop\\scipy\\scipy\\stats\\_morestats.py:3380: UserWarning: Sample size too small for normal approximation.\n", + " warnings.warn(\"Sample size too small for normal approximation.\")\n" + ] + }, + { + "data": { + "text/plain": [ + "0.765625" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def statistic(x, y):\n", + " return stats.wilcoxon(x, y, alternative='greater', method='approx').statistic\n", + "\n", + "res = stats.permutation_test((x, y), statistic, alternative='greater', permutation_type='samples')\n", + "res.pvalue" + ] + }, + { + "cell_type": "markdown", + "id": "8bd72655-a67b-4493-aa7b-54a4cee5ef98", + "metadata": {}, + "source": [ + "(The warning can be safely ignored because it is referring to the $p$-value returned by the test, whereas we are using only the statistic value.)\n", + "\n", + "Again, the advantage of `permutation_test` is its ability to handles ties in the data; `wilcoxon` does not." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "3f9a43ec-861c-418d-b736-646075f65785", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "C:\\Users\\matth\\Desktop\\scipy\\scipy\\stats\\_morestats.py:3366: UserWarning: Exact p-value calculation does not work if there are zeros. Switching to normal approximation.\n", + " warnings.warn(\"Exact p-value calculation does not work if there are \"\n" + ] + }, + { + "data": { + "text/plain": [ + "WilcoxonResult(statistic=4.5, pvalue=0.8539232992870668)" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x = rng.integers(0, 5, size=7)\n", + "y = rng.integers(0, 5, size=7)\n", + "stats.wilcoxon(x, y, method='exact')" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "226a7dd7-7dc9-4ef1-9161-b2a59bf324fd", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(4.5, 1.0)" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "res = stats.permutation_test((x, y), statistic, permutation_type='samples')\n", + "res.statistic, res.pvalue" + ] + }, + { + "cell_type": "markdown", + "id": "f10349d8-9232-482b-92ed-74e59023dd08", + "metadata": {}, + "source": [ + "#### One-sample Test\n", + "\n", + "The `wilcoxon` statistic does not depend on the specific values in `x` and `y`, only on the values of `x - y`. Instead of passing `x` and `y` into `wilcoxon` separately, `wilcoxon` can accept a single sample - the differences between paired observations." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "c77d4c4c-efec-429b-9af1-b900308f05a5", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "WilcoxonResult(statistic=25.0, pvalue=0.0390625)" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x = np.array([209, 200, 177, 169, 159, 169, 187])\n", + "y = np.array([151, 168, 147, 164, 166, 163, 176])\n", + "stats.wilcoxon(x - y, alternative='greater', method='exact')" + ] + }, + { + "cell_type": "markdown", + "id": "7d4d542a-a922-47ca-aebc-9328209a1da3", + "metadata": {}, + "source": [ + "As described above, this is relatively common in paired-sample tests (e.g. see also `ttest_rel`), so `permutation_test` also supports single samples as input and forms the null distribution by permuting the *signs* of each observation." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "9667f070-749c-45c2-8e51-1e284fda6a1c", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.0390625" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def statistic(d):\n", + " # return just the Mann-Whitney U statistic\n", + " return stats.wilcoxon(d, alternative='greater').statistic\n", + "\n", + "res = stats.permutation_test((x-y,), statistic, alternative='greater', permutation_type='samples')\n", + "res.pvalue" + ] + }, + { + "cell_type": "markdown", + "id": "ecb0eb66-4c5c-4dd5-8403-a8c1a46f0ebd", + "metadata": {}, + "source": [ + "#### Gotchas\n", + "\n", + "Suppose that we wish to perform the Wilcoxon test with a two-sided alternative. We might expect that `alternative='two-sided'` should be specified everywhere the `alternative` parameter is accepted." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "353b33cc-2bfc-43ea-b870-47fb600aa879", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.078125 0.15625\n" + ] + } + ], + "source": [ + "alternative = 'two-sided'\n", + "res1 = stats.wilcoxon(x - y, alternative=alternative)\n", + "\n", + "def statistic(d):\n", + " return stats.wilcoxon(d, alternative=alternative).statistic\n", + "\n", + "res2 = stats.permutation_test((x-y,), statistic, alternative=alternative, permutation_type='samples')\n", + "\n", + "print(res1.pvalue, res2.pvalue)" + ] + }, + { + "cell_type": "markdown", + "id": "cefd2711-9f37-46fb-97c1-073362fa7041", + "metadata": {}, + "source": [ + "This guess was not correct; the `pvalue` returned by `permutation_test` is greater by a factor of two. Note that the documentation of `wilcoxon` states:\n", + "> If `alternative` is “two-sided”, [the `statistic` is] the sum of the ranks of the differences above or below zero, whichever is smaller.\n", + "\n", + "The sign information that should be carried by the statistic (\"above or below zero\") is not preserved by `statistic`. This can be corrected by passing `alternative='greater'` or `alternative='less'` in the call to `wilcoxon` so that that the statistic is always:\n", + "> the sum of the ranks of the differences above zero." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "2aa4d597-f85e-4e31-a5f3-d38b8dd6fef4", + "metadata": {}, + "outputs": [], + "source": [ + "def statistic(d):\n", + " # `alternative='less'` or `alternative='greater'` will work\n", + " return stats.wilcoxon(d, alternative='less').statistic\n", + "\n", + "res2 = stats.permutation_test((x-y,), statistic, alternative=alternative, permutation_type='samples')\n", + "\n", + "np.testing.assert_allclose(res2.pvalue, res1.pvalue, atol=1e-15)" + ] + }, + { + "cell_type": "markdown", + "id": "31159ba3-500b-468f-90cd-00647704cd1b", + "metadata": {}, + "source": [ + "### Other Tests\n", + "`permutation_test` with `permutation_type='samples'` is a versatile tool for comparing paired samples. Provided only data and a statistic, it can produce the null distribution and replicate the $p$-value of similar tests tests in SciPy, and it may be more accurate than these existing implementations, especially for small samples and when there are ties:\n", + "\n", + "- [`ttest_rel`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html)\n", + "- [`wilcoxon`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html)\n", + "- [`page_trend_test`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.page_trend_test.html)\n", + "\n", + "In addition, `permutation_test` with `permutation_type='samples'` can be used to perform tests not yet implemented in SciPy.\n", + "\n", + "However, there are yet other types of permutation tests that assume neither that samples are independent nor paired. We conclude the study of `permutation_test` with [Correlated-Sample Tests](https://nbviewer.org/github/mdhaber/scipy/blob/resampling_tutorial/doc/source/tutorial/stats/notebooks/resampling_tutorial_2c.ipynb)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.5" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/ipython/ResamplingAndMonteCarloMethods/resampling_tutorial_2c.ipynb b/ipython/ResamplingAndMonteCarloMethods/resampling_tutorial_2c.ipynb new file mode 100644 index 0000000..de68a7c --- /dev/null +++ b/ipython/ResamplingAndMonteCarloMethods/resampling_tutorial_2c.ipynb @@ -0,0 +1,532 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "693a57e1-90c1-4947-8bb4-55043cafd19f", + "metadata": {}, + "source": [ + "### Correlated-Sample Tests" + ] + }, + { + "cell_type": "markdown", + "id": "ffeda6a3-f272-4d58-aa8b-a85be4c17f04", + "metadata": {}, + "source": [ + "Hollander and Wolfe's [Nonparametric Statistic methods](https://onlinelibrary.wiley.com/doi/book/10.1002/9781119196037) considers data from [[6](https://www.jci.org/articles/view/106443)], which studied the relationship between free proline (an amino acid) and total collagen (a protein often found in connective tissue) in diseased human livers.\n", + "\n", + "The `x` and `y` arrays below record the measurements of the two compounds." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "9f6a6603-3dec-4d28-b101-28c550d6e326", + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "x = np.array([7.1, 7.1, 7.2, 8.3, 9.4, 10.5, 11.4]) # total collagen (mg/g dry weight of liver)\n", + "y = np.array([2.8, 2.9, 2.8, 2.6, 3.5, 4.6, 5.0]) # free proline (μ mole/g dry weight of liver)" + ] + }, + { + "cell_type": "markdown", + "id": "341c6722-3240-4baa-854f-6cab6d397557", + "metadata": {}, + "source": [ + "The text shows the results of analysis using Spearman's correlation coeefficient, a statistic sensitive to linear association between the ranks of the samples. Specifically, the null hypothesis that there is no association between total collagen and free protein is tested against the alternative that there is a positive, linear association between the two." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "26d45212-2e68-4177-ad0d-56c6560fd0f3", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "SpearmanrResult(correlation=0.7000000000000001, pvalue=0.03995834515444954)" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from scipy import stats\n", + "res_asymptotic = stats.spearmanr(x, y, alternative='greater')\n", + "res_asymptotic" + ] + }, + { + "cell_type": "markdown", + "id": "6a472b08-7ec8-4aeb-8588-8dbec497c84f", + "metadata": {}, + "source": [ + "As usual, the $p$-value of SciPy's `stats.spearmanr` approximates the probability of obtaining such an extreme value of the statistic under the null hypothesis. The small $p$-value corresponding with the positive correlation coefficient provides \"marginal evidence\" that the total collagen and free proline are positively correlated.\n", + "\n", + "However, the $p$-value of `stats.spearmanr` is based on an asymptotic approximation, which may not be accurate for such a small sample. Even if `spearmanr` did implement an exact $p$-value calculation, it is unlikely that it would support data with ties due to the limitations of common algorithms. Therefore, we consider how a permutation test oculd be used to compute an exact $p$-value.\n", + "\n", + "Under the null hypothesis, all the proline measurements are independent samples from the same distribution; they are uncorrelated with the measurements of collagen, and the observed pairings have no significance. Therefore, the null distribution is formed by computing the statistic for *all possible pairings* of proline and collagen measurements *without permuting samples*. Because `spearmanr` treats elements `x[i]` and `y[i]` as paired, we can accomplish this by computing the statistic for all possible orderings of only one of the two arrays (e.g. `y`)." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "bfce021a-2bf1-4797-affd-b90baa419591", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Text(0, 0.5, 'Observed Frequency')" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "from itertools import permutations\n", + "import matplotlib.pyplot as plt\n", + "\n", + "def statistic(x, y):\n", + " return stats.spearmanr(x, y, alternative='greater').correlation\n", + "\n", + "def null_distribution(x, y):\n", + " # the order of `x` can remain fixed\n", + " # By generating all possible orderings of `y` (alone),\n", + " # we explore all possible pairings between observations\n", + " # in `x` and `y`\n", + " null_distribution = []\n", + " for yperm in permutations(y):\n", + " stat = statistic(x, yperm)\n", + " null_distribution.append(stat)\n", + " return null_distribution\n", + "\n", + "null_dist = null_distribution(x, y)\n", + "plt.hist(null_dist, density=True, bins=100)\n", + "plt.xlabel(\"Value of test statistic\")\n", + "plt.ylabel(\"Observed Frequency\")" + ] + }, + { + "cell_type": "markdown", + "id": "e5709630-c43b-49d7-a760-ef5c46b55443", + "metadata": {}, + "source": [ + "The $p$-value is the percentage of values in the null distribution that equal or exceed the observed value of the statistic." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "d0e86da4-b5d0-4a5b-9cbe-1f1192bce745", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Asymptotic p-value: 0.03995834515444954\n", + "Exact p-value: 0.04563492063492063\n" + ] + } + ], + "source": [ + "pvalue = np.sum(null_dist >= statistic(x, y)) / len(null_dist)\n", + "print(f\"Asymptotic p-value: {res_asymptotic.pvalue}\")\n", + "print(f\"Exact p-value: {pvalue}\")" + ] + }, + { + "cell_type": "markdown", + "id": "f272d090-d956-426d-86f4-16018675db28", + "metadata": {}, + "source": [ + "The asymptotic p-value is reasonably accurate in this case, but it smaller. This suggests that the asymptotic approximation is not conservative; that is, it is more likely to lead to _Type I error_ - being taken as evidence against the null hypothesis even when the null hypothesis is actually true.\n", + "\n", + "`permutation_test` can perform the same test using `permutation_type='pairings'` (so named because it forms the null distribution by permuting the pairings of the observations without permuting observations between samples)." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "26863c88-a0c5-4bb6-b20b-c9265c8702b3", + "metadata": {}, + "outputs": [], + "source": [ + "def statistic(y):\n", + " return stats.spearmanr(x, y).correlation\n", + "res = stats.permutation_test((y,), statistic, alternative='greater', permutation_type='pairings', n_resamples=np.infty)\n", + "assert res.pvalue == pvalue" + ] + }, + { + "cell_type": "markdown", + "id": "1e176203-d762-43a9-8a13-95830b8bcdaf", + "metadata": {}, + "source": [ + "Many other correlation tests can be performed as permutation tests. We continue with another example to help avoid common pitfalls in the usage of `permutation_test` with `permutation_type='pairings'`." + ] + }, + { + "cell_type": "markdown", + "id": "713434e9-ad0e-4de0-a2ab-c1c9f15fa3fe", + "metadata": {}, + "source": [ + "#### Gotchas\n", + "\n", + "Another example of a correlation test in SciPy is `scipy.stats.kendalltau`." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "c06f4848-7dc1-4d38-9c10-c3b75d5bbedd", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "KendalltauResult(correlation=-0.39999999999999997, pvalue=0.48333333333333334)" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Each element in `x` is paired with the corresponding element of `y`\n", + "rng = np.random.default_rng()\n", + "x = rng.random(5)\n", + "y = rng.random(5) \n", + "stats.kendalltau(x, y, alternative='two-sided')" + ] + }, + { + "cell_type": "markdown", + "id": "b1d8e8f8-6350-4e8a-9e04-cb9aa9be6b34", + "metadata": {}, + "source": [ + "Like `mannwhitneyu` and `wilcoxon`, `kendalltau` computes its p-value using permutations. Using `kendalltau` to compute only the statistic, we could compute the same $p$-value with `permutation_test`." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "b6039a81-5a7f-4b17-9372-91644d1f522b", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.4816" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def statistic(x, y):\n", + " return stats.kendalltau(x, y).correlation\n", + "res = stats.permutation_test((x, y,), statistic, alternative='two-sided', permutation_type='pairings')\n", + "res.pvalue" + ] + }, + { + "cell_type": "markdown", + "id": "14f3d86c-a40c-4b52-b0f5-4044756e5b3f", + "metadata": {}, + "source": [ + "What happened here? In all cases before, `permutation_test` produced the exact p-value, but here we have only a four-digit approximation.\n", + "\n", + "Note that the null distribution contains only 9999 elements, the default for a randomized test." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "83f73b8b-1112-4f69-812e-6354fcf348c0", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "9999" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(res.null_distribution)" + ] + }, + { + "cell_type": "markdown", + "id": "7680292d-e64c-4827-aaaa-21a4b42a58fe", + "metadata": {}, + "source": [ + "If we were to allow for unlimited permutations, `permutation_test` would eventually compute the exact answer. " + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "6ea2531c-b691-4c03-b644-426311f1d466", + "metadata": {}, + "outputs": [], + "source": [ + "res = stats.permutation_test((x, y), statistic, alternative='two-sided', permutation_type='pairings', n_resamples=np.inf)" + ] + }, + { + "cell_type": "markdown", + "id": "21aa5416-f1f3-4304-a8e9-12d800aaab49", + "metadata": {}, + "source": [ + "Then we can compute the $p$-value as the percentage of elements in the null distribution as extreme as the observed value of the test statistic. In this case, either large or small values are considered more extreme because `alternative='two-sided'`. " + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "47c98f9d-5412-4eee-ba26-711088cdd83f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.48333333333333334\n" + ] + } + ], + "source": [ + "pvalue = np.sum(np.abs(res.null_distribution) >= np.abs(statistic(x, y)) ) / len(res.null_distribution)\n", + "print(pvalue)" + ] + }, + { + "cell_type": "markdown", + "id": "fd18314e-6613-4f56-8637-1ead152e627c", + "metadata": {}, + "source": [ + "Note that this definition only makes sense for distributions that are symmetric about a known median. To produce the same value for symmetric distributions but generalize to asymmetric distributions, `permutation_test` actually computes the p-value by doubling the minimum of the `'greater'` and `'less'` p-values." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "1cf372d4-209e-4861-9580-7368c1968cdb", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.48333333333333334\n" + ] + } + ], + "source": [ + "pvalue_greater = np.sum(res.null_distribution >= statistic(x, y) ) / len(res.null_distribution)\n", + "pvalue_less = np.sum(res.null_distribution <= statistic(x, y) ) / len(res.null_distribution)\n", + "pvalue = 2 * min(pvalue_greater, pvalue_less)\n", + "print(pvalue)" + ] + }, + { + "cell_type": "markdown", + "id": "a912f79d-1fae-42e6-9c2c-744b36c1e285", + "metadata": {}, + "source": [ + "But let's step back a minute - theoretically, there are only $5!=120$ possible pairings of the observations between the two samples, so why did it take so many resamples to compute an exact answer? \n", + "\n", + "`permutation_test` permutes the orders of _all_ provided samples, so we computed all possible permutations of both `x` and `y`. That's $5! \\cdot 5!=14,400$ permutations:" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "06f986f4-0662-44b7-9509-9a58de803544", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "14400" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(res.null_distribution)" + ] + }, + { + "cell_type": "markdown", + "id": "e20badf7-1766-4181-80c0-ae71d465ec70", + "metadata": {}, + "source": [ + "This is $5!$ times the amount of work it needed to do, since only the pairings between observations in `x` and `y` affect the statistic, not the order of the pairs within the arrays. We improve efficiency by leaving `x` out of the call to `permutation_test` and simply include it as part of the `statistic` itself." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "9a0f9fe9-2eb3-49d0-bf23-555e4048bdb7", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.48333333333333334\n" + ] + } + ], + "source": [ + "def statistic(y):\n", + " return stats.kendalltau(x, y, alternative='two-sided').correlation\n", + "\n", + "res = stats.permutation_test((y,), statistic, alternative='two-sided', permutation_type='pairings')\n", + "assert len(res.null_distribution) == 120\n", + "print(res.pvalue)" + ] + }, + { + "cell_type": "markdown", + "id": "86228230-34c8-4b15-8d32-02329ede4d90", + "metadata": {}, + "source": [ + "Again, `permutation_test` is particularly useful when there are ties because, according to the [`kendalltau` documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kendalltau.html):\n", + "\n", + "> ‘exact’: computes the exact p-value, but can only be used if no ties are present. " + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "9fa298f8-5e90-44bb-9c4a-57f88a3432aa", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ValueError: Ties found, exact method cannot be used.\n" + ] + } + ], + "source": [ + "# with more observations than distinct values,\n", + "# there will be ties within and between samples\n", + "x = rng.integers(0, 5, size=7) \n", + "y = rng.integers(0, 5, size=7)\n", + "try:\n", + " stats.kendalltau(x, y, method='exact', alternative='two-sided')\n", + "except ValueError as e:\n", + " print(f\"{type(e).__name__}: {e}\")" + ] + }, + { + "cell_type": "markdown", + "id": "b8d88171-4457-4140-b34a-239edfdf6d70", + "metadata": {}, + "source": [ + "`permutation_test` has no such restriction. Since we are using `kendalltau` only to compute the correlation statistic, we can pass option `method='asymptotic'` to avoid the computational expense of computing exact p-values." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "899c532f-b6b3-4819-b58d-445c724624be", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.5714285714285714" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def statistic(y):\n", + " return stats.kendalltau(x, y, alternative='two-sided', method='asymptotic').correlation\n", + "\n", + "res = stats.permutation_test((y,), statistic, alternative='two-sided', permutation_type='pairings')\n", + "res.pvalue" + ] + }, + { + "cell_type": "markdown", + "id": "427a5fa2-096a-419e-9f68-29338f6ef4c5", + "metadata": {}, + "source": [ + "### Other Tests\n", + "`permutation_test` with `permutation_type='pairings'` is a versatile tool for assessing association between samples. Provided only data and a statistic, it can produce the null distribution and replicate the $p$-value of similar tests tests in SciPy, and it may be more accurate than these existing implementations, especially for small samples and when there are ties:\n", + "\n", + "- [`pearsonr`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html)\n", + "- [`spearmanr`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html)\n", + "- [`kendalltau`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kendalltau.html)\n", + "- [`somersd`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.somersd.html)\n", + "- [`linregress`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html)\n", + "\n", + "In addition, `permutation_test` with `permutation_type='pairings'` can be used to perform tests not yet implemented in SciPy.\n", + "\n", + "But there is much more to statistics than $p$-values! We conclude with a discussion of one of the most versatile techniques of all: [the bootstrap](https://nbviewer.org/github/mdhaber/scipy/blob/resampling_tutorial/doc/source/tutorial/stats/notebooks/resampling_tutorial_3.ipynb)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.5" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/ipython/ResamplingAndMonteCarloMethods/resampling_tutorial_3.ipynb b/ipython/ResamplingAndMonteCarloMethods/resampling_tutorial_3.ipynb new file mode 100644 index 0000000..4d7b58d --- /dev/null +++ b/ipython/ResamplingAndMonteCarloMethods/resampling_tutorial_3.ipynb @@ -0,0 +1,799 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "5ddad2d4-6b8e-417b-b2f1-818ec162c523", + "metadata": {}, + "source": [ + "## The Bootstrap\n", + "\"Bootstrapping\" refers to computational techniques for making inferences about a statistic beyond point estimates by treating the samples as though they were they were the populations of interest. Regarding the origin of the term, [An Introduction to the Bootstrap](https://cindy.informatik.uni-bremen.de/cosy/teaching/CM_2011/Eval3/pe_efron_93.pdf) states:\n", + "\n", + "> The use of the term bootstrap derives from the phrase *to\n", + "pull oneself up by one's bootstrap*, widely thought to be based on\n", + "one of the eighteenth century Adventures of Baron Munchausen,\n", + "by Rudolph Erich Raspe. (The Baron had fallen to the bottom of\n", + "a deep lake. Just when it looked like all was lost, he thought to\n", + "pick himself up by his own bootstraps.)\n", + "\n", + "Let us return to the experiment that we considered at the beginning of the discussion of permutation tests. Again, a new medical treatment is intended to prolong life after a form of surgery. Sixteen mice are randomly assigned to either a treatment group or control group. All mice receive the surgery, but only the treatment group will receive the new treatment. The survival time of each mouse after surgery is recorded below.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "7581fc1a-795e-41b1-9de5-4167100d91d5", + "metadata": {}, + "outputs": [], + "source": [ + "# survival times measured in days\n", + "import numpy as np\n", + "x = np.array([94, 197, 16, 38, 99, 141, 23]) # treatment group\n", + "y = np.array([52, 104, 146, 10, 51, 30, 40, 27, 46]) # control group" + ] + }, + { + "cell_type": "markdown", + "id": "ca5fe064-d6b2-493b-8609-4a8d3b7be900", + "metadata": {}, + "source": [ + "The permutation test allowed us to study whether or not the treatment had any effect on the survival times. In many studies, we are interested not only in whether there is an effect; we are also interested in the _magnitude_ of the effect. It would be misleading to report only the difference in mean survival times, especially since the permutation test and t-test showed that there was a ~$14\\%$ chance of observing such an extreme difference in means due to chance alone. In addition to reporting our statistic (the difference in means), we should also report some measurement of our uncertainty.\n", + "\n", + "One way of quantifying our uncertainty is the _standard error_ of our statistic. Suppose we were to perform the same experiment (with new mice) repeatedly. Because the mice are random samples from some greater population and there will be some random error in the effect of the treatment, we would not observe the same value of the statistic every time; rather, the values of the statistic would form a distribution. The standard error is the standard deviation of this distribution.\n", + "\n", + "How do we calculate the standard error if we do not know the underlying distribution from which the mice survival times are sampled? The typical approach, which we will not discuss in detail, assumes that the underlying distributions are normal; from this assumption and some math, statisticians have derived a formula to estimate the standard error of the statistic. This approach is limited in applicability, however, as it may not produce a good estimate if the original distributions are non-normal; moreover, standard error formulas are only available for a few statistics. Instead, we take a different approach, beginning with the mild assumption that the observed samples are representative of the distributions from which they were taken. We estimate the standard error by repeatedly *resampling from the observed data* (with replacement), calculating the statistic of the resample each time, and computing the standard deviation of the resulting distribution. This makes sense: to estimate the standard error, we would happily re-sample from the distribution itself it if were available to us. It's not, so we do the next best thing, which is re-sampling from the data we already have." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "e30a2a3c-dae6-4c29-aa9f-d585a4b39213", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Observed Statistic Value: 30.63492063492064\n", + "Standard Error: 27.336478035417056\n" + ] + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "rng = np.random.default_rng()\n", + "\n", + "def statistic(x, y, axis=0):\n", + " return np.mean(x, axis=axis) - np.mean(y, axis=axis)\n", + "\n", + "def bootstrap_distribution(x, y):\n", + " nx, ny = len(x), len(y)\n", + " N = 1000\n", + " bootstrap_distribution = []\n", + " for i in range(N):\n", + " # random indices to resample from x and y\n", + " ix = rng.integers(0, nx, size=nx)\n", + " iy = rng.integers(0, ny, size=ny)\n", + " xi = x[ix]\n", + " yi = y[iy]\n", + " stat = statistic(xi, yi)\n", + " bootstrap_distribution.append(stat)\n", + " return bootstrap_distribution\n", + "\n", + "boot_dist = bootstrap_distribution(x, y)\n", + "\n", + "plt.hist(boot_dist, density=True, bins=20)\n", + "plt.xlabel(\"Value of test statistic\")\n", + "plt.ylabel(\"Observed Frequency\")\n", + "\n", + "observed_statistic = statistic(x, y)\n", + "standard_error = np.std(boot_dist, ddof=1)\n", + "print(f\"Observed Statistic Value: {observed_statistic}\")\n", + "print(f\"Standard Error: {standard_error}\")" + ] + }, + { + "cell_type": "markdown", + "id": "b7802b49-0a53-4f5f-95b1-a0710ab98a81", + "metadata": {}, + "source": [ + "This is precisely what `bootstrap` does." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "530d48d9-8944-43e3-848b-37831e3630f7", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "26.78533925848876" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from scipy import stats\n", + "# `n_resamples=1000` indicates that the statistic will be calculated for\n", + "# each of 1000 resamples.\n", + "# The meaning of `method='percentile'` will be discussed below\n", + "res = stats.bootstrap((x, y), statistic, n_resamples=1000, method='percentile')\n", + "assert res.standard_error == np.std(res.bootstrap_distribution, ddof=1)\n", + "res.standard_error" + ] + }, + { + "cell_type": "markdown", + "id": "bce2cf1c-4a82-4438-afb7-1832690132f2", + "metadata": {}, + "source": [ + "The two standard errors estimates differ slightly because the bootstrap algorithm is inherently stochastic, but that is OK. The best we can hope for is an approximation, and these two approximations agree with one another quite well.\n", + "\n", + "An even better way of quantifying the uncertainty, especially when the distribution of the statisic is non-normal, is to produce a *confidence interval* on the statistic. Suppose we perform the the experiment repeatedly and produce a \"95% confidence interval\" $l_i$ and $u_i$ from the data in each experiment $i$; this means that the true value of the statistic (the difference in the *population* means) will be between $l_i$ and $u_i$ in 95% of the replications $i$." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "590cc9d0-0b34-4faf-829b-8302b4230553", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "ConfidenceInterval(low=-18.494047619047613, high=82.84563492063491)" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "res.confidence_interval # 95% confidence interval by default" + ] + }, + { + "cell_type": "markdown", + "id": "935b014a-ec43-4c5f-a480-c804edea3e6a", + "metadata": {}, + "source": [ + "By choosing `method='percentile'` above, we indicated that bootstrap should estimate this confidence interval as the central 95% of the bootstrap distribution - that is, the boundaries of our interval will be the 2.5 and 97.5 percentiles of the bootstrap distribution." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "86857327-0c35-4f5b-9529-9b1c65edd133", + "metadata": {}, + "outputs": [], + "source": [ + "ci_percentile = stats.scoreatpercentile(res.bootstrap_distribution, [2.5, 97.5])\n", + "np.testing.assert_allclose(res.confidence_interval, ci_percentile) # confidence interval is the central 95% of the bootstrap distribution " + ] + }, + { + "cell_type": "markdown", + "id": "86f64385-b4a6-4015-8222-3adc85572384", + "metadata": {}, + "source": [ + "Again, this means that if we were to perform the mice experiment repeatedly and each time use `bootstrap` to compute such a confidence interval from the data, we would expect the confidence interval to contain the true value of the difference in mean survival times 95% of the time. Note also that our confidence interval contains 0. This is closely related to our conclusion from the hypothesis tests above: our data is not inconsistent with the null hypothesis that the treatment has no effect." + ] + }, + { + "cell_type": "markdown", + "id": "9e000136-9159-4361-9da8-b2d0f61d30ae", + "metadata": {}, + "source": [ + "### Single-Sample, Scalar-Valued Statistics (and Confidence Intervals)\n", + "This definition of a confidence interval can be difficult to interpret correctly, so we illustrate with a simpler example. Suppose there is an election with only two candidates, `0` and `1`, and all voters will vote for either one or the other (never both, and never for neither). We wish to estimate the percentage of voters who will vote for candidate `1` by performing an experiment before the election: we will ask a random sample of 1000 voters who the will vote for on election day. The results are stored in the array `sample`." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "2a72dd5a-7de6-4d7b-8d6d-e618f7c9f3fa", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "249 for candidate 0, 751 for candidate 1\n" + ] + } + ], + "source": [ + "# Rather than entering `sample` directly, let's generate one to work with.\n", + "# To simulate the results of such an experiment, suppose that the true \n", + "# (but unknown) percentage of voters who will vote for candidate `1` is 75%. \n", + "# If we sample voters at random from the population before the election and \n", + "# ask them who they will vote for, the responses will follow a Bernoulli \n", + "# distribution with shape parameter `p=0.75`.\n", + "p = 0.75\n", + "dist = stats.bernoulli(p=p)\n", + "sample = dist.rvs(size=1000)\n", + "vote_for_0 = np.sum(sample == 0)\n", + "vote_for_1 = np.sum(sample == 1)\n", + "print(f\"{vote_for_0} for candidate 0, {vote_for_1} for candidate 1\")" + ] + }, + { + "cell_type": "markdown", + "id": "c9f9532d-46ee-4d59-ab3b-af452790fab1", + "metadata": {}, + "source": [ + "The statistic we wish to estimate is the percentage of voters who will vote for candidate 1, so we can produce a *point estimate* of the statistic from the sample as:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "89b46532-a3b2-448f-b47a-22b92529074c", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.751" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def statistic(sample, axis=0):\n", + " return np.sum(sample, axis=axis) / sample.shape[axis]\n", + "statistic(sample)" + ] + }, + { + "cell_type": "markdown", + "id": "b30681e0-52d0-41ac-8e90-11ecee5bf235", + "metadata": {}, + "source": [ + "`bootstrap` can produce a confidence interval around the point estimate." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "ee6d8613-22ca-4a59-bf66-c86ed1bdb726", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "ConfidenceInterval(low=0.726, high=0.772)" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# As with `permutation_test`, the first argument of `bootstrap` needs to be a *sequence* of samples\n", + "data = (sample,)\n", + "# Passing `confidence_level=0.9` produces a 90% confidence interval\n", + "res = stats.bootstrap(data, statistic, confidence_level=0.9)\n", + "res.confidence_interval" + ] + }, + { + "cell_type": "markdown", + "id": "593bb317-2f96-40a3-8f7f-1499015ea2b2", + "metadata": {}, + "source": [ + "Suppose we perform the same experiment $100$ times, each time collecting new data from the same population, but computing the confidence interval in the same way." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "b09ee5a2-1adb-4430-8118-ad16f2ed2549", + "metadata": {}, + "outputs": [], + "source": [ + "# lower and upper limits of confidence intervals produced by `bootstrap`\n", + "n_replications = 100 # 100 replications of the same experiment\n", + "n_observations = 1000 # 1000 observations per sample\n", + "\n", + "# Draw 100 new samples from the same population of voters, each with 1000 observations\n", + "sample = dist.rvs(size=(100, 1000)) \n", + "\n", + "# bootstrap the 90% confidence interval for all 100 samples (10 at a time)\n", + "res = stats.bootstrap((sample,), statistic, confidence_level=0.9, axis=1, batch=10)\n", + "li, ui = res.confidence_interval\n", + " \n", + "# This was equivalent to (but faster than) the following \n", + "# li = np.empty((n_replications,))\n", + "# ui = np.empty((n_replications,))\n", + "# for i in range(n_replications):\n", + "# sample = dist.rvs(size=n_observations) # collect a new sample from the same population of voters\n", + "# res = stats.bootstrap((sample,), statistic, confidence_level=0.9, vectorized=False)\n", + "# li[i], ui[i] = res.confidence_interval" + ] + }, + { + "cell_type": "markdown", + "id": "55cbf673-1d90-4436-9755-12c79cd4c054", + "metadata": {}, + "source": [ + "We expect that the confidence interval will contains the true value of the statistic ($p=0.75$) approximately 90% of the time." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "9856f8c6-4f4e-43e6-98e7-476202f27bcb", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "87\n" + ] + } + ], + "source": [ + "contained = (li < p) & (p < ui)\n", + "print(np.sum(contained))" + ] + }, + { + "cell_type": "markdown", + "id": "c172f0e3-40cf-49ec-841e-0a3a7ab7bf64", + "metadata": {}, + "source": [ + "### Paired-Sample, Vector-Valued Statistics\n", + "\n", + "[An Introduction to the Bootstrap](https://books.google.com/books?id=MWC1DwAAQBAJ&printsec=frontcover) considers a small data set collected when studying a medical device for continuously delivering an anti-inflammatory hormone to test subjects. The arrays `x` and `y` record the number of hours the device was worn and the amount of hormone remaining in the device, respectively." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "07480ef9-594e-4eb6-9661-25cced033299", + "metadata": {}, + "outputs": [], + "source": [ + "x = np.array([99, 152, 293, 155, 196, 53, 184, 171, 52, 376, 385, 402, 29, 76, 296, 151, 177, 209, 119, 188, 115, 88, 58, 49, 150, 107, 125]) # hours worn\n", + "y = np.array([25.8, 20.5, 14.3, 23.2, 20.6, 31.1, 20.9, 20.9, 30.4, 16.3, 11.6, 11.8, 32.5, 32.0, 18.0, 24.1, 26.5, 25.8, 28.8, 22.0, 29.7, 28.9, 32.8, 32.5, 25.4, 31.7, 28.5]) # amount remaining (units unspecified)" + ] + }, + { + "cell_type": "markdown", + "id": "e1e416f8-7f0e-46f8-9584-388fd67602fe", + "metadata": {}, + "source": [ + "A standard linear regression is performed in SciPy as follows." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "f64d81c2-d847-425a-bee0-dc53e8efed05", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The slope estimate is: -0.0574462986976377\n", + "The intercept estimate is: 34.16752817399911\n", + "The slope standard error is: 0.004464173160311544\n", + "The intercept standard error is: 0.8671972620941928\n" + ] + } + ], + "source": [ + "res_lr = stats.linregress(x, y)\n", + "\n", + "plt.plot(x, y, '.', label='original data')\n", + "plt.plot(x, res_lr.intercept + res_lr.slope*x, 'r', label='fitted line')\n", + "plt.legend()\n", + "plt.show()\n", + "print(f\"The slope estimate is: {res_lr.slope}\")\n", + "print(f\"The intercept estimate is: {res_lr.intercept}\")\n", + "print(f\"The slope standard error is: {res_lr.stderr}\")\n", + "print(f\"The intercept standard error is: {res_lr.intercept_stderr}\")" + ] + }, + { + "cell_type": "markdown", + "id": "3aa96aa6-be6a-4e9c-a55c-f5e4e14fb6c4", + "metadata": {}, + "source": [ + "`linregress` produces point estimates of the slope and intercept as well as standard errors for each statistic, assuming that the residuals between the best fit line and the data are normally distributed. We can test the normality assumption using `stats.shapiro`." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "b5eb0671-a179-47e7-809c-8bc39f56843d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "ShapiroResult(statistic=0.9171469211578369, pvalue=0.03371993452310562)" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "e = y - res_lr.intercept + res_lr.slope*x\n", + "stats.shapiro(e)" + ] + }, + { + "cell_type": "markdown", + "id": "77de4703-5400-4a4b-a928-0af8878c330a", + "metadata": {}, + "source": [ + "Although the $p$-value is does not conclusively reject the null hypothesis at all reasonable confidence levels, it does suggest that we might want to relax the residual normality assumption. `bootstrap` makes no such assumption about the residuals, and it can go beyond the standard errors, producing bias-corrected confidence intervals. The standard errors produced by `bootstrap` matche those produced by `linregress` fairly well; however `linregress` may overestimate these quantities for this data." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "b393954f-53a8-4e69-baf2-6f75f79020f3", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The slope standard error is: 0.004260082018022721\n", + "The intercept standard error is: 0.7292901429594613\n", + "The confidence interval on the slope is: (-0.06516915578638195, -0.04868503288913481)\n", + "The confidence interval on the intercept is: (32.58844595440181, 35.45478220570269)\n" + ] + } + ], + "source": [ + "def statistic(x, y):\n", + " res = stats.linregress(x, y)\n", + " return res.slope, res.intercept\n", + "\n", + "res = stats.bootstrap((x, y), statistic, vectorized=False, paired=True)\n", + "\n", + "print(f\"The slope standard error is: {res.standard_error[0]}\")\n", + "print(f\"The intercept standard error is: {res.standard_error[1]}\")\n", + "print(f\"The confidence interval on the slope is: {res.confidence_interval.low[0], res.confidence_interval.high[0]}\")\n", + "print(f\"The confidence interval on the intercept is: {res.confidence_interval.low[1], res.confidence_interval.high[1]}\")" + ] + }, + { + "cell_type": "markdown", + "id": "1dceaaf7-ba8e-44d7-afb7-1cc1e74ed491", + "metadata": {}, + "source": [ + "Again, because the statistic has multiple values, a visualization of the bootstrap distribution may be more informative." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "662eee2a-0c36-4f9e-bbaa-8896f328aa8c", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[]" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "for m, b in res.bootstrap_distribution.T[::10]:\n", + " plt.plot(x, m*x + b, color='b', alpha=0.01)\n", + "plt.plot(x, y, '.', label='original data')\n", + "plt.plot(x, res_lr.intercept + res_lr.slope*x, 'r', label='fitted line')" + ] + }, + { + "cell_type": "markdown", + "id": "a38780c7-38c8-44fe-93be-7e5347e3aea4", + "metadata": {}, + "source": [ + "A major advantage of the bootstrap is that it can produce standard errors and confidence intervals even in more general regression models that have no simple analytical solutions, such as when the regression function is nonlinear in the parameters and when using fitting methods other than least squares." + ] + }, + { + "cell_type": "markdown", + "id": "1a886e39-f2b5-4ea4-8dff-110b7ddb5b45", + "metadata": {}, + "source": [ + "### Gotchas\n", + "\n", + "Our final example will show yet another application of the `bootstrap` chosen to illustrate common pitfalls.\n", + "\n", + "[An Introduction to the Bootstrap](https://books.google.com/books?id=MWC1DwAAQBAJ&printsec=frontcover) presents a study about whether regular doses of aspirin can prevent heart attacks. Subjects were randomly assigned two two groups: 11,037 received aspirin pills, and the remaining 11,034 received placebos. The subjects were instructed to take one pill every other day, and the scientists recorded the number of subject who experienced a heart attack during the study period: 104 in the aspirin group, and 189 in the placebo group. The statistic to assess the effectiveness of aspirin was the relative prevalence of heart attacks in the aspirin group versus the placebo group." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "f4153813-8a78-43c6-bd28-2cc6f89d7a33", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.5501149812103875" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x = np.zeros(11037) # 11037 subjects in the aspirin group\n", + "x[:104] = 1 # 104 experience a heart attack\n", + "y = np.zeros(11034) # 11034 subjects in the placebo group\n", + "y[:189] = 1 # 189 experience a heart attack\n", + "def statistic(x, y):\n", + " return (np.sum(x)/len(x))/(np.sum(y)/len(y))\n", + "statistic(x, y)" + ] + }, + { + "cell_type": "markdown", + "id": "d1b65fcb-6f4b-480c-a163-085e272e3249", + "metadata": {}, + "source": [ + "The risk of heart atttack for aspirin-takers seemed to be approximately half that of placebo-takers.\n", + "\n", + "Suppose we wish to generate a 95% confidence interval to quantify our uncertainty." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "2b1bcd63-2a11-4643-b036-bce7121afb4f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TypeError: bootstrap() takes 2 positional arguments but 3 positional arguments (and 1 keyword-only argument) were given\n" + ] + } + ], + "source": [ + "try:\n", + " stats.bootstrap(x, y, statistic, confidence_level=0.95)\n", + "except Exception as e:\n", + " print(f\"{type(e).__name__}: {e}\")" + ] + }, + { + "cell_type": "markdown", + "id": "f0b5caf3-3434-41a2-b8ea-da8d1ccfe679", + "metadata": {}, + "source": [ + "This reminds us that all the data needs to be passed in as a single sequence, not two separate arguments `x` and `y`." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "8d2738da-e79f-4ed3-968d-6533dadb7769", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ValueError: `method = 'BCa' is only available for one-sample statistics\n" + ] + } + ], + "source": [ + "data = (x, y) \n", + "try:\n", + " stats.bootstrap(data, statistic, confidence_level=0.95)\n", + "except Exception as e:\n", + " print(f\"{type(e).__name__}: {e}\")" + ] + }, + { + "cell_type": "markdown", + "id": "aebd581b-d7df-4179-a6f1-a8d4feab9043", + "metadata": {}, + "source": [ + "`bootstrap` offers a `method` argument that selects how the confidence interval is to be estimated from the `bootstrap` distribution; the three methods `{'BCa', 'percentile', 'basic'}` vary in their performance and accuracy. `BCa` is the most computationally intensive but tends to be the most accurate, so it is the default. However, it is currently only available when our data has only one independent sample, whereas our data consists of two independent samples `x` and `y`. Let's try another option, `percentile`." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "33a84978-c534-4069-8f7e-3171614e0e20", + "metadata": {}, + "outputs": [], + "source": [ + "try:\n", + " stats.bootstrap(data, statistic, method='basic', confidence_level=0.95)\n", + "except Exception as e:\n", + " print(f\"{type(e).__name__}: {e}\")" + ] + }, + { + "cell_type": "markdown", + "id": "47a2f909-b392-4738-ade4-4a80e07f7071", + "metadata": {}, + "source": [ + "Unlike `permutation_test`, `bootstrap` expects `statistic` to be vectorized by default. We can solve this by passing `vectorized=False`." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "15046441-1e6a-45c8-bdf6-2efa274f127c", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "ConfidenceInterval(low=0.4324405646124769, high=0.6926221426325354)" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "res = stats.bootstrap(data, statistic, method='percentile', confidence_level=0.95, vectorized=False)\n", + "res.confidence_interval" + ] + }, + { + "cell_type": "markdown", + "id": "09cc4689-4037-4ddb-a457-449015ed5256", + "metadata": {}, + "source": [ + "Alternatively, we can vectorize our statistic by making it accept a parameter `axis` and having it work along the specified axis-slice of N-dimensional arrays `x` and `y`." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "7588b203-8de2-4b52-9ced-64ad374c82a7", + "metadata": {}, + "outputs": [], + "source": [ + "def statistic(x, y, axis=0):\n", + " return (np.sum(x, axis=axis)/x.shape[axis])/(np.sum(y, axis=axis)/y.shape[axis])\n", + "\n", + "try:\n", + " res = stats.bootstrap(data, statistic, method='percentile', confidence_level=0.95)\n", + " res.confidence_interval\n", + "except Exception as e:\n", + " print(f\"{type(e).__name__}: {e}\")" + ] + }, + { + "cell_type": "markdown", + "id": "7800e793-92cb-4871-815b-06e3da89bb18", + "metadata": {}, + "source": [ + "Depending on your computer's hardware, you may run into a MemoryError there. Vectorized computations require a lot of memory. The default value of `n_resamples` is $9,999$, and there are a total of $11,037 + 11,034 = 22,071$ observations. Therefore, the resampled data arrays will contain a total of $9,999 \\cdot 22071 = 220,687,929$ elements. Each element is stored in double precision (8-bytes), so at least 1.7GB will be used during the calculation. To relax the memory requirement, we'll process the data in batches of 100 resamples rather than all at once." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "f6769306-af22-42b3-850b-377e40e3b703", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "ConfidenceInterval(low=0.4309040629913994, high=0.6944371938603752)" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "res = stats.bootstrap(data, statistic, method='percentile', confidence_level=0.95, batch=100)\n", + "res.confidence_interval" + ] + }, + { + "cell_type": "markdown", + "id": "19a3d153-fb2f-4f32-aa10-a347799c33e9", + "metadata": { + "tags": [] + }, + "source": [ + "## Conclusion\n", + "\n", + "The resampling approaches in SciPy can be used not only to replicate the results of most of SciPy's hypothesis tests, but to\n", + "\n", + "- improve the accuracy of statistical tests for small sample sizes and in the presence of ties,\n", + "- provide standard errors and confidence intervals for arbitrary statistics, and\n", + "- easily implement statistical tests that SciPy does not yet offer." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.5" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}