# Confidence intervals

* Confidence intervals regardless of sample size (Student $t$ distribution).
* Confidence intervals for proportions.
* Useful Jupyter tool: the slider.

In [1]:
import scipy.stats
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from ipywidgets import interact,fixed,IntSlider
import ipywidgets

## Student $t$ distribution

When our sample size $n$ is small, the sample mean $\bar x$ is normal only if the population is normal. So unless we can assume that our population is normally distributed, we can make no progress anyway.

But even then, $s$ is no longer a good estimator for $\sigma$. For small $n$, $s/\sqrt{n}$ underestimates the true variation of $\bar x$, and to correct our formula, we need to replace the normal distribution with the Student $t$ distribution. 

In practice, a tabled standard $t$ distribution, works very much like a tabled normal distribution.
* First we convert $\bar x$ into a $z$ or $t$ value, and the conversion is actually the same, the difference lies in which distribution are we planning to use: normal ($z$-value) or Student $t$ ($t$-value).
* Then we can find the CDF in our table. The CDF is the area to the left of $z$ or $t$.

But, but, but ...
1. A Student $t$ distribution depends on $n$, so we need a table for each $n$ (or equivalently each $df$).
2. Normally, we want the inverse function of the CDF, that is, we know the propability - in confidence interval language the area $\alpha/2$ - and we would like to find the corresponding $z$ or $t$ value. To compute the inverse of the CDF with a table, we work the table "backwards". However, we can, of course, tabulate the inverse function, and Fig. 7.1.6 lists results for some frequently needed values of $\alpha$.

### Task

Make sure you understand that Fig 7.1.5 lists the CDF of a normal distribution, while Fig. 7.1.6 lists the inverse of the Student $t$ distribution with the limiting nomal distribution in the last line ($df=\infty$ line). Identify all entries in the $df=\infty$ line in Fig 7.1.5.

## Names of functions you already know.

First of all, for the time being let's stick to *standard* functions: *Standard* implies a mean of $0$ and a standard deviation of $1$. For example, a standard normal distribution or a standard Student $t$ distribution. Instead of $X$ and $x$, we use either $Z$ and $z$ or $T$ or $t$ to indicate that we deal with a standard normal or standard $t$ distribution. 

We already used a CDF table as in Fig 7.1.5 in various examples. In general, a CDF is a function that takes an $x$, $z$, or $t$ value as input, and returns the probability to the left of that value, in other words, the left area: $P_{left} = \mathrm{CDF}(z)$.  

In addition, we used three related functions without giving them an explicit name:
1. The probability or area to the right of $z$: $P_{right}=1 - \mathrm{CDF}(z)$.
2. The inverse of the CDF: A function that takes the left area as an input and returns the corresponding $z$ or $t$ value.
3. The inverse of $1-\mathrm{CDF}$: A function that takes the right area as an input and returns the corresponding $z$ or $t$ value.

These four functions are closely related, and when practicing with a normal distribution we saw that the table codes both the CDF (working the table "forward") and its inverse (working the table "backward"), and that the left area functions can be computed using "left area plus right area equals $1$".

Still, you will not be surprised to hear that all four functions have names, and that you the names come in handy for understanding the literature and for calling Python functions.
1. The right probability is called *survival function* $\mathrm{SF} = 1 - \mathrm{CDF}$. 
2. The inverse of the CDF is called *percent point function*, PPF. For example, for a normal distribution, PPF($0.1$) will give you the $z$-value of the $10$%-ile. 
3. The inverse of the survival function has no specical name but is abivated ISF. For example, ISF($0.1$) looks for a right tail containg $10$%, the $90$%-ile.

### `scipy` implementation

A RV having a normal distribution: `rv=scipy.stats.norm`

A RV having a Student $t$ distribution: `rv=scipy.stats.t`

The four functions:

Function| Normal | Student t |
---|---|---|
CDF | `rv.cdf(z)` | `rv.cdf(t, df)`
SF  | `rv.sf(z)` | `rv.sf(t, df)`
PPF | `rv.ppf(P_left)` | `rv.ppf(P_left, df)`
ISF | `rv.isf(P_right)` | `rv.isf(P_right, df)`

As you can see, straightforward if you know the names.

Now we are ready to compute a confidence interval.

In [2]:
# define the distribution
rv=scipy.stats.t

In [3]:
# define the sample 
# normally we would have data in a pandas column and compute xbar and s
n=5
xbar=0.678
s=0.234

In [4]:
# confidence level we want to compute; confidence_level = 1 - alpha
conf_level=0.99

In [5]:
# compute the confidence interval
# we use ISF because we want a right tail of alpha/2 
df = n-1
alpha = 1-conf_level
t_alpha_over_2 = rv.isf(alpha/2, df)
E = t_alpha_over_2 * s / np.sqrt(n)
print(f'Confidence interval: {xbar:.3f} ± {E:.3f}  or  ({xbar-E:.3f}, {xbar+E:.3f})')

Confidence interval: 0.678 ± 0.482  or  (0.196, 1.160)


### Task 

Solve problem 7.2.18 using a $t$-table. Then check your work using the Jupyter cells above. 

## Sliders

Often, I find myself playing with cells such as the example above: an output (here $E$ is computed from multiple inputs ($n$, mean, $s$, confidence level). I do this either to  understand how the output behaves when the input changes, or to optimize the input for a particular purpose. 

As long as I have to scroll up, change the input, and re-execute the Jupyter cells less than 10 times or so that is fine, but beyond that this procedure becomes tiresome and clumsy.

A better option are ***sliders***. Sliders enable you to change a variable using the mouse, and instantaneously use the new values in a compution, which you can simply print or even plot.

Example: Slider for sample size $n$ from $5$ to $50$, step $1$:

In [6]:
@interact(n=(5, 50, 1))
def print_sample_size(n=10):
    print(f'sample size is {n}')

interactive(children=(IntSlider(value=10, description='n', max=50, min=5), Output()), _dom_classes=('widget-in…

### Task

Modify the `print_sample_size` function to also print the degrees-of-freedom $df$.

Second example: two sliders, one for $n$, one for the confidence level.

In [7]:
@interact(n=(5, 50, 1), conf_level=(90, 99.9, 0.1))
def sampling(n=10, conf_level=95):
    print(f'sample size is {n}, confidence level is {conf_level:.1f}%')

interactive(children=(IntSlider(value=10, description='n', max=50, min=5), FloatSlider(value=95.0, description…

Let's combine the Student $t$ confidence interval calculation from above with sliders:

In [8]:
@interact(n=(5, 50, 1), conf_level=(90, 99.9, 0.1))
def sampling(n=10, conf_level=95):
    xbar=0.678
    s=0.234
    df = n-1
    alpha = 1-conf_level/100
    t_alpha_over_2 = rv.isf(alpha/2, df)
    E = t_alpha_over_2 * s / np.sqrt(n)
    print(f'Confidence interval: {xbar:.3f} ± {E:.3f}  or  ({xbar-E:.3f}, {xbar+E:.3f})')    

interactive(children=(IntSlider(value=10, description='n', max=50, min=5), FloatSlider(value=95.0, description…

### Task

Is the margin or error more susceptiple to the the sample size or to the confidence interval? In what sense?

### Task

Proportion sliders: 
* Copy-and-paste the last Jupyter cell to the cell below. 
* Modify the sliders for input of $n$ and $\hat p$. 
* Check whether the sample is "large", and either print "large sample" or "too small sample".
* Compute the margin of error $E$ (careful, now we need a normal RV).
* Play with your sliders: For which proportions are small samples sufficient? For which proportions do we need rather large samples?

In [9]:
# define the distribution
rv=scipy.stats.norm

In [15]:
@interact(n=(5, 100, 1), p_hat=(0.02, 0.988, 0.02), conf_level=(90, 99.9, 0.1))
def sampling(n=40, p_hat=0.4, conf_level=95):
    sigma=np.sqrt(p_hat*(1-p_hat)/n)
    low=p_hat-3*sigma
    high=p_hat+3*sigma
    print(f'3-sigma interval: [{low:.3f}, {high:.3f}]')
    if low > 0 and high < 1:
        print('large sample')
    else:
        print('too small sample')
    alpha = 1-conf_level/100
    z_alpha_over_2 = rv.isf(alpha/2)
    E = z_alpha_over_2 * sigma
    print(f'Confidence interval: {p_hat:.3f} ± {E:.3f}  or  ({p_hat-E:.3f}, {p_hat+E:.3f})')

interactive(children=(IntSlider(value=40, description='n', min=5), FloatSlider(value=0.4, description='p_hat',…