### Confidence intervals (CIs)

A [confidence interval](https://en.wikipedia.org/wiki/Confidence_interval) is an estimate computed from observed data's statistics. A CI is a range of values we are fairly sure our true value lies within. We can say: we are 99% certain (confidence *level*) that most of these datasets (confidence *intervals*) contain the true population parameter. Several tests can be done to determine the CIs:
1. **Z-score**: it shows us how many standard deviations we are away from the mean of a distribution. For example, in investing, Z-scores are measures of an observation's variability and can be put to use by traders in determining market volatility. 
2. **t-score**: it allows you to take one score and standardize it which then enables you to compare it to other scores. The t-score determines the ratio of differences between two groups or samples, as well as the differences within a group or sample.

Note that the **z-score** is used when the sample size, $N$, is $N\geq30$, while **t-test** is used otherwise. In some situations even with samples more than 30 the data may be non-normal. Also in that case, the z-test should be preferred.

#### One example: uncertainty on $\mu$

![](images/conf_interval.png)

Given a *standard* normal distribution (i.e., $\mu=0$ and $\sigma^2=1$), a z–score is the value $z_r$ such that $P(– z_r \leq Z \leq z_r ) = r$. In other words, we set a target probability $r$ and we look for the value $z_r$ such that we get that probability. For example, for a standard normal distribution, we can say that to be 90% sure that our mean is zero, the sample mean $\hat{x}$ must fall in [-1.645, 1.645]. Indeed, $P(–1.645 \leq Z \leq 1.645) = .9$.

We can define the z-score as
$$
z_{score} = \frac{x-\mu}{\sigma}
$$

Under the normality assumption

$$P\Big( \bar{X} - 1.96\cdot\frac{\sigma}{\sqrt{n}}\ \le \mu \le \ \bar{X} + 1.96\cdot\frac{\sigma}{\sqrt{n}} \Big) = 0.95 $$

defines a confidence interval of 95%. In general, by the central limit theorem and for reasonably large sample size $n$, the above equation is still approximately true. 

Then we can say that for a 95% CI for $\mu$ when $\sigma$ is know in:

$$\Big( \bar{X} - 1.96\cdot\frac{\sigma}{\sqrt{n}}\ ,\ \bar{X} + 1.96\cdot\frac{\sigma}{\sqrt{n}} \Big)$$  

More generally, if we define $z_{\alpha/2}$ as the value that cuts off an area of $\alpha/2$ in the upper tail of the standard normal distribution, we can define a $1-\alpha$ confidence interval for the population mean $\mu$ as:

$$\Big( \bar{X} - Z_{\alpha/2}\cdot\frac{\sigma}{\sqrt{n}}\ ,\ \bar{X} + Z_{\alpha/2}\cdot\frac{\sigma}{\sqrt{n}} \Big)$$  

To calculate $Z_{\alpha/2}$ for a 95% confidence interval we can use the `.interval()` function or the `.ppf()` function.

In [1]:
import scipy.stats as stats
import numpy as np

Let's say we want to have a confidence of 95% for our results. This means taking $\alpha=0.05$ (5%). This means that we want to confirm the null hypotesis (for example, that $\hat{x}\approx\mu$) with a confidence of 95% (or, in other terms, accepting an error of 5%).

For the standard normal distribution:

In [5]:
print(F"Confidence Interval: {stats.norm.interval(.95)}")
print(f"Upper z-score:\t{stats.norm.ppf( 1- ((1 - .95)/2)):.3f}") # ppf = percent point function
print(f"Lower z-score:\t{stats.norm.ppf( (1 - .95)/2):.3f}")

Confidence Interval: (-1.959963984540054, 1.959963984540054)
Upper z-score:	1.960
Lower z-score:	-1.960


Let's say now that we measured the mean and standard deviaton for a data sample with the following results:
- $\hat{x}=135$ (sample mean)
- $s=25$ (sample st.dev.)
- $n=56$ (n. of samples)

We want to check that the mean ($\mu$) of the population this sample comes from is correctly approximated by $\hat{x}$. We allow an error of maximum 5%. Note the the underlying assumption is that the data is normally distributed. Then:

In [11]:
x_hat = 135.
s = 25.
n = 56
alpha = .05
zeta_alpha_half = stats.norm.ppf(1 - alpha/2)

print(f"CI(95%): {x_hat - zeta_alpha_half * s / np.sqrt(n), x_hat + zeta_alpha_half * s / np.sqrt(n)}")

CI(95%): (128.45221989235253, 141.54778010764747)


In [14]:
stats.norm.interval?

In [15]:
print(f'Calculation with .interval(): {stats.norm.interval(confidence=1-alpha, loc=x_hat, scale=s/np.sqrt(n))}')

Calculation with .interval(): (128.45221989235253, 141.54778010764747)


Just to make a clear difference between *z-score* and *t-score*, the former is used when $\sigma$ is known, while the latter is used when $\sigma$ is *un*known. It follows that we can write:
$$
\mu = \hat{x} \pm z \frac{\sigma}{\sqrt{n}}
$$
where $z$ is the *z*-score.

And
$$
\mu = \hat{x} \pm t_\nu \frac{s}{\sqrt{n}}
$$
where $t$ is the *t*-score. *s* is now the sample mean and t depends on the degree of freedom, $\nu$.

Note that, while the population and sample means are calculated in the same way, the population and sample variances are different:
$$
\hat{x}=\mu=\frac{\sum_i x_i}{N}
$$
but
$$
\sigma^2=\frac{\sum_i (x_i-\mu)^2}{N} \\
s^2=\frac{\sum_i (x_i-\hat{x})^2}{N-1}
$$

This needs to be taken into account when calculating the *z*- and *t*-scores.