## Confidence Intervals

- in an experiment, we cant observe the variables' true values directly
  - we observe other values
  - we make assumptions as to how they are distributed
  - we can estimate the true value
  - law of large numbers: when our sample is big enough, the sample parameters approach the population parameters
- with continuous values, it's useless to say that the mean is equal to a certain value (why?)
- **confidence interval** - a range of values that we're fairly sure contains the true value
  - how confident? A matter of choice
- **confidence level** - the probability that the value falls within the interval 

## Confidence Intervals - Interpretation

- similar to the probability interpretations
- to illustrate these, let's take a confidence interval [5; 7,3] and a 70% confidence level
- frequency
  - if we perform the experiment many times, 70% of the values will fall in the interval [5; 7,3] and 30% - outside it
- certainty of next trial
  - next time we perform the experiment, we are 70% certain that the value will fall within [5; 7,3]
  - note that this is a statement **about the interval**, not about the value
- typically used confidence intervals • 50%; 90%; 95%; 99,7% 


## Confidence Intervals and Z-Scores

- observe the Z-distribution (Gaussian, $ \mu = 0, \sigma = 1$)
- what's the probability that a value drawn from it $x \in [-2; 1]$?
  - this corresponds to the shaded area in the graph
  - the cumulative function gives us the area to the left of some value
  - shaded area = $cdf(1) — cdf(-2) = 0,819 = 81,9%$
- interpretations
  - if we draw many random numbers from the Z-distribution, we expect that 81,9% of them will be in [-2; 1]
  - if we draw one random number, there is 81,9% chance of it being in [-2; 1]
- commonly used intervals
  - $1\sigma \rightarrow 68,27\%; 2\sigma \rightarrow 95,45\%; 3\sigma \rightarrow 99,73\% $
  - also $1,96\sigma \rightarrow 95\%$
![Confidence Z-Scores](confidence-z-scores.png)

## Confidence Intervals Example

- note that once again we need to subtract the left white region
  - area of shaded region: $p (e.g. p=0,95)$
  - area of both tails: $1 - p$
  - percentage point of left tail: $\frac{1-p}{2}$
  - percentage point of right tail: $\frac{1-p}{2} + p = \frac{1+p}{2}$
  
```python
import scipy.stats as st 

def get_real_confidence_interyal(probability, mean, std): 
    lower_area = (1 - probability) / 2 
    upper_area = (1 • probability) / 2 
    return [ 
        st.norm.ppf(lower_area, mean, std), 
        st.norm.ppf(upper_area, mean, std)] 
```

## Testing Hypotheses

## Hypotheses

- after performing an experiment and getting data, the scientific method requires that we form a hypothesis
  - fact, law, theory and hypothesis are different terms
- in the simplest case, we have two hypotheses
  - null hypothesis ($H_0$) - the status quo is real, "nothing interesting happens"
  - alternate hypothesis ($H_1$) - what we're trying to demonstrate
- types of hypotheses
  - attributive - something exists and can be measured
  - associative - there is a relationship between two behaviors
  - causal - differences in the amount / kind of one behavior cause differences in other behaviors

## Hypotheses - Examples

- examples of hypotheses - study of Disneyland visitors
  - attributive
    - most of the population has heard of Disneyland
    - disneyland visitors are diverse in demographics
  - associative
    - income level is correlated with visiting Disneyland
    - people who live closer to Disneyland are more apt to visit Disneyland
  - causal
    - frequent exposure to Disneyland advertising results in increased attendance
    - discounting tickets for local residents produces an increase in visitor numbers
- note that attributive hypotheses involve one variable (univariate) while associative and causal hypotheses involve two variables (bivariate)

## Testing a Hypothesis

- in random experiments, we have error sources
  - human error, systematic error, random errors, etc.
- we cannot prove (or reject) a hypothesis with complete certainty
- the errors we can make are two types
  - Type I error - reject $H_0$ while it's true (false positive)
  - Type II error - accept $H_0$ while $H_1$ is true (false negative)
- the possible results can be summarized in the following truth table
  - also called confusion matrix
![Confusion Matrix](confusion-matrix.png)  

- to measure the probability of producing a wrong hypothesis, we use a test statistic - measure of deviations from $H_0$
  - different tests produce different measures (statistics)
  - we accept or reject the null hypothesis based on the value of the test statistic
- let's denote the probability of getting a type I error with $\alpha$
  - each value of the selected test statistic has a corresponding alpha-value
  - we perform the experiment, get data and calculate the test statistic value
  - from that, we calculate the corresponding alpha-value
  - we reject the null hypothesis if $\alpha < \alpha_c$, where $\alpha_c$ is a **critical confidence level**


## Z-test

- A Z-test uses the Z-statistic
- $H_0$: standard normal distribution
- example: light bulb factory
  - a factory produces light bulbs with lifetime $X \thicksim N(\mu = 500h, \sigma = 50h)$
  - a sample of 25 bulbs has a mean lifetime $ \bar{x} = 480h$
  - is there something wrong with the production line?
- forming hypotheses
  - $H_0$: The production line works normally, the observed deviation of the sample mean from the population mean is due to chance
  - $H_1$: The production line is broken
- suppose we take a lot of samples from the entire population
  - each sample mean will be different
  - the distribution of sample means will be more or less Gaussian
    - parameters (our best estimate): $\mu_{\bar{x}} = \mu, \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}$
    - here's why the parameters are chosen as such 
- if $H_0$ is correct, we assume that: $\bar{x} \thicksim N (\mu, \frac{\sigma}{\sqrt{n}})$
- Z-statistic
  - $Z = \frac{\bar{x} - \mu}{\sigma_{\bar{x}}} = \frac{480 - 500}{50 / \sqrt{25}} = -2 $
- we can see that we are 2 std's below the mean
- how extreme is that?
  - what's the probability that we get results **as extreme or more extreme** than we observed, assuming the null hypothesis is true?
    - less than 5% 

## Two-tailed Z-test 

- we can get the confidence interval from the Z-statistic 
- we are looking for more extreme values
  - values outside the confidence interval
  - what's the probability $P(|Z| \geq 2)$?
  - we're looking for a value different than the mean
    - we **can't assume** whether it's smaller or larger
    - therefore, we have to look at both "tails" 
- if we assume a critical value (also called a p-value) of 5%, the results are significant
  - $P(|Z| > 2) \approx 0,0455 = 4,55\%$
  - we can reject $H_0$ at the 5% level
  - even at lower levels, up to 4,55% 


## One-tailed Z-test

- the same logic applies, but now we're looking at one tail only
- question: Is the lifespan **significantly lower** than it should be?
- cutoff point: $\alpha_c = 5\%, Z = —2$
  - $P(Z \leq —2) = \frac{0,00455}{2} - 0,02275 = 2,275\% < \alpha_c$
  - answer: Yes, at the given significance level
- question: Is the lifespan significantly higher than it should be?
  - $P(Z \geq —2) = 97,725\% \alpha_c$
  - answer: No, at the given significance level

## t-test

- the Z-test requires that we know the standard deviation of the population 
  - Usually not available
- we can use another test statistic, called **t**
- advantages over the Z-test
  - we don't need to know the population $\sigma$
  - it's better when we have very small sample sizes (e.g., n < 30)
  - it can be used for testing the mean of a sample against a standard, but also for comparing two means
    - we can see whether two sets of data are significantly different from each other
- null hypothesis: The test statistic follows Student's t-distribution
  - similar to Gaussian distribution, with "fatter" tails 


## One-Sample t-test

- the details of the calculation are fairly complex but we can do this in code
  - using scipy.stats
- first, we generate 100 random numbers with $ \mu = 5, \sigma = 10$
- then we ask whether the sample mean is equal to the true mean (and other values, just for testing)
- we get the p-value - probability of the null hypothesis being true
  - i.e. probability that the mean is equal to the given mean 

```python
sample_data = st.norm.rvs(5, 10, 100) 
print(st.ttest_lsamp(sample_data, 5).pvalue) # 0.9301 
print(st.ttest_lsamp(sample_data, 4).pvalue) # 0.3352 
print(st.ttest_lsamp(sample_data, 0).pvalue) # 1.104e-6
```


## Independent Two-Sample t-test

- we compare two independent distributions
  - we want to see whether they have the same mean
  - we assume equal variances (scipy can also do tests with unequal variances - important when sample sizes differ)
- example: Grain size
  - we are given data (in grain_data. csv) of grain sizes from two different farms
  - do they differ significantly (at the 95% level)?
  - * we can also plot histograms to see what the distributions look like 
  
```python  
grain_data = ...
st.ttest_ind(grain_data.GreatNorthern, grain_data.BigFour) 
#Ttest_indResult(statistic=1.312336706487564, 
# pvalue=0.20792200785311768) 
```

## Paired Two-Sample t-test

- we compare two distributions
  - observations in samples can be paired
  - examples - before / after observations; comparison between two different treatments applied to the same subjects
- example: Drinking water
  - we are given data (in water data. csv) of Zn concentration in surface and bottom water at 10 different locations
  - does the true average concentration in bottom water exceed that of top water?
  - we use a paired t-test because the samples are from the same locations
  - it reduces experimental error (and provides stronger evidence)
  
```python  
water_data = ... 
# We use a one-tailed t-test
st.ttest_rel(water_data.surface, water_data.bottom).pvalue / 2 
# 0.00044555772891127738 
```

## Generalizations to More Variables

- sometimes it's not enough to compare two distributions
  - we may want to compare multiple distributions against the same null hypothesis
  - e.g. how is the percentage of smokers distributed by income and age?
- other times, we create a model and want to evaluate it
  - e.g. a linear regression
  - we can explain some of the variance in the sample
- there are other tests to perform these "checks"
  - ANOVA (Analysis of Variance) - useful for grouped data
    - observe the variance inside groups and between groups
  - chi-square(d) test — can be applied to categorical data
    - two common types • How good a model is (goodness of fit)
    - whether two variables are independent 

## Analisys of Variance (ANOVA) 

- we want to compare several groups
- $H_0$: The means of the groups are the same
- method (scipy.stats.f oneway())
  - for each group $\Rightarrow$ group mean
    - in-group variance: distances from an individual point to the group mean
    - between-group variance: distances between the means of two groups
  - for the entire data $\Rightarrow$ total mean (mean of all data)
    - also equal to the mean of all group means
    - total variance: in-group + between-group
- F-statistic (Fisher) variance between groups
  - $ F = \frac{variane between groups}{variance within groups}$
  - F - large $\Rightarrow$ the variance between groups dominates
  - for each value of F, there's a corresponding p-value
  - if $p \leq p_c$, we can reject $H_0$ 

## Chi-Squared ($x^2$) Test

- compares expected (predicted) and observed frequencies
  - is there a significant difference between these?
  - this is a goodness-of-fit measure
    - how well were we able to predict
- statistic: $X^2 = \frac{(f_{observed} - f_{estimated})^2}{f_{estimated}}$
- $H_0$: No significant difference between observed and estimated 
- the test returns the value of the statistic and the p-value corresponding to it
- works the same as any other test
- python: scipv. stats chisoua re()  
