<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Review CLT, Confidence Intervals, and Hypothesis Testing


---

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

## Boston Housing dataset

Information about the boston housing dataset can be found [here](https://www.kaggle.com/datasets/simpleparadox/bostonhousingdataset)


In [2]:
# Read in the dataset
data = pd.read_csv('datasets/boston.csv')

# we will only explore the NOX and AGE variables
NOX = data['NOX']
AGE = data['AGE']
data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


### 1. Find the mean, standard deviation, and the standard error of the mean for variable `AGE`

In [3]:
print("Mean:", np.mean(AGE))
print("Standard Deviation:", np.std(AGE))

Mean: 68.57490118577076
Standard Deviation: 28.121032570236885


In [6]:
# scipy standard error function
from scipy.stats import sem
print("Standard Error:", sem(AGE))

Standard Error: 1.2513695252583041


### 2. Generate a 90%, 95%, and 99% confidence interval for `AGE`

You can use the `scipy.stats.t.interval` function to calculate confidence interval range.

```python
# Endpoints of the range that contains alpha percent of the distribution
stats.t.interval(confidence, df, loc=0, scale=1)	
```

Arguments:
- `confidence` = confidence level, between 0 and 1
- `df` = the degrees of freedom, will be the length of the vector -1.
- `loc` = the mean of the t-distribution (your point estimate - mean of the variable)
- `scale` = the standard deviation of the t-distribution (the standard error of your sample mean)

**Interpret the results from all three confidence intervals.**

In [7]:
import scipy
import scipy.stats as stats
from scipy.stats import t

In [13]:
df = len(AGE) - 1

confidence_levels = [0.90, 0.95, 0.99]

for confidence in confidence_levels:
    confidence_percentage = int(confidence * 100)
    print("\n{}% Confidence Interval:".format(confidence_percentage))
    interval = stats.t.interval(confidence, df, np.mean(AGE), sem(AGE))
    print(interval)


90% Confidence Interval:
(66.51279866704186, 70.63700370449965)

95% Confidence Interval:
(66.11636971854321, 71.0334326529983)

99% Confidence Interval:
(65.33936041834139, 71.81044195320013)


### 3. Did you rely on the Central Limit Theorem in question 2? Why or why not? Explain.

Certainly, the Central Limit Theorem (CLT) played an implicit role in our methodology for question 2. Here's the rationale behind this:

The Central Limit Theorem posits that if we draw sufficiently large random samples with replacement from a population characterized by a mean (μ) and a standard deviation (σ), the distribution of sample means will tend to follow a normal distribution. Importantly, this holds true irrespective of whether the source population exhibits a normal or skewed distribution, as long as the sample size is sufficiently large, typically considered to be greater than 30.

In our context, where we calculate confidence intervals, we are working with the mean of the variable AGE, which is essentially a sample statistic. According to the Central Limit Theorem, if our sample size is adequately large, the mean of AGE should adhere to a normal distribution, regardless of the original distribution of AGE within the population.

This normal distribution assumption is pivotal because it enables us to employ the t-distribution (when the population standard deviation is unknown) or the z-distribution (when the population standard deviation is known) for calculating confidence intervals. Both of these distributions are applicable when we assume that our data approximates a normal distribution, thereby demonstrating the CLT's implicit influence in our approach.

### 4. For the variable `NOX`, generate a 95% confidence interval and interpret it.

In [14]:
df = len(NOX) - 1
mean = np.mean(NOX)
sem = sem(NOX)

conf_interval = stats.t.interval(0.95, df, mean, sem)
print(conf_interval)

(0.5445742622921801, 0.5648158562848951)


### 5. For the variable `NOX`, we are going to test the hypothesis that the (true) mean is equal to the median in the sample

In this case, we are performing the hypothesis test to test the mean based on a single sample.
These are the steps:
1. Define hypothesis
2. Set alpha (Let alpha = 0.05)
3. Calculate point estimate
4. Calculate test statistic
5. Find the p-value
6. Interpret results

In [15]:
# A:
## Step 1: Define hypotheses.
### H_0: mu_NOX = M_NOX
### H_A: mu_NOX != M_NOX

## Step 2: alpha = 0.05.
alpha = 0.05

## Step 3: Calculate point estimate.
sample_mean = NOX.mean()
sample_median = 0.54
sample_std = NOX.std()
sample_size = len(NOX)

## Step 4: Calculate test statistic.
t_statistic = (sample_mean - sample_median)/(sample_std/sample_size**0.5)

## Step 5: Find p-value.
## t.sf is survival function, which is 1-cdf at a given value 
## (proportion of values at least as extreme as...)
p_value = t.sf(np.abs(t_statistic), len(NOX)-1) * 2 


## Because our alternative hypothesis is != (rather than greater than or less than),
## we multiply our p-value by 2. (This is called a two-sided test.)
print("Our sample median is {:.4f}.".format(0.54))
print("Our sample mean is {:.4f}.".format(sample_mean))
print("Our t-statistic is {:.6f}.".format(t_statistic))
print("Our p-value is {:.6f}.".format(p_value))

if p_value < alpha:
    print("We reject our null hypothesis and conclude that the true mean NOX value is different from the median NOX value.")
elif p_value > alpha:
    print("We fail to reject our null hypothesis and cannot conclude that the true mean NOX value is different from the median .")
else:
    print("Our test is inconclusive.")

Our sample median is 0.5400.
Our sample mean is 0.5547.
Our t-statistic is 2.852639.
Our p-value is 0.004514.
We reject our null hypothesis and conclude that the true mean NOX value is different from the median NOX value.


**1-sample t-test**

To perform the t-test on a single sample, you can use `scipy.stats.ttest_1samp()`.

Try it out. Do you get the same values?

In [16]:
from scipy import stats

# Use stats.ttest_1samp() with appropriate parameters to check the results above
t_statistic, p_value  = stats.ttest_1samp(NOX, sample_median)

print("t-statistic: ", t_statistic)
print("p-value: ", p_value)

t-statistic:  2.8526390677766322
p-value:  0.004513586425934958


### 6. What do you notice about the results from Exercise 4 and Exercise 5? 

**If you were going to generalize this to the relationship between hypothesis tests and confidence intervals, what might you say? Be specific.**

In Exercise 4 and Exercise 5, we performed two different statistical analyses on the variable NOX:

**Exercise 4: Confidence Interval**

We generated a 95% confidence interval for the mean of the NOX variable.
The confidence interval ranged from approximately 0.545 to 0.565.
This analysis aimed to estimate a range within which we can be reasonably confident that the true population mean of NOX lies.

**Exercise 5: Hypothesis Test**

We conducted a hypothesis test to determine whether the true population mean of NOX is equal to a specific value (the sample median, which was 0.54 in this case).
We set a significance level (alpha) of 0.05 and calculated a test statistic and a p-value.
The p-value was approximately 0.0045, leading us to reject the null hypothesis that the mean NOX value is equal to the median NOX value in the sample.
Generalization to the Relationship between Hypothesis Tests and Confidence Intervals:

In Exercise 4, the confidence interval provided a range of values within which we believe the true population parameter (mean) lies with a certain level of confidence (95% in this case).
In Exercise 5, the hypothesis test aimed to determine whether a specific value (the sample median) is likely to represent the true population parameter (mean) or not, based on the observed sample data.

**Relationship between hypothesis tests and confidence intervals:**

***Directionality***: Confidence intervals are two-tailed, meaning they provide a range of values around a point estimate (e.g., mean) without specifying a direction. Hypothesis tests, on the other hand, are typically one-tailed or two-tailed tests that assess whether a parameter is equal to, greater than, or less than a specific value.

***Overlap***: The results of a hypothesis test (e.g., p-value) and a confidence interval can be related. If a confidence interval includes the hypothesized value (null hypothesis value), the p-value of a corresponding hypothesis test may be greater than the chosen significance level, leading to a failure to reject the null hypothesis. Conversely, if the hypothesized value falls outside the confidence interval, the p-value is likely to be small, leading to rejection of the null hypothesis.

***Confidence Level vs. Significance Level***: The confidence level of a confidence interval corresponds to the complement of the significance level (alpha) used in a hypothesis test. For example, a 95% confidence interval is equivalent to a significance level of 0.05 in a two-tailed hypothesis test.

***Interpretation***: In both cases, the interpretation depends on whether the null hypothesis is rejected or not. Rejecting the null hypothesis in a hypothesis test or having the hypothesized value fall outside the confidence interval suggests evidence of a significant difference or effect.

In summary, confidence intervals and hypothesis tests are closely related, and the choice between them often depends on the specific research question and whether you want to estimate a parameter or assess a hypothesis about a parameter. The results of both analyses can provide valuable insights into the underlying population characteristics.