<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Review CLT, Confidence Intervals, and Hypothesis Testing


---

### Read in the housing data (code provided).

You can find the original data [here](https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data).

In [8]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [None]:
from sklearn.datasets import load_boston

data_boston = load_boston()
data = pd.DataFrame(data_boston.data,columns=data_boston.feature_names)
NOX = data['NOX']
AGE = data['AGE']


In [9]:


data = pd.read_csv("./datasets/boston.csv")
NOX = data['NOX']
AGE = data['AGE']

In [11]:
data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


### 1. Find the mean, standard deviation, and the standard error of the mean for variable `AGE`

In [13]:
# A:
mean_age=np.mean(AGE)
std_age=np.std(AGE)

# Calculate standard error of the mean
n = len(AGE)
sem_age = std_age / np.sqrt(n)

print(f"Mean of AGE: {mean_age:.2f}")
print(f"Standard Deviation of AGE: {std_age:.2f}")
print(f"Standard Error of AGE: {sem_age:.2f}")

Mean of AGE: 68.57
Standard Deviation of AGE: 28.12
Standard Error of AGE: 1.25


In [14]:
# scipy standard error function
from scipy.stats import sem
sem(AGE)

1.2513695252583041

### 2. Generate a 90%, 95%, and 99% confidence interval for `AGE`

You can use the `scipy.stats.t.interval` function to calculate confidence interval range.

```python
# Endpoints of the range that contains alpha percent of the distribution
stats.t.interval(confidence, df, loc=0, scale=1)	
```

Arguments:
- `confidence` = confidence level, between 0 and 1
- `df` = the degrees of freedom, will be the length of the vector -1.
- `loc` = the mean of the t-distribution (your point estimate - mean of the variable)
- `scale` = the standard deviation of the t-distribution (the standard error of your sample mean)

**Interpret the results from all three confidence intervals.**

In [19]:
from scipy.stats import t

In [21]:
# A: 

df = n - 1

# Calculate the confidence intervals
ci_90 = t.interval(0.90, df, loc=mean_age, scale=sem_age)
ci_95 = t.interval(0.95, df, loc=mean_age, scale=sem_age)
ci_99 = t.interval(0.99, df, loc=mean_age, scale=sem_age)

print(f"90% Confidence Interval: {ci_90}")
print(f"95% Confidence Interval: {ci_95}")
print(f"99% Confidence Interval: {ci_99}")

90% Confidence Interval: (66.51279866704186, 70.63700370449965)
95% Confidence Interval: (66.11636971854321, 71.0334326529983)
99% Confidence Interval: (65.33936041834137, 71.81044195320014)


### 3. Did you rely on the Central Limit Theorem in question 2? Why or why not? Explain.

# A:In question 2, we implicitly relied on the Central Limit Theorem (CLT) by assuming the sampling distribution of the AGE mean approaches normality for a sufficiently large sample size, even if the population distribution might not be normal.


### 4. For the variable `NOX`, generate a 95% confidence interval and interpret it.

In [23]:
# A:
mean_nox = NOX.mean()
std_dev_nox = NOX.std()
n_nox = len(NOX)
sem_nox = std_dev_nox / np.sqrt(n_nox)
df_nox = n_nox - 1

# Calculate the 95% confidence interval for NOX
ci_95_nox = t.interval(0.95, df_nox, loc=mean_nox, scale=sem_nox)

print(f"95% Confidence Interval for NOX: {ci_95_nox}")

95% Confidence Interval for NOX: (0.5445742622921801, 0.5648158562848951)


### 5. For the variable `NOX`, we are going to test the hypothesis that the (true) mean is equal to the median in the sample

In this case, we are performing the hypothesis test to test the mean based on a single sample.
These are the steps:
1. Define hypothesis
2. Set alpha (Let alpha = 0.05)
3. Calculate point estimate
4. Calculate test statistic
5. Find the p-value
6. Interpret results

In [27]:
# A:
## Step 1: Define hypotheses.
## H_0: mu_NOX = M_NOX
## H_A: mu_NOX != M_NOX

## Step 2: alpha = 0.05.
alpha = 0.05

## Step 3: Calculate point estimate.
sample_mean = NOX.mean()
sample_median = 0.54
sample_std = NOX.std()
sample_size = len(NOX)

## Step 4: Calculate test statistic.
t_statistic = (sample_mean - sample_median)/(sample_std/sample_size**0.5)

## Step 5: Find p-value.
## t.sf is survival function, which is 1-cdf at a given value 
## (proportion of values at least as extreme as...)
p_value = t.sf(np.abs(t_statistic), len(NOX)-1) * 2 


## Because our alternative hypothesis is != (rather than greater than or less than),
## we multiply our p-value by 2. (This is called a two-sided test.)
print("Our sample median is {:.4f}.".format(0.54))
print("Our sample mean is {:.4f}.".format(sample_mean))
print("Our t-statistic is {:.6f}.".format(t_statistic))
print("Our p-value is {:.6f}.".format(p_value))

if p_value < alpha:
    print("We reject our null hypothesis and conclude that the true mean NOX value is different from the median NOX value.")
elif p_value > alpha:
    print("We fail to reject our null hypothesis and cannot conclude that the true mean NOX value is different from the median .")
else:
    print("Our test is inconclusive.")

Our sample median is 0.5400.
Our sample mean is 0.5547.
Our t-statistic is 2.852639.
Our p-value is 0.004514.
We reject our null hypothesis and conclude that the true mean NOX value is different from the median NOX value.


**1-sample t-test**

To perform the t-test on a single sample, you can use `scipy.stats.ttest_1samp()`.

Try it out. Do you get the same values?

In [28]:


# Use stats.ttest_1samp() with appropriate parameters to check the results above
from scipy.stats import ttest_1samp

# Perform a one-sample t-test
t_stat, p_val = ttest_1samp(NOX, sample_median)

print("Using scipy's ttest_1samp:")
print("t-statistic: {:.6f}".format(t_stat))
print("p-value: {:.6f}".format(p_val))

if p_val < alpha:
    print("We reject our null hypothesis and conclude that the true mean NOX value is different from the median NOX value.")
elif p_val > alpha:
    print("We fail to reject our null hypothesis and cannot conclude that the true mean NOX value is different from the median.")
else:
    print("Our test is inconclusive.")


Using scipy's ttest_1samp:
t-statistic: 2.852639
p-value: 0.004514
We reject our null hypothesis and conclude that the true mean NOX value is different from the median NOX value.


### 6. What do you notice about the results from Exercise 4 and Exercise 5? 

**If you were going to generalize this to the relationship between hypothesis tests and confidence intervals, what might you say? Be specific.**

In q5 the median of NOX falls outside the 95% confidence interval obtained in q4