**Table of contents**<a id='toc0_'></a>    
- [Intro: Let's check schools in the Cali hoods.](#toc1_)    
  - [Hypothesis setting: Houses are more expensive if they're close to a school.](#toc1_1_)    
  - [Hypothesis testing](#toc1_2_)    
    - [Getting distances between houses and the school](#toc1_2_1_)    
- [Confidence intervals](#toc2_)    
  - [Using the normal distribution](#toc2_1_)    
  - [Using the T-distribution](#toc2_2_)    
  - [💡 Do it yourself](#toc2_3_)    
  - [Conclusions](#toc2_4_)    
  - [Computing number of samples needed](#toc2_5_)    
- [Resources](#toc3_)    
- [References/Acknowledgements](#toc4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Intro: Let's check schools in the Cali hoods.](#toc0_)

In [None]:
import pandas as pd
import numpy as np
import scipy.stats as stats

Today we will start off with the infamous [California housing dataset](https://www.kaggle.com/datasets/camnugent/california-housing-prices/):

In [None]:
houses = pd.read_csv('https://raw.githubusercontent.com/sabinagio/data-analytics/main/data/california_housing_census.csv')
houses.head(2)

In [None]:
houses.shape

There are two schools at these coordinates:

In [None]:
school1_latitude = 37.39
school1_longitude = -122.13
school2_latitude = 33.83
school2_longitude = -118.48

## <a id='toc1_1_'></a>[Hypothesis setting: Houses are more expensive if they're close to a school.](#toc0_)

In data analyst terms, the `median_house_value` is higher if the distance to a school is lower.

## <a id='toc1_2_'></a>[Hypothesis testing](#toc0_)

### <a id='toc1_2_1_'></a>[Getting distances between houses and the school](#toc0_)

We  can assume that distances between two points p1 and p2 in latitude and longitude can be computed using the usual Euclidean distances:  

$distance(x, y) = \sqrt{((x_{longitude} - y_{longitude}) ^ 2 + (x_{latitude} - y_{latitude}) ^ 2)}  $  

Let's assume that a point is "close" to another if the distance in latitude and longitude is < 0.5.

**Note (for the nerds):** Using the Euclidian distance simply means that we assume the surface we're looking at (that of the neighbourhoods in our dataset) is flat. The reality is that, given Earth is a sphere, that's not **exactly** true, but it's a pretty good approximation if we look at a small enough surface. In our case, we can say that a county in the US is a small enough surface compared to the full surface of the Earth.

In [None]:
# Create a function to calculate the Euclidian distance between the school coordinates and the neighbourhood coordinates
def distance_school(latitude, longitude, school_lat, school_long):
  return np.sqrt((latitude - school_lat) ** 2 + (longitude - school_long) ** 2)

In [None]:
# Apply function to dataframe for both schools
houses['D_sch1'] = houses.apply(lambda row: distance_school(row['latitude'], row['longitude'], school1_latitude, school1_longitude), axis=1)
houses['D_sch2'] = houses.apply(lambda row: distance_school(row['latitude'], row['longitude'], school2_latitude, school2_longitude), axis=1)

In [None]:
# Review dataframe
houses.head()

In [None]:
# Choose the optimal school distance
houses['distance_to_school'] = houses[['D_sch1', 'D_sch2']].apply(min, axis=1)
houses.head()

In [None]:
# Create close_to_school feature based on optimal distance
houses['close_to_school?'] = houses['distance_to_school'] < 0.5
houses.head()

In [None]:
# What is the median house value per close_to_school?
houses.groupby('close_to_school?').agg({'median_house_value': 'mean'})

Final reveal: The median house value is indeed higher for neighbourhoods closer to a school.

# <a id='toc2_'></a>[Confidence intervals](#toc0_)

![](https://media.giphy.com/media/1VV5mivAbIHSSiKXL9/giphy.gif)

In [None]:
# Separate the two samples
houses_close = houses[houses['close_to_school?'] == True]
houses_far = houses[houses['close_to_school?'] == False]

In [None]:
# Get a sample for the neighbourhoods with schools close and calculate its stats
houses_close_sample = houses_close.sample(100)
houses_close_mean = houses_close['median_house_value'].mean()
houses_close_std = houses_close['median_house_value'].std()
houses_close_n = 100

In [None]:
# Show the dataframe and its stats
display(houses_close_sample.head())
display(houses_close_sample.shape)
display(houses_close_std)
display(houses_close_mean)

## <a id='toc2_1_'></a>[Using the normal distribution](#toc0_)

![](https://imgs.search.brave.com/O4ZLy7nFQpsvh2kthg0EjUzjH5JMJKM3bUxMyeikDXY/rs:fit:860:0:0/g:ce/aHR0cDovL29wZW5i/b29rcy5saWJyYXJ5/LnVtYXNzLmVkdS9w/MTMyLWxhYi1tYW51/YWwvd3AtY29udGVu/dC91cGxvYWRzL3Np/dGVzLzI2LzIwMjAv/MDcvQmVsbC1jdXJ2/ZS5qcGc)

In [None]:
# Calculate the confidence interval manually
print("left end: ", houses_close_mean - 2 * (houses_close_std / np.sqrt(houses_close_n)))
print("right end: ", houses_close_mean + 2 * (houses_close_std / np.sqrt(houses_close_n)))

In [None]:
# Calculate the confidence interval using stats
stats.norm.interval(0.955, loc=houses_close_mean, scale=houses_close_std/np.sqrt(houses_close_n))

It's normal for results to be slightly different between the manual and stats method as there are extra approximations made by stats under the hood regarding the degrees of freedom of the sample.

What if we prefer 99.6% confidence?

In [None]:
# This is for you to answer during the class ;)

What if we prefer 98% confidence? How can we get the number of standard deviations?

In [None]:
# Get # of standard deviations
z = stats.norm.ppf(.99)   # Why .99 instead of .98?
print(round(z, 2))

In [None]:
# Get the manual confidence interval
print("left end: ", houses_close_mean - z * (houses_close_std / np.sqrt(houses_close_n)))
print("right end: ", houses_close_mean + z * (houses_close_std / np.sqrt(houses_close_n)))

In [None]:
# Is the real mean in between our interval?
houses_close['median_house_value'].mean()

## <a id='toc2_2_'></a>[Using the T-distribution](#toc0_)

Noticing that the normal distribution doesn't properly describe small sample sizes, the statistician William Gossett developed the t-distribution to adapt to this scenario. The t-distribution is calculated for each sample size so when a sample is smaller, it differs more from a normal distribution. Typically, the t-distribution has more extreme observations (fatter tails) than a normal distribution: 

![](https://www.investopedia.com/thmb/wejbVOkkG-O2IyRfbz-vErbEea8=/750x0/filters:no_upscale():max_bytes(150000):strip_icc():format(webp)/norm_vs_t2-1024x941-f3559a8fd4e947d49723541273a7d162.png)  
(Source: [Investopedia](https://www.investopedia.com/terms/t/tdistribution.asp))

> You should use the t-distribution table when working problems when the population standard deviation (σ) is not known and the sample size is small (n<30). 

**Fun fact:** The T distribution is usually called the Student's T distribution because he used the Student pseudonym. Legend has it it's because he was working for Guinness and the [brewery wouldn't let him share his real name](https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/full/10.1002/cem.2713).

What is the critical value for a t-distribution if I want my confidence to be 98%?

In [None]:
# Get the critical value
t = stats.t.ppf(1 - ((1 - 0.98) / 2), 100 - 1)
t

In [None]:
# Get the interval manually
print("left end: ", houses_close_mean - t * (houses_close_std / np.sqrt(100)))
print("right end: ", houses_close_mean + t * (houses_close_std / np.sqrt(100)))

In [None]:
# What was the interval previously?
print("left end: ", houses_close_mean - z * (houses_close_std / np.sqrt(houses_close_n)))
print("right end: ", houses_close_mean + z * (houses_close_std / np.sqrt(houses_close_n)))

We notice there was an increase in the width of the interval when switching from the normal to the t distribution.

## <a id='toc2_3_'></a>[💡 Do it yourself](#toc0_)

Now repeat this exercise for the set of houses away from schools. What do you see?

In [None]:
# Your code here

## <a id='toc2_4_'></a>[Conclusions](#toc0_)

Your conclusion here

# <a id='toc3_'></a>[Resources](#toc0_)

- [Why do we use the squared root of the sample size? (intuition)](https://www.drdawnwright.com/why-divide-by-the-square-root-of-n/)
- [How to get the critical value for a chosen confidence level?](https://www.khanacademy.org/math/ap-statistics/xfb5d8e68:inference-categorical-proportions/one-sample-z-interval-proportion/v/critical-value-for-a-given-confidence-level) - 6 min

# <a id='toc4_'></a>[References/Acknowledgements](#toc0_)

Thank you, David Henriques for the awesome structure and content of your lessons :) 