# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [2]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np



In [3]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [4]:
import pandas as pd
import scipy.stats as st
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Load all the Pokemon data
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")

print("Pokemon data loaded!")
print(df.head())

Pokemon data loaded!
            Name Type 1  Type 2  HP  Attack  Defense  Sp. Atk  Sp. Def  Speed  \
0      Bulbasaur  Grass  Poison  45      49       49       65       65     45   
1        Ivysaur  Grass  Poison  60      62       63       80       80     60   
2       Venusaur  Grass  Poison  80      82       83      100      100     80   
3  Mega Venusaur  Grass  Poison  80     100      123      122      120     80   
4     Charmander   Fire     NaN  39      52       43       60       50     65   

   Generation  Legendary  
0           1      False  
1           1      False  
2           1      False  
3           1      False  
4           1      False  


- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


In [5]:
print("CHALLENGE 1, p 1: Dragon vs Grass HP")

dragon_hp = df[df['Type 1'] == 'Dragon']['HP']
grass_hp = df[df['Type 1'] == 'Grass']['HP']

print(f"Dragon Pokemon HP: mean = {dragon_hp.mean():.2f}, n = {len(dragon_hp)}")
print(f"Grass Pokemon HP: mean = {grass_hp.mean():.2f}, n = {len(grass_hp)}")

# two sample t-test
t_stat, p_value = st.ttest_ind(dragon_hp, grass_hp)

print(f"\nT-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

alpha = 0.05
if p_value < alpha:
    print(f"Result: p-value ({p_value:.4f}) < {alpha}")
    print("We rejecct the null hypothesis")
    print("Dragon Pokemon have significantly different HP than Grass pokemons")
else:
    print(f"Result: p-value ({p_value:.4f}) >= {alpha}")
    print("We failt to reeject the null hypothesis")
    print("No significant difference in HP between Dragon and grass pokemons")

CHALLENGE 1, p 1: Dragon vs Grass HP
Dragon Pokemon HP: mean = 83.31, n = 32
Grass Pokemon HP: mean = 67.27, n = 70

T-statistic: 3.5904
P-value: 0.0005
Result: p-value (0.0005) < 0.05
We rejecct the null hypothesis
Dragon Pokemon have significantly different HP than Grass pokemons


In [7]:
print("PART 2: Legendary vs Non-Legendary Stats")

legendary = df[df['Legendary'] == True]
non_legendary = df[df['Legendary'] == False]

stats_to_test = ['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']

for stat in stats_to_test:
    leg_stat = legendary[stat]
    non_leg_stat = non_legendary[stat]
    
    t_stat, p_value = st.ttest_ind(leg_stat, non_leg_stat)
    
    print(f"\n{stat}:")
    print(f"  Legendary mean: {leg_stat.mean():.2f}")
    print(f"  Non-Legendary mean: {non_leg_stat.mean():.2f}")
    print(f"  P-value: {p_value:.4f}")
    
    if p_value < 0.05:
        print(f"  Result: SIGNIFICANT difference (p < 0.05)")
    else:
        print(f"  Result: NO significant difference (p >= 0.05)")

PART 2: Legendary vs Non-Legendary Stats

HP:
  Legendary mean: 92.74
  Non-Legendary mean: 67.18
  P-value: 0.0000
  Result: SIGNIFICANT difference (p < 0.05)

Attack:
  Legendary mean: 116.68
  Non-Legendary mean: 75.67
  P-value: 0.0000
  Result: SIGNIFICANT difference (p < 0.05)

Defense:
  Legendary mean: 99.66
  Non-Legendary mean: 71.56
  P-value: 0.0000
  Result: SIGNIFICANT difference (p < 0.05)

Sp. Atk:
  Legendary mean: 122.18
  Non-Legendary mean: 68.45
  P-value: 0.0000
  Result: SIGNIFICANT difference (p < 0.05)

Sp. Def:
  Legendary mean: 105.94
  Non-Legendary mean: 68.89
  P-value: 0.0000
  Result: SIGNIFICANT difference (p < 0.05)

Speed:
  Legendary mean: 100.18
  Non-Legendary mean: 65.46
  P-value: 0.0000
  Result: SIGNIFICANT difference (p < 0.05)


**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [8]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


In [9]:
print(df.columns)
print(df.head())

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value'],
      dtype='str')
   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -114.31     34.19                15.0       5612.0          1283.0   
1    -114.47     34.40                19.0       7650.0          1901.0   
2    -114.56     33.69                17.0        720.0           174.0   
3    -114.57     33.64                14.0       1501.0           337.0   
4    -114.57     33.57                20.0       1454.0           326.0   

   population  households  median_income  median_house_value  
0      1015.0       472.0         1.4936             66900.0  
1      1129.0       463.0         1.8200             80100.0  
2       333.0       117.0         1.6509             85700.0  
3       515.0       226.0         3.1917             73400.0  
4       624.0       262.0         1.9250      

**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 34)
- Hospital coordinates (-122, 37)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [10]:
print("CHALLENGE 2: CALIFORNIA'S HOUSING")

# Load housing data
df_housing = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")

print("\nHousing data loaded!")
print(df_housing.head())

CHALLENGE 2: CALIFORNIA'S HOUSING

Housing data loaded!
   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -114.31     34.19                15.0       5612.0          1283.0   
1    -114.47     34.40                19.0       7650.0          1901.0   
2    -114.56     33.69                17.0        720.0           174.0   
3    -114.57     33.64                14.0       1501.0           337.0   
4    -114.57     33.57                20.0       1454.0           326.0   

   population  households  median_income  median_house_value  
0      1015.0       472.0         1.4936             66900.0  
1      1129.0       463.0         1.8200             80100.0  
2       333.0       117.0         1.6509             85700.0  
3       515.0       226.0         3.1917             73400.0  
4       624.0       262.0         1.9250             65500.0  


In [12]:
# define school and hospital coordinates as reference po
school_coords = (-118, 34)
hospital_coords = (-122, 37)

# function to calculate euclidean distance
def euclidean_distance(lon, lat, target_lon, target_lat):
    return np.sqrt((lon - target_lon)**2 + (lat - target_lat)**2)

# calculate distances
df_housing['dist_to_school'] = euclidean_distance(
    df_housing['longitude'], 
    df_housing['latitude'],
    school_coords[0],
    school_coords[1]
)

df_housing['dist_to_hospital'] = euclidean_distance(
    df_housing['longitude'],
    df_housing['latitude'], 
    hospital_coords[0],
    hospital_coords[1]
)

# determine if close to either school or hospital
threshold = 0.50
df_housing['close_to_amenity'] = (
    (df_housing['dist_to_school'] < threshold) | 
    (df_housing['dist_to_hospital'] < threshold)
)

# split into two groups
close_houses = df_housing[df_housing['close_to_amenity'] == True]['median_house_value']
far_houses = df_housing[df_housing['close_to_amenity'] == False]['median_house_value']

print(f"Houses close to school/hospital: n = {len(close_houses)}")
print(f"Houses far from school/hospital: n = {len(far_houses)}")

print(f"\nMean house value (close): ${close_houses.mean():.2f}")
print(f"Mean house value (far): ${far_houses.mean():.2f}")

# two sample t-test
t_stat, p_value = st.ttest_ind(close_houses, far_houses)

print(f"\nT-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

alpha = 0.05
if p_value < alpha:
    print(f"\nResult: p-value ({p_value:.4f}) < {alpha}")
    print("We reject the null hypothesis")
    print("Houses close to schools/hospitals ARE significantly more expensive")
else:
    print(f"\nResult: p-value ({p_value:.4f}) >= {alpha}")
    print("We don not the null hypothesis")
    print("No significant difference in price based on proximity to school/hospital")

print("Done")

Houses close to school/hospital: n = 6829
Houses far from school/hospital: n = 10171

Mean house value (close): $246951.98
Mean house value (far): $180678.44

T-statistic: 38.0463
P-value: 0.0000

Result: p-value (0.0000) < 0.05
We reject the null hypothesis
Houses close to schools/hospitals ARE significantly more expensive
Done
