# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [2]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np



In [3]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [14]:
dragon_hp = df[df["Type 1"]=="Dragon"]["HP"]
grass_hp = df[df["Type 1"]=="Grass"]["HP"]


In [15]:
#Set the hypothesis

#H0: mean HP Dragon = mean HP Grass
#H1: mean HP Dragon > mean HP Grass

#significance level = 0.05

In [16]:
from scipy import stats

t_stat, p_two_sided = stats.ttest_ind(dragon_hp, grass_hp, equal_var=False)
p_one_sided = p_two_sided / 2  

print("t-statistic:", t_stat)
print("two-sided p-value:", p_two_sided)
print("one-sided p-value (Dragon > Grass):", p_one_sided)




t-statistic: 3.3349632905124063
two-sided p-value: 0.0015987219490841199
one-sided p-value (Dragon > Grass): 0.0007993609745420599


In [17]:
alpha = 0.05

if (t_stat > 0) and (p_one_sided < alpha):
    print("Reject H0: Dragons have significantly higher mean HP than Grass at 5% level.")
else:
    print("Fail to reject H0: no significant evidence that Dragons have higher mean HP.")

Reject H0: Dragons have significantly higher mean HP than Grass at 5% level.


In [18]:
# He reject the hypothesis, as the Dragons have significantly higher mean HP than Grass at 5% level.

- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


In [20]:
import pandas as pd
from scipy.stats import f_oneway


url = 'https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv'
df = pd.read_csv(url)


stats = ['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']

# Separate groups: Legendary (True) vs Non-Legendary (False)
legendary = df['Legendary'] == True

# Perform one-way ANOVA for each stat
results = {}
for stat in stats:
    group1 = df[~legendary][stat].dropna()  
    group2 = df[legendary][stat].dropna()   
    f_stat, p_value = f_oneway(group1, group2)
    results[stat] = {'F': round(f_stat, 4), 'p-value': round(p_value, 6)}

print("ANOVA Results (α=0.05):")
for stat, res in results.items():
    sig = "Significant" if res['p-value'] < 0.05 else "Not significant"
    print(f"{stat}: F={res['F']}, p={res['p-value']} → {sig}")

# Display mean stats by group
print("\nMean stats by group:")
print(df.groupby('Legendary')[stats].mean().round(2))


ANOVA Results (α=0.05):
HP: F=64.5793, p=0.0 → Significant
Attack: F=108.1043, p=0.0 → Significant
Defense: F=51.5702, p=0.0 → Significant
Sp. Atk: F=201.396, p=0.0 → Significant
Sp. Def: F=121.8319, p=0.0 → Significant
Speed: F=95.3598, p=0.0 → Significant

Mean stats by group:
              HP  Attack  Defense  Sp. Atk  Sp. Def   Speed
Legendary                                                  
False      67.18   75.67    71.56    68.45    68.89   65.46
True       92.74  116.68    99.66   122.18   105.94  100.18


In [None]:
# All six stats reject the null hypothesis (equal means) with p<0.05. 
# Legendary Pokémon have significantly higher means across all stats than non-Legendary

**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [4]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 34)
- Hospital coordinates (-122, 37)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [None]:
#1.1 Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.

In [10]:
import math

school = (-118, 34) 
hospital = (-122, 37) 

def euclidean_distance(point1, point2): 
    return math.sqrt((point1[0] - point2[0])**2 + (point1[1] - point2[1])**2) 

def is_close(house_coords, threshold=0.50):
    dist_school = euclidean_distance(house_coords, school)
    dist_hospital = euclidean_distance(house_coords, hospital)
    return dist_school < threshold or dist_hospital < threshold

#test 
house = (-118.2, 34.1)
print("Is the house close?", is_close(house))


Is the house close? True


In [None]:
#1.2 Divide your dataset into houses close and far from either a hospital or school.

In [19]:
from scipy import stats 
school = (-118, 34)
hospital = (-122, 37)



# H0: Mean house value is the same for houses close vs far from a school or hospital 
# H1: Houses close to a school or hospital are more expensive 

def euclidean_distance(point1, point2): 
    return math.sqrt((point1[0] - point2[0])**2 + (point1[1] - point2[1])**2) 


def is_close(lon, lat, threshold=0.50):
    dist_school = euclidean_distance((lon, lat), school)
    dist_hospital = euclidean_distance((lon, lat), hospital)
    return dist_school < threshold or dist_hospital < threshold


df["is_close"] = df.apply(lambda row: is_close(row["longitude"], row["latitude"]), axis=1)


close_houses = df[df["is_close"]]["median_house_value"] 
far_houses = df[~df["is_close"]]["median_house_value"]

t_stat, p_value = stats.ttest_ind(close_houses, far_houses, equal_var=False) 

print("Mean house value (close):", close_houses.mean()) 
print("Mean house value (far):", far_houses.mean()) 
print("t-statistic:", t_stat) 
print("p-value:", p_value)


alpha = 0.05
if p_value < alpha:
    print("Conclusion: Reject H0. Houses close to a school or hospital are significantly more expensive.")
else:
    print("Conclusion: Fail to reject H0. No significant evidence that proximity affects house prices.")

Mean house value (close): 246951.98213501245
Mean house value (far): 180678.44105790975
t-statistic: 37.992330214201516
p-value: 3.0064957768592614e-301
Conclusion: Reject H0. Houses close to a school or hospital are significantly more expensive.


In [None]:
Houses close to a school or hospital are significantly more expensive.