# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [4]:
#libraries
import pandas as pd
import scipy.stats as stats
import numpy as np



In [5]:
pokemon_df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
pokemon_df

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [8]:
#we can use a Two Sample T-Test. This test is appropriate because we are comparing the means of two independent groups (Dragon-type and Grass-type Pokémon) to see if there is a statistically significant difference between them.
# Filter the data for Dragon and Grass types
dragon_hp = pokemon_df[pokemon_df['Type 1'] == 'Dragon']['HP']
grass_hp = pokemon_df[pokemon_df['Type 1'] == 'Grass']['HP']

# Check for normality (e.g., using Shapiro-Wilk test)
shapiro_dragon = stats.shapiro(dragon_hp)
shapiro_grass = stats.shapiro(grass_hp)
print(f"Shapiro-Wilk Test for Dragon-type HP: p-value = {shapiro_dragon.pvalue}")
print(f"Shapiro-Wilk Test for Grass-type HP: p-value = {shapiro_grass.pvalue}")

# Check for equal variances (e.g., using Levene's test)
levene_test = stats.levene(dragon_hp, grass_hp)
print(f"Levene's Test for Equal Variances: p-value = {levene_test.pvalue}")

# Perform the Two Sample T-Test (assuming unequal variances)
t_stat, p_value = stats.ttest_ind(dragon_hp, grass_hp, equal_var=False, alternative='greater')
print(f"T-statistic: {t_stat}, P-value: {p_value}")

# Interpret the results
if p_value < 0.05:
    print("Reject the null hypothesis: Dragon-type Pokémon have significantly higher average HP than Grass-type Pokémon.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in average HP between Dragon-type and Grass-type Pokémon.")

Shapiro-Wilk Test for Dragon-type HP: p-value = 0.3419890019248819
Shapiro-Wilk Test for Grass-type HP: p-value = 0.11214356726520774
Levene's Test for Equal Variances: p-value = 0.17484739934331067
T-statistic: 3.3349632905124063, P-value: 0.0007993609745420599
Reject the null hypothesis: Dragon-type Pokémon have significantly higher average HP than Grass-type Pokémon.


- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


In [14]:
from statsmodels.multivariate.manova import MANOVA
# Select the relevant columns
stats_df = pokemon_df[['Legendary', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']]
stats_df = stats_df.rename(columns={'Sp. Atk': 'SpAtk', 'Sp. Def': 'SpDef'})

# Perform MANOVA
# Use single or double quotes to enclose column names with spaces
manova = MANOVA.from_formula('HP + Attack + Defense + SpAtk + SpDef + Speed ~ Legendary', data=stats_df)
#Alternatively, you can rename the columns with spaces to remove the spaces:
#stats_df = stats_df.rename(columns={'Sp. Atk': 'SpAtk', 'Sp. Def': 'SpDef'})
#manova = MANOVA.from_formula('HP + Attack + Defense + SpAtk + SpDef + Speed ~ Legendary', data=stats_df)
print(manova.mv_test())

# Interpret the results
results = manova.mv_test()
p_value = results.results['Legendary']['stat']['Pr > F'][0]
if p_value < 0.05:
    print("Reject the null hypothesis: There is a significant difference in stats between Legendary and Non-Legendary Pokémon.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in stats between Legendary and Non-Legendary Pokémon.")

                   Multivariate linear model
                                                                
----------------------------------------------------------------
       Intercept         Value  Num DF  Den DF   F Value  Pr > F
----------------------------------------------------------------
          Wilks' lambda  0.0592 6.0000 793.0000 2100.8338 0.0000
         Pillai's trace  0.9408 6.0000 793.0000 2100.8338 0.0000
 Hotelling-Lawley trace 15.8953 6.0000 793.0000 2100.8338 0.0000
    Roy's greatest root 15.8953 6.0000 793.0000 2100.8338 0.0000
----------------------------------------------------------------
                                                                
----------------------------------------------------------------
          Legendary        Value  Num DF  Den DF  F Value Pr > F
----------------------------------------------------------------
             Wilks' lambda 0.7331 6.0000 793.0000 48.1098 0.0000
            Pillai's trace 0.2669 6.0000 793.

  p_value = results.results['Legendary']['stat']['Pr > F'][0]


**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [16]:
california_df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
california_df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 34)
- Hospital coordinates (-122, 37)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.


In [18]:
import pandas as pd
import numpy as np
import scipy.stats as stats

# Function to calculate Euclidean distance
def euclidean_distance(lat1, lon1, lat2, lon2):
    return np.sqrt((lat1 - lat2)**2 + (lon1 - lon2)**2)

# Coordinates for school and hospital
school_coords = (-118, 34)
hospital_coords = (-122, 37)

# Calculate distances
california_df['distance_to_school'] = california_df.apply(lambda row: euclidean_distance(row['latitude'], row['longitude'], school_coords[1], school_coords[0]), axis=1)
california_df['distance_to_hospital'] = california_df.apply(lambda row: euclidean_distance(row['latitude'], row['longitude'], hospital_coords[1], hospital_coords[0]), axis=1)

# Determine if a house is close to either a school or a hospital
california_df['close_to_school_or_hospital'] = (california_df['distance_to_school'] < 0.50) | (california_df['distance_to_hospital'] < 0.50)

# Split the data into two groups
close_houses = california_df[california_df['close_to_school_or_hospital']]['median_house_value']
far_houses = california_df[~california_df['close_to_school_or_hospital']]['median_house_value']

# Perform the Two Sample T-Test
t_stat, p_value = stats.ttest_ind(close_houses, far_houses, equal_var=False)
print(f"T-statistic: {t_stat}, P-value: {p_value}")

# Interpret the results
if p_value < 0.05:
    print("Reject the null hypothesis: Houses close to either a school or a hospital are significantly more expensive.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in house prices between houses close to and far from schools or hospitals.")

T-statistic: 37.992330214201516, P-value: 3.0064957768592614e-301
Reject the null hypothesis: Houses close to either a school or a hospital are significantly more expensive.
