# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [99]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np



In [100]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [102]:
column_summary_df = pd.DataFrame({
    'Column Name': df.columns,
    'Data Type': df.dtypes.values,
    'Percentage Null': df.isnull().mean().values * 100,
    'Unique Values': df.nunique().values
})

#column_summary_df.to_excel('column_summary_df.xlsx', index=False)
print(column_summary_df)

   Column Name Data Type  Percentage Null  Unique Values
0         Name    object            0.125            799
1       Type 1    object            0.000             18
2       Type 2    object           48.250             18
3           HP     int64            0.000             94
4       Attack     int64            0.000            111
5      Defense     int64            0.000            103
6      Sp. Atk     int64            0.000            105
7      Sp. Def     int64            0.000             92
8        Speed     int64            0.000            108
9   Generation     int64            0.000              6
10   Legendary      bool            0.000              2


In [103]:
# H0: There is no difference in average HP between Dragon and Grass type Pokémon.
# H1: Dragon-type Pokémon have, on average, more HP than Grass-type Pokémon.

In [104]:
dragon_hp = df[df['Type 1'] == 'Dragon']['HP']
grass_hp = df[df['Type 1'] == 'Grass']['HP']

In [105]:
from scipy.stats import ttest_ind

t_stat, p_value = ttest_ind(dragon_hp, grass_hp, equal_var=False)  # Set equal_var based on your homogeneity test

print(f"T-Statistic: {t_stat}")
print(f"P-Value: {p_value}")

T-Statistic: 3.3349632905124063
P-Value: 0.0015987219490841199


In [106]:
# The standardized difference between the averages of the two samples (HP of Dragon-type and Grass-type Pokémon)
# The null hypothesis is rejected (Probability of observing an equal or more extreme difference between the groups, assuming that the null hypothesis is true).

- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


In [108]:
# H0: There is no significant difference in the set of statistics.
# H1: there is at least one significant difference in the set of statistics.

In [109]:
legendary_stats = df[df['Legendary'] == True][['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']]
non_legendary_stats = df[df['Legendary'] == False][['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']]

In [110]:
from statsmodels.multivariate.manova import MANOVA

# Concatenate the data for MANOVA
df_manova = df[['Legendary', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']].copy()
df_manova = df_manova.rename(columns={'Sp. Atk': 'Sp_Atk', 'Sp. Def': 'Sp_Def'})
df_manova['Legendary'] = df_manova['Legendary'].astype(int)

# Create the MANOVA model
manova = MANOVA.from_formula('HP + Attack + Defense + Sp_Atk + Sp_Def + Speed ~ Legendary', data=df_manova)
result = manova.mv_test()

print(result)

                   Multivariate linear model
                                                                
----------------------------------------------------------------
       Intercept         Value  Num DF  Den DF   F Value  Pr > F
----------------------------------------------------------------
          Wilks' lambda  0.0592 6.0000 793.0000 2100.8338 0.0000
         Pillai's trace  0.9408 6.0000 793.0000 2100.8338 0.0000
 Hotelling-Lawley trace 15.8953 6.0000 793.0000 2100.8338 0.0000
    Roy's greatest root 15.8953 6.0000 793.0000 2100.8338 0.0000
----------------------------------------------------------------
                                                                
----------------------------------------------------------------
          Legendary        Value  Num DF  Den DF  F Value Pr > F
----------------------------------------------------------------
             Wilks' lambda 0.7331 6.0000 793.0000 48.1098 0.0000
            Pillai's trace 0.2669 6.0000 793.

In [111]:
# Since the p-value values are extremely low (0.0000), we can reject the null hypothesis that there is no difference in the set of statistics between Legendary and Non-Legendary Pokémon

**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [114]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


In [115]:
column_summary_df = pd.DataFrame({
    'Column Name': df.columns,
    'Data Type': df.dtypes.values,
    'Percentage Null': df.isnull().mean().values * 100,
    'Unique Values': df.nunique().values
})

#column_summary_df.to_excel('column_summary_df_2.xlsx', index=False)
print(column_summary_df)

          Column Name Data Type  Percentage Null  Unique Values
0           longitude   float64              0.0            827
1            latitude   float64              0.0            840
2  housing_median_age   float64              0.0             52
3         total_rooms   float64              0.0           5533
4      total_bedrooms   float64              0.0           1848
5          population   float64              0.0           3683
6          households   float64              0.0           1740
7       median_income   float64              0.0          11175
8  median_house_value   float64              0.0           3694


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 34)
- Hospital coordinates (-122, 37)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [133]:
from scipy.stats import ttest_ind

# Function to calculate Euclidean distance
def euclidean_distance(x1, y1, x2, y2):
    # Calculate the Euclidean distance between two points (x1, y1) and (x2, y2)
    return np.sqrt((x2 - x1)**2 + (y2 - y1)**2)

# Coordinates for school and hospital
school_coords = (-118, 34)
hospital_coords = (-122, 37)

# Calculate distances
df['dist_to_school'] = df.apply(lambda row: euclidean_distance(row['longitude'], row['latitude'], *school_coords), axis=1)
df['dist_to_hospital'] = df.apply(lambda row: euclidean_distance(row['longitude'], row['latitude'], *hospital_coords), axis=1)

# Determine if each house is close to a school or hospital
df['is_close'] = ((df['dist_to_school'] < 0.50) | (df['dist_to_hospital'] < 0.50)).astype(int)

close_houses = df[df['is_close'] == 1]['median_house_value']
far_houses = df[df['is_close'] == 0]['median_house_value']

t_stat, p_value = ttest_ind(close_houses, far_houses, equal_var=False)  # Use equal_var=False if variances are not equal

print(f"T-Statistic: {t_stat}")
print(f"P-Value: {p_value}")

# Interpretation of results
if p_value < 0.05:
    print("There is a significant difference in median house value between houses close to or far from a school or hospital.")
else:
    print("There is no significant difference in median house value based on proximity to a school or hospital.")

T-Statistic: 37.992330214201516
P-Value: 3.0064957768592614e-301
There is a significant difference in median house value between houses close to or far from a school or hospital.
