# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [1]:
import pandas as pd
import scipy.stats as st
import numpy as np


In [42]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
#df.to_csv('C:\\Users\\Rike\\Documents\\git\\Statistics\\lab-hypothesis-testing\\pokemon.csv')
df.head(20)

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
5,Charmeleon,Fire,,58,64,58,80,65,80,1,False
6,Charizard,Fire,Flying,78,84,78,109,85,100,1,False
7,Mega Charizard X,Fire,Dragon,78,130,111,130,85,100,1,False
8,Mega Charizard Y,Fire,Flying,78,104,78,159,115,100,1,False
9,Squirtle,Water,,44,48,65,50,64,43,1,False


- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [33]:
#all pecemons with type 1 or type 2 "dragon"
df_dragon = df[(df["Type 1"] == "Dragon") | (df["Type 2"] == "Dragon")
# all with "grass"
df_grass = df[(df["Type 1"]=="Grass") | (df["Type 2"]=="Grass")

SyntaxError: invalid syntax (4112770953.py, line 4)

In [35]:
df_dragon['Name'].count()
# df_dragon

50

In [17]:
df_grass['Name'].count()

95

In [81]:
# H0: hp dragon = hp grass
# H1: hp dragon > hp grass
#st.ttest_ind(df_dragon['HP'].mean(), df_grass['HP'].mean(), equal_var=False)
t_statistic, p_value = st.ttest_ind(df_dragon['HP'], df_grass['HP'], equal_var=False, alternative= 'greater')
print(t_statistic, p_value)  # 5.090769061176926e-05 = far too small p-value to NOT reject the nullhypothesis

4.097528915272702 5.090769061176926e-05


In [29]:
# only devide by 2 if one did not use alternative = 'greater'
#p_value_one_tailed = p_value / 2   # because we are only interested in one tail...
#p_value_one_tailed                # with or without this we get a far too small p-value to NOT reject the nullhypothesis

- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


In [41]:
df.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value'],
      dtype='object')

In [50]:
# Ho: stats are same compairing legendary and nonlegendary
# H1: stats legendary != stats non_legendary



df_legendary = df[df['Legendary'] == True]
df_non_legendary = df[df['Legendary'] == False]

# Liste der Statistiken, die getestet werden sollen
stats = ['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']

# save results stat and p
results = {}

# t-Test for every statistic
for stat in stats:
    t_statistic, p_value = st.ttest_ind(df_legendary[stat], df_non_legendary[stat], equal_var=False)
    
    # save statistic and p-value for every stat in stats in dict results
    results[stat] = {'t_statistic': t_statistic, 'p_value': p_value}

# print every result for decisionmaking
for stat, result in results.items():
    print(f"{stat} - Teststatistik: {result['t_statistic']:.4f}, p-Wert: {result['p_value']:.30f}")
# p-values are veeeeeeery small > nullhypothesis rejected

HP - Teststatistik: 8.9814, p-Wert: 0.000000000000100269117080352841
Attack - Teststatistik: 10.4381, p-Wert: 0.000000000000000252037244923665
Defense - Teststatistik: 7.6371, p-Wert: 0.000000000048269984949193315837
Sp. Atk - Teststatistik: 13.4174, p-Wert: 0.000000000000000000001551461411
Sp. Def - Teststatistik: 10.0157, p-Wert: 0.000000000000002294932786405283
Speed - Teststatistik: 11.4750, p-Wert: 0.000000000000000001049016311882


**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [68]:
df_house = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df_house.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 37)
- Hospital coordinates (-122, 34)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [69]:
# euc_dist = np.sqrt((lon1 - lon2)**2 + (lat1 - lat2)**2)
lon_sch = -118
lat_sch = 37
lon_hosp = -122
lat_hosp = 34

In [70]:
distance_sch = []
distance_hosp = []
for i in range(len(df_house)):
    dist_sch = np.sqrt((df_house.loc[i, 'longitude'] - lon_sch)**2 + ((df_house.loc[i, 'latitude'] - lat_sch)**2))
    dist_hosp = np.sqrt((df_house.loc[i, 'longitude'] - lon_hosp)**2 + ((df_house.loc[i, 'latitude'] - lat_hosp)**2))
    distance_sch.append(dist_sch)
    distance_hosp.append(dist_hosp)


In [71]:
df_house['dist_school'] = distance_sch
df_house['dist_hospital'] = distance_hosp
df_house

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,dist_school,dist_hospital
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0,4.638125,7.692347
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0,4.384165,7.540617
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0,4.773856,7.446456
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0,4.801510,7.438716
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0,4.850753,7.442432
...,...,...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0,7.211380,6.957298
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0,7.275232,7.064630
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0,7.944533,8.170410
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0,7.920227,8.132035


In [80]:
print("Min Dist to School:", df_house['dist_school'].min())
print("Max Dist to School:", df_house['dist_school'].max())
print("Min Dist to hospital:", df_house['dist_hospital'].min())
print("Max Dist to hospital:", df_house['dist_hospital'].max())

Min Dist to School: 0.31575306807694153
Max Dist to School: 7.944532711242367
Min Dist to hospital: 1.574198208612878
Max Dist to hospital: 8.232988521794503


In [78]:
# threshold_degrees for "near" hospital or school
threshold = 0.50  

# filter df_house for houses near school or hospital and houses far from both
df_close = df_house[(df_house['dist_school'] < threshold) | (df_house['dist_hospital'] < threshold)]
df_far = df_house[(df_house['dist_school'] >= threshold) & (df_house['dist_hospital'] >= threshold)]

print("Count near:", len(df_close))
print("Count far:", len(df_far))


Count near: 5
Count far: 16995


In [84]:
t_stat, p_value = st.ttest_ind(df_close['median_house_value'], df_far['median_house_value'], equal_var=False, alternative='greater')
print(t_stat, p_value) # 0.9999738999071939 > 0.05 = enough evidence to reject H0! YAY!

-17.174167998688404 0.9999738999071939
