# Hypothesis testing challenge

We will use the _data set_ [2016 Olympics in Rio de Janeiro] (https://www.kaggle.com/rio2016/olympic-games/), which contains data on the athletes of the 2016 Olympics in Rio de Janeiro.

This _data set_ has general information about 11538 athletes such as name, nationality, height, weight and sport. We will be especially interested in the numerical variables 'height' and 'weight'. The analyzes made here are part of an Exploratory Data Analysis (EDA).

## _Setup_ 

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as sct
import seaborn as sns

In [None]:
%matplotlib inline

from IPython.core.pylabtools import figsize


figsize(12, 8)

sns.set()

In [4]:
athletes = pd.read_csv("athletes.csv")

In [6]:
def get_sample(df, col_name, n=100, seed=42):
    np.random.seed(seed)
    
    random_idx = np.random.choice(df[col_name].dropna().index, size=n, replace=False)
    
    return df.loc[random_idx, col_name]

### Starts analysis

In [7]:
athletes.head()

Unnamed: 0,id,name,nationality,sex,dob,height,weight,sport,gold,silver,bronze
0,736041664,A Jesus Garcia,ESP,male,10/17/69,1.72,64.0,athletics,0,0,0
1,532037425,A Lam Shin,KOR,female,9/23/86,1.68,56.0,fencing,0,0,0
2,435962603,Aaron Brown,CAN,male,5/27/92,1.98,79.0,athletics,0,0,1
3,521041435,Aaron Cook,MDA,male,1/2/91,1.83,80.0,taekwondo,0,0,0
4,33922579,Aaron Gate,NZL,male,11/26/90,1.81,71.0,cycling,0,0,0


In [11]:
athletes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11538 entries, 0 to 11537
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           11538 non-null  int64  
 1   name         11538 non-null  object 
 2   nationality  11538 non-null  object 
 3   sex          11538 non-null  object 
 4   dob          11537 non-null  object 
 5   height       11208 non-null  float64
 6   weight       10879 non-null  float64
 7   sport        11538 non-null  object 
 8   gold         11538 non-null  int64  
 9   silver       11538 non-null  int64  
 10  bronze       11538 non-null  int64  
dtypes: float64(2), int64(4), object(5)
memory usage: 991.7+ KB


## Question 1

Considering a sample of size 3000 from the `height` column obtained with the` get_sample () `function, run the Shapiro-Wilk normality test with the` scipy.stats.shapiro () `function. Can we say that heights are normally distributed based on this test (at the 5% significance level)? Respond with a boolean (`True` or` False`).

In [12]:
def q1():
    data = get_sample(df= athletes, col_name='height', n= 3000)
    return sct.shapiro(data)[1] > 0.5
    
q1()

False

## Question 2

Repeat the same procedure above, but now using the Jarque-Bera normality test using the `scipy.stats.jarque_bera ()` function. Can we now say that heights are normally distributed (at the 5% significance level)? Respond with a boolean (`True` or` False`).

In [13]:
def q2():
    data = get_sample(df= athletes, col_name='height', n= 3000)
    return bool(sct.jarque_bera(data)[1] > 0.5)

q2()

False

## Question 3

Now considering a sample of size 3000 from the `weight` column obtained with the` get_sample () `function. Take the D'Agostino-Pearson normality test using the `scipy.stats.normaltest ()` function. Can we say that the weights come from a normal distribution at a 5% significance level? Respo

In [19]:
def q3():
    data = get_sample(df= athletes, col_name='weight', n= 3000)
    return bool(sct.normaltest(data)[1] > 0.5)

q3()

False

## Question 4

Perform a logarithmic transformation on the weight sample in question 3 and repeat the same procedure. Can we state the normality of the transformed variable at the 5% significance level? Respond with a boolean (`True` or` False`).

In [20]:
def q4():
    data = np.log(get_sample(df= athletes, col_name='weight', n= 3000))
    return bool(sct.normaltest(data)[1] > 0.5)

q4()

False

> __For questions 5 6 and 7 below, consider all tests performed at the 5% significance level__.

## Question 5

Get all Brazilian, North American and Canadian athletes in `DataFrame`s called` bra`, `usa` and` can`, respectively. Perform a hypothesis test to compare height averages (`height`) for independent samples and different variances with the` scipy.stats.ttest_ind () `function between` bra` and `usa`. Can we say that the averages are statistically equal? Respond with a boolean (`True` or` False`).

In [21]:
def q5():
    bra = athletes.loc[athletes['nationality'] == 'BRA', 'height']
    usa = athletes.loc[athletes['nationality'] == 'USA', 'height']
    return bool(sct.ttest_ind(bra, usa, equal_var= False, nan_policy= 'omit')[1] > 0.5)

q5()

False

## Question 6

Repeat the procedure of question 5, but now between the heights of `bra` and` can`. Can we now affirm that the averages are statistically equal? Reset with a boolean (`True` or` False`).

In [22]:
def q6():
    bra = athletes.loc[athletes['nationality'] == 'BRA', 'height']
    can = athletes.loc[athletes['nationality'] == 'CAN', 'height']
    return bool(sct.ttest_ind(bra, can, equal_var= False, nan_policy= 'omit')[1] > 0.5)

q6()

True

## Question 7

Repeat the procedure of question 6, but now between the heights of `usa` and` can`. What is the value of the p-value returned? Respond as a single scalar rounded to eight decimal places.

In [23]:
def q7():
    usa = athletes.loc[athletes['nationality'] == 'USA', 'height']
    can = athletes.loc[athletes['nationality'] == 'CAN', 'height']
    return float(sct.ttest_ind(usa, can, equal_var= False, nan_policy= 'omit')[1].round(8))

q7()

0.00046601