# Challenge 4


In this challenge, weâ€™re going to practice a little bit about hypothesis testing. We will use the _data set_ [2016 Olympics in Rio de Janeiro](https://www.kaggle.com/rio2016/olympic-games/), which contains data on athletes from the 2016 Olympics in Rio de Janeiro.

This _data set_ has general information about 11538 athletes such as name, nationality, height, weight and sport. We will be especially interested in the numerical variables height (height) and weight (weight). The analyzes made here are part of an Exploratory Data Analysis (EDA).

## _Setup_ 

In [48]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as sct
import seaborn as sns
import statsmodels.api as sm

In [49]:
athletes = pd.read_csv("athletes.csv")

In [50]:
def get_sample(df, col_name, n=100, seed=42):
    """Get a sample from a column of a dataframe.
    
    It drops any numpy.nan entries before sampling. The sampling
    is performed without replacement.
    
    Example of numpydoc for those who haven't seen yet.
    
    Parameters
    ----------
    df : pandas.DataFrame
        Source dataframe.
    col_name : str
        Name of the column to be sampled.
    n : int
        Sample size. Default is 100.
    seed : int
        Random seed. Default is 42.
    
    Returns
    -------
    pandas.Series
        Sample of size n from dataframe's column.
    """
    np.random.seed(seed)
    
    random_idx = np.random.choice(df[col_name].dropna().index, size=n, replace=False)
    
    return df.loc[random_idx, col_name]

# Refactoring code with OOP


After completing the challenge, I use the good practice of refactoring the code with a view to reusing, improving performance and improving understanding.

In [51]:
import scipy.stats as sct

class HypothesisTests(object):
    """
    Class used to test hypotheses by means of p-value analysis.
    """
    
    def __init__(self, alpha=0.1):
        """
        Init Method used to initialize the instance
        
        Args:
            self: Instance os class Hypothesis
            alpha: Value to evaluate if the hypothesis is true or not
        
        Attributes:
            alpha: value to evaluate if the hypothesis is true or not
            p_value: value calculated by statistical tests and compared with alpha
            stats: stats returned by statistical tests
            is_true: boolean used to define if the hypothesis is true or not
        """
        
        self.alpha = alpha
        self.p_value = 0
        self.stats = 0
        self.is_true = False
    
    
    @property
    def alpha(self):
        
        return self.__alpha
    
    
    @alpha.setter
    def alpha(self, alpha):
        
        self.__alpha = alpha
    
    
    def test_shapiro(self, arr):
        """
        Test Hypothesis that an iterable follow a normal distribution using shapiro method and update stats and p_value.
        Update is_true: True if follow a normal distribution and False if not.
        
        args:
            arr: iterable of floats or integers
            
        return:
            None
        """
        
        self.stats, self.p_value = sct.shapiro(arr)
        self.is_true = not (self.p_value < self.alpha)
        
        
    def test_jarquebera(self, arr):
        """
        Test Hypothesis that an iterable follow a normal distribution using jarque-bera method and update stats and p_value.
        Update is_true: True if follow a normal distribution and False if not.
        
        args:
            arr: iterable of floats or integers
            
        return:
            None
        """
        
        self.stats, self.p_value = sct.jarque_bera(arr)
        self.is_true = not (self.p_value < self.alpha)
        

    def test_normaltest(self, arr):
        """
        Test Hypothesis that an iterable follow a normal distribution using normal test method and update stats and p_value.
        Update is_true: True if follow a normal distribution and False if not.
        args:
            arr: iterable of floats or integers
            
        return:
            None
        """
        
        self.stats, self.p_value = sct.normaltest(arr)
        self.is_true = not (self.p_value < self.alpha)
        
    
    def test_ttest_ind(self, arr_1, arr_2, equal_var=True):
        """
        Test Hypothesis of the means of two iterables ate statistical the same.
        Update stats and p_value with model info
        Update is_true: True if means are the same and False if not.
        
        args:
            arr: iterable of floats or integers
            
        return:
            None
        """
        
        self.stats, self.p_value = sct.ttest_ind(arr_1, arr_2, equal_var=equal_var)
        self.is_true = not (self.p_value < self.alpha)       

## Analysis

In [52]:
# Question 01

shapiro = HypothesisTests(0.05)
shapiro.test_shapiro(get_sample(athletes, 'height', n=3000))
q1_result = shapiro.is_true

# Question 02

jarque_bera = HypothesisTests(0.05)
jarque_bera.test_jarquebera(get_sample(athletes, 'height', n=3000))
q2_result = jarque_bera.is_true

# Question 03
normaltest = HypothesisTests(0.05)
normaltest.test_normaltest(get_sample(athletes, 'weight', n=3000))
q3_result = normaltest.is_true

# Question 04
normaltest_log = HypothesisTests(0.05)
normaltest_log.test_normaltest(np.log(get_sample(athletes, 'weight', n=3000)))
q4_result = normaltest_log.is_true


bra = athletes[athletes['nationality'] == 'BRA']
usa = athletes[athletes['nationality'] == 'USA']
can = athletes[athletes['nationality'] == 'CAN']

column = 'height'

# Question 05
ttest_ind_usa_bra = HypothesisTests(0.05)
ttest_ind_usa_bra.test_ttest_ind(usa[column].dropna(), bra[column].dropna(), False)
q5_result = ttest_ind_usa_bra.is_true

# Question 06
ttest_ind_can_bra = HypothesisTests(0.05)
ttest_ind_can_bra.test_ttest_ind(can[column].dropna(), bra[column].dropna(), False)
q6_result = ttest_ind_can_bra.is_true

# Question 07
ttest_ind_can_usa = HypothesisTests(0.05)
ttest_ind_can_usa.test_ttest_ind(can[column].dropna(), usa[column].dropna(), False)
q7_result = round(ttest_ind_can_usa.p_value, 8)

## Question 1

Considering a sample of size 3000 from the `height` column obtained with the` get_sample () `function, run the Shapiro-Wilk normality test with the` scipy.stats.shapiro () `function. Can we say that heights are normally distributed based on this test (at the 5% significance level)? Respond with a boolean (`True` or` False`).

In [53]:
def q1():
    
    return q1_result

## Question 2

Repeat the same procedure above, but now using the Jarque-Bera normality test through the `scipy.stats.jarque_bera ()` function. Can we now say that heights are normally distributed (at the 5% significance level)? Respond with a boolean (`True` or` False`).

In [54]:
def q2():
    return q2_result

## Question 3

Now considering a sample of size 3000 from the `weight` column obtained with the` get_sample () `function. Take the D'Agostino-Pearson normality test using the `scipy.stats.normaltest ()` function. Can we say that the weights come from a normal distribution at a 5% significance level? Respond with a boolean (`True` or` False`).

In [55]:
def q3():
    return q3_result

In [56]:
def q4():
    return q4_result

## Question 5

Get all Brazilian, North American and Canadian athletes in `DataFrame`s called` bra`, `usa` and` can`, respectively. Perform a hypothesis test to compare height averages (`height`) for independent samples and different variances with the` scipy.stats.ttest_ind () `function between` bra` and `usa`. Can we say that the averages are statistically equal? Respond with a boolean (`True` or` False`).

In [57]:
def q5():
    return q5_result

## Question 6

Repeat the procedure of question 5, but now between the heights of `bra` and` can`. Can we now affirm that the averages are statistically equal? Reset with a boolean (`True` or` False`).

In [58]:
def q6():
    return q6_result

## Question 7

Repeat the procedure of question 6, but now between the heights of `usa` and` can`. What is the value of the p-value returned? Respond as a single scalar rounded to eight decimal places.

In [59]:
def q7():
    return q7_result