### Key points

1. H0 stands for follow the distribution & expectations are under the H0

2. Observation is the reality, true value, which is given

3. H1 stands for the opposite of the distribution that we want to follow

4. Compute the statistics, with the expected, the observed results

    - import scipy.stats as st
    - st.chisquare --> stands for the observed values
    - f_exp --> stands for the expected values

5. Information about p-value:
    - If p-value is below 0.05, we can confidently declare the null-hypothesis is rejected and the difference is significant. 
    - If p-value is between 0.05 and 0.1, we may also declare the null-hypothesis is rejected but we are not highly confident. If p-value is above 0.1 we do not reject the null-hypothesis.

## Question 1
The following table indicates the number of 6-point scores in an American rugby match in the 1979 season.

![](table1.png)

Based on these results, we create a Poisson distribution with the sample mean parameter  = 2.435. Is there any reason to believe that at a .05 level the number of scores is a Poisson variable?

In [2]:
import pandas as pd
import numpy as np
import scipy.stats as st
from scipy.stats import poisson


In [3]:
mu = 2.435

poisson_dist = poisson(mu)

# 1. We create an empty list of expected Poisson variables 

expected = []

#2. 
# - We define a for loop, with range 0-7 to append the expected Poisson distribitions for each match played in the 1979 season
# - We also compute the Poisson distribition up to the 7th match played in 1979, which we appen to the list of expected distributions

for n in range(0,7):
    expected.append(poisson_dist.pmf(n))
    
expected.append(poisson_dist.sf(6)) # sf is exclusive

print(expected)
    

[0.08759774704805763, 0.21330051406202033, 0.2596933758705097, 0.21078445674823038, 0.12831503804548525, 0.06248942352815135, 0.025360291048508066, 0.012459153649037175]


In [12]:
# 1. Set the hyptothesis

# H0: we want to follow the Poisson distribution of 2.435
# H1: We don't want to follow the Poisson distribution of 2.435


# 2. Define the significance level

alpha = 0.05


In [13]:
# 3. Determine the sample, the observed results, which are the number of times from our table

O = np.array([35,99,104,110,62,25,10,3])
E = np.array(expected)*448 # where 448 corresponds to the total of matches played in the 1979 season

In [14]:
# 4. Compute the statistics with chinsquare
# 5. Calculte the p-value

stats, p_value = st.chisquare(f_obs=O, f_exp=E)

print(stats)
print(p_value)

6.491310681109792
0.48368890685373034


In [15]:
# 6. Make a decision

p_value < alpha

False

#### Comments

Because the p_value is not smaller than our alpha, we don't reject H0 and start believing that the results of the rugby matches follow a Poisson distribution with a mean of 2.435.

## BONUS/OPTIONAL - Question 2
Let's analyze a discrete distribution. To analyze the number of defective items in a factory in the city of Medellín, we took a random sample of n = 60 articles and observed the number of defectives in the following table:

![](table2.png)

A poissón distribution was proposed since it is defined for x = 0,1,2,3, .... using the following model:

![](image1.png)

For some extra insights check the following link: https://online.stat.psu.edu/stat504/node/63/ 

Does the distribution of defective items follow this distribution?

In [None]:
# your code here

## Question 3
A quality control engineer takes a sample of 10 tires that come out of an assembly line, and would like to verify on the basis of the data that follows, if the number of tires with defects observed over 200 days, if it is true that 5% of all tires have defects (that is, if the sample comes from a binomial population with n = 10 and p = 0.05). 

![](table3.png)


In [18]:
# We compute the expected probability for the binomial probability

n = 10
p = 0.05

from scipy.stats import binom

binomial_dist = binom(n,p)


In [19]:
expected_3 = []

#2. 
# - We define a for loop, with range 0-2 to append the expected Binomial distribitions for each number of defects up to 2
# - We also compute the Binomial distribition for more than 2 defects, which we then appen to the list of expected Binomial distributions

for n in range(0,2):
    expected_3.append(binomial_dist.pmf(n))
    
expected_3.append(binomial_dist.sf(1)) # sf is exclusive

print(expected_3)

[0.5987369392383787, 0.3151247048623047, 0.08613835589931637]


In [20]:
# 1. Set the hypothesis

## H0: follows binomial distr (n = 10, p = 0.05)
## H1: doesn't follow binomial distr (n = 10, p = 0.05) 

# 2. Define the significance level
alpha = 0.05

# 3. Determine the sample

O = np.array([138,53,9]) # observed frequency for number of defective items
E = np.array(expected_3)*(138+53+9) 
# we multiply the Binomial distributions of each number of defective item by the observed count of frequency of each
# defective item

# 4. and 5 Compute the stats and the p_value
stats, p_value = st.chisquare(f_obs=O, f_exp=E)

print(stats)
print(p_value)


8.30617951954277
0.015715783395951168


In [21]:
# 6. Make a decision

p_value < alpha

True

#### Comments:

Because the p_value is smaller than our alpha, we do reject H0 and do not believe that the frequency of the number of defective items follow a binomial distribution of n = 10 and p = 0.05.


## Question 4
A researcher gathers information about the patterns of Physical Activity of children in the fifth grade of primary school of a public school. He defines three categories of physical activity (Low, Medium, High). He also inquires about the regular consumption of sugary drinks at school, and defines two categories (Yes = consumed, No = not consumed). We would like to evaluate if there is an association between patterns of physical activity and the consumption of sugary drinks for the children of this school, at a level of 5% significance. The results are in the following table: 

![](table4.png)

In [22]:
# 1. Set the hypothesis

## H0: the physical activity variable is independent from the consumption of sugary drinks for the children of the school
## H1: the physical activity variable is dependent on the consumption of sugary drinks for the children of the school

In [23]:
# 2. Define the significance level

alpha = 0.05

In [25]:
# 3. Define the sample

sugar = np.array([[32, 12],
                  [14, 32],
                  [6, 9]])

sugar

array([[32, 12],
       [14, 32],
       [ 6,  9]])

In [28]:
# 4. Compute the statistics
# 5. Calculate the p_value

st.chi2_contingency(sugar)


(16.726380674288794,
 0.00023329884325305633,
 2,
 array([[21.79047619, 22.20952381],
        [22.78095238, 23.21904762],
        [ 7.42857143,  7.57142857]]))

In [29]:
# 6. Decision

0.00023329884325305633 < alpha


True

#### Comments:

Because the p_value is smaller than our alpha, we do reject H0 and do not believe, with a confidence level of 95%, that the physical activity of the children at school is dependent, as a variable, on the consumption of sugary drinks.
