## Question 1
The following table indicates the number of 6-point scores in an American rugby match in the 1979 season.

![](table1.png)

Based on these results, we create a Poisson distribution with the sample mean parameter  = 2.435. Is there any reason to believe that at a .05 level the number of scores is a Poisson variable?

In [3]:
from scipy.stats import poisson
from scipy.stats import chi2
from scipy.stats import chisquare
mu = 2.435
observed = [35,99,104,110,62,25,10,3]
expected = [(poisson.pmf(x, mu) * 448) for x in range(len(observed) - 1)]
expected.append(448 - sum(expected))

n = len(observed)
alpha = 0.05
dof = n - 2 # n - 1 - parameters (here we only estimate one parameter, the mean, so just -1)

p_value = chisquare(observed, f_exp = expected)[1]
chi_2 = chisquare(observed, f_exp = expected)[0]
critical_value = chi2.ppf(q = 1-alpha, df = dof)

print('p-value: ' + str(p_value))
print('chi-2: ' + str(chi_2))
print('critical value: ' + str(critical_value))

p-value: 0.48368890685372257
chi-2: 6.491310681109862
critical value: 12.591587243743977


In [4]:
# chi-squared - the lower it is the more closely related
from math import pow
sum(pow(expected[x] - observed[x],2) / expected[x] for x in range(len(observed)))

6.491310681109861

- The p-value is higher than 0.05 so we do not reject the null hypothesis and can assume it is equal to what we we're testing against - poisson.

## Question 2
The following are the ordered values of a random sample of SAT scores (university entrance exam) for several students: 852, 875, 910, 933, 957, 963, 981, 998, 1010, 1015, 1018, 1023, 1035, 1048, 1063. In previous years, the scores were presented by N (985,50). Based on the sample, is there any reason to believe that there has been a change in the distribution of scores this year? Use the level alpha = 0.05. 

In [5]:
from scipy.stats import kstest
from scipy.stats import norm

mu = 985
s = 50

observed = [852, 875, 910, 933, 957, 963, 981, 998, 1010, 1015, 1018, 1023, 1035, 1048, 1063]
expected = norm.rvs(loc = mu, scale = s, size = len(observed))

chisquare(observed, f_exp = expected)

Power_divergenceResult(statistic=116.66726447173679, pvalue=2.8211923789642365e-18)

### your answer here
- The p-value is below 0.05, very close to 0 so we reject the null hypothesis that the distribution is normal, and assume it's different from last year.

## Question 3
Let's analyze a discrete distribution. To analyze the number of defective items in a factory in the city of Medellín, we took a random sample of n = 60 articles and observed the number of defectives in the following table:

![](table4.png)

A poissón distribution was proposed since it is defined for x = 0,1,2,3, .... using the following model:

![](image1.png)

Does the distribution of defective items follow this distribution?

In [6]:
# your code here
n = 60
defect = [0,1,3,4]
observed = [32, 15, 9, 4]
mu = sum([defect[x]*observed[x] for x in range(len(observed))]) / n
expected = [(poisson.pmf(x, mu) * n) for x in range(len(observed))]

chisquare(observed, f_exp = expected)

Power_divergenceResult(statistic=6.3034965141828545, pvalue=0.09774272956839174)

- The p-value is above 0.05 so we will not reject the null hypothesis that the distributions are the same.

## Question 4
A quality control engineer takes a sample of 10 tires that come out of an assembly line, and would like to verify on the basis of the data that follows, if the number of tires with defects observed over 200 days, if it is true that 5% of all tires have defects (that is, if the sample comes from a binomial population with n = 10 and p = 0.05). 

![](table6.png)


In [8]:
from scipy.stats import binom

n = 10
p = 0.05
observed = [138, 53, 9]
expected = [(binom.pmf(x, n, p)*200) for x in range(len(observed) - 1)]
expected.append(200 - sum(expected))
#se = math.sqrt((p * (1 - p)) / n)

chisquare(observed, f_exp = expected)

Power_divergenceResult(statistic=8.306179519542717, pvalue=0.01571578339595159)

### your answer here
- The p-value is 1.5% so at a 5% significance level we reject the null hypothesis that the level of defects is 5% and accept that it may be lower.
- If the significance level is 1% however we cannot assume it's lower.

## Question 5
A researcher gathers information about the patterns of physical activity (AF) of children in the fifth grade of primary school of a public school. He defines three categories of physical activity (1 = Low, 2 = Medium, 3 = High). He also inquires about the regular consumption of sugary drinks at school, and defines two categories (1 = consumed, 0 = not consumed). We would like to evaluate if there is an association between patterns of physical activity and the consumption of sugary drinks for the children of this school, at a level of 5% significance. The results are in the following table: 

![](table5.png)

In [9]:
#your answer here
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency

af = ['baja' , 'media', 'alta']
si = [32, 14, 6]
no = [12, 22, 9]

d = {'af': af, 'si': si, 'no': no}
ds = pd.DataFrame(d).set_index('af')

chi2_contingency(ds)[1]

0.004719280137040844

- The p-value is very small under 5% so we reject the null hypothesis that they are independant and can conclude that there is an association with the activity level and consumption of sugary drinks.