## Question 1
A researcher gathers information about the patterns of Physical Activity of children in the fifth grade of primary school of a public school. He defines three categories of physical activity (Low, Medium, High). He also inquires about the regular consumption of sugary drinks at school, and defines two categories (Yes = consumed, No = not consumed). We would like to evaluate if there is an association between patterns of physical activity and the consumption of sugary drinks for the children of this school, at a level of 5% significance. The results are in the following table: 

![](table4.png)

In [1]:
import scipy.stats as stats
import numpy as np

observed = np.array([[32, 12], [14, 22], [6, 9]])

chi2, p_value, dof, expected = stats.chi2_contingency(observed)

print(f"Chi-square statistic: {chi2}")
print(f"p-value: {p_value}")
print(f"Degrees of freedom: {dof}")
print(f"Expected frequencies:\n{expected}")

Chi-square statistic: 10.712198008709638
p-value: 0.004719280137040844
Degrees of freedom: 2
Expected frequencies:
[[24.08421053 19.91578947]
 [19.70526316 16.29473684]
 [ 8.21052632  6.78947368]]


In [None]:
"""
Low Physical Activity: Sugary Drinks (Yes: 24.08, No: 19.92)
Medium Physical Activity: Sugary Drinks (Yes: 19.71, No: 16.29)
High Physical Activity: Sugary Drinks (Yes: 8.21, No: 6.79)

Chi-square statistic: The value of 10.71 represents the difference between the observed and expected frequencies across the 
different categories of physical activity and sugary drink consumption.

p-value (0.0047): This is the probability of observing the data (or something more extreme) assuming that the null hypothesis 
is true. The null hypothesis states that there is no association between physical activity levels and sugary drink consumption.

Degrees of freedom (2): This value is calculated as (number of rows - 1) * (number of columns - 1) in the contingency table. 
It indicates the number of values that are free to vary in the calculation.

Expected frequencies: These are the frequencies that we would expect to find if there were no association between physical 
activity and sugary drink consumption. They are calculated based on the marginal totals and the overall sample size.

H₀: There is no association between physical activity levels and the consumption of sugary drinks among the children.
H₁: There is an association between physical activity levels and the consumption of sugary drinks among the children.

Since the p-value (0.0047) is less than the significance level of 0.05, we reject the null hypothesis. 
This means there is statistically significant evidence to suggest that there is an association between the level of 
physical activity and the consumption of sugary drinks among the children in the study.
This indicates that patterns of sugary drink consumption vary significantly with different 
levels of physical activity.
"""

## [OPTIONAL] Question 2
The following table indicates the number of 6-point scores in an American rugby match in the 1979 season.

![](table1.png)

Based on these results, we create a Poisson distribution with the sample mean parameter  = 2.435. Is there any reason to believe that at a .05 level the number of scores is a Poisson variable?

Check [here](https://www.geeksforgeeks.org/how-to-create-a-poisson-probability-mass-function-plot-in-python/) how to create a poisson distribution and how to calculate the expected observations, using the probability mass function (pmf). 
A Poisson distribution is a discrete probability distribution. It gives the probability of an event happening a certain number of times (k) within a given interval of time or space. The Poisson distribution has only one parameter, λ (lambda), which is the mean number of events.

In [3]:
from scipy.stats import poisson, chisquare

observed_scores = np.array([35, 99, 104, 110, 62, 25, 10, 3])
total_scores = 448
lambda_poisson = 2.435

expected_frequencies = []
for k in range(7):
    expected_frequencies.append(poisson.pmf(k, lambda_poisson) * total_scores)
    
expected_frequencies.append(poisson.sf(6, lambda_poisson) * total_scores)

expected_frequencies = np.array(expected_frequencies)

chi2, p_value = chisquare(observed_scores, expected_frequencies)

print(f"Chi-square statistic: {chi2}")
print(f"p-value: {p_value}")
print(f"Expected frequencies: {expected_frequencies}")

Chi-square statistic: 6.491310681109792
p-value: 0.48368890685373034
Expected frequencies: [ 39.24379068  95.5586303  116.34263239  94.43143662  57.48513704
  27.99526174  11.36141039   5.58170083]


In [None]:
"""
(H₀): The number of scores in the rugby matches follows a Poisson distribution with the mean 𝜆=2.435
(H₁): The number of scores in the rugby matches does not follow a Poisson distribution with the mean λ=2.435.

Chi-square Statistic (6.49):
This value measures the discrepancy between the observed and expected frequencies under the Poisson distribution assumption. 
The higher this value, the greater the difference between observed and expected frequencies.

p-value (0.484):
The p-value indicates the probability of obtaining a Chi-square statistic as extreme as, or more extreme than, the observed one,
assuming the null hypothesis is true. In this context, a p-value of 0.484 suggests that there is a 48.4% chance of observing 
the given data if the true distribution of the number of scores follows the Poisson distribution with the specified mean.

At the significance level of 0.05, the p-value of 0.484 is much greater than 0.05. Therefore, we fail to reject the 
null hypothesis.
"""