In [1]:
## importing libraries

from scipy import stats
import pandas as pd
import numpy as np

## Question 1
The following table indicates the number of 6-point scores in an American rugby match in the 1979 season.

![](table1.png)

Based on these results, we create a Poisson distribution with the sample mean parameter  = 2.435. Is there any reason to believe that at a .05 level the number of scores is a Poisson variable?

In [2]:
## data import
scores =[i for i in range(8)]
times = [35,99,104,110,62,25,10,3]
scores_df = pd.DataFrame(times, index=scores, columns=['#Times'])

In [3]:
scores_df

Unnamed: 0,#Times
0,35
1,99
2,104
3,110
4,62
5,25
6,10
7,3


In [4]:
#We calculate the probability of each event to happen according to the poisson distribution
##since the last value is equal to 7 or more, we are not going to calculate the probability of being 7, but the 
#probability of 1- the others. 

## Poisson parameters
s_mean = 2.435
poisson_prob =  [stats.poisson.pmf(k=score, mu=s_mean) for score in scores[:-1]]
poisson_prob.append(1-sum(poisson_prob))

sum(poisson_prob) #we check that all prob sum 1

scores_df['poisson_prob'] = poisson_prob

#we calculate the expected value 
scores_df['exp_value'] = scores_df['#Times'].sum() * scores_df['poisson_prob']

#we calculate the difference between times and expected value powered at 2 
scores_df['diff'] = (scores_df['#Times'] - scores_df['exp_value'])**2

chi2_score = sum(scores_df['diff']/scores_df['exp_value'])
chi2_score

6.4913106811098205

In [5]:

## CI 
alpha =0.05
#according to the documentation, https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html
#when any parameter is estimated, as in this case the sample mean, it has to be substracted from the degrees of freedom
#dof = k-1-p being #p estimated parameters. in this case p=1 
p=1
dof = len(scores) -1 - p  #degrees of freedom, in this case k-1, being k=8 values the probability can take. 

## hypotheses testing 
# H0: number of scores follows a poisson distribution 
# Ha: numbr of scores does not follow a poisson distribution 

#we calculate the chi-squared indicator to prove if our chi2_score is inside the CI. 

#http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Calcul_P_Value.pdf

upper_CI = stats.chi2.ppf(q=1-alpha, df=dof )

In [6]:
CI = [0, upper_CI]
CI

[0, 12.591587243743977]

`Conclusions`: 

Since the chi2_score value is 6.49131, it belongs to the CI we have calculated. That means that we **cannot** reject the null hypotheses

# Question 2
The following are the ordered values of a random sample of SAT scores (university entrance exam) for several students: 852, 875, 910, 933, 957, 963, 981, 998, 1010, 1015, 1018, 1023, 1035, 1048, 1063. In previous years, the scores were presented by N (985,50). Based on the sample, is there any reason to believe that there has been a change in the distribution of scores this year? Use the level alpha = 0.05. 

In [10]:
SAT_scores = [852, 875, 910, 933, 957, 963, 981, 998, 1010, 1015, 1018, 1023, 1035, 1048, 1063]

#previous years values
mu = 985
sigma = 50
alpha= 0.05 

pop_SAT = stats.norm.rvs(loc=mu, scale=sigma, size=len(SAT_scores))

stats.ks_2samp(SAT_scores, pop_SAT)

#since in this case we are working with continuous variables, we should apply a different test: 
#Kolmogarov-Smirnov Test (K-TEST)

#stats.kstest(rvs=SAT_scores, cdf=stats.norm.rvs(size=15, loc=mu, scale=sigma), N= len(SAT_scores))

Ks_2sampResult(statistic=0.2, pvalue=0.9383310279844598)

`Insights`: 

We **cannot** reject the null hypotheses since the pvalue is close to 1 and it is higher than 0.05. 

## Question 3
Let's analyze a discrete distribution. To analyze the number of defective items in a factory in the city of Medellín, we took a random sample of n = 60 articles and observed the number of defectives in the following table:

![](table4.png)

A poissón distribution was proposed since it is defined for x = 0,1,2,3, .... using the following model:

![](image1.png)

Does the distribution of defective items follow this distribution?

In [99]:
## data import
def_items =[i for i in range(5)]
obs_freq = [32,15,0,9,4]
obs_df = pd.DataFrame(obs_freq, index=def_items, columns=['obs_freq'])
obs_df



Unnamed: 0,obs_freq
0,32
1,15
2,0
3,9
4,4


In [115]:
#We calculate the probability of each event to happen according to the 
sample_mean = sum(obs_freq[x]  * x for x in def_items)/sum(obs_freq)

poisson_prob2 =  [stats.poisson.pmf(k=item, mu=sample_mean) for item in def_items[:-1]]
poisson_prob2.append(1-sum(poisson_prob2))

sum(poisson_prob2) #we check that all prob sum 1

obs_df['poisson_prob'] = poisson_prob2

#we calculate the expected value 
obs_df['exp_value'] = obs_df['obs_freq'].sum() * obs_df['poisson_prob']

#we calculate the difference between times and expected value powered at 2 
obs_df['diff'] = (obs_df['obs_freq'] - obs_df['exp_value'])**2

chi2_score = sum(obs_df['diff']/obs_df['exp_value'])
chi2_score

34.32169618960069

In [124]:
obs_df

Unnamed: 0,obs_freq,poisson_prob,exp_value,diff
0,32,0.380349,22.820925,84.255411
1,15,0.36767,22.060228,49.846818
2,0,0.177707,10.662443,113.687701
3,9,0.057261,3.435676,30.961699
4,4,0.017012,1.020727,8.876068


In [122]:
## CI 
alpha =0.05
#according to the documentation, https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html
#when any parameter is estimated, as in this case the sample mean, it has to be substracted from the degrees of freedom
#dof = k-1-p being #p estimated parameters. in this case p=0 
p=0
dof = len(scores) -1 - p  #degrees of freedom, in this case k-1, being k=8 values the probability can take. 

## hypotheses testing 
# H0: number of defective pieces follows a poisson distribution 
# Ha:  number of defective piecesdoes not follow a poisson distribution 

#we calculate the chi-squared indicator to prove if our chi2_score is inside the CI. 

#http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/en_Tanagra_Calcul_P_Value.pdf

upper_CI = stats.chi2.ppf(q=1-alpha, df=dof )

In [123]:
CI = [0, upper_CI]
CI

[0, 14.067140449340169]

`Conclusions`: 

Since the chi2_score value is 34.32,  it does not belongs to the CI we have calculated. That means that we **can** reject the null hypotheses

## Question 4
A quality control engineer takes a simple of 10 tires that come out of an assembly line, and would like to verify on the basis of the data that follows, if the number of tires with defects observed over 200 days, if it is true that 5% of all tires have defects (that is, if the sample comes from a binomial population with n = 10 and p = 0.05). 

In [None]:
size = 10 


## Question 5
A researcher gathers information about the patterns of physical activity (AF) of children in the fifth grade of primary school of a public school. He defines three categories of physical activity (1 = Low, 2 = Medium, 3 = High). He also inquires about the regular consumption of sugary drinks at school, and defines two categories (1 = consumed, 0 = not consumed). We would like to evaluate if there is an association between patterns of physical activity and the consumption of sugary drinks for the children of this school, at a level of 5% significance. The results are in the following table: 

![](table5.png)

In [None]:
#your answer here