## Inferential statistics

04_02_inferential_statistics_assign

Answer all **Question**

References:
- SciPy      
https://www.scipy.org/

- Wikipedia Z-Score    
https://en.wikipedia.org/wiki/Chi-squared_test

- Wikipedia F-test, ANOVA    
https://en.wikipedia.org/wiki/F-test

- Wikipedia Pearson's Correlation Coefficient    
https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

- Statistics - Student T Test   
https://www.tutorialspoint.com/statistics/student_t_test.htm

- Stat Trek Chi-Square Distribution   
https://stattrek.com/probability-distributions/chi-square.aspx

- Jake Huneycutt, Running Chi-Square Tests with Die Roll Data in Python   
https://towardsdatascience.com/running-chi-square-tests-in-python-with-die-roll-data-b9903817c51b


### Z-score (standard score)

The `z-score` is the signed fractional number of standard deviations an observation or data point is above the mean value of what is being observed or measured.

If the population mean and population standard deviation are known, the standard score of a raw score $x$ is calculated as:

$$z=\dfrac{(x - \mu)}{\sigma}$$

where:  
$\mu$ is the mean of the population.  
$\sigma$ is the standard deviation of the population.

When the population mean and the population standard deviation are unknown, the standard score may be calculated using the sample mean and sample standard deviation as estimates of the population values.

In these cases, the z score is:

$$z=\dfrac{(x - \bar x)}{S}$$

where:  
$\bar {x}$ is the mean of the sample.   
$S$ is the standard deviation of the sample.


Example:

Suppose that student A scored 1800 on the SAT, and student B scored 24 on the ACT. Which student performed better relative to other test-takers?

| |SAT|ACT|
|---|---|---|
|Mean|1500|21|
|Standard deviation|300|5|

The z-score for student A is $z={x-\mu  \over \sigma }={1800-1500 \over 300}=1$

The z-score for student B is $z={x-\mu  \over \sigma }={24-21 \over 5}=0.6$


#### Question: Using Scipy

`scipy.stats.zscore`

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.zscore.html

Calculate the z-score for the following array.
    

In [0]:
%matplotlib inline
import numpy as np

a = np.array([ 0.7972,  0.0767,  0.4383,  0.7866,  0.8091,
               0.1954,  0.6307,  0.6599,  0.1065,  0.0508])



In [0]:

# your work here



This function preserves ndarray subclasses, and works also with matrices and masked arrays. The the array returned are the z-scores for each element in the original array.

### Student's t-Test

The `t-test` is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis.

A t-test is most commonly applied when the test statistic would follow a normal distribution.

The t-test can be used, for example, to determine if the means of two sets of data are significantly different from each other.

The t-test also tells you how significant the differences are with a p-value.

The t score is a ratio between the difference between two groups and the difference within the groups. 

The larger the t score, the more difference there is between groups. The smaller the t score, the more similarity there is between groups. 

A t score of 3 means that the groups are three times as different from each other as they are within each other. 

https://en.wikipedia.org/wiki/Student%27s_t-test 

For applying t-test, the value of t-statistic is computed. For this, the following formula is used:

$t=\dfrac{\text{Deviation from the population parameter}}{\text{Standard Error of the sample statistic}}$

where $t = \text{Test of Hypothesis}$

#### Test of Hypothesis about a population

$t= \dfrac{\bar{X} - \mu}{ S/ \sqrt{n}}$

where $S=\dfrac{\sum (X - \bar{X})^2}{n-1}$



#### Question: Student's t-test

A sample of $n=9$ taken from a population demonstrated a sample mean of $41.5$ inches and the square of deviation from this mean equivalent to $72$ inches. 

Show whether the assumed population mean of $44.5$ inches is reasonable.

Degrees of freedom $= v=n−1=9−1=8$. 

For a two-tailed test, if $v=8$, $t_{0.05}=2.306$. 

$\bar{X}=41.5$

$\mu=44.5$

$n=9$

$\sum(X− \bar{X})^2=72$

Take the null hypothesis that the population mean is $44.5$:

$H0: \mu=44.5$ and $H1: \mu \ne 44.5$ 

If $|t|$ is greater than $t_{0.05}$, reject the null and the assumed population mean is unreasonable.


In [0]:
# Your work here




#### T-test for the means of two independent samples of scores.

`scipy.stats.ttest_ind(a, b, axis=0, equal_var=True, nan_policy='propagate')`

This is a two-sided test for the null hypothesis that 2 independent samples have identical average (expected) values. This test assumes that the populations have identical variances by default.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html

For the following t-tests, what do the results mean?


In [0]:
from scipy import stats
np.random.seed(12345678)

# Test with sample with identical means:

rvs1 = stats.norm.rvs(loc=5,scale=10,size=500)
rvs2 = stats.norm.rvs(loc=5,scale=10,size=500)
print( stats.ttest_ind(rvs1,rvs2) )

print(  stats.ttest_ind(rvs1,rvs2, equal_var = False) )

In [0]:
# ttest_ind underestimates p for unequal variances:

rvs3 = stats.norm.rvs(loc=5, scale=20, size=500)
print( stats.ttest_ind(rvs1, rvs3) )

print( stats.ttest_ind(rvs1, rvs3, equal_var = False) )

In [0]:
# When n1 != n2, the equal variance t-statistic is no longer equal to the unequal variance t-statistic:

rvs4 = stats.norm.rvs(loc=5, scale=20, size=100)
print( stats.ttest_ind(rvs1, rvs4) )

print( stats.ttest_ind(rvs1, rvs4, equal_var = False) )

In [0]:
# T-test with different means, variance, and n:

rvs5 = stats.norm.rvs(loc=8, scale=20, size=100)
print( stats.ttest_ind(rvs1, rvs5) )

print( stats.ttest_ind(rvs1, rvs5, equal_var = False) )

### F-test

An F-test is any statistical test in which the test statistic has an `F-distribution` under the null hypothesis. 

It is often used when comparing statistical models that have been fitted to a data set in order to identify the model that best fits the population from which the data were sampled.

Typically the F-Test to Compare Two Variances (Analysis of Variance - ANOVA).

The analysis of variance (ANOVA) can be thought of as an extension to the t-test. The independent t-test is used to compare the means of a condition between 2 groups. ANOVA is used when one wants to compare the means of a condition between 2+ groups. 

The formula for the one-way ANOVA F-test statistic is:

$$F={\frac  {{\text{explained variance}}}{{\text{unexplained variance}}}}$$,
$$or$$

$$F={\frac  {{\text{between-group variability}}}{{\text{within-group variability}}}}$$

The "explained variance", or "between-group variability" is:

$${\displaystyle \sum _{i=1}^{K}n_{i}({\bar {Y}}_{i\cdot }-{\bar {Y}})^{2}/(K-1)}$$

where ${\bar  {Y}}_{{i\cdot }}$ denotes the sample mean in the $i^{th}$ group, $n_{i}$ is the number of observations in the $i^{th}$ group, ${\bar  {Y}}$ denotes the overall mean of the data, and $K$ denotes the number of groups.

The "unexplained variance", or "within-group variability" is:

$$\sum _{i=1}^{K}\sum _{j=1}^{n_{i}}\left(Y_{ij}-{\bar {Y}}_{i\cdot }\right)^{2}/(N-K)$$

where $Y_{ij}$ is the $j^{th}$ observation in the $i^{th}$ out of $K$ groups and $N$ is the overall sample size. 

Note that when there are only two groups for the one-way ANOVA F-test.

#### ANOVA Example

 data that is measuring the effects of different doses of a clinical drug, Difficile, on libido. It contains 2 columns of interest, “dose” and “libido”. Dose contains information on the dosing, “placebo”, “low”, and “high”, and libido is a measure of low-high libido on a 7 point Likert scale with 7 being the highest and 1 being the lowest. 
    
https://pythonfordatascience.org/anova-python/

In [0]:
import pandas as pd
import scipy.stats as stats
#import researchpy as rp
import statsmodels.api as sm
from statsmodels.formula.api import ols
    
import matplotlib.pyplot as plt

# Loading data
df = pd.read_csv("https://raw.githubusercontent.com/Opensourcefordatascience/Data-sets/master/difficile.csv")
df.drop('person', axis= 1, inplace= True)

# Recoding value from numeric to string
df['dose'].replace({1: 'placebo', 2: 'low', 3: 'high'}, inplace= True)
    
# Summary statistics
df['libido'].describe()

count    15.000000
mean      3.466667
std       1.767430
min       1.000000
25%       2.000000
50%       3.000000
75%       4.500000
max       7.000000
Name: libido, dtype: float64

We are really interested in the data by dosing.

In [0]:
df['libido'].groupby(df['dose']).describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
dose,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
high,5.0,5.0,1.581139,3.0,4.0,5.0,6.0,7.0
low,5.0,3.2,1.30384,2.0,2.0,3.0,4.0,5.0
placebo,5.0,2.2,1.30384,1.0,1.0,2.0,3.0,4.0


ANOVA with scipy.stats

If using scipy.stats, the method needed is stats.f_oneway(). The general applied method looks like this:

`stats.f_oneway(data_group1, data_group2, data_group3, data_groupN)`

In [0]:
stats.f_oneway(df['libido'][df['dose'] == 'high'], 
             df['libido'][df['dose'] == 'low'],
             df['libido'][df['dose'] == 'placebo'])

F_onewayResult(statistic=5.11864406779661, pvalue=0.024694289538222603)

The F-statistic= 5.119 and the p-value= 0.025 which is indicating that there is an overall significant effect of medication on libido. However, we don’t know where the difference between dosing/groups is.

### Correlation Coefficients

Pearson correlation coefficient is a measure of the linear correlation between two variables X and Y. 

It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation. 

It is widely used in the sciences.

#### For a sample

Pearson's correlation coefficient when applied to a sample is commonly represented by $r_{xy}$.

The formula for $r_{xy}$ can be derived by substituting estimates of the covariances and variances. 

Given paired data ${\displaystyle \left\{(x_{1},y_{1}),\ldots ,(x_{n},y_{n})\right\}}$ consisting of $n$ pairs, $r_{xy}$ is defined as:

$${\displaystyle r_{xy}={\frac {\sum _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{{\sqrt {\sum _{i=1}^{n}(x_{i}-{\bar {x}})^{2}}}{\sqrt {\sum _{i=1}^{n}(y_{i}-{\bar {y}})^{2}}}}}}$$

where: 
- $n$ is sample size  
- $x_{i},y_{i}$ are the individual sample points indexed with $i$  
- ${\bar {x}}={\frac {1}{n}}\sum _{i=1}^{n}x_{i}$ (the sample mean); and analogously for ${\bar {y}}$


#### Question: Pearson Correlation

`scipy.stats.pearsonr(x, y)`

Pearson correlation coefficient and p-value for testing non-correlation.

Returns: Pearson’s correlation coefficient, Two-tailed p-value.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html

Use `scipy.stats.pearsonr(x, y)` to calculate the correlation coefficent with p-value for the following two arrays.
    

In [0]:
a = np.array([0, 0, 0, 1, 1, 1, 1])
b = np.arange(7)

# Your work here


In [0]:
a = [1, 2, 3, 4, 5]
b = [10, 9, 2.5, 6, 4]

# Your work here


### Chi-squared test

A chi-squared test, also written as $\chi^2$ test, is any statistical hypothesis test where the sampling distribution of the test statistic is a $\chi$-squared distribution when the null hypothesis is true.  

The chi-squared test is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories.

It is typically used as a goodness of fit model of a sample with respect to the population.


$$\chi^2 = \dfrac{(n-1)*s^2}{\sigma^2}$$

where 
$\sigma$ is the standard deviation of the population    
$s$ is the standard deviation of the sample   
$n$ is the number of sample observations.

#### Question: Chi Squared Statistic

The Big Tech Company has developed a new cell phone battery. On average, the battery lasts 60 minutes on a single charge. The standard deviation is 4 minutes.

Suppose the manufacturing department runs a quality control test. They randomly select 7 batteries. The standard deviation of the selected batteries is 6 minutes. What would be the chi-square statistic represented by this test?



In [0]:
# Your work here


#### Chi squared test example

If we roll a standard 6-sided die a thousand times, we know that each number should come up approximately 1/6 of the time (i.e. 16.66667%). A chi-square test can help determine whether a die is ‘fair’ or if die-roll generators (such as those used in software) are generating ‘random’ results.

Assume we have the following dice roll data.

In [0]:
import numpy as np
a1 = [6, 4, 5, 10]
a2 = [8, 5, 3, 3]
a3 = [5, 4, 8, 4]
a4 = [4, 11, 7, 13]
a5 = [5, 8, 7, 6]
a6 = [7, 3, 5, 9]
dice = np.array([a1, a2, a3, a4, a5, a6])

In [0]:
from scipy import stats

stats.chi2_contingency(dice)

(16.490612061288754,
 0.35021521809742745,
 15,
 array([[ 5.83333333,  5.83333333,  5.83333333,  7.5       ],
        [ 4.43333333,  4.43333333,  4.43333333,  5.7       ],
        [ 4.9       ,  4.9       ,  4.9       ,  6.3       ],
        [ 8.16666667,  8.16666667,  8.16666667, 10.5       ],
        [ 6.06666667,  6.06666667,  6.06666667,  7.8       ],
        [ 5.6       ,  5.6       ,  5.6       ,  7.2       ]]))

The first value (16.49) is the chi-square statistic. 

The third number in the output is thee `degrees of freedom.` This can be calculated by taking the number of rows minus one and multiplying this result by the number of columns minus one.

In this instance:
    
Rows = 6 [die rolls 1–6]

Columns = 4 [samples]

So we take (6–1) and multiply by (4–1) to get 15 degrees of freedom.

With the chi-square stat and the degrees of freedoms, we can find the p-value. 

The p-value is what we use to determine significance (or independence in this case). 

Depending on the test, we are generally looking for a threshold at either 0.05 or 0.01. 

Our test is significant (i.e. we reject the null hypothesis) if we get a p-value below our threshold.

For our purposes, we’ll use 0.01 as the threshold. 

In this particular example, the p-value (the second number in our output: 0.3502) is far from 0.01, and thus we have not met the threshold for statistical significance.


In [0]:
chi2_stat, p_val, dof, ex = stats.chi2_contingency(dice)
print("===Chi2 Stat===")
print(chi2_stat)
print("\n")
print("===Degrees of Freedom===")
print(dof)
print("\n")
print("===P-Value===")
print(p_val)
print("\n")
print("===Contingency Table===")
print(ex)

===Chi2 Stat===
16.490612061288754


===Degrees of Freedom===
15


===P-Value===
0.35021521809742745


===Contingency Table===
[[ 5.83333333  5.83333333  5.83333333  7.5       ]
 [ 4.43333333  4.43333333  4.43333333  5.7       ]
 [ 4.9         4.9         4.9         6.3       ]
 [ 8.16666667  8.16666667  8.16666667 10.5       ]
 [ 6.06666667  6.06666667  6.06666667  7.8       ]
 [ 5.6         5.6         5.6         7.2       ]]


The array at the end of the output is the contingency table with expected values based on all samples. 

Note in this case, our contingency table produced values that are, in some cases, quite a bit off of what we know we should expect with die rolls. This is because we are using too small of a sample to accurate measure the population.