# The Chi-square Goodness of Fit Test 

The Chi-square Goodness of fit tests whether the observed distribution of a variable fits a given distribution. When the calculated p-value is less than the significance level, reject the null hypothesis. This is different from [Chi-square test for independence](https://medium.com/@shinichiokada/chi-square-test-for-independence-for-ib-diploma-mathematics-e197aafbd771?source=friends_link&sk=0dc57f1115d88885affa7a04aadfded9) where we test whether two variables are independent.

You can find all codes in this [link](https://github.com/shinokada/python-for-ib-diploma-mathematics/blob/master/The%20Chi-square%20Goodness%20of%20Fit%20Test%20and%20The%20Pooled%20two-sample%20t-test%20.ipynb).

$H_0$ Null hypothesis: In Chi-Square goodness of fit test, the null hypothesis assumes that there is no significant difference between the observed and the expected value.

$H_1$ Alternative hypothesis: In Chi-Square goodness of fit test, the alternative hypothesis assumes that there is a significant difference between the observed and the expected value.


We import `chisquare` from `scipy.stats`.

In [45]:
# https://gist.github.com/shinokada/10a50e6850d8309f7eaed74c3dd0d27a

from scipy.stats import chisquare
significance = 0.05
observed=[51,24,13,12]
expected=[48,25,15,12]
chi, pval = chisquare(observed, expected)

print('chi-square=%.6f, p-value=%.2f\n' % (chi, pval))

if pval < significance:
	print("""At %.2f level of significance, we reject the null hypotheses and accept H1. 
There is a significant difference between the observed and the expected value.""" % (significance))
else:
	print("""At %.2f level of significance, we accept the null hypotheses. 
There is no significant difference between the observed and the expected value.""" % (significance))


chi-square=0.494167, p-value=0.92

At 0.05 level of significance, we accept the null hypotheses. 
There is no significant difference between the observed and the expected value.


# The Pooled two-sample t-test 

In IB Diploma Mathematics Applications and Interpretation, you are required to find the pooled two-sample t-test. This test is a comparison of the means of two independent set of data that are sampled selected from a normally-distributed population.

## One-tailed tests

One of the tests is called one-tailed test. It is a hypothesis test with an alternative hypothesis that only considers one side of the distribution curve; for example, H1:μ<μ0 or H1:μ>μ0.

\begin{align}
\text{If }\ H_0:\mu_1 \geq \mu_2, then H_1:\mu_1<\mu_2\neq0 \tag{1}\\
\text{If }\ H_0:\mu_1 \leq \mu_2, then H_1:\mu_1>\mu_2\neq0 \tag{2}\\
\end{align}


We are going to use [statsmodels](https://www.statsmodels.org/dev/index.html) module to find out the pooled two-sample t-test. 

We need to install it from a terminal. 

If you are using Anaconda,

`conda install -c conda-forge statsmodels`

You can install it by using `pip`.

`pip install statsmodels`

We set the alternative hypothesis in the option `alternative`. 

- ‘two-sided’ (default): H1: difference in means not equal to value $H_1:\mu_1\ne\mu_2$
- ‘larger’ : H1: difference in means larger than value $H_1:\mu_1>\mu_2$
- ‘smaller’ : H1: difference in means smaller than value $H_1:\mu_1<\mu_2$

In the following we set it to `smaller` which means 
$$H_0:\mu_1\geq\mu_2$$
$$H_1:\mu_1<\mu_2$$

Since IB requires the pooled test, we set it as `usevar='pooled'`.

In [47]:
# https://gist.github.com/shinokada/88dc8e2a2c868c97e15dc0cdc2f63572

import statsmodels.stats.weightstats as sm

significance = 0.05
list1=[3,5,4,6,6,5,3,2,3,4,5,3,4]
list2=[4,6,6,7,6,4,4,4,3,6,5,4,5]
tstat, pvalue, df = sm.ttest_ind(
    list1,list2,
    alternative='smaller',
    usevar='pooled')

print("""Test statistic=%.2f, 
p-value=%.4f, 
degree of freedom=%.0f\n""" % (tstat,pvalue,df))

if pvalue < significance:
	print("""At %.2f level of significance, 
we reject the null hypotheses. 
The mean 1 is less than the mean 2.""" % (significance))
else:
	print("""At %.2f level of significance, 
we accept the null hypotheses.  
The mean 1 is greater than the mean 2.""" % (significance))

Test statistic=-1.77, 
p-value=0.0451, 
degree of freedom=24

At 0.05 level of significance, 
we reject the null hypotheses. 
The mean 1 is less than the mean 2.


If you want to test $H_1:\mu_1>\mu_2$, you need to set the `alternative` to `larger` which means 

$$H_0:\mu_1\leq\mu_2$$
$$H_1:\mu_1>\mu_2$$

In [54]:
# https://gist.github.com/shinokada/29b65b084a1aa05a016d8081faaa40a2

import statsmodels.stats.weightstats as sm

significance = 0.05
list1=[3,5,4,6,6,5,3,2,3,4,5,3,4]
list2=[4,6,6,7,6,4,4,4,3,6,5,4,5]
tstat, pvalue, df = sm.ttest_ind(
    list1,list2,
    alternative='larger',
    usevar='pooled')

print("""Test statistic=%.2f, 
p-value=%.4f, 
degree of freedom=%.0f\n""" % (tstat,pvalue,df))

if pvalue < significance:
	print("""At %.2f level of significance, 
we reject the null hypotheses. 
The mean 1 is greater than the mean 2.""" % (significance))
else:
	print("""At %.2f level of significance, 
we accept the null hypotheses.  
The mean 1 is less than the mean 2.""" % (significance))

Test statistic=-1.77, 
p-value=0.9549, 
degree of freedom=24

At 0.05 level of significance, 
we accept the null hypotheses.  
The mean 1 is less than the mean 2.


## Two-tailed tests

Another test is called two-tailed test. It is a hypothesis test with an alternative hypothesis that considers both sides of the distribution curve; for example, H1:μ≠μ0.
Also testing whether the mean of the first set is significantly different from the mean of the second set on either side.

This means:

$$H_0:\mu_1=\mu_2$$
$$H_1:\mu_1\ne\mu_2$$


In [56]:
# https://gist.github.com/shinokada/967875874850b70c1a7950bfb12202f5

import statsmodels.stats.weightstats as sm

significance = 0.05
list1=[3,5,4,6,6,5,3,2,3,4,5,3,4]
list2=[4,6,6,7,6,4,4,4,3,6,5,4,5]
tstat, pvalue, df = sm.ttest_ind(
    list1,list2,
    alternative='two-sided',
    usevar='pooled')

print("""Test statistic=%.2f, 
p-value=%.4f, 
degree of freedom=%.0f\n""" % (tstat,pvalue,df))

if pvalue < significance:
	print("""At %.2f level of significance, 
we reject the null hypotheses. 
The mean 1 is equal to the mean 2.""" % (significance))
else:
	print("""At %.2f level of significance, 
we accept the null hypotheses.  
The mean 1 is not equal to the mean 2.""" % (significance))

Test statistic=-1.77, 
p-value=0.0903, 
degree of freedom=24

At 0.05 level of significance, 
we accept the null hypotheses.  
The mean 1 is not equal to the mean 2.


## Using csv file

This link is a data set "Brain Size and Intelligence". You can 

# Reference

- https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html

- https://www.statsmodels.org/dev/generated/statsmodels.stats.weightstats.ttest_ind.html

- https://www.statisticssolutions.com/chi-square-goodness-of-fit-test/