# <br>Goodness of Fit Test<br>

## Table of Contents
1. [Introduction](#section1)<br>
    - 1.1 [Problem Statement](#section101)<br/>
    - 1.2 [Hypothesis](#section102)<br/> 
2. [Chi-Square Stats](#section2)
3. [Conclusion](#section3)

<a id=section1></a>

# 1. Introduction

 **Goodness of fit test** determines if a **sample matches** the **population**.The test is applied when you have one categorical variable from a single population. It is used to determine whether sample data are consistent with a hypothesized distribution. 


For the chi-square goodness-of-fit computation following equation is used :
                              ![image.png](attachment:image.png)

where O is the observed Values and  E is the expected Values.

A model fits the data well if the **differences** between the observed values and the Expected values are **small**.

<a id=section101></a>
## 1.1 Problem Statement

In this notebook through **Good of Fit Test**  is applied to test whether the **weed proportion** of population 
is same in year **Actual**(2015) is same as in **Expected** (2014)<br>

- **Assumption:** The proportion of people who bought High, Medium and Low quality weed in Jan-2014 is the expected proportion for                Jan-2015<br> 

![image.png](attachment:image.png)

<a id=section102></a>
## 1.2 Hypothesis Statement
- **Null Hypothesis**: Actual population(2015) proportion is  equal to  Expected (Data from 2014)
- **Alternate Hypotheis** : Actual population proportion is different form Expected

### Importing Required Packages

In [1]:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib as mpl
%matplotlib inline
import seaborn as sns
sns.set(color_codes=True)
weed_pd = pd.read_csv("")

In [2]:
weed_pd["month"] = weed_pd.date.apply(lambda x: x.month)
weed_pd["year"] = weed_pd.date.apply(lambda x: x.year)

In [3]:
weed_pd.head()

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date,month,year
0,Alabama,339.06,1042,198.64,933,149.49,123,2014-01-01,1,2014
1,Alaska,288.75,252,260.6,297,388.58,26,2014-01-01,1,2014
2,Arizona,303.31,1941,209.35,1625,189.45,222,2014-01-01,1,2014
3,Arkansas,361.85,576,185.62,544,125.87,112,2014-01-01,1,2014
4,California,248.78,12096,193.56,12812,192.92,778,2014-01-01,1,2014


<a id=section2></a>
## 2.Chi-square stats

### Find if proportion of people who bought weed in Jan 2015 conformed to the norm

In [4]:
weed_jan2014 = weed_pd[(weed_pd.year==2014) & (weed_pd.month==1)][["HighQN", "MedQN", "LowQN"]]
weed_jan2015 = weed_pd[(weed_pd.year==2015) & (weed_pd.month==1)][["HighQN", "MedQN", "LowQN"]]

In [5]:
weed_jan2014.head()

Unnamed: 0,HighQN,MedQN,LowQN
0,1042,933,123
1,252,297,26
2,1941,1625,222
3,576,544,112
4,12096,12812,778


In [6]:
weed_jan2015.head()

Unnamed: 0,HighQN,MedQN,LowQN
51,1539,1463,182
52,350,475,37
53,2638,2426,306
54,846,836,145
55,16512,19151,1096


In [7]:
Expected = np.array(weed_jan2014.apply(sum, axis=0))
Observed = np.array(weed_jan2015.apply(sum, axis=0))

In [8]:
print ("Expected:", Expected, "\n" , "Observed:", Observed)

Expected: [2918004 2644757  263958] 
 Observed: [4057716 4035049  358088]


#### When the observed and expected values match, the power-divergence statistic is **zero**.

In [9]:
stats.chisquare(Observed, Expected)

Power_divergenceResult(statistic=1209562.2775169075, pvalue=0.0)

### Insights :

- The **p-value = 0.0** ,hence  **zero percent** risk of rejecting concluding  the **null** hypothesis and conclude that the data does not follow expected proportions.


-  
  As  Power _divergences stats is larger than zero with value = **1209562** which supports that Expected & Actual  values are **diverging**.

<a id=section3></a>
## 3. Conclusion
- We **reject null** hypothesis. The proportions in Jan **2015** is different than what was expected.That is  Expected does not matches the popultation.

-------------------------------------------------------------------------------------------------------------------------------