# Notebook, Statistical test with Avocado price

Create a summary of Statistical test using Avocado price data as the subject.

Previous notebooks<br>

Notebooks<br>
Classification method<br>
https://www.kaggle.com/urayukitaka/notebook-classification-method<br>
Regression method<br>
https://www.kaggle.com/urayukitaka/notebook-regression-method<br>
Dimension reduction method<br>
https://www.kaggle.com/urayukitaka/notebook-dimension-reduction<br>
Image preprocessing OpenCV library<br>
https://www.kaggle.com/urayukitaka/notebook-image-preprocessing-opencv-library<br>

### Statistical method
- Normality test
- t test
- chi-square test
- F test
- ANOVA (one-way)
- ANOVA (two-way)

In [None]:
# Basic library
import numpy as np 
import pandas as pd 

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Statistics library
from scipy.stats import norm
from scipy import stats
import scipy
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols
import statsmodels.api as sm
import statsmodels.stats.anova as anova

# random value
from numpy.random import *

# Visualization
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
import seaborn as sns

## Data loading and checks

In [None]:
df = pd.read_csv("/kaggle/input/avocado-prices/avocado.csv", header=0)

In [None]:
df.head()

In [None]:
# Null value
df.isnull().sum()

In [None]:
sns.pairplot(df.sample(200))

# Normality test
Perform normality test of average price

### Visualization check

distplot and qqplot with rawdata

In [None]:
# data
data = df["AveragePrice"]

# calculation of skew and kurtosis
skew = scipy.stats.skew(data)
kurt = scipy.stats.kurtosis(data)

# basic check with 
fig, ax = plt.subplots(1,2,figsize=(20,6))
sns.distplot(data, fit=norm, ax=ax[0])
ax[0].set_ylabel("frequency")
ax[0].set_title("Distribution plot\n<skewness:%.2f>\n<kurtosis:%.2f>" % (skew,kurt))
stats.probplot(data, plot=ax[1])
ax[1].set_title("Probability plot")

Change raw data to log.

In [None]:
# data
data = np.log(df["AveragePrice"])

# calculation of skew and kurtosis
skew = scipy.stats.skew(data)
kurt = scipy.stats.kurtosis(data)

# basic check with 
fig, ax = plt.subplots(1,2,figsize=(20,6))
sns.distplot(data, fit=norm, ax=ax[0])
ax[0].set_ylabel("frequency")
ax[0].set_title("Distribution plot\n<skewness:%.2f>\n<kurtosis:%.2f>" % (skew,kurt))
stats.probplot(data, plot=ax[1])
ax[1].set_title("Probability plot")

The distribution is closer to the left because sker is greater than 0. In this case, it may be possible to approximate the normal distribution by taking the logarithm. As a result, the regularity was improved.

## Shapiro Wilk test
The null hypothesis is that the population is normally distributed.

In [None]:
# data
data = np.log(df["AveragePrice"])

# with stats model
WS, p = stats.shapiro(data.sample(4999))

In [None]:
print("p value:{}".format(p))

p value is small, so we can reject the null hypothesis. That is, it is not a normal distribution.

## Kolmogorov–Smirnov test

In [None]:
# data
data = np.log(df["AveragePrice"])

# with stats model
KS, p = stats.kstest(data, "norm")

In [None]:
print("p value:{}".format(p))

This data does not return the value normally.

# Anderson-Darling test

In [None]:
# data
data = np.log(df["AveragePrice"])

statistic, critical_values, significance_level = scipy.stats.anderson(data, "norm")

In [None]:
print("critical values:{}".format(critical_values))
print("significant_level:{}".format(significance_level))

From the Anderson Darling test results, it can be said that there is normality because almost all the numbers are below the significance level.

### For reference, let's artificially create a normal distribution and execute a normality test.

In [None]:
# create norm data, about Shapiro Wilk test, N<5000
norm_data = randn(4999)

# Visualization
sns.distplot(norm_data)

In [None]:
# with stats model
WS, p = stats.shapiro(norm_data)
print("Shapiro Wilk test p value:{}".format(p))

# with stats model
KS, p = stats.kstest(norm_data, "norm")
print("Kolmogorov–Smirnov test p value:{}".format(p))

# with stats model
statistic, critical_values, significance_level = scipy.stats.anderson(norm_data, "norm")
print("critical values:{}".format(critical_values))
print("significant_level:{}".format(significance_level))

If the data are normally distributed, you can reject the null hypothesis. = The data population is normally distributed.

# t-test (test for difference in mean)

*Subsequent prices are normal price values. not log.

Objective: To test the difference between 2015 and 2016 price averages. <br>

The null hypothesis: The average prices for 2015 and 2016 are same. <br>

Conflict hypothesis: The average prices for 2015 and 2016 are difference.<br>
(we cannot say that the average price for 2015 and 2016 are same.) <br>

Superiority level: 5% <br>

In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(x="year", y="AveragePrice", data=df)
print("2015 average price:{}".format(df.query("year==2015")["AveragePrice"].mean()))
print("2016 average price:{}".format(df.query("year==2016")["AveragePrice"].mean()))

In [None]:
plt.figure(figsize=(10,6))
sns.distplot(df.query("year==2015")["AveragePrice"])
sns.distplot(df.query("year==2016")["AveragePrice"])
print("2015 average price:{}".format(df.query("year==2015")["AveragePrice"].mean()))
print("2016 average price:{}".format(df.query("year==2016")["AveragePrice"].mean()))

### Unpaired, student t test

In [None]:
# data
price_2015 = np.log(df.query("year==2015")["AveragePrice"].values)
price_2016 = np.log(df.query("year==2016")["AveragePrice"].values)

# with stats model, 
stats.ttest_ind(price_2015, price_2016)

p-value<0.05, we can reject the null hypothesis. As a result, the average prices for 2015 and 2016 are difference.

### Unpaired, weltch t test

In [None]:
# with stats model, equal_var=False
stats.ttest_ind(price_2015, price_2016, equal_var=False)

p-value<0.05, we can reject the null hypothesis. As a result, the average prices for 2015 and 2016 are difference.

### Unpaired,Mannwhitney u test

In [None]:
# with stats model
stats.mannwhitneyu(price_2015, price_2016)

p-value<0.05, we can reject the null hypothesis. As a result, the average prices for 2015 and 2016 are difference.

# Chi-square test (Test of independence)

Objective: To test whether the average price for each type of year is independent. <br>

The null hypothesis: Average prices are independent. <br>

Conflict hypothesis: Average prices are dependent. <br>
(We cannot say that average prices are independent)

Superiority level: 5% <br>

In [None]:
# pivot table
pivot = pd.pivot_table(df, index="type", columns="year", values="AveragePrice", aggfunc="mean")
pivot.head()

In [None]:
# visualization
plt.figure(figsize=(10,6))
plt.plot(pivot.T.index, pivot.T["conventional"])
plt.plot(pivot.T.index, pivot.T["organic"])
plt.xlabel("year")
plt.xticks([2015,2016,2017,2018])
plt.ylabel("Average price")
plt.yticks([0.5,1,1.5,2])

In [None]:
# stats model
x2, p, dof, expected = scipy.stats.chi2_contingency(pivot)

In [None]:
# result
print("x2:{}".format(x2))
print("p:{}".format(p))
print("dof:{}".format(dof))
print("expectd:\n{}".format(expected))

p-value>0.05, we cannot reject the null hypothesis. As a result, the average price are independent.

# F test (Test of variance)

Objective: To test the difference between 2015 and 2016 price variance. <br>

The null hypothesis: The average variance for 2015 and 2016 are same. <br>

Conflict hypothesis: The average variance for 2015 and 2016 are difference.<br>
(we cannot say that the average variance for 2015 and 2016 are same.) <br>

Superiority level: 5% <br>

In [None]:
# data
price_2015 = df.query("year==2015")["AveragePrice"].values
price_2016 = df.query("year==2016")["AveragePrice"].values

# stats model
scipy.stats.bartlett(price_2015, price_2016)

In [None]:
# Visualization check
plt.figure(figsize=(10,6))
sns.distplot(price_2015)
sns.distplot(price_2016)
plt.xlabel("variance")
plt.title("Distribution \n variance at 2015 %.2f \n variance at 2016 %.2f" % (price_2015.var(), price_2016.var()))

p-value<0.05, we can reject the null hypothesis. As a result, the variance are difference.

# ANOVA, one-way analysis of variance

Objective: Test the average price of 2015, 2016, 2017, 2018. <br>

The null hypothesis: The average variance for each year are same. <br>

Conflict hypothesis: The average variance for each year are difference.<br>
(we cannot say that the average variance for each year are same.) <br>

Superiority level: 5% <br>

In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(x="year", y="AveragePrice", data=df)
print("2015 average price:{}".format(df.query("year==2015")["AveragePrice"].mean()))
print("2016 average price:{}".format(df.query("year==2016")["AveragePrice"].mean()))
print("2017 average price:{}".format(df.query("year==2017")["AveragePrice"].mean()))
print("2018 average price:{}".format(df.query("year==2018")["AveragePrice"].mean()))

### 1st) Perform normality test on four data.

In [None]:
# data
price_2015 = df.query("year==2015")["AveragePrice"].values
price_2016 = df.query("year==2016")["AveragePrice"].values
price_2017 = df.query("year==2017")["AveragePrice"].values
price_2018 = df.query("year==2018")["AveragePrice"].values

print("Shapiro Wilk test")
print("price_2015 p-value:{}:".format(stats.shapiro(price_2015[:4999])[1]))
print("price_2016 p-value:{}:".format(stats.shapiro(price_2016[:4999])[1]))
print("price_2017 p-value:{}:".format(stats.shapiro(price_2017[:4999])[1]))
print("price_2018 p-value:{}:".format(stats.shapiro(price_2018[:4999])[1]))

Result) None of the data is normally distributed.

### 2nd) Test for homoscedasticity

if normaly data ⇒ Bartlett test, else ⇒Levene test(It can be used to some extent even if it is not normally distributed.)

This time, i select Levene test.

The null hypothesis : equal variance for the four samples.

In [None]:
# Levene test
stats.levene(price_2015, price_2016, price_2017, price_2018)

In [None]:
# Reference) Bartlett test
stats.bartlett(price_2015, price_2016, price_2017, price_2018)

p-value <0.5, the null hypothesis can be rejected.

So, need to non parametric method.

### ANOVA with no correspondence

The null hypothesis : Average of each year is equal.
### I could not reject the null hypothesis, but this time I will perform analysis of variance as it is because the method is recorded.

In [None]:
# Reference) Parametric version, if can be eaual variances.
f, p = stats.f_oneway(price_2015, price_2016, price_2017, price_2018)

print("p-value:{}".format(p))

p-value<0.05, we can reject the null hypothesis. As a result, the averages are difference.

# ANOVA, two-way analysis of variance


Objective: Test the average price of (2015, 2016, 2017, 2018) vs (conventional, organic). <br>

The null hypothesis: The average variance for each are same. <br>

Conflict hypothesis: The average variance for each are difference.<br>
(we cannot say that the average variance for each are same.) <br>

Superiority level: 5% <br>

In [None]:
# Check data frame summary
df.groupby(["year", "type"])["AveragePrice"].mean()

In [None]:
# Create dataframe
sample_data = df[["AveragePrice", "type", "year"]]
sample_data.head()

In [None]:
# Statsmodel
formula = 'AveragePrice ~ C(type)+C(year) + C(type):C(year)'

model = ols(formula, sample_data).fit()

# Result
model.summary()

In [None]:
aov_table = sm.stats.anova_lm(model, typ=2)
print(aov_table)

The p-value is small for all variables and interactions, and the null hypothesis cannot be rejected.<br>
Therefore, each average is different.