# Statistical analysis on biovolume with temperature and media as factors

First we import all relevant libraries. Then import our nitrogen removal data (biovolume_SBWW.xlsx and biovolume_BBM.xlsx) the dataframe is displayed below. 

Since we have multiple groups with one independent variable (temperature and media) and want to compare them we do an two-way ANOVA for statistical analysis. Therefore before we do so we need to check if we violate the assumptions for doing the two-way ANOVA analysis. Assumptions are normality and equal variance. The analysis for determining if these assumptions hold are Shapiro-wilks and Levene's test respectivly. 

In [34]:
import numpy as np
from scipy import stats
import pandas as pd
import scikit_posthocs as sp

In [35]:
df1 = pd.read_excel('biovolume_SBWW.xlsx')
df2 = pd.read_excel('biovolume_BBM.xlsx')
display(df1) #Making dataframe from excel file

Unnamed: 0,SBWW_10,SBWW_15,SBWW_20,SBWW_25
0,2001000000,3071000000,1650000000,2921000000
1,1609000000,1788000000,1874000000,3737000000
2,1504000000,5224000000,1509000000,2423000000


In [36]:
display(df2)

Unnamed: 0,BBM_10,BBM_15,BBM_20,BBM_25
0,1726000000,2463000000,2296000000,4258000000
1,1756000000,2775000000,1876000000,3819000000
2,1422000000,2097000000,1828000000,5049000000


In [37]:
SBWW_10 = df1['SBWW_10'].values.tolist() #Putting each column from dataframe into lists
SBWW_15 = df1['SBWW_15'].values.tolist() #Putting each column from dataframe into lists
SBWW_20 = df1['SBWW_20'].values.tolist() #Putting each column from dataframe into lists
SBWW_25 = df1['SBWW_25'].values.tolist() #Putting each column from dataframe into lists
data1 = [SBWW_10, SBWW_15, SBWW_20, SBWW_25]
print(data1)

[[2001000000, 1609000000, 1504000000], [3071000000, 1788000000, 5224000000], [1650000000, 1874000000, 1509000000], [2921000000, 3737000000, 2423000000]]


In [38]:
BBM_10 = df2['BBM_10'].values.tolist() #Putting each column from dataframe into lists
BBM_15 = df2['BBM_15'].values.tolist() #Putting each column from dataframe into lists
BBM_20 = df2['BBM_20'].values.tolist() #Putting each column from dataframe into lists
BBM_25 = df2['BBM_25'].values.tolist() #Putting each column from dataframe into lists
data2 = [BBM_10, BBM_15, BBM_20, BBM_25]
print(data2)

[[1726000000, 1756000000, 1422000000], [2463000000, 2775000000, 2097000000], [2296000000, 1876000000, 1828000000], [4258000000, 3819000000, 5049000000]]


In [39]:
stats.levene(SBWW_10, SBWW_15, SBWW_20, SBWW_25, center='median', proportiontocut=0.05)

LeveneResult(statistic=1.9224103238567891, pvalue=0.20450453669099114)

In [40]:
stats.levene(BBM_10, BBM_15, BBM_20, BBM_25, center='median', proportiontocut=0.05)

LeveneResult(statistic=0.7495201570154548, pvalue=0.552557316227051)

In [41]:
print(stats.shapiro(SBWW_10),
stats.shapiro(SBWW_15),
stats.shapiro(SBWW_20),
stats.shapiro(SBWW_25))

ShapiroResult(statistic=0.8999643325805664, pvalue=0.38538697361946106) ShapiroResult(statistic=0.9790767431259155, pvalue=0.7227675318717957) ShapiroResult(statistic=0.9830556511878967, pvalue=0.7506838440895081) ShapiroResult(statistic=0.9808510541915894, pvalue=0.7348626255989075)


In [42]:
print(stats.shapiro(BBM_10),
stats.shapiro(BBM_15),
stats.shapiro(BBM_20),
stats.shapiro(BBM_25))

ShapiroResult(statistic=0.8167734742164612, pvalue=0.1552039533853531) ShapiroResult(statistic=0.9978899359703064, pvalue=0.9122374057769775) ShapiroResult(statistic=0.8260316848754883, pvalue=0.17828816175460815) ShapiroResult(statistic=0.9734259843826294, pvalue=0.6872667074203491)


## Normality and equal variance

Since in both the Shapiro-Wilks and Levene's test we do not have any significant difference there is evidence for normality and equal variance. therefore we proceed with a two-way ANOVA test.

# Preparations for the test 

Importing the proper format of the data must be done in order for the method to recognize the data. We also import the proper packages.

In [43]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [44]:
df3 = pd.read_excel('biovolume_BBM_SBWW.xlsx')
display(df3)

Unnamed: 0,media,temperature,biovolume
0,BBM,10.0,1726000000.0
1,BBM,10.0,1756000000.0
2,BBM,10.0,1422000000.0
3,BBM,15.0,2463000000.0
4,BBM,15.0,2775000000.0
5,BBM,15.0,2097000000.0
6,BBM,20.0,2296000000.0
7,BBM,20.0,1876000000.0
8,BBM,20.0,1828000000.0
9,BBM,25.0,4258000000.0


In [45]:
model = ols('biovolume ~ C(media) + C(temperature) + C(media):C(temperature)', data=df3).fit()
sm.stats.anova_lm(model, typ=2)

Unnamed: 0,sum_sq,df,F,PR(>F)
C(media),1.757882e+17,1.0,0.337947,0.569117
C(temperature),1.637179e+19,3.0,10.49141,0.000465
C(media):C(temperature),3.972998e+18,3.0,2.545986,0.09254
Residual,8.322639e+18,16.0,,
