# Module 17: One Way ANOVA

In this module, we discuss how to compare the means of multiple groups using analysis of variance, or ANOVA. The ANOVA procedure generalizes the two-sample T-test for means that we saw in Module 12, where now we allow for more than two groups. We also look at how to verify some of the assumptions of our inference procedures.

## The ANOVA F-Test

Suppose that you want to open a business, but you're not sure what type. The five types of business you are considering are making pizza, baking, selling shoes, selling gifts and selling pet supplies. You would like to choose something with a low startup cost.

You gather data on several of each of these types of businesses, and compile your results in the following dataset.

In [None]:
# Read-in the data and print the first few lines
data.bus = read.csv("Businesses.csv")
head(data.bus)

The two variables in this dataset are "Type" and "Cost". Note that cost is given in thousands of dollars.

Suppose that we want to test whether the average startup cost is the same across these business types. Our null hypothesis is that all five of the business types have the same average startup cost, and our alternative hypothesis is that they do not all have the same average startup cost. We can test these hypotheses using an ANOVA F-test.

We start by fitting an ANOVA model to this dataset in R using the "aov()" function. The "aov" function has two main inputs: "formula" and "data". The "formula" input is a formula describing the relationship between the variables in our model. See Module 6 for more information about formulas in R. The "data" input tells R which data frame to get its data from. The "aov()" function produces an "aov" object that we can save and use later.

Notice that the "aov()" function is very similar to the "lm()" function we used in Modules 6 and 16 to fit linear regression models. The output is similar between these two models as well.

Let's fit an ANOVA model to our business dataset.

In [None]:
# Fit our ANOVA model
fit.bus = aov(formula = Cost ~ Type, data = data.bus)

We can evaluate our model using the "summary()" function on our "aov" object.

In [None]:
summary(fit.bus)

This table contains a lot of information, but the key part is the second row of the last column. That is, the intersection of the row labelled for our predictor variable, "Type", and the column labelled "Pr(>F)". The number at this intersection is the p-value for our ANOVA F-test. The p-value here is 0.0184, which is less than 0.05, so we can conclude that not all of the business types we considered have the same average startup cost.

## Checking ANOVA Assumptions

One of the assumptions for doing an ANOVA F-test is that all groups have the same population standard deviation. We can check this assumption by calculating the sample standard deviation of each group and checking whether the largest is more than double the smallest. If the ratio between the largest and smallest sample standard deviations is more than two, then we conclude that the population standard deviations are too different for an ANOVA F-test to be valid. If this ratio is less than two, we are not able to say that the population standard deviations are different, so we assume that they are similar enough to carry out our analysis.

We can compute the standard deviation of each group in R using the "by()" function. The "by()" function has three main inputs: "data", "INDICES" and "FUN". The "data" input tells R which values to work with. Note that, in the "by()" function, "data" is a list of values, not a data frame. The "INDICES" input tells R which group each value in "data" belongs to. The "FUN" input tells R what function to apply to each group.

Let's use the "by()" function to calculate the sample standard deviation of startup costs for each of our five business types. We want to calculate the standard deviation of each group, so we should set the "FUN" input equal to "sd", the name of the R function that calculates standard deviations.

In [None]:
by(data = data.bus$Cost, INDICES = data.bus$Type, FUN = sd)

Note that the "INDICES" and "FUN" input names must both be capitalized.

The largest sample standard deviation is 38.9 for baking, and the smallest sample standard deviation is 27.1 for pet supplies. It is pretty clear that 38.9 is less than double 27.1, but let's calculate their ratio anyway.

In [None]:
38.9 / 27.1

The ratio between our largest and smallest sample standard deviations is 1.4, which is less than two. Therefore, the assumption of equal population standard deviations across groups is sensible.