# LESSON GOALS
In this lesson we will learn more about tests that we can perform with SciPy. These tests allow us to make decisions based on data and compare information in two or more variables.

# INTRODUCTION
The field of statistics helps us make decisions using data. In previous lessons, we have looked at the comparison of one sample to a constant or the comparison of two samples to each other. In this lesson, we will use statistical tools to examine a number of features at once. We will also learn about linear regression using SciPy.

# ANOVA
The one-way analysis of variance (ANOVA) is used to determine whether there are any statistically significant differences between the means of three or more independent (unrelated) groups.

An example of when we might use ANOVA is when conducting a test on an e-commerce website and trying out multiple UI designs at once to see if there is a change in sales.

The hypothesis test that we are examining is:

![alt text](https://camo.githubusercontent.com/8ef2e56f4ff102812fab4f02f7c14332b4054fe2/68747470733a2f2f73332d65752d776573742d312e616d617a6f6e6177732e636f6d2f69682d6d6174657269616c732f75706c6f6164732f646174612d7374617469632f696d616765732f616e6f76612d6879706f7468657369732e706e67)


Where Ој represents a mean and there are a total of k means that we are comparing.

Typically, the ANOVA is a table consisting of values that help us compute a p-value for our hypothesis. The p-value will be found by performing the F-test. The F-test is a test for comparing variances.

With the ANOVA, we compare the difference in variation between the groups and the difference in variation within the groups themselves. If the F statistic is sufficiently large, this means the p-value will be sufficiently small. This will lead us to reject the null hypothesis and conclude that there is significant variation between the groups and therefore at least one of the means is different.

This is how we would construct an ANOVA:

![alt text](https://camo.githubusercontent.com/f4be2d25ab16745e51ad1d15349badb3329ec78d/68747470733a2f2f73332d65752d776573742d312e616d617a6f6e6177732e636f6d2f69682d6d6174657269616c732f75706c6f6164732f646174612d7374617469632f696d616765732f616e6f76612e706e67)

In [72]:
import pandas as pd


### We would like to show that there is a difference in the cylinders based on fuel type

In [73]:
data = pd.read_csv('https://raw.githubusercontent.com/loukjsmalbil/datasets_ws/master/vehicles.csv')

In [74]:
len(data)

35952

In [75]:
data.head()

Unnamed: 0,Make,Model,Year,Engine Displacement,Cylinders,Transmission,Drivetrain,Vehicle Class,Fuel Type,Fuel Barrels/Year,City MPG,Highway MPG,Combined MPG,CO2 Emission Grams/Mile,Fuel Cost/Year
0,AM General,DJ Po Vehicle 2WD,1984,2.5,4.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,19.388824,18,17,17,522.764706,1950
1,AM General,FJ8c Post Office,1984,4.2,6.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550
2,AM General,Post Office DJ5 2WD,1985,2.5,4.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,20.600625,16,17,16,555.4375,2100
3,AM General,Post Office DJ8 2WD,1985,4.2,6.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550
4,ASC Incorporated,GNX,1987,3.8,6.0,Automatic 4-spd,Rear-Wheel Drive,Midsize Cars,Premium,20.600625,14,21,16,555.4375,2550


In [76]:
selecteddf= data[['Fuel Type','Cylinders']].copy()

In [77]:
selecteddf.head()

Unnamed: 0,Fuel Type,Cylinders
0,Regular,4.0
1,Regular,6.0
2,Regular,4.0
3,Regular,6.0
4,Premium,6.0


In [78]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [79]:
selecteddf[['FuelType']] = selecteddf[['Fuel Type']]

In [80]:
model = ols('Cylinders ~ C(FuelType)', data=selecteddf).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
anova_table

Unnamed: 0,sum_sq,df,F,PR(>F)
C(FuelType),8993.133198,12.0,264.650686,0.0
Residual,101770.695796,35939.0,,


The p-value is 0.0. This value is very small, certainly smaller than 0.05. Therefore, we reject the null hypothesis and conclude that the cylinders differ by fuel type.