In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
%matplotlib inline

# Load the data

df = pd.read_csv('./automobileEDA.csv')

df.head()

Unnamed: 0.1,Unnamed: 0,symboling,normalized-losses,make,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price,city-L/100km,horsepower-binned,diesel,gas
0,0,3,122,alfa-romero,std,two,convertible,rwd,front,88.6,...,9.0,111.0,5000.0,21,27,13495.0,11.190476,Medium,0,1
1,1,3,122,alfa-romero,std,two,convertible,rwd,front,88.6,...,9.0,111.0,5000.0,21,27,16500.0,11.190476,Medium,0,1
2,2,1,122,alfa-romero,std,two,hatchback,rwd,front,94.5,...,9.0,154.0,5000.0,19,26,16500.0,12.368421,Medium,0,1
3,3,2,164,audi,std,four,sedan,fwd,front,99.8,...,10.0,102.0,5500.0,24,30,13950.0,9.791667,Medium,0,1
4,4,2,164,audi,std,four,sedan,4wd,front,99.4,...,8.0,115.0,5500.0,18,22,17450.0,13.055556,Medium,0,1


# Analysis of Variance - ANOVA

This is a statistical comparison whether there are significant differences between two or more groups.

It helps in finding the correlation between different groups of categorical variables.

## What do we obtain fron ANOVA?
ANOVA mainly returns 2 variables:
    - F-test score: it calculates the ratio of variation between groups mean over variation within each of the sample groups. It assumes a default mean and the calculates the variation of each group from the default mean. The larger the score, the larger the difference between means.

    - P-value: it shows the significance level. The threshold of 0.05 is used in general. OR in other words the confidence level.

## When to use ANOVA?
    - ANOVA is used to compare the means of more than 2 groups.



## Drive Wheels

ANOVA analyses the difference between different groups of the same variable, the `groupby()` method will come in hand. ANOVA takes the average of the data already, there is not need to take it again.

Let's see how the `drive-wheels` variable affects the `price` variable.


In [8]:
df_gptest = df[['drive-wheels', 'price']]

In [9]:
grouped_test2 = df_gptest[['drive-wheels', 'price']].groupby(['drive-wheels'])
grouped_test2.head(2)


Unnamed: 0,drive-wheels,price
0,rwd,13495.0
1,rwd,16500.0
3,fwd,13950.0
4,4wd,17450.0
5,fwd,15250.0
136,4wd,7603.0


In [10]:
df_gptest

Unnamed: 0,drive-wheels,price
0,rwd,13495.0
1,rwd,16500.0
2,rwd,16500.0
3,fwd,13950.0
4,4wd,17450.0
...,...,...
196,rwd,16845.0
197,rwd,19045.0
198,rwd,21485.0
199,rwd,22470.0


We can obtain the values of the method `groupby()` as follows:


In [12]:
grouped_test2.get_group('4wd')['price']


4      17450.0
136     7603.0
140     9233.0
141    11259.0
144     8013.0
145    11694.0
150     7898.0
151     8778.0
Name: price, dtype: float64

We can use the funciton 'f_oneway' in the module 'stats' to obtain the *F-test score* and *P-value*.


In [14]:
# ANOVA

f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'], grouped_test2.get_group('4wd')['price'])  

print( "ANOVA results: F=", f_val, ", P =", p_val)

ANOVA results: F= 67.95406500780399 , P = 3.3945443577151245e-23


This is a great result with a large F-test score showing a strong correlation and a P-value of almost 0 implying almost certain statistical significance. But does this mean all three tested groups are all this highly correlated?

Let us examine them separately shall we?
#### fwd and rwd


In [15]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'])

print( "ANOVA results: F=", f_val, ", P =", p_val)


ANOVA results: F= 130.5533160959111 , P = 2.2355306355677845e-23


This shows the f value of 130.553 and a p value of 2.235e-23. This is a strong correlation between the two groups and a high confidence level in the result.

#### 4wd and rwd


In [16]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.get_group('rwd')['price'])

print( "ANOVA results: F=", f_val, ", P =", p_val)


ANOVA results: F= 8.580681368924756 , P = 0.004411492211225333


This f-value is comparatively low at 8.580 and a p-value of 0.004. This is a moderate correlation between the two groups and a high confidence level in the result, although not as high as between the fwd and rwd groups.

#### 4wd and fwd


In [17]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.get_group('fwd')['price'])

print( "ANOVA results: F=", f_val, ", P =", p_val)


ANOVA results: F= 0.665465750252303 , P = 0.41620116697845666


This shows very low f_value and very low confidence in the result.


### Conclusion: Important Variables

We now have a better idea of what our data looks like and which variables are important to take into account when predicting the car price. We have narrowed it down to the following variables:

Continuous numerical variables:
- Length
- Width
- Curb-weight
- Engine-size  
- Horsepower
- City-mpg
- Highway-mpg
- Wheel-base
- Bore

Categorical variables:

- Drive-wheels


No we can move on to build a machine learning model to automate the analysis, feeding the model with variables that meaningfullyl affect our price will improve the model's prediction performance.
