In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

df = pd.read_csv('./automobileEDA.csv')

df.head()

Unnamed: 0.1,Unnamed: 0,symboling,normalized-losses,make,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price,city-L/100km,horsepower-binned,diesel,gas
0,0,3,122,alfa-romero,std,two,convertible,rwd,front,88.6,...,9.0,111.0,5000.0,21,27,13495.0,11.190476,Medium,0,1
1,1,3,122,alfa-romero,std,two,convertible,rwd,front,88.6,...,9.0,111.0,5000.0,21,27,16500.0,11.190476,Medium,0,1
2,2,1,122,alfa-romero,std,two,hatchback,rwd,front,94.5,...,9.0,154.0,5000.0,19,26,16500.0,12.368421,Medium,0,1
3,3,2,164,audi,std,four,sedan,fwd,front,99.8,...,10.0,102.0,5500.0,24,30,13950.0,9.791667,Medium,0,1
4,4,2,164,audi,std,four,sedan,4wd,front,99.4,...,8.0,115.0,5500.0,18,22,17450.0,13.055556,Medium,0,1


# Correlation and Causation

**Correlation**: This is a statistical metric for measuring to what extent different variables are interdependent.

**Causation**: The relationship between cause and effect between two variables.

It is very important to know the difference between these two and that correlation does not imply causation.

When we look at two variables over a long time, we might be interested in how a change in one of the variables might affect a change in another.

## Examples
1. Smoking and lung cancer
2. Rain and Umbrellas
3. Ice cream sales and temperature

> NB: Correlation doens't imply causation.



## Correlation Statistics

### Pearson Correlation

This is a measure of the linear correlation between two variables X and Y. It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation.

The pearson correlation is the default method of the method `corr()` in pandas.



In [6]:
df.corr()

  df.corr()


Unnamed: 0.1,Unnamed: 0,symboling,normalized-losses,wheel-base,length,width,height,curb-weight,engine-size,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price,city-L/100km,diesel,gas
Unnamed: 0,1.0,-0.162764,-0.241092,0.125517,0.161848,0.043976,0.252015,0.06482,-0.047764,0.244734,-0.163636,0.144301,-0.022474,-0.195662,0.027956,0.020344,-0.118214,-0.099157,0.121454,-0.121454
symboling,-0.162764,1.0,0.466264,-0.535987,-0.365404,-0.242423,-0.55016,-0.233118,-0.110581,-0.140019,-0.008245,-0.182196,0.075819,0.27974,-0.035527,0.036233,-0.082391,0.066171,-0.196735,0.196735
normalized-losses,-0.241092,0.466264,1.0,-0.056661,0.019424,0.086802,-0.373737,0.099404,0.11236,-0.029862,0.055563,-0.114713,0.217299,0.239543,-0.225016,-0.181877,0.133999,0.238567,-0.101546,0.101546
wheel-base,0.125517,-0.535987,-0.056661,1.0,0.876024,0.814507,0.590742,0.782097,0.572027,0.493244,0.158502,0.250313,0.371147,-0.360305,-0.470606,-0.543304,0.584642,0.476153,0.307237,-0.307237
length,0.161848,-0.365404,0.019424,0.876024,1.0,0.85717,0.492063,0.880665,0.685025,0.608971,0.124139,0.159733,0.579821,-0.28597,-0.665192,-0.698142,0.690628,0.657373,0.211187,-0.211187
width,0.043976,-0.242423,0.086802,0.814507,0.85717,1.0,0.306002,0.866201,0.729436,0.544885,0.188829,0.189867,0.615077,-0.2458,-0.633531,-0.680635,0.751265,0.673363,0.244356,-0.244356
height,0.252015,-0.55016,-0.373737,0.590742,0.492063,0.306002,1.0,0.307581,0.074694,0.180449,-0.062704,0.259737,-0.087027,-0.309974,-0.0498,-0.104812,0.135486,0.003811,0.281578,-0.281578
curb-weight,0.06482,-0.233118,0.099404,0.782097,0.880665,0.866201,0.307581,1.0,0.849072,0.64406,0.167562,0.156433,0.757976,-0.279361,-0.749543,-0.794889,0.834415,0.785353,0.221046,-0.221046
engine-size,-0.047764,-0.110581,0.11236,0.572027,0.685025,0.729436,0.074694,0.849072,1.0,0.572609,0.209523,0.028889,0.822676,-0.256733,-0.650546,-0.679571,0.872335,0.745059,0.070779,-0.070779
bore,0.244734,-0.140019,-0.029862,0.493244,0.608971,0.544885,0.180449,0.64406,0.572609,1.0,-0.05539,0.001263,0.566936,-0.267392,-0.582027,-0.591309,0.543155,0.55461,0.054458,-0.054458



#### P-value

This is the probability value that the correlation between these two variables is statistically significant. Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant.

By convention, when the

- p-value is < 0.001 we say there is strong evidence that the correlation is significant.
- the p-value is < 0.05; there is moderate evidence that the correlation is significant.
- the p-value is < 0.1; there is weak evidence that the correlation is significant.
- the p-value is > 0.1; there is no evidence that the correlation is significant.

We can obtain this information using 'stats' module in the 'scipy' library.


In [7]:
from scipy import stats


Now we calculate the Pearson correlation coefficient and P-value of various attributes of the dataset against price.

**Wheel-base vs Price**


In [8]:
pearson_coef, p_value = stats.pearsonr(df['wheel-base'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)


The Pearson Correlation Coefficient is 0.584641822265508  with a P-value of P = 8.076488270733218e-20


- Conclusion:

Since the p-value is < 0.001, the correlation between wheel-base and price is statistically significant, although the linear relationship isn't extremely strong (0.585).


**Horsepower vs Price**

Now we calculate the Pearson correlation coefficient and P-value of 'horsepower' and 'price'.


In [9]:
pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)


The Pearson Correlation Coefficient is 0.8095745670036559  with a P-value of P = 6.369057428260101e-48


- Conclusion:

Since the p-value is < 0.001, the correlation between horsepower and price is statistically significant, and the linear relationship is quite strong (~0.809, close to 1).

**Length vs Price**

Now we calculate the Pearson correlation coefficient and P-value of 'length' and 'price'.


In [10]:
pearson_coef, p_value = stats.pearsonr(df['length'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)

The Pearson Correlation Coefficient is 0.6906283804483638  with a P-value of P = 8.016477466159556e-30


- Conclusion:
It has kind of a moderate linear relationship (~0.691), with a very strong p-value (< 0.001).


**Width vs Price**


In [11]:
pearson_coef, p_value = stats.pearsonr(df['width'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)


The Pearson Correlation Coefficient is 0.7512653440522673  with a P-value of P = 9.200335510481646e-38


- Conclusion:
Quite a strong linear relationship (~0.751) and a very strong p-value (< 0.001).

This analysis can be done for all the other variables in the dataset.
