###Disclaimer
The information contained in this notebook and any accompanying files are proprietary and is confidential to the participants of the Machine Learning Technician program and should not be copied, distributed or reproduced in whole or in part, nor passed to any third party without written permission from the Alberta Machine Intelligence Institute, Amii.

# L2-1: Basic EDA

Hi! Welcome to your first lab!

First things first, you should make a copy of this and and all other Colab notebooks that are shared with you to your own personal Google Drive in order to be able to use them. You can accomplish this by choosing *File > Save a copy in Drive* from the menu. Once you have the file in your own Google Drive (it will be located in the 'Colab Notebooks' folder in your Google Drive), you can execute the code cells and observe the results to see how data science code works. 

Also, for this notebook to work, create a folder called `datasets` (use exactly this spelling without capitalizations or extra spaces) in the 'Colab Notebooks' folder in your own Google Drive. Then, place a copy of the supplied `automobile.csv` file in that 'datasets' subfolder (again, make sure you use exactly this spelling without capitalizations or extra spaces). 

Then, click on the folder icon on the left hand sidebar and click on the icon titled 'Mount Drive' (the black folder icon with the Google Drive logo) and follow the instuctions given to give permission to the Colab notebook to access your Google Drive folder to use the data file (You may have to get an authorization code from a web link and supply that: if that is required, you will be given instructions).

Now, you can execute the next cell which 'mount's your Google Drive, so you can access the data file you supplied:

In [11]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


The next (hidden) code cell shows how the dataset was generated (processed from a web URL), for those of you who are interested to know how, and once you are done with the lab. Understanding the next cell is not and intended for this lab and so, we will not explain it, but it is available for you to study if you feel curious.

In [None]:
#@title How dataset was generated.

# import numpy as np
# import pandas as pd

# files_dir = "/content/drive/MyDrive/Colab Notebooks/Datasets/EDA_Video/"

# dataset_url = "https://archive.ics.uci.edu/ml/machine-learning-databases" + \
#               "/autos/imports-85.data"
# column_names = ["symboling", "normalized-losses", "make", "fuel-type", \
#                 "aspiration", "num-of-doors", "body-style", "drive-wheels", \
#                 "engine-location", "wheel-base", "length", "width", \
#                 "height", "curb-weight", "engine-type", "num-of-cylinders", \
#                 "engine-size", "fuel-system", "bore", "stroke", \
#                 "compression-ratio", "horsepower", "peak-rpm", "city-mpg", \
#                 "highway-mpg", \
#                  "price"]
# df = pd.read_csv(dataset_url, names=column_names).replace("?", np.nan)

# filename = files_dir + "automobile.csv"
# df.to_csv(filename, index=None)

# %reset -f

The original dataset is taken from:
[Dua, D. and Graff, C. (2019). UCI Machine Learning Repository: http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science.](https://archive.ics.uci.edu/ml/datasets/Automobile)

So, let's import some packages we will use. More specifically, we will use [NumPy](https://numpy.org/) for representing and processing vectors and matrices, [pandas](https://pandas.pydata.org/docs/index.html) for data analysis and manipulation and [Plotly](https://plotly.com/python/) (express) for visualization:

In [12]:
import numpy as np
import pandas as pd
import plotly.express as px

Now, let's load up our datafile into a pandas [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) by using the [`read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function defined in pandas module and call that DataFrame `df`:

In [13]:
files_dir = "/content/drive/MyDrive/Colab Notebooks/datasets/"
filename = files_dir + "automobile.csv"

df = pd.read_csv(filename)

Now, we can take a look at our DataFrame object `df` and eyeball our table. That's usually a first step in the EDA process. We usually look to understand what data types with what ranges are represented in the table in different columns, if there are errors or missing values and generally what the data contains. We use this as a first tool to help us **understand the data**, the true goal of the EDA process:

In [14]:
df

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.40,10.0,102.0,5500.0,24,30,13950.0
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.40,8.0,115.0,5500.0,18,22,17450.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,95.0,volvo,gas,std,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,9.5,114.0,5400.0,23,28,16845.0
201,-1,95.0,volvo,gas,turbo,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,8.7,160.0,5300.0,19,25,19045.0
202,-1,95.0,volvo,gas,std,four,sedan,rwd,front,109.1,...,173,mpfi,3.58,2.87,8.8,134.0,5500.0,18,23,21485.0
203,-1,95.0,volvo,diesel,turbo,four,sedan,rwd,front,109.1,...,145,idi,3.01,3.40,23.0,106.0,4800.0,26,27,22470.0


You can check a lot of things in this table!

We can explicitly check the size of our DataFrame object in terms of how many rows and columns by using the [`shape`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html) attribute of the DataFrame:

In [15]:
df.shape

(205, 26)

We can also take a look at the names of our columns by using the [`columns`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html) attribute:

In [16]:
df.columns

Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
       'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
       'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
       'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg', 'price'],
      dtype='object')

Sometimes, when the table is too big, you want to take a look at the first few rows which you can do by using the [`head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) method of our DataFrame object `df`:

In [17]:
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102.0,5500.0,24,30,13950.0
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115.0,5500.0,18,22,17450.0


You can also check the last few rows by [`tail()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html):

In [18]:
df.tail()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
200,-1,95.0,volvo,gas,std,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,9.5,114.0,5400.0,23,28,16845.0
201,-1,95.0,volvo,gas,turbo,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,8.7,160.0,5300.0,19,25,19045.0
202,-1,95.0,volvo,gas,std,four,sedan,rwd,front,109.1,...,173,mpfi,3.58,2.87,8.8,134.0,5500.0,18,23,21485.0
203,-1,95.0,volvo,diesel,turbo,four,sedan,rwd,front,109.1,...,145,idi,3.01,3.4,23.0,106.0,4800.0,26,27,22470.0
204,-1,95.0,volvo,gas,turbo,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,9.5,114.0,5400.0,19,25,22625.0


You can also get a table of some statistcial summaries about the data by using the [`describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) method.

In [19]:
df.describe()

Unnamed: 0,symboling,normalized-losses,wheel-base,length,width,height,curb-weight,engine-size,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
count,205.0,164.0,205.0,205.0,205.0,205.0,205.0,205.0,201.0,201.0,205.0,203.0,203.0,205.0,205.0,201.0
mean,0.834146,122.0,98.756585,174.049268,65.907805,53.724878,2555.565854,126.907317,3.329751,3.255423,10.142537,104.256158,5125.369458,25.219512,30.75122,13207.129353
std,1.245307,35.442168,6.021776,12.337289,2.145204,2.443522,520.680204,41.642693,0.273539,0.316717,3.97204,39.714369,479.33456,6.542142,6.886443,7947.066342
min,-2.0,65.0,86.6,141.1,60.3,47.8,1488.0,61.0,2.54,2.07,7.0,48.0,4150.0,13.0,16.0,5118.0
25%,0.0,94.0,94.5,166.3,64.1,52.0,2145.0,97.0,3.15,3.11,8.6,70.0,4800.0,19.0,25.0,7775.0
50%,1.0,115.0,97.0,173.2,65.5,54.1,2414.0,120.0,3.31,3.29,9.0,95.0,5200.0,24.0,30.0,10295.0
75%,2.0,150.0,102.4,183.1,66.9,55.5,2935.0,141.0,3.59,3.41,9.4,116.0,5500.0,30.0,34.0,16500.0
max,3.0,256.0,120.9,208.1,72.3,59.8,4066.0,326.0,3.94,4.17,23.0,288.0,6600.0,49.0,54.0,45400.0


One caveat is when you have a mixture of numeric and non-numeric (categorical) data, only stats for numeric columns will be displayed, as you can see.

Let's specify the name of the label and numerical and non-numerical columns for further processing then:

In [21]:
label = "price"
numerical_features = np.array(["symboling", "normalized-losses", "wheel-base", \
                               "length", "width", "height", "curb-weight", \
                               "engine-size", "bore", "stroke", \
                               "compression-ratio", "horsepower", "peak-rpm", \
                               "city-mpg", "highway-mpg"])
non_numerical_features = np.array([column for column in df.columns \
                                   if column not in numerical_features and \
                                      column != label])

Now, we can use `describe()` separately for numeric:

In [22]:
df[numerical_features].describe()

Unnamed: 0,symboling,normalized-losses,wheel-base,length,width,height,curb-weight,engine-size,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg
count,205.0,164.0,205.0,205.0,205.0,205.0,205.0,205.0,201.0,201.0,205.0,203.0,203.0,205.0,205.0
mean,0.834146,122.0,98.756585,174.049268,65.907805,53.724878,2555.565854,126.907317,3.329751,3.255423,10.142537,104.256158,5125.369458,25.219512,30.75122
std,1.245307,35.442168,6.021776,12.337289,2.145204,2.443522,520.680204,41.642693,0.273539,0.316717,3.97204,39.714369,479.33456,6.542142,6.886443
min,-2.0,65.0,86.6,141.1,60.3,47.8,1488.0,61.0,2.54,2.07,7.0,48.0,4150.0,13.0,16.0
25%,0.0,94.0,94.5,166.3,64.1,52.0,2145.0,97.0,3.15,3.11,8.6,70.0,4800.0,19.0,25.0
50%,1.0,115.0,97.0,173.2,65.5,54.1,2414.0,120.0,3.31,3.29,9.0,95.0,5200.0,24.0,30.0
75%,2.0,150.0,102.4,183.1,66.9,55.5,2935.0,141.0,3.59,3.41,9.4,116.0,5500.0,30.0,34.0
max,3.0,256.0,120.9,208.1,72.3,59.8,4066.0,326.0,3.94,4.17,23.0,288.0,6600.0,49.0,54.0


...and non-numeric feature columns:

In [23]:
df[non_numerical_features].describe()

Unnamed: 0,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,engine-type,num-of-cylinders,fuel-system
count,205,205,205,203,205,205,205,205,205,205
unique,22,2,2,2,5,3,2,7,7,8
top,toyota,gas,std,four,sedan,fwd,front,ohc,four,mpfi
freq,32,185,168,114,96,120,202,148,159,94


...as well as the label column:

In [24]:
df[label].describe()

count      201.000000
mean     13207.129353
std       7947.066342
min       5118.000000
25%       7775.000000
50%      10295.000000
75%      16500.000000
max      45400.000000
Name: price, dtype: float64

For categorical features, we can see how many unique values exist for each features by using the [`nunique()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nunique.html) method of a DataFrame:

In [25]:
df[non_numerical_features].nunique()

make                22
fuel-type            2
aspiration           2
num-of-doors         2
body-style           5
drive-wheels         3
engine-location      2
engine-type          7
num-of-cylinders     7
fuel-system          8
dtype: int64

We can also print this out in a nice format by iterating through the different (non-numeric) features:

In [26]:
for feature in non_numerical_features:
  print("{:22s}: {}".format(feature, df[feature].unique()))

make                  : ['alfa-romero' 'audi' 'bmw' 'chevrolet' 'dodge' 'honda' 'isuzu' 'jaguar'
 'mazda' 'mercedes-benz' 'mercury' 'mitsubishi' 'nissan' 'peugot'
 'plymouth' 'porsche' 'renault' 'saab' 'subaru' 'toyota' 'volkswagen'
 'volvo']
fuel-type             : ['gas' 'diesel']
aspiration            : ['std' 'turbo']
num-of-doors          : ['two' 'four' nan]
body-style            : ['convertible' 'hatchback' 'sedan' 'wagon' 'hardtop']
drive-wheels          : ['rwd' 'fwd' '4wd']
engine-location       : ['front' 'rear']
engine-type           : ['dohc' 'ohcv' 'ohc' 'l' 'rotor' 'ohcf' 'dohcv']
num-of-cylinders      : ['four' 'six' 'five' 'three' 'twelve' 'two' 'eight']
fuel-system           : ['mpfi' '2bbl' 'mfi' '1bbl' 'spfi' '4bbl' 'idi' 'spdi']


We can also see how many examples have appropriate values for each feature by using the [`count()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.count.html) method of the DataFrame. We can use this to see if we have missing values, and if so, how many:

In [27]:
df.count()

symboling            205
normalized-losses    164
make                 205
fuel-type            205
aspiration           205
num-of-doors         203
body-style           205
drive-wheels         205
engine-location      205
wheel-base           205
length               205
width                205
height               205
curb-weight          205
engine-type          205
num-of-cylinders     205
engine-size          205
fuel-system          205
bore                 201
stroke               201
compression-ratio    205
horsepower           203
peak-rpm             203
city-mpg             205
highway-mpg          205
price                201
dtype: int64

You should have a good idea about the data we are working with at this point, if you are using these tools to help you understand the data, as well as consulting experts.

We can also use visualization to help us with that. Let's plot a scatterplot matrix of the features by using the [`scatter_matrix()`](https://plotly.com/python/splom/) function of Plotly express:

In [28]:
px.scatter_matrix(df, numerical_features, color=label, \
                  width=len(numerical_features) * 200, 
                  height=len(numerical_features) * 200)

You can use this to get an understanding of the values of each feature, the (pairwise) distribution shapes and the correlations between features.

One other tool we can use is to visualize the correlation directly. Let's first calculate correlations using, without visualizing them, by using the [`corr()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html) method:

In [29]:
df.corr()

Unnamed: 0,symboling,normalized-losses,wheel-base,length,width,height,curb-weight,engine-size,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
symboling,1.0,0.528667,-0.531954,-0.357612,-0.232919,-0.541038,-0.227691,-0.10579,-0.134205,-0.008965,-0.178515,0.071622,0.274573,-0.035823,0.034606,-0.082391
normalized-losses,0.528667,1.0,-0.074362,0.02322,0.105073,-0.432335,0.119893,0.167365,-0.036167,0.065627,-0.132654,0.295772,0.264597,-0.258502,-0.210768,0.203254
wheel-base,-0.531954,-0.074362,1.0,0.874587,0.795144,0.589435,0.776386,0.569329,0.490378,0.161477,0.249786,0.352297,-0.361052,-0.470414,-0.544082,0.584642
length,-0.357612,0.02322,0.874587,1.0,0.841118,0.491029,0.877728,0.68336,0.60748,0.129739,0.158414,0.555003,-0.287325,-0.670909,-0.704662,0.690628
width,-0.232919,0.105073,0.795144,0.841118,1.0,0.27921,0.867032,0.735433,0.559204,0.182956,0.181129,0.642482,-0.219957,-0.642704,-0.677218,0.751265
height,-0.541038,-0.432335,0.589435,0.491029,0.27921,1.0,0.295572,0.067149,0.176195,-0.056999,0.261214,-0.110711,-0.322272,-0.04864,-0.107358,0.135486
curb-weight,-0.227691,0.119893,0.776386,0.877728,0.867032,0.295572,1.0,0.850594,0.649045,0.168929,0.151362,0.751034,-0.266306,-0.757414,-0.797465,0.834415
engine-size,-0.10579,0.167365,0.569329,0.68336,0.735433,0.067149,0.850594,1.0,0.59409,0.206675,0.028971,0.810773,-0.244618,-0.653658,-0.67747,0.872335
bore,-0.134205,-0.036167,0.490378,0.60748,0.559204,0.176195,0.649045,0.59409,1.0,-0.055909,0.005203,0.577273,-0.264269,-0.594584,-0.594572,0.543436
stroke,-0.008965,0.065627,0.161477,0.129739,0.182956,-0.056999,0.168929,0.206675,-0.055909,1.0,0.18617,0.090254,-0.071493,-0.042906,-0.044528,0.08231


Now, we can also use the function defined in Plotly express for showing images, [`imshow()`](https://plotly.com/python/imshow/), coupled with the correlations to visualize the correlations:

In [30]:
px.imshow(df.corr())

Finally, we can get an even better idea about the distribution of single features by plotting histograms for numerical features with [`histogram()`](https://plotly.com/python/histograms/) function of Plotly express to make the histogram plot figure (you can get a better idea about values as well as by visualizing a box plot on the side) and by using the [`show()`](https://plotly.com/python/renderers/) method of the figures to display them (you can scroll):

In [31]:
for feature in numerical_features:
  fig = px.histogram(df, x=feature, marginal="box")
  fig.show()

..and by plotting bar charts for the non-numeric features, by using the same `histogram()` function (but this side box plots are not applicable)!

In [32]:
for feature in non_numerical_features:
  fig = px.histogram(df, x=feature)
  fig.show()

We can plot the histogram or box plot for the label column as well. Here, we have a regression problem (a numeric label, so we can also do a box plot on the margin for the label):

In [1]:
px.histogram(df, label, marginal="box")

NameError: ignored

Remember, these are only tools, not end goals. **The end goal is to understand the data**.

This concludes our first lab.

That's all Folks!