The purpose of this notebook is to work through an example of a data analysis. We will show how we perform some common preprocessing / analysis tasks to gather more information about the data. This will inform and prepare us for the modelling step.

The data we will be looking at is taken from https://www.kaggle.com/epa/fuel-economy. If it is not already in the */kaggle/input/fuel-economy* folder then you can add it as follows: Go to file --> Add or upload data. Look for "fuel economy" and click on _Add_ next to **Vehicle Fuel Economy Estimates, 1984-2017**.

The goal with this dataset is to see if we can predict the fuel economy of a car based on some of its characteristics.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
%matplotlib inline

We will first read the data in and look at some basic information.

In [None]:
# ignore warning about "Dtypes"

In [None]:
# print out the number of rows and columns (dimensions)

In [None]:
# print the names of the columns

There are too many columns to analyse right now so I'm going to focus on a few. Columns related to MPG all say the same thing so I'll pick one. Also the columns that come after the MPG columns don't look very relevant so we'll skip them as well.

In [None]:
# select columns

The easiest way to get to know your data is to have a look at it in its raw form.

In [None]:
# print first 5 rows

In [None]:
# print last 5 rows

Next we go one step deeper and look at all the data and summaries of the values.

In [None]:
# print column information

In [None]:
# print summary statistics per column

**All prior information about the data is an assumption until confirmed by the data**

### Duplicates

Duplicated data can be caused by errors during data entry or during data collection. For example, your SQL query may have included a join which resulted in duplicates. If all records were duplicated then it's not a huge problem other than that you have too much data and training your model may take extra time. However if only some records are duplicated their effect on the model results will be enlarged.

In this example it's not straightforward that we're dealing with duplicates. Take the first rows:

In [None]:
# print first 2 rows

In [None]:
id_columns = ['Make', 'Model', 'Class', 'Drive', 'Transmission',
            'Transmission Descriptor', 'Engine Cylinders', 'Engine Displacement', 'Turbocharger',
           'Supercharger', 'Fuel Type', 'Fuel Type 1',
            'Combined MPG (FT1)']

# print number of duplicates

In [None]:
# print example of duplicates

We can drop a few more columns from the identifier columns which will probably result in more duplicates but we'll leave it at that.

### Categorical Values

We summarise categorical values by looking at their counts.

In [None]:
# print counts for Make

In [None]:
# print counts for Class

In [None]:
# print counts for Drive, remember Drive had a few missing values

Will come back to the missing values.

In [None]:
# print counts for Turbocharger

In [None]:
# print counts for Supercharger

In [None]:
# print counts for Fuel Type


### Numerical values

We looked at summary statistics for the numerical values but we can go a little deeper by looking at the distribution of values.

In [None]:
# print distribution for Engine Cylinders

In [None]:
# print distribution for Engine Displacement

In [None]:
# print distribution for Combined MPG (FT1)

### Missing Values

Dealing with missing values is necessary because ML algorithm don't know what to do with _NaN_ values. In most programming languages _NaN_ is a special value that is neither numeric nor character. So it requires manual intervention to make sure _NaN_ values are dealt with correctly.

There are a number of strategies we can apply here and they are dependent on what the source of the missing values is. There is various research available on this topic, referred to as missing value imputation, but we'll stick to some simple solutions.

#### Errors

When missing values are the result of errors, whether in the data gathering part or the data collection part (i.e. wrong query), I think it is best to ignore the records if possible.

In [None]:
# print number of non missing values for Drive

#### Laziness

Sometimes a missing values implies the information is either not known or not applicable. In these case we simply make it explicit by providing a fill value. There are other methods of "filling" the gaps, see the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) for more information.

In [None]:
# fill missing values for Turbocharger and Supercharger

### Outliers

Outliers can too be caused by errors (e.g. misreading in measurement equipment) or they might not be representative of the relationships we're considering. In either case it's a matter of detecting them and removing them.

A simple way of detecting outliers is to consider the mean/median and add one or two standard deviations. Anything above these values can be considered an outlier. 

In [None]:
# print number of outliers for MPG

In [None]:
# create clean dataset

## Relationship Analysis

The above is all to do with cleaning the data to make modeling easier. But before we move on to that stage we can still extract some more information from the dataset by looking at relationships between variables. The best way to do this is through visualisations.

In [None]:
# print pair plot


### Boxplots

When it comes to relationship between categorical values and numerical values, boxplots are the way to go. They display the distribution of the numerical values per category.

In [None]:
# print boxplot for Drive vs MPG

In [None]:
# print boxplot for Turbocharger vs MPG

In [None]:
# print boxplot for Supercharger vs MPG

In [None]:
# print boxplot for Engine Cylinders vs MPG

There are various incarnations of the boxplot (violinplot, swarmplot) that may give you more information but in this case that would be overkill. 