#### Introduction to Statistical Learning, Exercise 2.2

__Please do yourself a favour and only look at the solutions after you honestly tried to solve the exercises.__

# Explore the Auto Data Set

### A. Quantitative vs. Qualitative Variables

First, use the `islpy.dataset` module to read the documentation of the `Auto` data set. Then read it in and assign the data frame to the variable `auto`.

Which of the predictors (or variables, features...) are quantitative and which are qualitative? For some predictors this distinction is not clear cut. Write down your reasons for the different classifications.

In [None]:
import pandas as pd
from islpy import datasets

In [None]:
help(datasets.Auto)

In [None]:
auto = datasets.Auto()
auto.head()

The clearly quantitative variables are `mpg`, `displacement`, `horsepower`, `acceleration` and `weight`.

The clearly qualitative variables are `origin` and `name`. Name would best serve as a row name instead of a variable.

The variables `cylinders` and `year` are a bit ambiguous.  The number of different values for `cylinders` is small enough that we are inclined to classify it as qualitative.  The variable `year` should probably be also classified as qualitative, but there might be scenarios where treating it as quantitative is appropriate.

### B. Variable Ranges

What is the *range* of each clearly qualitative predictor? You can use the `min()` and `max()` methods on a selection of columns from the data set to find out.

In [None]:
mins = auto[['mpg', 'displacement', 'weight',
             'horsepower', 'acceleration']].min()
maxs = auto[['mpg', 'displacement', 'weight',
             'horsepower', 'acceleration']].max()
ranges = maxs - mins
print(ranges)

### C. Variable Mean & Standard Deviation

What are the means and standard deviations of the clearly quantitative variables? You can use the `mean()` and `std()` methods on a column selection, or you can use the `describe()` method on the selction. The latter provides more information.

In [None]:
auto[['mpg', 'displacement', 'weight', 'horsepower', 'acceleration']].mean()

In [None]:
auto[['mpg', 'displacement', 'weight', 'horsepower', 'acceleration']].std()

In [None]:
auto[['mpg', 'displacement', 'weight', 'horsepower', 'acceleration']].describe()

### D. Looking at a Subset of a Data Set

Create a subset of the `Auto` data set by removing the 10th through 85th (inclusive) observations. You can use the `drop()` method of the data frame and specify a `range`. Do not modify the original data set.

What is the range,  mean and standard deviation of the quantitative variables in the subset?

In [None]:
# this does not modify the original data frame unless inplace=True
sub_auto = auto.drop(range(10, 86))

In [None]:
mins = sub_auto[['mpg', 'displacement', 'weight',
                 'horsepower', 'acceleration']].min()
maxs = sub_auto[['mpg', 'displacement', 'weight',
                 'horsepower', 'acceleration']].max()
ranges = maxs - mins
print(ranges)

In [None]:
sub_auto[['mpg', 'displacement', 'weight', 'horsepower', 'acceleration']].mean()

In [None]:
sub_auto[['mpg', 'displacement', 'weight', 'horsepower', 'acceleration']].std()

In [None]:
sub_auto[['mpg', 'displacement', 'weight',
          'horsepower', 'acceleration']].describe()

### E. Investigate the Data Set

Using the full data set, investigate the features graphically. You can use scatter plots or other graphical tools of your choice. The plots should highlight some relationships among the variables. Comment on your findings.

This exercise does not have a unique solution. We have to decide what to investigate.

For convenience, we first make some changes to the data set: use the car model names as row names and replace the origin indicators with the corresponding string labels.

In [None]:
auto.rename({'name': 'model'}, axis='columns', inplace=True)
auto.set_index('model', inplace=True)
auto['origin'].replace({1: 'America', 2: 'Europe', 3: 'Japan'}, inplace=True)
auto.head()

The number of variables in the `Auto` data set is small, so we can make a full pair plot to look for relations to investigate. This is not possible (or helpful) if the number of features is too large. We colour code the origin so we can more easily spot relations involving this variable. 

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
sns.pairplot(data=auto, hue='origin')
plt.show()

We can already read off some interesting relations: clearly `horsepower`, `displacement` and `weight` tend to higher in American cars.  We look at this more detail next.

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(24, 6))
sns.boxplot(x='origin', y='horsepower', data=auto, ax=ax[0])
sns.boxplot(x='origin', y='displacement', data=auto, ax=ax[1])
sns.boxplot(x='origin', y='weight', data=auto, ax=ax[2])
plt.show()

The box plots clearly confirm that American cars are more powerful, heavier and have larger engines.

We expect that `acceleration` is correlated with `horsepower` (better) and `weight` (worse). We investigate the relationship of these variables depending on the `origin`. 

In [None]:
sns.relplot(x='horsepower', y='acceleration', data=auto, hue='origin', size='weight')
plt.show()

As expected, acceleration improves with power. The heavy American cars have more powerful engines and so still have good acceleration. Japanese and European cars tend to be lighter and can achieve good acceleration with less powerful engines.

### F. Variables Related to Mileage

Suppose we wish to predict gas mileage (`mpg`) on the basis of other variables. Do your plots suggest that any of the other variables might be useful for predicting `mpg`? Justify your answer.

The pair plot matrix suggests `mpg` is related to `horsepower` and `weight`. This is not surprising, as more energy is needed to move more weight. There is also a correlation with the region of `origin`.

We make a relation plot to confirm this:

In [None]:
sns.relplot(x='mpg', y='horsepower', data=auto, hue='origin', size='weight')
plt.show()