You will now have the possibility to practice all we have seen in this chapter, on a new dataset.

The dataset is called "Wine quality" and has been found and downloaded here: http://archive.ics.uci.edu/ml/datasets/Wine+Quality.

It contains the characteristics of 6497 wines, and their ``quality``, a grade between 0 and 10 given by 3 wine experts. On the website, the data is separated between white and red wine, but we have grouped them together for the purpose of the exercise. We added an additionnal attribute ``type``, containing the type of wine (red or white).

For the purpose of the exercise, and to train with missing values, some values have been deleted or modified from the original dataset.

In [1]:
# Import the functions for machine learning

%run 1-functions.ipynb

### 1. Import the dataset

+ Import the dataset called ``wine.csv`` into the variable ``data`` __(2 points)__.

In [44]:
# import dataset

import pandas as pd

# Begin answer
data = pd.read_csv('wine.csv')
# End answer

In [32]:
### BEGIN HIDDEN TESTS
assert not data.empty
### END HIDDEN TESTS

In [33]:
### BEGIN HIDDEN TESTS
assert not data['fixed acidity'].empty
### END HIDDEN TESTS

### 2. Get to know the dataset

Understand the dataset: how it looks like, the different attributes, their distribution. Plot the distribution of the attributes.

+ Place the head of the dataset in the variable ``head`` __(1 point)__.
+ Place the description of the numerical attributes in the variable ``description`` __(1 point)__.
+ How many unique values does the attribute ``type`` contain? Place this number in the variable ``unique_type`` __(1 point)__.

In [34]:
# examine dataset

# Begin answer
head = data.head()
description = data.describe()
unique_type = len(data['type'].unique())
# End answer

In [35]:
### BEGIN HIDDEN TESTS
assert head.equals(data.head())
### END HIDDEN TESTS

In [36]:
### BEGIN HIDDEN TESTS
assert description.equals(data.describe())
### END HIDDEN TESTS

In [37]:
### BEGIN HIDDEN TESTS
assert unique_type == len(data['type'].unique())
### END HIDDEN TESTS

### 3. Detect the missing values

+ There is only one attribute which contains missing values. Place the number of missing values for this attribute in the variable ``missing_nb`` __(1 point)__.

Think about how to deal with the missing values.

In [40]:
# detect missing values

# Begin answer
missing_nb = data['density'].isna().sum()
# End answer

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):
fixed acidity           6497 non-null float64
volatile acidity        6497 non-null float64
citric acid             6497 non-null float64
residual sugar          6497 non-null float64
chlorides               6497 non-null float64
free sulfur dioxide     6497 non-null float64
total sulfur dioxide    6497 non-null float64
density                 5235 non-null float64
pH                      6497 non-null float64
sulphates               6497 non-null float64
alcohol                 6497 non-null float64
quality                 6497 non-null int64
type                    6497 non-null object
dtypes: float64(11), int64(1), object(1)
memory usage: 659.9+ KB


In [39]:
### BEGIN HIDDEN TESTS
assert missing_nb == data['density'].isna().sum()
### END HIDDEN TESTS

+ The attribute containing the missing values can be recovered with a reasonable accuracy using the attributes ``residual sugar`` and ``alcohol``. Use this formula to recover the attribute: ``attribute = -0.0014 * alcohol + 0.0002 * residual sugar + 1.0082`` __(2 points)__.

In [51]:
# recover the missing values

# Begin answer
data.loc[data['density'].isna(), 'density'] = -0.0014 * data['alcohol'] + 0.0002 * data['residual sugar'] + 1.0082
# End answer

In [53]:
### BEGIN HIDDEN TESTS
assert data.loc[data['density'].isna()].empty
assert data.iloc[6]['density'] == -0.0014 * data.iloc[6]['alcohol'] + 0.0002 * data.iloc[6]['residual sugar'] + 1.0082
### END HIDDEN TESTS

### 4. Regression model

+ Use the dataset with all the attributes to predict the attribute ``quality`` with a regression model (you can use the functions we used along the chapter). Place the MAE in the variable ``mae_regression`` __(2 points)__.

5. Get the MAE and plot the predictions vs. the true labels.

6. Build other models with different attributes. Try to find the combination of attributes which gives the best performance.

7. Predict the attribute ``quality`` with a classification model, get the MAE and plot the results.

8. Split the dataset on a well chosen attribute, and predict the attribute ``quality`` with the new datasets.