# Exploratory Data Analysis (EDA) Wine Quality dataset

We will analyze the well-known [wine dataset](https://archive.ics.uci.edu/ml/datasets/wine+quality) using our newly gained skills in this part. Each wine is described with several attributes obtained by physicochemical tests and by its quality (from 1 to 10). This dataset is perfect for many ML tasks such as:

- Testing Outlier detection algorithms that can detect the few excellent or poor wines
- Modeling the relationship between wine quality and wine attributes (using regression or classification)
- Attribute (feature) selection techniques to remove unimportant features

> ✏️ The example is inspired by {cite}`packtpublishing`.

Let's start by importing packages:

In [1]:
import pandas as pd

Let's load our data (there are two datasets for red and white wines):

In [2]:
red_wine_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
white_wine_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"

# load data to DF
df_red = pd.read_csv(red_wine_url, delimiter=";")
df_white = pd.read_csv(white_wine_url, delimiter=";")

In [3]:
# TODO - basic checks

As we will later combine our datasets into a single one, let's add an extra column to indicate _category_:

In [4]:
df_white['wine_category'] = 'white'
df_red['wine_category'] = 'red'

A colleague in our department, a wine expert, gave us a tip that scaling the quality from 1 to 10 might probably be too granular for certain types of analysis, and it might be OK to divide wines into three quality categories - _low_, _medium_, _high_:

In [5]:
df_red['quality_label'] = df_red['quality'].apply(lambda value: ('low' if value <= 5 else 'medium') if value <= 7 else 'high')
df_red['quality_label'] = pd.Categorical(df_red['quality_label'], categories=['low', 'medium', 'high'])

df_white['quality_label'] = df_white['quality'].apply(lambda value: ('low' if value <= 5 else 'medium') if value <= 7 else 'high')
df_white['quality_label'] = pd.Categorical(df_white['quality_label'], categories=['low', 'medium', 'high'])

In [6]:
# TODO - analysis of red wine - summary statistics (variables), see distribution of quality, pair plot, correlation plot, violin plots, jointplot, relationship with quality

In [7]:
# TODO - analysis of white wine

In [8]:
# TODO - combine datasets

In [9]:
# TODO - analysis of combined data

## Exercises

#### What sorts of passengers were more likely to survive the Titanic crash?

For the exercise, you will use the famous [Titanic dataset](https://www.kaggle.com/c/titanic). The sinking of the Titanic is one of the most infamous shipwrecks in history.

Unfortunately, there weren't enough lifeboats for everyone onboard on Titanic, resulting in 1502 out of 2224 passengers and crew deaths. While some elements of luck were involved in surviving, some groups of people seemed more likely to survive than others.

In [10]:
# load the dataset
titanic_url = 'https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv'
titanic_data = pd.read_csv(titanic_url)

# TODO: your answer here

## Resources

```{bibliography}
:filter: docname in docnames
```