# Exploratory Analysis of Wine Types and Quality Data (Solution)

Adapted from Dipanjan Sarkar et al. 2018. [Practical Machine Learning with Python](https://link.springer.com/book/10.1007/978-1-4842-3207-1).

## Overview

This module covers exploratory data analysis of wine types and quality using physicochemical attributes. We will analyze two datasets from the UCI Machine Learning Repository containing red and white wine samples to understand patterns and relationships between wine properties, types and quality ratings.

## Learning Objectives

- Understand and analyze relationships between wine physicochemical properties and quality ratings
- Apply statistical analysis techniques to wine attribute data
  - Perform descriptive statistical analysis
  - Conduct inferential statistical tests like ANOVA
- Create effective visualizations to explore wine data patterns
  - Generate univariate distribution plots
  - Produce multivariate relationship plots
- Build analytical frameworks for wine type and quality prediction

### Tasks to complete

- Perform descriptive statistical analysis
- Conduct inferential statistical tests
- Generate univariate distribution visualizations
- Create multivariate relationship plots
- Analyze patterns between wine attributes and quality

## Prerequisites

- Python programming environment
- Basic understanding of statistical and machine learning concepts
- Familiarity with common ML libraries


## Get Started

### Set up conda environment

Ensure that you have created then conda environment using the `environment.yml` file included in this repository. E.g.,

```
# Create conda environment
conda env create -f conda_env_submodule_4.yml

# Register the kernel
python -m ipykernel install --user \
    --name=nigms_sandbox_ud__submodule_4 \
    --display-name "Python (NIGMS Sandbox UD, Submodule 4)"
```

Then, when starting the notebook, select the appropriate kernel from the list.

Note that you may need to restart Jupyter Lab for these changes to take effect.

### Import necessary libraries


In [None]:
# Import necessary dependencies
# We wil use matplotlib and seaborn for exploratory data analysis and visualizations
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats

# make your plot outputs appear and be stored within the notebook.
%matplotlib inline 

# Set up plotting style
sns.set(style="whitegrid", palette="pastel")
plt.rcParams["figure.figsize"] = (10, 6)

## Problem Statement

“Given a dataset, or in this case two datasets that deal with physicochemical properties of wine, can you
guess the wine type and quality?” We will process,
analyze, visualize, and model our dataset based on standard Machine Learning and data mining workflow
models like the CRISP-DM model.

The datasets used are available in the very popular UCI Machine Learning Repository
under the name of Wine Quality Data Set. You can access more details at https://archive.ics.uci.edu/ml/datasets/wine+quality. There are two datasets, one for red wines and the other for white wines.

We will be trying to solve the following major problems by
leveraging Machine Learning and data analysis on our wine quality dataset.

- Predict if each wine sample is a red or white wine.
- Predict the quality of each wine sample, which can be low, medium, or high.


## Load and merge datasets


In [None]:
# Load datasets
red_wine = pd.read_csv("../../Data/winequality-red.csv", sep=";")
white_wine = pd.read_csv("../../Data/winequality-white.csv", sep=";")

# Add wine type and quality labels
red_wine["wine_type"] = "red"
white_wine["wine_type"] = "white"

def categorize_quality(quality):
    """Categorize wine quality into low, medium, or high."""
    if quality <= 5:
        return "low"
    elif quality <= 7:
        return "medium"
    else:
        return "high"

red_wine["quality_label"] = red_wine["quality"].apply(categorize_quality)
white_wine["quality_label"] = # Your code goes here

# Merge datasets and shuffle
wines = pd.concat([red_wine, white_wine]).sample(frac=1, random_state=42).reset_index(drop=True)

## Understand dataset features and values


In [None]:
print(white_wine.shape, red_wine.shape)
print(wines.info())

We have 4898 white wine data points and 1599 red wine data points. The
merged dataset contains a total of 6497 data points and we also get an idea of numeric and categorical
attributes.


In [None]:
# Let’s take a peek at our dataset to see some sample data points.
wines.head()

## Domain knowledge about wine and its attributes


### Understanding Wine and Types

Wine is an alcoholic beverage made from grapes which is fermented without the addition of sugars, acids, enzymes, water, or other nutrients

Red wine is made from dark red and black grapes. The color usually ranges from various shades of red, brown and violet. This is produced with whole grapes including the skin which adds to the color and flavor of red wines, giving it a rich flavor.

White wine is made from white grapes with no skins or seeds. The color is usually straw-yellow, yellow-green, or yellow-gold. Most white wines have a light and fruity flavor as compared to richer red wines.


### Understanding Wine Attributes and Properties

The 14 attributes are described as follows:

- **fixed acidity:** Acids are one of the fundamental properties of wine and contribute greatly to the taste of the wine. Reducing acids significantly might lead to wines tasting flat. Fixed acids include tartaric, malic, citric, and succinic acids which are found in grapes (except succinic). This variable is usually expressed in $\frac{g(tartaricacid)}{dm^3}$ in the dataset.

- **volatile acidity:** These acids are to be distilled out from the wine before completing the production process. It is primarily constituted of acetic acid though other acids like lactic, formic and butyric acids might also be present. Excess of volatile acids are undesirable and lead to unpleasant flavor. In the US, the legal limits of volatile acidity are 1.2 g/L for red table wine and 1.1 g/L for white table wine. The volatile acidity is expressed in $\frac{g(aceticacid)}{dm^3}$ in the dataset.

- **citric acid:** This is one of the fixed acids which gives a wine its freshness. Usually most of it is consumed during the fermentation process and sometimes it is added separately to give the wine more freshness. It's usually expressed in $\frac{g}{dm^3}$ in the dataset.

- **residual sugar:** This typically refers to the natural sugar from grapes which remains after the fermentation process stops, or is stopped. It's usually expressed in $\frac{g}{dm^3}$ in the dataset.

- **chlorides:** This is usually a major contributor to saltiness in wine. It's usually expressed in $\frac{g(sodiumchloride)}{dm^3}$ in the dataset.

- **free sulfur dioxide:** This is the part of the sulphur dioxide that when added to a wine is said to be free after the remaining part binds. Winemakers will always try to get the highest proportion of free sulphur to bind. They are also known as sulfites and too much of it is undesirable and gives a pungent odour. This variable is expressed in $\frac{mg}{dm^3}$ in the dataset.

- **total sulfur dioxide:** This is the sum total of the bound and the free sulfur dioxide ($SO_2$). Here, it's expressed in $\frac{mg}{dm^3}$. This is mainly added to kill harmful bacteria and preserve quality and freshness. There are usually legal limits for sulfur levels in wines and excess of it can even kill good yeast and give out undesirable odour.

- **density:** This can be represented as a comparison of the weight of a specific volume of wine to an equivalent volume of water. It is generally used as a measure of the conversion of sugar to alcohol. Here, it's expressed in $\frac{g}{cm^3}$.

- **pH:** Also known as the potential of hydrogen, this is a numeric scale to specify the acidity or basicity the wine. Fixed acidity contributes the most towards the pH of wines. You might know, solutions with a pH less than 7 are acidic, while solutions with a pH greater than 7 are basic. With a pH of 7, pure water is neutral. Most wines have a pH between 2.9 and 3.9 and are therefore acidic.

- **sulphates:** These are mineral salts containing sulfur. Sulphates are to wine as gluten is to food. They are a regular part of the winemaking around the world and are considered essential. They are connected to the fermentation process and affects the wine aroma and flavor. Here, it's expressed in $\frac{g(potassiumsulphate)}{dm^3}$ in the dataset.

- **alcohol:** Wine is an alcoholic beverage. Alcohol is formed as a result of yeast converting sugar during the fermentation process. The percentage of alcohol can vary from wine to wine. Hence it is not a surprise for this attribute to be a part of this dataset. It's usually measured in % vol or alcohol by volume (ABV).

- **quality:** Wine experts graded the wine quality between 0 (very bad) and 10 (very excellent). The eventual quality score is the median of at least three evaluations made by the same wine experts.

- **wine_type:** Since we originally had two datasets for red and white wine, we introduced this attribute in the final merged dataset which indicates the type of wine for each data point. A wine can either be a 'red' or a 'white' wine. One of the predictive models we will build in this chapter would be such that we can predict the type of wine by looking at other wine attributes.

- **quality_label:** This is a derived attribute from the `quality` attribute. We bucket or group wine quality scores into three qualitative buckets namely low, medium and high. Wines with a quality score of 3, 4 & 5 are low quality, scores of 6 & 7 are medium quality and scores of 8 & 9 are high quality wines. We will also build another model in this chapter to predict this wine quality label based on other wine attributes.


## Exploratory Data Analysis and Visualizations

Standard Machine Learning and analytics workflow recommend processing, cleaning, analyzing, and
visualizing your data before moving on toward modeling your data. We will also follow the same workflow here.


### Descriptive Statistics


In [None]:
# Let’s build a descriptive summary table on various wine attributes separated by wine type.
# Summary statistics for key attributes based on wine types
subset_attributes = ["residual sugar", "total sulfur dioxide", "sulphates", "alcohol", "volatile acidity", "quality"]
summary = wines.groupby("wine_type")[subset_attributes].describe().T
print(summary)

We can see mean residual sugar and total sulfur dioxide content in
white wine seems to be much higher than red wine. Also, the mean value of sulphates and volatile acidity
seem to be higher in red wine as compared to white wine.


In [None]:
# Calculate summary statistics for key attributes based on wine quality labels
quality_summary = # Your code goes here
print(quality_summary)

Interestingly, mean alcohol levels seem to increase based on the rating of the
wine quality. We also see that pH levels are almost consistent across the wine samples of varying quality.


### Inferential Statistics

Inferential Statistics is to draw inferences and propositions of a population using a
data sample. The idea is to use statistical methods and models to draw statistical inferences from a given
hypotheses. Each hypothesis consists of a null hypothesis and an alternative hypothesis. Based on statistical
test results, if the result is statistically significant based on pre-set significance levels (e.g., if obtained
p-value is less than 5% significance level), we reject the null hypothesis in favor of the alternative hypothesis.
Otherwise, if the results is not statistically significant, we conclude that our null hypothesis was correct.


#### ANOVA

A great statistical model to prove or disprove the difference in mean among subsets of data is to use
the one-way ANOVA test. ANOVA stands for “analysis of variance,” which is a nifty statistical model and can
be used to analyze statistically significant differences among means or averages of various groups. This is
basically achieved using a statistical test that helps us determine whether or not the means of several groups
are equal.

- The null hypothesis $H_0$ indicates that the group means for the various
  groups are not very different from each other based on statistical significance levels.
- The alternative
  hypotheses, $H_A$, tells us that there exists at least two group means that are statistically significantly different
  from each other.

Usually the F-statistic and the associated p-value from it is used to determine the statistical
significance. Typically a p-value less than 0.05 is taken to be a statistically significant result where we reject
the null hypothesis in favor of the original.
In our case, three data subsets or groups from the data are created based on wine quality ratings. The
mean values in the first test would be based on the wine alcohol content and the second test would be based
on the wine pH levels. Also let’s assume the null hypothesis is that the group means for low, medium, and high
quality wine is same and the alternate hypothesis would be that there is a difference (statistically significant)
between at least two group means.


In [None]:
# Perform ANOVA for alcohol and pH levels across quality labels
def perform_anova(data, attribute):
    """Perform ANOVA test for a given attribute across quality labels."""
    groups = [data[data["quality_label"] == label][attribute] for label in ["low", "medium", "high"]]
    F, p = stats.f_oneway(*groups)
    print(f"ANOVA for {attribute}: F-statistic = {F:.2f}, p-value = {p:.4f}")

perform_anova(wines, "alcohol") #perform the one-way ANOVA test based on alcohol levels
# perform the one-way ANOVA test based on pH levels
# Your code goes here 

We can clearly see we have a p-value much less than 0.05 in the first test and
greater than 0.05 in the second test. This tells us that there is a statistically significant difference in alcohol
level means for at least two groups out of the three (rejecting the null hypothesis in favor of the alternative).
However, in case of pH level means, we do not reject the null hypothesis and thus we conclude that the pH
level means across the three groups are not statistically significantly different.


In [None]:
# We can even visualize these two features and observe the means.
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
f.suptitle("Wine Quality - Alcohol Content/pH", fontsize=14)
f.subplots_adjust(top=0.85, wspace=0.3)

# Show boxplot of wine quality classes vs wine alcohol percentage
sns.boxplot(x="quality_label", y="alcohol", data=wines, ax=ax1)
ax1.set_xlabel("Wine Quality Class", size=12, alpha=0.8)
ax1.set_ylabel("Wine Alcohol %", size=12, alpha=0.8)

# Show boxplot of wine quality classes vs wine pH.
#Your code goes here

The boxplots depicted in the Figure above show us stark differences in wine alcohol content distributions
based on wine quality as compared to pH levels


### Univariate Analysis

Univariate
analysis involves analyzing data such that at any instance of analysis we are only dealing with one variable or
feature. No relationships or correlations are analyzed among multiple variables. The simplest way to easily
visualize all the variables in your data is to build some histograms.


In [None]:
# visualize distributions of data values for all features of red wines
red_wine.hist(
    bins=15,
    color="red",
    edgecolor="black",
    linewidth=1.0,
    xlabelsize=8,
    ylabelsize=8,
    grid=False,
)
plt.tight_layout(rect=(0, 0, 1.2, 1.2))
rt = plt.suptitle("Red Wine Univariate Plots", x=0.65, y=1.25, fontsize=14)

# visualize distributions of data values for all features of white wines
# Your code goes here

In [None]:
# take residual sugar and plot the distributions across data pertaining to red and white wine samples.
fig = plt.figure(figsize=(10, 4))
title = fig.suptitle("Residual Sugar Content in Wine", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.3)

ax1 = fig.add_subplot(1, 2, 1)
ax1.set_title("Red Wine")
ax1.set_xlabel("Residual Sugar")
ax1.set_ylabel("Frequency")
ax1.set_ylim([0, 2500])
ax1.text(
    8, 1000, r"$\mu$=" + str(round(red_wine["residual sugar"].mean(), 2)), fontsize=12
)
r_freq, r_bins, r_patches = ax1.hist(
    red_wine["residual sugar"], color="red", bins=15, edgecolor="black", linewidth=1
)

# plot residual sugar distribution in white wine samples
# Your code goes here

We can see residual sugar content in white wine
samples seems to be more as compared to red wine samples.


In [None]:
# take sulphates and plot the distributions across data pertaining to red and white wine samples.
fig = plt.figure(figsize=(10, 4))
title = fig.suptitle("Sulphates Content in Wine", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.3)

ax1 = fig.add_subplot(1, 2, 1)
ax1.set_title("Red Wine")
ax1.set_xlabel("Sulphates")
ax1.set_ylabel("Frequency")
ax1.set_ylim([0, 1200])
ax1.text(1.2, 800, r"$\mu$=" + str(round(red_wine["sulphates"].mean(), 2)), fontsize=12)
r_freq, r_bins, r_patches = ax1.hist(
    red_wine["sulphates"], color="red", bins=15, edgecolor="black", linewidth=1
)

ax2 = fig.add_subplot(1, 2, 2)
ax2.set_title("White Wine")
ax2.set_xlabel("Sulphates")
ax2.set_ylabel("Frequency")
ax2.set_ylim([0, 1200])
ax2.text(
    0.8, 800, r"$\mu$=" + str(round(white_wine["sulphates"].mean(), 2)), fontsize=12
)
w_freq, w_bins, w_patches = ax2.hist(
    white_wine["sulphates"], color="white", bins=15, edgecolor="black", linewidth=1
)

We can see the sulphate content is slightly more in red wine samples
as compared to white wine samples.


In [None]:
# take alcohol and plot the distributions across data pertaining to red and white wine samples.
fig = plt.figure(figsize=(10, 4))
title = fig.suptitle("Alcohol Content in Wine", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.3)

# plot alcohol distribution in red wine samples
# Your code goes here


# plot alcohol distribution in white wine samples
# Your code goes here


We can see the alcohol content is almost similar in both types on an average. Of
course, frequency counts are higher in all cases for white wine because we have more white wine sample
records as compared to red wine.


In [None]:
# take quality and plot the distributions across data pertaining to red and white wine samples.
fig = plt.figure(figsize=(10, 4))
title = fig.suptitle("Wine TYpes - Quality", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.3)

ax1 = fig.add_subplot(1, 2, 1)
ax1.set_title("Red Wine")
ax1.set_xlabel("Quality")
ax1.set_ylabel("Frequency")
rw_q = red_wine["quality"].value_counts()
rw_q = (list(rw_q.index), list(rw_q.values))
ax1.set_ylim([0, 2500])
ax1.tick_params(axis="both", which="major", labelsize=8.5)
bar1 = ax1.bar(rw_q[0], rw_q[1], color="red", edgecolor="black", linewidth=1)


ax2 = fig.add_subplot(1, 2, 2)
ax2.set_title("White Wine")
ax2.set_xlabel("Quality")
ax2.set_ylabel("Frequency")
ww_q = white_wine["quality"].value_counts()
ww_q = (list(ww_q.index), list(ww_q.values))
ax2.set_ylim([0, 2500])
ax2.tick_params(axis="both", which="major", labelsize=8.5)
bar2 = ax2.bar(ww_q[0], ww_q[1], color="white", edgecolor="black", linewidth=1)

In [None]:
# take quality_label categorical features and plot the distributions across data pertaining to red and white wine samples.

fig = plt.figure(figsize=(10, 4))
title = fig.suptitle("Wine Type - Quality Label", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.3)

ax1 = fig.add_subplot(1, 2, 1)
ax1.set_title("Red Wine")
ax1.set_xlabel("Quality Class")
ax1.set_ylabel("Frequency")
rw_q = red_wine["quality_label"].value_counts()
rw_q = (list(rw_q.index), list(rw_q.values))
ax1.set_ylim([0, 3200])
bar1 = ax1.bar(
    list(range(len(rw_q[0]))),
    rw_q[1],
    color="red",
    edgecolor="black",
    linewidth=1,
    tick_label=rw_q[0],
)

ax2 = fig.add_subplot(1, 2, 2)
ax2.set_title("White Wine")
ax2.set_xlabel("Quality Class")
ax2.set_ylabel("Frequency")
ww_q = white_wine["quality_label"].value_counts()
ww_q = (list(ww_q.index), list(ww_q.values))
ax2.set_ylim([0, 3200])
bar2 = ax2.bar(
    list(range(len(ww_q[0]))),
    ww_q[1],
    color="white",
    edgecolor="black",
    linewidth=1,
    tick_label=ww_q[0],
)

It is quite evident that high quality wine samples are far less as compared to low and medium
quality wine samples.


### Multivariate Analysis

Analyzing multiple feature variables and their relationships is what multivariate analysis is all about. We
would want to see if there are any interesting patterns and relationships among the physicochemical
attributes of our wine samples, which might be helpful in our modeling process in the future.

One of the best
ways to analyze features is to build a pairwise correlation plot depicting the correlation coefficient between
each pair of features in the dataset.


In [None]:
# Correlation heatmap
plt.figure(figsize=(12, 8))
corr = wines.corr()
sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Wine Attributes Correlation Heatmap", fontsize=14)
plt.show()

We can see a strong negative
correlation between density and alcohol and a strong positive correlation between total and free sulfur
dioxide, which is expected.


In [None]:
# visualize patterns and relationships among multiple variables
# using pairwise plots and use different hues for the wine types
# essentially plotting three variables at a time.
cols = ["wine_type", "quality", "sulphates", "volatile acidity"]
pp = sns.pairplot(
    wines[cols],
    hue="wine_type",
    height=1.8,
    aspect=1.8,
    palette={"red": "#FF9999", "white": "#FFE888"},
    plot_kws=dict(edgecolor="black", linewidth=0.5),
)
fig = pp.fig
fig.subplots_adjust(top=0.93, wspace=0.3)
t = fig.suptitle("Wine Attributes Pairwise Plots", fontsize=14)

We can notice several interesting patterns, which are in alignment with
some insights we obtained earlier.

- Presence of higher sulphate levels in red wines as compared to white wines
- Lower sulphate levels in wines with high quality ratings
- Lower levels of volatile acids in wines with high quality ratings
- Presence of higher volatile acid levels in red wines as compared to white wines

To observe relationships among features with a more microscopic view, joint plots are excellent visualization
tools specifically for multivariate visualizations.


In [None]:
# plot relationship between sulphates, and quality ratings for red wines
rj = sns.jointplot(
    x="quality",
    y="sulphates",
    data=red_wine,
    kind="reg",
    ylim=(0, 2),
    color="red",
    space=0,
    height=4.5,
    ratio=4,
)
rj.ax_joint.set_xticks(list(range(3, 9)))
fig = rj.fig
fig.subplots_adjust(top=0.9)
t = fig.suptitle("Red Wine Sulphates - Quality", fontsize=12)

# plot relationship between sulphates, and quality ratings for white wines
# Your code goes here

The seaborn framework provides facet grids that
help us visualize higher number of variables in two-dimensional plots.


In [None]:
# visualize relationships between wine type, quality ratings, volatile acidity, and alcohol volume levels.
g = sns.FacetGrid(
    wines,
    col="wine_type",
    hue="quality_label",
    col_order=["red", "white"],
    hue_order=["low", "medium", "high"],
    aspect=1.2,
    height=3.5,
    palette=sns.light_palette("navy", 3),
)
g.map(
    plt.scatter,
    "volatile acidity",
    "alcohol",
    alpha=0.9,
    edgecolor="white",
    linewidth=0.5,
)
fig = g.fig
fig.subplots_adjust(top=0.8, wspace=0.3)
fig.suptitle("Wine Type - Alcohol - Quality - Acidity", fontsize=14)
l = g.add_legend(title="Wine Quality Class")

Not only are we able to successfully visualize
four variables, but also we can see meaningful relationships among them. Higher quality wine samples
(depicted by darker shades) have lower levels of volatile acidity and higher levels of alcohol content as
compared to wine samples with medium and low ratings. Besides this, we can also see that volatile acidity
levels are slightly lower in white wine samples as compared to red wine samples.


In [None]:
# visualize relationships between wine type, quality ratings, volatile acidity, and total sulfur dioxide.
# Your code goes here

We can see _volatile acidity_ as well as _total sulfur dioxide_ is
considerably lower in high quality wine samples. Also, total sulfur dioxide is considerable more in white
wine samples as compared to red wine samples. However, volatile acidity levels are slightly lower in white
wine samples as compared to red wine samples we also observed in the previous plot.

A nice way to visualize numerical features segmented by groups (categorical variables) is to use box
plots. Let’s try to visualize the relationship between
wine alcohol levels grouped by wine quality ratings.


In [None]:
# Visualizing relationships between wine types: quality and alcohol content
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 4))
f.suptitle("Wine Type - Quality - Alcohol Content", fontsize=14)

# Show relationships between wine alcohol content and wine quality
sns.boxplot(
    x="quality",
    y="alcohol",
    hue="wine_type",
    data=wines,
    palette={"red": "#FF9999", "white": "white"},
    ax=ax1,
)
ax1.set_xlabel("Wine Quality", size=12, alpha=0.8)
ax1.set_ylabel("Wine Alcohol %", size=12, alpha=0.8)

# Show relationships between wine alcohol content and wine quality classes (labels)
# Your code goes here

l = plt.legend(loc="best", title="Wine Type")

Each box plot in the figure depicts the distribution of alcohol level
for a particular wine quality rating separated by wine types. The box itself depicts the inter-quartile range
and the line inside depicts the median value of alcohol. Whiskers indicate the minimum and maximum
value with outliers often depicted by individual points.

We can clearly observe the wine alcohol by volume
distribution has an increasing trend based on higher quality rated wine samples.

Similarly we can also using
violin plots to visualize distributions of numeric features over categorical feature


In [None]:
# Visualizing relationships between wine types: quality and acidity
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 4))
f.suptitle("Wine Type - Quality - Acidity", fontsize=14)

# Show relationships between wine fixed acidity and wine quality
sns.violinplot(
    x="quality",
    y="volatile acidity",
    hue="wine_type",
    data=wines,
    split=True,
    inner="quart",
    linewidth=1.3,
    palette={"red": "#FF9999", "white": "white"},
    ax=ax1,
)
ax1.set_xlabel("Wine Quality", size=12, alpha=0.8)
ax1.set_ylabel("Wine Fixed Acidity", size=12, alpha=0.8)

# Show relationships between wine fixed acidity and wine quality classes (labels)
# Your code goes here

l = plt.legend(loc="upper right", title="Wine Type")

Each violin plot typically depicts the inter-quartile range with the median which is shown
with dotted lines in this figure. We have built a split-violin plot in this case depicting both types of
wine.

It is quite evident that red wine samples have higher acidity as compared to its white wine counterparts.
Also we can see an overall decrease in acidity with higher quality wine for red wine samples but not so much
for white wine samples.

These code snippets and examples should give you some good frameworks and blueprints to perform effective exploratory data analysis on your datasets in the future.


## Conclusion

Through this analysis, we learned:

- How to effectively analyze relationships between wine properties and quality
- Statistical techniques for comparing wine attribute distributions
- Methods for visualizing multi-dimensional wine data relationships
- Patterns distinguishing high vs low quality wines
- Differences between red and white wine characteristics

## Clean up

Remember to shut down your Jupyter Notebook environment and delete any unnecessary files or resources once you've completed the tutorial.
