# Vinho Verde: Wine Quality Analysis

## Introduction

Wine is a type of alcoholic beverage made from the fermentation of grape juice. There are many factors which determine the final quality of wine: The volatile acidity is usually the measure of acetic acid in wine, if the VA is above a certain threshold the wine is considered spoiled and will taste like vinegar. On the other hand, citric acid is used to increase acidity and enhance the flavour of the wine. The residual sugar in wine is the leftover sugar from the grapes after the fermentation process; higher sugar levels increase the sweetness of the wine. While the sugar controls sweetness, chloride contributes to the saltiness of wine, more sodium chloride creates a saltier beverage. PH is essential to wine quality, low pH makes the taste of wine crisp and tart, whereas a high pH results in susceptibility to bacterial growth. Along with this, higher sulphate content also decreases the quality of wine, making it taste duller. 
 
**How do different variables change the overall quality of the wine? Are certain variables more impactful and important than others? Do certain variables have a threshold after which the quality sharply decreases?**

The goal of the project will be to answer these two questions by making a scatter plot graph that compares the wines on both an individual and group level. We will also do a classification of the wines in order to better visualize their quality. Finally, the alcohol content affects its flavour, texture, and overall quality. All of these factors must be considered when trying to answer our question. 

The dataset which will be used is data collected by Paulo Cortez, from the University of Minho in Portugal. He, along with A. Cerdeira, F. Almeida, T. Matos and J. Reis, Viticulture Commission of the Vinho Verde Region(CVRVV), Porto, Portugal collected data about the Vinho Verde wines that have been grown in Northern Portugal. This dataset (from 2009) specifically only takes into account the physicochemical qualities of the wine such as acidity, pH,sulphates, residual sugars and more however it ignores things such as grape type, wine brand, selling price, or other things that could be left as a subjective preference of each person.

**Citation:**

P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis.
Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009

## Preliminary exploratory data analysis

In [None]:
library(tidyverse)
library(digest)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)

We use the `read_csv2()` function to read the file from the web.

In [None]:
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
wine_red <- read_csv2(url)

We clean the column names to remove spaces using `make.names`, then change the datatypes of columns to dbl using `mutate()` and `as.double()` and finally remove the columns we dont wish to use using `select()`.

In [3]:
colnames(wine_red) <- make.names(colnames(wine_red))
wine_red <- wine_red |>
            mutate(volatile.acidity = as.double(volatile.acidity),
                  citric.acid = as.double(citric.acid),
                  chlorides = as.double(chlorides),
                  alcohol = as.double(alcohol),
                  sulphates = as.double(sulphates)) |>
            select(-fixed.acidity, -free.sulfur.dioxide, -total.sulfur.dioxide, -density)

wine_red

volatile.acidity,citric.acid,residual.sugar,chlorides,pH,sulphates,alcohol,quality
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0.70,0.00,19,0.076,351,0.56,94,5
0.88,0.00,26,0.098,32,0.68,98,5
0.76,0.04,23,0.092,326,0.65,98,5
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
0.510,0.13,23,0.076,342,0.75,11,6
0.645,0.12,2,0.075,357,0.71,102,5
0.310,0.47,36,0.067,339,0.66,11,6


Splitting the dataset into training and testing sets.

In [4]:
wine_split <- initial_split(wine_red, prop = 0.75, strata = quality)
wine_train <- training(wine_split)
wine_test <- testing(wine_split)

wine_train

volatile.acidity,citric.acid,residual.sugar,chlorides,pH,sulphates,alcohol,quality
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0.70,0.00,19,0.076,351,0.56,94,5
0.76,0.04,23,0.092,326,0.65,98,5
0.70,0.00,19,0.076,351,0.56,94,5
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
0.25,0.29,22,0.054,34,0.76,109,7
0.37,0.43,23,0.063,317,0.81,112,7
0.32,0.44,24,0.061,329,0.80,116,7


How many NA values are present in each column:

In [30]:
apply(X = is.na(wine_train), MARGIN = 2, FUN = sum)

## Method
 We will conduct our data analysis by creating a histogram graph and individually assessing how each variable changes the quality of the wine. We will also be removing 4 columns that we believe have little to no change throughout all of the data. For example fixed acidity is always going to be fixed and as such will have no change. This means that it is completely useless to have an unchanging variable within our data. We will also be removing density, free and total sulfur dioxide as the difference in values is very small and as such will not have much of a noticeable impact on the quality of the wine. All 8 other columns that we will be using however all have significant impact upon the quality of the wine and as such it is extremely important to see how their variability changes the wine and which one has the biggest impact with the smallest change.
One way that we will visualize the result is to make a histogram graph. On the y-axis we will have the quality of the wine on a 1-10 scale and on the x-axis we will have our other variable that it is being compared to i.e. residual sugar, sulphates, volatile acidity etc. This will allow us to have many graphs that show how each variable changes the overall quality of the wine.

## Expected Outcomes and Significance
1. **What do we expect to find?**

We expect to find which variables have the biggest impact on the wine and whether or not certain variables are simply more impactful than others. It would also be interesting to see if a certain variable has a “breaking point” or a spot where the quality will sharply drop as soon as it hits that threshold. For example if after volatile hits 0.4 we see exponential change in the wine quality. Lastly, we would like to see what the ideal level of each variable is in order to achieve the perfect (or closest to it) vinho verde.

2. **What impact could such findings have?**

Determining wine quality can promote wine testing and ensure safety within not only wine sectors, but the overall food and beverage industry.  Being able to determine the quality of a commodity can inform consumers and motivate sellers to consistently improve their practice and uphold a certain standard of quality. Furthermore, the quality of wine can serve as a reflection of its contents, which provide an important insight as to whether the wine is safe to drink and follows food health and safety guidelines. 

3. **What future questions could this lead to?**

Knowing which variables are more impactful and which ones need to be focused on more can be an important tool for winemakers in order to perfect their craft. If they know that pH is a particularly sensitive variable that quality shifts sharply with them a point can be made in order to focus on refining the winemaking process to ensure the stability of pH throughout. This can also be used the other way where knowing that certain variables can be compromised ex. citric acid then winemakers can focus less on making something that is super stable in its citric acid in favour of a process that increases/decreases the level of citric acid in order to perfect the level of pH and increase the quality of their wine. 
