# Vinho Verde: Wine Quality Analysis

## Introduction

Wine is a type of alcoholic beverage made from the fermentation of grape juice. There are many factors which determine the final quality of wine: The volatile acidity is usually the measure of acetic acid in wine, if the VA is above a certain threshold the wine is considered spoiled and will taste like vinegar. On the other hand, citric acid is used to increase acidity and enhance the flavour of the wine. The residual sugar in wine is the leftover sugar from the grapes after the fermentation process; higher sugar levels increase the sweetness of the wine. While the sugar controls sweetness, chloride contributes to the saltiness of wine, more sodium chloride creates a saltier beverage. PH is essential to wine quality, low pH makes the taste of wine crisp and tart, whereas a high pH results in susceptibility to bacterial growth. Along with this, higher sulphate content also decreases the quality of wine, making it taste duller. 
 
**How do different variables change the overall quality of the wine? Are certain variables more impactful and important than others? Do certain variables have a threshold after which the quality sharply decreases?**

The goal of the project will be to answer these two questions by making a scatter plot graph that compares the wines on both an individual and group level. We will also do a classification of the wines in order to better visualize their quality. Finally, the alcohol content affects its flavour, texture, and overall quality. All of these factors must be considered when trying to answer our question. 

The dataset which will be used is data collected by Paulo Cortez, from the University of Minho in Portugal. He, along with A. Cerdeira, F. Almeida, T. Matos and J. Reis, Viticulture Commission of the Vinho Verde Region(CVRVV), Porto, Portugal collected data about the Vinho Verde wines that have been grown in Northern Portugal. This dataset (from 2009) specifically only takes into account the physicochemical qualities of the wine such as acidity, pH,sulphates, residual sugars and more however it ignores things such as grape type, wine brand, selling price, or other things that could be left as a subjective preference of each person.

**Citation:**

P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis.
Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009

## Preliminary exploratory data analysis

In [116]:
library(tidyverse)
library(digest)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)

We use the `read_csv2()` function to read the file from the web.

In [117]:
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
wine_red <- read_csv2(url)

[36mℹ[39m Using [34m[34m"','"[34m[39m as decimal and [34m[34m"'.'"[34m[39m as grouping mark. Use `read_delim()` for more control.

“One or more parsing issues, see `problems()` for details”
[1mRows: [22m[34m1599[39m [1mColumns: [22m[34m12[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ";"
[31mchr[39m (5): volatile acidity, citric acid, chlorides, density, sulphates
[32mdbl[39m (2): total sulfur dioxide, quality

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


We clean the column names to remove spaces using `make.names`, then change the datatypes of columns to dbl using `mutate()` and `as.double()` and finally remove the columns we dont wish to use using `select()`.

In [118]:
colnames(wine_red) <- make.names(colnames(wine_red))

wine_red <- wine_red |>
            mutate(across(fixed.acidity:sulphates, as.double)) |>
            select(-free.sulfur.dioxide, -pH, -alcohol, -density)

wine_red

fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,total.sulfur.dioxide,sulphates,quality
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
74,0.70,0.00,19,0.076,34,0.56,5
78,0.88,0.00,26,0.098,67,0.68,5
78,0.76,0.04,23,0.092,54,0.65,5
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
63,0.510,0.13,23,0.076,40,0.75,6
59,0.645,0.12,2,0.075,44,0.71,5
6,0.310,0.47,36,0.067,42,0.66,6


Splitting the dataset into training and testing sets.

In [119]:
wine_split <- initial_split(wine_red, prop = 0.75, strata = quality)
wine_train <- training(wine_split)
wine_test <- testing(wine_split)

wine_train

fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,total.sulfur.dioxide,sulphates,quality
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
74,0.70,0.00,19,0.076,34,0.56,5
78,0.88,0.00,26,0.098,67,0.68,5
78,0.76,0.04,23,0.092,54,0.65,5
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
84,0.37,0.43,23,0.063,19,0.81,7
74,0.36,0.30,18,0.074,24,0.70,8
7,0.56,0.17,17,0.065,24,0.68,7


**1.**
Checking how many NA values are present in each column using the `apply()` function. 

In [120]:
apply(X = is.na(wine_train), MARGIN = 2, FUN = sum)

Since there are no NA values in any of the columns, it is clear that all columns have 1198 rows of observations.

**2.**
Summarizing the means and ranges of all variables using `summarize()` and `map_df()`.

In [122]:
# mean_summary <- summarize(wine_train, across(fixed.acidity:sulphates, mean))
mean_summary <- wine_train |>
                select(-quality) |>
                map_df(mean)

range_summary <- summarize(wine_train, across(fixed.acidity:sulphates, range))

mean_summary
range_summary

fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,total.sulfur.dioxide,sulphates
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
75.16361,0.5278631,0.2723289,26.18114,0.08748998,,0.6563523


fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,total.sulfur.dioxide,sulphates
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
5,0.12,0.0,2,0.012,,0.33
159,1.58,0.79,465,0.467,,1.98


The range data gives us the minimum and maximum values for each column in the database.

**3.**
Finding out means of predictors that correspond to each level in the quality scale. This will help us later to check if our prediction of wine quality makes logical sense.

In [123]:
group_by(wine_train, quality) |>
        summarize(across(fixed.acidity:sulphates, mean))

quality,fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,total.sulfur.dioxide,sulphates
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
3,77.42857,0.9242857,0.08,25.42857,0.122,24.57143,0.5657143
4,71.36842,0.7089474,0.1434211,26.34211,0.08084211,35.60526,0.5457895
5,73.63158,0.5825536,0.2424951,27.11696,0.09351462,,0.6225341
6,74.99372,0.4937866,0.2789958,22.94142,0.08484728,40.69247,0.6733473
7,81.41333,0.3964,0.3794,33.33333,0.07655333,34.46667,0.7414667
8,80.0,0.3858333,0.4641667,25.75,0.07283333,25.0,0.7641667


From this table, we can see that a quality 4 wine for example corresponds to an average value of 23.5 in residual sugar.

**4.**
Finding overall summary of dataset using `summary()`.

In [124]:
summary(wine_train)

 fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
 Min.   :  5.00   Min.   :0.1200   Min.   :0.0000   Min.   :  2.00  
 1st Qu.: 68.00   1st Qu.:0.3900   1st Qu.:0.1000   1st Qu.: 17.00  
 Median : 77.00   Median :0.5200   Median :0.2600   Median : 22.00  
 Mean   : 75.16   Mean   :0.5279   Mean   :0.2723   Mean   : 26.18  
 3rd Qu.: 91.00   3rd Qu.:0.6400   3rd Qu.:0.4300   3rd Qu.: 26.00  
 Max.   :159.00   Max.   :1.5800   Max.   :0.7900   Max.   :465.00  
                                                                    
   chlorides       total.sulfur.dioxide   sulphates         quality     
 Min.   :0.01200   Min.   :  6.00       Min.   :0.3300   Min.   :3.000  
 1st Qu.:0.07100   1st Qu.: 21.75       1st Qu.:0.5500   1st Qu.:5.000  
 Median :0.08000   Median : 37.00       Median :0.6200   Median :6.000  
 Mean   :0.08749   Mean   : 45.97       Mean   :0.6564   Mean   :5.636  
 3rd Qu.:0.09100   3rd Qu.: 61.00       3rd Qu.:0.7300   3rd Qu.:6.000  
 Max.   :0

## Visualization

## Method
 We will conduct our data analysis by creating a histogram graph and individually assessing how each variable changes the quality of the wine. We will also be removing 4 columns that we believe have little to no change throughout all of the data. For example fixed acidity is always going to be fixed and as such will have no change. This means that it is completely useless to have an unchanging variable within our data. We will also be removing density, free and total sulfur dioxide as the difference in values is very small and as such will not have much of a noticeable impact on the quality of the wine. All 8 other columns that we will be using however all have significant impact upon the quality of the wine and as such it is extremely important to see how their variability changes the wine and which one has the biggest impact with the smallest change.
One way that we will visualize the result is to make a histogram graph. On the y-axis we will have the quality of the wine on a 1-10 scale and on the x-axis we will have our other variable that it is being compared to i.e. residual sugar, sulphates, volatile acidity etc. This will allow us to have many graphs that show how each variable changes the overall quality of the wine.

## Expected Outcomes and Significance
1. **What do we expect to find?**

We expect to find which variables have the biggest impact on the wine and whether or not certain variables are simply more impactful than others. It would also be interesting to see if a certain variable has a “breaking point” or a spot where the quality will sharply drop as soon as it hits that threshold. For example if after volatile hits 0.4 we see exponential change in the wine quality. Lastly, we would like to see what the ideal level of each variable is in order to achieve the perfect (or closest to it) vinho verde.

2. **What impact could such findings have?**

Determining wine quality can promote wine testing and ensure safety within not only wine sectors, but the overall food and beverage industry.  Being able to determine the quality of a commodity can inform consumers and motivate sellers to consistently improve their practice and uphold a certain standard of quality. Furthermore, the quality of wine can serve as a reflection of its contents, which provide an important insight as to whether the wine is safe to drink and follows food health and safety guidelines. 

3. **What future questions could this lead to?**

Knowing which variables are more impactful and which ones need to be focused on more can be an important tool for winemakers in order to perfect their craft. If they know that pH is a particularly sensitive variable that quality shifts sharply with them a point can be made in order to focus on refining the winemaking process to ensure the stability of pH throughout. This can also be used the other way where knowing that certain variables can be compromised ex. citric acid then winemakers can focus less on making something that is super stable in its citric acid in favour of a process that increases/decreases the level of citric acid in order to perfect the level of pH and increase the quality of their wine. 
