In [None]:
# Loading libraries
library(tidyverse)
library(digest)
library(repr)
library(tidymodels)
library(GGally)
library(ISLR)
options(repr.matrix.max.rows = 12)

In [None]:
# Loading the data
ww_url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"
rw_url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
ww <- read_delim(ww_url, delim = ";", col_names = TRUE)
rw <- read_delim(rw_url, delim = ";", col_names = TRUE)

colnames(ww) <- make.names(colnames(ww))
colnames(rw) <- make.names(colnames(rw))

Below we summarize each dataset to gain preliminary insight into the datasets.

In [None]:
# Red wine summary statistics
"red wine statistics"
summary(rw)

# white wine summary statistics
"white wine statistics"
summary(ww)

Below we convert the multi-categorical classes of quality to a binary variable where 0 represents "bad" or below average, and 1 represents "good" or above average. We purposefully do this to achieve better class balance as quality is approximately normally distributed between 5 and 6.

We split the classes based on the average quality score we calculate for each wine data set.

In [None]:
options(repr.plot.height = 15, repr.plot.width = 15)

# Calculating the wine quality averages
rw_avg <- mean(rw$quality)
ww_avg <- mean(ww$quality)

# Converting quality to binary scores as factors
rw_bin <- rw %>%
    mutate(binary.quality = as_factor(if_else(quality < rw_avg, 0,1)))

ww_bin <- ww %>%
    mutate(binary.quality = as_factor(if_else(quality < ww_avg, 0,1)))

# Converting quality to binary scores without factorizing
rw_bin_non_fct <- rw %>%
    mutate(binary.quality = if_else(quality < rw_avg, 0,1))

ww_bin_non_fct <- ww %>%
    mutate(binary.quality = if_else(quality < ww_avg, 0,1))

We use ggpairs() to create a summary of the variables and their respective distributions. We also use cor() to calcualate the correlations between each predictor variable and the response variable (binary_quality).


We use the information presented below to better inform which data set to use given its prospects for a fruitful data analysis.

In [None]:
# Red wine plot and Pearson correlations
rw_plot <- ggpairs(rw_bin)

rw_corr <- rw_bin_non_fct %>%
    cor() %>% # cor() returns a vector, here we convert it to a data frame so we can use it
    as.data.frame() %>% 
    select(binary.quality) %>%
    arrange(desc(abs(binary.quality))) %>%
    tail(-2)

rw_plot
rw_corr

In [None]:
# White wine plot and Pearson correlations
ww_plot <- ggpairs(rw_bin)

ww_corr <- ww_bin_non_fct %>%
    cor() %>%
    as.data.frame() %>%
    select(binary.quality) %>%
    arrange(desc(abs(binary.quality))) %>%
    tail(-2)

ww_plot
ww_corr

Intuitively, it seems the quality of a red wine is more dependent on its physicochemical composition. We combine the tables below.

In [None]:
total_corr <- merge(ww_corr, rw_corr, by = 'row.names', all = TRUE) %>%
    rename("variables" = Row.names, "white.wine" = binary.quality.x, "red.wine" = binary.quality.y) %>%
    mutate("stronger.correl" = if_else(white.wine > red.wine, "white", "red"))

total_corr

Confirming our intuition, the quality of a red wine, all else equal, is more dependent on its physicochemical composition; of the 10 variables, red wines have stronger correlations for 6.

Therefore, to conduct a fruitful analysis, we will hereinafter use the red wine dataset for all analysis. We now visualize the distributions of the five strongest red wine correlations

In [None]:
options(repr.plot.height = 7, repr.plot.width = 7)

rw_strongest <- total_corr %>%
    select(variables, red.wine) %>%
    arrange(desc(abs(red.wine)))

rw_strongest

# Alcohol density plot
alcohol_dens <- rw_bin %>%
    ggplot(aes(x = alcohol, group = binary.quality, fill = binary.quality)) +
    geom_density(adjust = 1.5, alpha = 0.4) +
    labs(x = "Alcohol Level in Wine", y = "Density of Observations", fill = "Wine Quality", title = "Density of Alcohol Observations by Wine Quality")

# Volatile acidity density plot
volatile_acidity_dens <- rw_bin %>%
    ggplot(aes(x = volatile.acidity, group = binary.quality, fill = binary.quality)) +
    geom_density(adjust = 1.5, alpha = 0.4) +
    labs(x = "Volatile Acidity Levels in Wine", y = "Density of Observations", fill = "Wine Quality", title = "Density of Volatile Acidity Observations by Wine Quality")

# Total sulfur dixoide plot
total_sulfur_dioxide_dens <- rw_bin %>%
    ggplot(aes(x = total.sulfur.dioxide, group = binary.quality, fill = binary.quality)) +
    geom_density(adjust = 1.5, alpha = 0.4) +
    labs(x = "Total Sulfur Dioxide Level in Wine", y = "Density of Observations", fill = "Wine Quality", title = "Density of Total Sulfur Dioxide by Wine Quality")

# Sulphates
sulphates_dens <- rw_bin %>%
    ggplot(aes(x = sulphates, group = binary.quality, fill = binary.quality)) +
    geom_density(adjust = 1.5, alpha = 0.4) +
    labs(x = "Sulphate Level in Wine", y = "Density of Observations", fill = "Wine Quality", title = "Density of Sulphate Observations by Wine Quality")

# Citric acid
citric_acid_dens <- rw_bin %>%
    ggplot(aes(x = citric.acid, group = binary.quality, fill = binary.quality)) +
    geom_density(adjust = 1.5, alpha = 0.4) +
    labs(x = "Citric Acid Level in Wine", y = "Density of Observations", fill = "Wine Quality", title = "Density of Citric Acid Observations by Wine Quality")

alcohol_dens
volatile_acidity_dens
total_sulfur_dioxide_dens
sulphates_dens
citric_acid_dens

As exhibited in the above plots, alcohol level has the highest positive correlation to the quality of a red wine. The following plots also show moderately significant correlations, so we must decide which to use as to obtain the highest accuracy. This process is called feature selection and we will be using a forward selection process.  

In [None]:
options(repr.plot.width = 12, repr.plot.height = 12)
set.seed(6238)

In [None]:
rw_good <- rw_bin[which(rw_bin$binary.quality == 1),]
length(rw_good$binary.quality)