Wine is an alcoholic drink, typically made from fermented grapes. With different quality of the raw materials and process, physicochemical components and sensory variables may change. We are exploring the question: based on physicochemical data, can we categorize if the quality of wine will be above or below average? We are using the “Wine Quality Data Set” that keeps records of red and white vinho verde wine samples, from the north of Portugal. Each row indicates the test for one type of wine including physicochemical variables (e.g. fixed acidity, volatile acidity, residual sugar) and an index indicating the condition (quality). We are going to find the relationship between the physicochemical statistics and the quality of wine.

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)

In [None]:
#loading data
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"
white_wine <- read_csv2(file = url, col_names = TRUE)
white_wine

In [None]:
## Data Cleaning ##
colnames(white_wine) <- make.names(colnames(white_wine))

white_wine$chlorides <- as.numeric(as.character(white_wine$chlorides))
white_wine$volatile.acidity <- as.numeric(as.character(white_wine$volatile.acidity))
white_wine$citric.acid <- as.numeric(as.character(white_wine$citric.acid))
white_wine$residual.sugar <- as.numeric(as.character(white_wine$residual.sugar))
white_wine$density <- as.numeric(as.character(white_wine$density))
white_wine$sulphates <- as.numeric(as.character(white_wine$sulphates))

# # Extracting the columns we want

whitewine <- white_wine %>%
select(volatile.acidity, sulphates, pH, total.sulfur.dioxide, alcohol, chlorides, quality)
whitewine

# Finding the mean quality of white wine
avg_quality <- mean(whitewine$quality)
avg_quality

# # Making the quality binary
white_wine_binary_temp <- whitewine %>%
    mutate(binary_quality = if_else(quality < avg_quality, 0, 1))

white_wine_binary_temp

# The binary data is not equally split up, so we need to make sure we have an equal distribution of 0 and 1 observations
wine1 <- white_wine_binary_temp[which(white_wine_binary_temp$binary_quality == 1),]
length(wine1$binary_quality)
wine0 <- white_wine_binary_temp[which(white_wine_binary_temp$binary_quality == 0),]
length(wine0$binary_quality)

wine1_subset <- sample_n(wine1, 1640)
length(wine1_subset$quality)

white_wine_binary <- rbind(wine1_subset, wine0)
length(white_wine_binary$quality)

In [None]:
# Visualizing binary quality distributions
white_wine_plot <- ggplot(white_wine, aes(x=quality)) + geom_histogram(binwidth=0.5)
white_wine_plot

white_wine_plot_binary <- ggplot(white_wine_binary, aes(x=binary_quality)) + geom_histogram(binwidth=0.5)
white_wine_plot_binary

In [None]:
# splitting data into training and testing
whitewine_split <- initial_split(white_wine_binary, prop = 0.74, strata = quality)
whitewine_train <- training(whitewine_split)
whitewine_test <- testing(whitewine_split)

In [None]:
## Creating summmary tables and visualizations

summary(whitewine_train)
whitewine_train_summary <- do.call(cbind, lapply(whitewine_train, summary))

whitewine_train_summary

chlorides.plot <- ggplot(whitewine_train, aes(x = binary_quality, y = chlorides)) + 
                          geom_point() +
                              xlab("Quality (0 or 1)") +
                              ylab("Chlorides (g(sodium chloride)/dm3)")
chlorides.plot

Classification is the process of predicting a categorical label of a data object based on its features and properties. In classification, we locate identifiers or boundary conditions that correspond to a particular label or category.

We first created 11 scatter plots each plotting the target variable, binary quality, to each predictor variable, such as pH level. Through this method, we determined six variables exist with significant influence on the target variable: Volatile acidity, total sulfur dioxide, sulphates, alcohol, chloride, pH. 

For our classification, we will use the KNN classification model. We will need to train it with the training and testing sets in order to predict the wine quality. We would visualize the data by plotting each pair of the variables on the x axis and y axis, then we would color the target variable quality. We can also find the classifier with the highest accuracy. 

We expect that higher alcohol, lower volatile acidity and lower chloride would lead to higher quality of the wine. It is hard to tell how the other three variables would affect quality by the scatterplot. We need to perform further analysis to justify their significance.

By predicting the wine quality using the above variables, we can give the wineries a clear indicator of what is considered a good-quality wine. Our findings would help the wineries to reflect on their wine production process. 

Some areas for further analysis include if there are other variables not included in this dataset that could improve our prediction, such as final sale price; is there a better combination of variables that could improve the prediction; is this model transferable for red wines as well?