In [2]:
library(tidyverse)
library(tidymodels)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔

Introduction 

Winemaking is a large global industry, with over 330 billion USD worth of wine sold in 2020. Cheap wine can be around 15 USD per bottle, with more expensive ones fetching 500 USD or more. Wine price is dictated partly by its quality, which is in turn linked to factors like sugar content, acidity, alcohol content, and many more. Below is a dataset with data about the quality of red wine, collected in 2009. 

In [3]:
red<- read_csv2("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv") 
red

[36mℹ[39m Using [34m[34m"','"[34m[39m as decimal and [34m[34m"'.'"[34m[39m as grouping mark. Use `read_delim()` for more control.

“One or more parsing issues, see `problems()` for details”
[1mRows: [22m[34m1599[39m [1mColumns: [22m[34m12[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ";"
[31mchr[39m (5): volatile acidity, citric acid, chlorides, density, sulphates
[32mdbl[39m (2): total sulfur dioxide, quality

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
<dbl>,<chr>,<chr>,<dbl>,<chr>,<dbl>,<dbl>,<chr>,<dbl>,<chr>,<dbl>,<dbl>
74,0.7,0,19,0.076,11,34,0.9978,351,0.56,94,5
78,0.88,0,26,0.098,25,67,0.9968,32,0.68,98,5
78,0.76,0.04,23,0.092,15,54,0.997,326,0.65,98,5
112,0.28,0.56,19,0.075,17,60,0.998,316,0.58,98,6
74,0.7,0,19,0.076,11,34,0.9978,351,0.56,94,5
74,0.66,0,18,0.075,13,40,0.9978,351,0.56,94,5
79,0.6,0.06,16,0.069,15,59,0.9964,33,0.46,94,5
73,0.65,0,12,0.065,15,21,0.9946,339,0.47,10,7
78,0.58,0.02,2,0.073,9,18,0.9968,336,0.57,95,7
75,0.5,0.36,61,0.071,17,102,0.9978,335,0.8,105,5


This dataset contains many different variables that play a part in determining the quality of red wine. For this model, we will be focusing on how citric acid, residual sugar, and alcohol content affect the rating of red wine. Below, we will select those variables and create our training and testing sets. 

In [4]:
red_data <- red |>
            mutate(quality = as_factor(quality))
colnames(red_data) <- make.names(colnames(red_data))

red_data_scaled<-red_data |>
                mutate(citric.acid=as.numeric(citric.acid)) |>
                mutate(scaled_citric_acid=scale(citric.acid,center=TRUE),
                        scaled_residual_sugar=scale(residual.sugar,center=TRUE),
                        scaled_alcohol=scale(alcohol,center=TRUE))

red_split<-initial_split(red_data_scaled, prop=0.75, strata=quality)
red_train<-training(red_split) 
red_test<-testing(red_split) 

red_train_set<-red_train |>
                select(c(citric.acid,residual.sugar,alcohol,scaled_citric_acid,scaled_residual_sugar,scaled_alcohol,quality))
red_test_set<-red_test |>
                select(c(citric.acid,residual.sugar,alcohol,scaled_citric_acid,scaled_residual_sugar,scaled_alcohol,quality))
red_train_set




citric.acid,residual.sugar,alcohol,scaled_citric_acid,scaled_residual_sugar,scaled_alcohol,quality
<dbl>,<dbl>,<dbl>,"<dbl[,1]>","<dbl[,1]>","<dbl[,1]>",<fct>
0.00,19,94,-1.39103710,-0.18264984,-0.04317968,5
0.00,26,98,-1.39103710,-0.01701753,-0.04317968,5
0.04,23,98,-1.18569949,-0.08800280,-0.04317968,5
0.00,19,94,-1.39103710,-0.18264984,-0.04317968,5
0.00,18,94,-1.39103710,-0.20631160,-0.04317968,5
0.06,16,94,-1.08303069,-0.25363512,-0.04317968,5
0.36,61,105,0.45700139,0.81114405,-0.04317968,5
0.08,18,92,-0.98036188,-0.20631160,-0.04317968,5
0.36,61,105,0.45700139,0.81114405,-0.04317968,5
0.00,16,99,-1.39103710,-0.25363512,-0.04317968,5


The above training set contains the data that we will use to train our model. 

In [6]:
counts<-red_train_set |>
        group_by(quality) |>
        summarize(n=n())
counts

counts_sugar<-red_train_set |>
        group_by(residual.sugar, quality) |>
        summarize(n=n())
counts_sugar

quality,n
<fct>,<int>
3,8
4,38
5,509
6,481
7,150
8,12


[1m[22m`summarise()` has grouped output by 'residual.sugar'. You can override using
the `.groups` argument.


residual.sugar,quality,n
<dbl>,<fct>,<int>
2,4,5
2,5,53
2,6,44
2,7,10
2,8,1
3,5,6
3,6,9
4,5,2
4,6,6
4,7,2


The two tables provide some more insight about the wrangled data. The "counts" table provides the number of red wines that were given a particular rating, with modt of the wines being rated a 5/8. The "counts_sugar" table is longer, and shows the ratings that were given to wines based solely on their residual sugar content. We could potentially do the same with citric acid or alcohol content, but the wider range pf variables would lengthen the table even more. Now we will create a plot to help visualize things better. 

In [15]:
plot <- 
plot

ERROR: Error in parse(text = x, srcfile = src): <text>:2:1: unexpected symbol
1: plot <- ggplot(red_train_set, aes(x= quality, y=stat()) + geom_bar(stat="identity")
2: plot
   ^
