# Machine Learning Exercise with `R`

There is still a lot to learn about machine learning, and it is important to recognize that we have barely started to scrape the surface of it. There are many things we could do to refine our model that we didn't touch on in this module (don't worry, these will be covered throughout your curriculum), such as data transformation, elegant methods for automated feature selection, as well as unsupervised learning.

For these exercises, we ask you to only complete **ONE** of the exercise notebooks, either `Python` or `R`. We will be asking you to predict wine quality using both Decision Tree and Naïve Bayes. Your exercises will serve as a sort-of extended practice in which you are free to try and refine the model however you see fit, but we do ask you to use both Decision Tree and Naïve Bayes.

The questions will guide you a bit, but if you want to experiment or you find, through data exploration, a model that is better, feel free to do so. If you go this route, leave comments in the code justifying why you did what you did.

### Read in Packages

In [1]:
library(tree)
library(ggplot2)
library(e1071)

### Read in the Data

Today we will be using the Red Wine Quality data. The target variable is numeric, so we are going to discretize it a bit before we get to the activities.

In [2]:
wine <- read.csv('/dsa/data/all_datasets/wine-quality/winequality-red.csv', sep = ";")
head(wine)

fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,quality
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
7.8,0.88,0.0,2.6,0.098,25,67,0.9968,3.2,0.68,9.8,5
7.8,0.76,0.04,2.3,0.092,15,54,0.997,3.26,0.65,9.8,5
11.2,0.28,0.56,1.9,0.075,17,60,0.998,3.16,0.58,9.8,6
7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
7.4,0.66,0.0,1.8,0.075,13,40,0.9978,3.51,0.56,9.4,5


In [3]:
# if wine quality is less than 6, assign the value "bad".
# if 6 or greater, assign "good". create a new target called
# taste
wine$taste <- ifelse(wine$quality < 6, 'bad', 'good') 

# 6 is the most popular value by a lot in this set, so 
# we are going to assign it a unique value. We will call 
# this "normal" as it is in the middle of the distribution.
wine$taste[wine$quality == 6] <- 'normal'

# make this target variable categorical
wine$taste <- as.factor(wine$taste)

# remove the old target, since it is no longer needed
wine <- wine[,-12]

**Exercise 1**: Create a training data set and testing data set from the `wine` data frame. Make sure that the rows are randomly selected. The training set should be constructed from 60% of the data; call it `train`. The testing set should be called `test` and should be constructed from the **other** 40% of the data. Be sure to pass `123` as the set.seed() first.

In [96]:
# Code for exercise 1 goes here
# *****************************

set.seed(123)
numrow = nrow(wine) 
train_ind <- sample(seq_len(nrow(wine)), size = as.integer(0.6*numrow))
train <- wine[train_ind,]
test <- wine[-train_ind,]

**Exercise 2**: Create a formula for the prediction task. First predict using all of the variables other than the target. In order to avoid typing out all of the variables, you can use the following notation:

```splus
target ~ .
```

The "." tells `R` to use all other variables in the dataset (that are not the target) as inputs.

In [97]:
# Code for exercise 2 goes here
# *****************************

frmla = taste ~ .

**Exercise 3**: Create a Decision Tree model using the `tree` function. Make sure that you pass the newly created formula as a parameter and specify the training data set. Be sure to name this object something (in the examples, we called it `tr`). Then run a summary on the object. 

In [98]:
# Code for exercise 3 goes here
# *****************************

tr <- tree(frmla, data = wine)

Pay attention to the output of the summary.

**Exercise 4**: What is the misclassification error rate of the tree using the **testing** set?

In [99]:
# Code for exercise 4 goes here
# *****************************

summary(tr)

# misclassification error rate of the tree is 0.3815 or 38%


Classification tree:
tree(formula = frmla, data = wine)
Variables actually used in tree construction:
[1] "alcohol"              "sulphates"            "total.sulfur.dioxide"
Number of terminal nodes:  8 
Residual mean deviance:  1.558 = 2479 / 1591 
Misclassification error rate: 0.3815 = 610 / 1599 

**Exercise 5**: Now create a Naïve Bayes classifier using the formula and training data. Be sure to name this model something (in the other notebooks, we called it `m`).

In [100]:
# Code for exercise 5 goes here
# *****************************

m <- naiveBayes(frmla, data = train)
m


Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
      bad      good    normal 
0.4734098 0.1303441 0.3962461 

Conditional probabilities:
        fixed.acidity
Y            [,1]     [,2]
  bad    8.191189 1.530286
  good   8.928000 2.106771
  normal 8.458947 1.834325

        volatile.acidity
Y             [,1]      [,2]
  bad    0.5984141 0.1872713
  good   0.4045200 0.1430999
  normal 0.4914079 0.1687192

        citric.acid
Y             [,1]      [,2]
  bad    0.2390749 0.1853319
  good   0.3744800 0.1905444
  normal 0.2882105 0.1950711

        residual.sugar
Y            [,1]     [,2]
  bad    2.512445 1.298201
  good   2.676400 1.299699
  normal 2.471053 1.327730

        chlorides
Y              [,1]       [,2]
  bad    0.09217401 0.05137602
  good   0.07500800 0.02245443
  normal 0.08388947 0.03630374

        free.sulfur.dioxide
Y            [,1]     [,2]
  bad    16.43612 10.69295
  good   

**Exercise 6**: What is the misclassification error rate of the Naïve Bayes classifier using the **testing** set?

In [101]:
# Code for exercise 6 goes here
# *****************************

table(predict(m, test[,-12]), test[,12])

# misclassification error rate of the Naïve Bayes classifier is 204/640 = 32%

        
         bad good normal
  bad    222    8     98
  good     7   47     30
  normal  61   37    130

Take a look at the summary of the tree created in Exercise 3. It shows us the features that it used for the classification task. 

**Exercise 7**: Create a new formula that predicts `taste` using only the features that the decision tree defined. Be sure to name this formula something different from the old formula.

In [78]:
# Code for exercise 7 goes here
# *****************************


frmla2 = taste ~ alcohol + sulphates + total.sulfur.dioxide


**Exercise 8**: Now create a Naïve Bayes classifier using this pruned formula and training data. Be sure to name this model something other than your original Naive Bayes model.

In [79]:
# Code for exercise 8 goes here
# *****************************


m2 <- naiveBayes(frmla2, data = train)
m2



Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
      bad      good    normal 
0.4734098 0.1303441 0.3962461 

Conditional probabilities:
        alcohol
Y             [,1]      [,2]
  bad     9.942401 0.7196797
  good   11.574133 0.9948421
  normal 10.641535 1.0472273

        sulphates
Y             [,1]      [,2]
  bad    0.6209031 0.1747981
  good   0.7485600 0.1372818
  normal 0.6681842 0.1439680

        total.sulfur.dioxide
Y            [,1]     [,2]
  bad    53.25881 36.24222
  good   36.41600 32.42365
  normal 41.64474 25.59287


**Exercise 9**: Does using only these select features create a better model according to the testing data misclassification error rate?

In [102]:
# Code for exercise 9 goes here
# *****************************

table(predict(m, test[,-12]), test[,12]) # misclassification error rate = 32%

table(predict(m2, test[,-c(1,2,3,4,5,6,8,9)]), test[,12]) # misclassification error rate = 30%

# Yes using only these select features creates a slightly better model according to the testing data misclassification error rate, but only marginally. 

        
         bad good normal
  bad    222    8     98
  good     7   47     30
  normal  61   37    130

        
         bad good normal
  bad    242   13    119
  good     2   20     10
  normal  46   59    129

# Save your noteboot, then `File > Close and Halt`