In [None]:
# load the packages
library('ggplot2')
library('dplyr')

# ignore the commands below; these just make sure plots fit on the screen
library('repr')
options(repr.plot.width=3, repr.plot.height=3)

# Lesson 15: Assessing Your Models

Today:
1. Assessing your models
    + Accuracy
    + Other ways to measure goodness of models
    + Improving your models
        + Incorporating more features
        + k-Nearest Neighbor Classifiers

## 1. Measuring "Goodness" of Classifiers

In Lesson 13, we have created a function called `predict_tumor_class()`:

`Z <- predict_tumor_class( X, Y )`

where
+ X = marginal adhesion value
+ Y = clump thickness value
+ Z = the prediction that your decision tree classifier makes for the given values of X and Y

**Example: Encoding a simple classifier (version 2)**

<table>
    <tr>
        <td><img src="lec20-knn-illustration2_wline2.jpg" width="600"></td>
        <td><img src="dec_tree1b.jpg" width="600"></td>
    </tr>
</table>  

Recall that our classifier was constructed based on our observation of the training dataset.
+ The original cancer dataset has 638 rows.
+ We take the first 400 for our training dataset and the last 283 for our test dataset

In [None]:
cancerdata <- read.csv('../../shared/datasets/cancer.csv')
dim(cancerdata)

In [None]:
# ---------------
# this part simply puts together the pieces we have done previously into one giant code cell

# 1. THE DATASET
# as we did in lesson 11, split into training and test datasets:
#cancerdata$Class <- factor(cancerdata$Class)

# split the cancer dataset into two: training data (the first 300 rows), test data (the remaining 383 rows) 
training_data <- cancerdata[1:300,]
test_data <- cancerdata[301:683,]


# 2. THE CLASSIFIER 
# here is the classifier from Lesson 13

# Encoding a simple classifier
# If marginal_adhesion is less than 4 AND clump_thickness is less than 7, the tumor is classified as 0 (benign); 
# else, it is classified as 1 (malignant)

predict_tumor_class <- function( clump_thickness, marginal_adhesion ){
    
    if( marginal_adhesion < 4 ){

        if( clump_thickness < 7 ){
            class_predicted <- 0
        }else{
            class_predicted <- 1
        }

    }else{
        class_predicted <- 1
    }
    
    return( class_predicted )
}


# 3. PREDICT THE CLASS OF EACH ROW OF THE TEST DATASET, USING A FOR LOOP
#   we did this in lesson 14:

# create an empty data frame, 2 columns, 283 rows

predictions <- data.frame(matrix(nrow = 283, ncol = 2))
names(predictions) <- c( 'class_actual', 'class_predicted' )

# column containing the actual class is from the Class column of the test data
predictions$class_actual <- test_data$Class

# store the predictions in this data frame
for( row in 1:283 ){

    output <- predict_tumor_class( test_data$Clump.Thickness[row] , 
                                        test_data$Marginal.Adhesion[row ] )

    predictions$class_predicted[row] <- output
}

head(predictions)
# -------------------


# Next, check how good our predictions are, by comparing to the actual class

# count how many predictions are incorrect and how many are correct
#    add a new column called "error"
#    if actual class is equal to predicted class, error is 0; else, error is 1








### The k Nearest Neighbor Classifier

In [None]:
#knn




In [None]:
# make an empty data frame to store the predictions
predictions <- data.frame( matrix( nrow = 283, ncol = 3   ))
names(predictions) <- c('class_actual', 'class_predicted', 'error')

head(predictions)