<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Cars-File" data-toc-modified-id="Cars-File-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Cars File</a></span><ul class="toc-item"><li><span><a href="#Data-Preparation" data-toc-modified-id="Data-Preparation-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Data Preparation</a></span></li><li><span><a href="#Model-Building" data-toc-modified-id="Model-Building-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Model Building</a></span><ul class="toc-item"><li><span><a href="#Logistic-Regression" data-toc-modified-id="Logistic-Regression-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Logistic Regression</a></span></li><li><span><a href="#Decision-Trees" data-toc-modified-id="Decision-Trees-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Decision Trees</a></span></li><li><span><a href="#Random-Forest" data-toc-modified-id="Random-Forest-1.2.3"><span class="toc-item-num">1.2.3&nbsp;&nbsp;</span>Random Forest</a></span></li><li><span><a href="#Multilayer-Perceptron" data-toc-modified-id="Multilayer-Perceptron-1.2.4"><span class="toc-item-num">1.2.4&nbsp;&nbsp;</span>Multilayer Perceptron</a></span></li></ul></li><li><span><a href="#Accuracy-metrics" data-toc-modified-id="Accuracy-metrics-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Accuracy metrics</a></span></li><li><span><a href="#Results" data-toc-modified-id="Results-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Results</a></span></li></ul></li></ul></div>

## Cars File

### Data Preparation

**Get all the packages and data ready**

In [1]:
# read in all packages
library(data.table)
library(caret)
library(dummy)
library(nnet)
library(randomForest)
library(RWeka)
# set options
options(warn=-1)
# read data
data <- fread("train.arff.csv")

“package ‘caret’ was built under R version 3.4.1”Loading required package: lattice
Loading required package: ggplot2
“unknown timezone 'zone/tz/2017c.1.0/zoneinfo/Asia/Kolkata'”dummy 0.1.3
dummyNews()
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.

Attaching package: ‘randomForest’

The following object is masked from ‘package:ggplot2’:

    margin

“package ‘RWeka’ was built under R version 3.4.3”

**Check the data types**

In [2]:
# check data types
dtypes <- as.matrix(sapply(data, class))

**Change the data types of the columns from characters to factors**

In [3]:
num_levels <- as.matrix(sapply(data, function(x){length(unique(x))}))
data_class_changed <- as.data.frame(sapply(data,as.factor))
target_levels <- as.matrix(sapply(data_class_changed, class))

**Impute Missing values**

In [4]:
num_nas <- apply(data_class_changed, 2, function(x) sum(x == " "))

<p style="color:green;"><b>Has no missing values&#8593;</b></p>

**Note**: in this exercise, we wont be splitting the dataset into test and train due to shortage of time.

**Check balance of data**

In [5]:
target_levels_dist <- 100*table(data_class_changed$class)/dim(data_class_changed)[1]

** Rename the `class` column. `class` is a function in R**

In [6]:
colnames(data_class_changed)[7] <- "category_type"

### Model Building

#### Logistic Regression

In [7]:
covariates <- names(data_class_changed)[!names(data_class_changed) == "category_type"]
x = (dummy(data_class_changed[covariates]))
x = sapply(x, function(xx){as.numeric(xx)-1})
new_names<- names(x)
y = data_class_changed["category_type"]
y = y$category_type

In [8]:
data_class_changed$category_type <- relevel(data_class_changed$category_type, ref = "unacc")
model_logreg <- multinom(category_type ~ ., data = data_class_changed)
predictions_logreg <- predict(model_logreg,data_class_changed)

# weights:  68 (48 variable)
initial  value 1677.416177 
iter  10 value 463.498200
iter  20 value 365.099598
iter  30 value 238.326744
iter  40 value 184.638230
iter  50 value 171.070923
iter  60 value 170.716280
iter  70 value 170.672318
iter  80 value 170.660067
iter  90 value 170.659530
final  value 170.659524 
converged


#### Decision Trees

In [9]:
model_dt <- J48(category_type~., data=data_class_changed)
predictions_dt <- predict(model_dt, data_class_changed)

#### Random Forest

In [10]:
model_rf <- randomForest(category_type~., data=data_class_changed)
predictions_rf <- predict(model_rf, data_class_changed)

#### Multilayer Perceptron

In [11]:
model_mlp <- caret::train(x, y, method="mlp")
predictions_mlp <- predict(model_mlp, x)

Loading required package: Rcpp

Attaching package: ‘RSNNS’

The following objects are masked from ‘package:caret’:

    confusionMatrix, train



### Accuracy metrics

In [12]:
y_true <- as.factor(data_class_changed[,"category_type"])

# build all the confusion matrices
cm_logreg <- as.data.frame.matrix(table(predictions_logreg, y_true))
cm_dt <- as.data.frame.matrix(table(predictions_dt, y_true))
cm_rf <- as.data.frame.matrix(table(predictions_rf, y_true))
cm_mlp <- as.data.frame.matrix(table(predictions_mlp, y_true))

In [13]:
# rearrange
cm_logreg <- cm_logreg[order(rownames(cm_logreg)),order(colnames(cm_logreg))]
cm_dt     <- cm_dt[order(rownames(cm_dt)),order(colnames(cm_dt))]
cm_rf     <- cm_rf[order(rownames(cm_rf)),order(colnames(cm_rf))]
cm_mlp    <- cm_mlp[order(rownames(cm_mlp)),order(colnames(cm_mlp))]

In [14]:
# function to caclulate class wise precision, accuracy, recall
classification_metrics <- function(conf_mat,model_type){
    precision <- diag(as.matrix(conf_mat)) / rowSums(conf_mat)
    recall <- diag(as.matrix(conf_mat)) / colSums(conf_mat)
    accuracy <- diag(as.matrix(conf_mat)) / sum(conf_mat)
    df <- data.frame(accuracy,precision,recall)
    colnames(df) <- paste(model_type,c("accuracy","precision","recall"),sep="_")
    df
}

In [15]:
# get all metrics
metrics_logreg <- classification_metrics(cm_logreg, "logreg")
metrics_dt <- classification_metrics(cm_dt, "dt")
metrics_rf <- classification_metrics(cm_rf, "rf")
metrics_mlp <- classification_metrics(cm_mlp, "mlp")

In [16]:
# combine all and rearrange
metric_comparison <- cbind(metrics_logreg, metrics_dt,metrics_rf,metrics_mlp)
rownames(metric_comparison) <- c("unacc", "acc", "good", "vgood")
models <- c("logreg", "dt", "rf", "mlp")
metrics <- c("accuracy", "recall", "precision")
model_metrics <- paste(rep(models,3),rep(metrics,4),sep="_")
accy <- model_metrics[grep("accuracy", model_metrics)][order(model_metrics[grep("accuracy", model_metrics)])]
recl <- model_metrics[grep("recall", model_metrics)][order(model_metrics[grep("recall", model_metrics)])]
prec <- model_metrics[grep("precision", model_metrics)][order(model_metrics[grep("precision", model_metrics)])]

### Results

The below table compares the classification metrics for each level and model. We have calculated three metrics for each model and level(levels here refers to each of the unique levels of the variable we are predicting: "unacc", "acc", "good", "vgood"). The metrics that we are calculating are: 
<ul>
<li><i><b>Accuracy: </b></i>measures the fraction of all instances that are correctly categorized</li>
<li><i><b>Recall: </b></i>is the proportion of people that tested positive and are positive (True Positive, TP) of all the people that actually are positive </li>
<li><i><b>Precision: </b></i>it is the proportion of true positives out of all positive results</li>
</ul>

<img src="Precisionrecall.png" width = 300px>

In [17]:
metric_comparison[c(accy,recl,prec)]

Unnamed: 0,dt_accuracy,logreg_accuracy,mlp_accuracy,rf_accuracy,dt_recall,logreg_recall,mlp_recall,rf_recall,dt_precision,logreg_precision,mlp_precision,rf_precision
unacc,0.19834711,0.1892562,0.21322314,0.21818182,0.9090909,0.8674242,0.9772727,1.0,0.8727273,0.8513011,0.9555556,0.9962264
acc,0.02727273,0.03553719,0.04214876,0.04297521,0.6346154,0.8269231,0.9807692,1.0,0.825,0.86,0.85,1.0
good,0.68099174,0.6677686,0.68512397,0.69586777,0.9774614,0.9584816,0.9833926,0.9988138,0.9763033,0.9676647,1.0,1.0
vgood,0.03884298,0.04132231,0.04214876,0.04214876,0.9215686,0.9803922,1.0,1.0,0.9215686,0.8928571,1.0,1.0
