<img src = "https://ibm.box.com/shared/static/hhxv35atrvcom7qc1ngn632mkkg5d6l3.png", width = 200></img>

<h2, align=center> Toronto - Big Data University Meetup</h2>
<h1, align=center> Data Mining Algorithms</h1>
<h3, align=center> October 26, 2015</h3>
<h4, align=center><a href = "linkedin.com/in/polonglin">Polong Lin</a></h4>
<h4, align=center><a href = "https://ca.linkedin.com/in/saeedaghabozorgi">Saeed Aghabozorgi</a></h4>

<hr>

## Welcome to Data Scientist Workbench

Data Scientist Workbench is an environment that hosts multiple data science tools:
- Python notebooks (PySpark pre-installed)
- R notebooks (SparkR pre-installed)
- Scala notebooks (Spark pre-installed)
- <a href = "https://datascientistworkbench.com/rstudio">RStudio</a>
- <a href = "https://datascientistworkbench.com/openrefine">OpenRefine</a>


### Initial setup

In [None]:
%%bash
pip install rpy2

In [None]:
%load_ext rpy2.ipython

In [None]:
%%R
detach(package:SparkR)
R.Version()$version.string

<br>
<hr>

<h1, align= center>Classification Trees in R</h1>

### Get the data

In [2]:
!wget -O recipes.csv https://ibm.box.com/shared/static/g9l7707576o1pbj9xoozpofyh2fxah6o.csv

/bin/sh: wget: command not found


## Import data

In [None]:
%%R
#May take 10-30 seconds to read
recipes <- read.csv("recipes.csv")
head(recipes)

## Data Cleaning

In [None]:
%%R
recipes$country <- tolower(as.character(recipes$country)) 
recipes$country[recipes$country == "china"] <- "chinese"
recipes$country[recipes$country == "france"] <- "french"
recipes$country[recipes$country == "germany"] <- "german"
recipes$country[recipes$country == "india"] <- "indian"
recipes$country[recipes$country == "israel"] <- "jewish"
recipes$country[recipes$country == "italy"] <- "italian"
recipes$country[recipes$country == "japan"] <- "japanese"
recipes$country[recipes$country == "korea"] <- "korean"
recipes$country[recipes$country == "mexico"] <- "mexican"
recipes$country[recipes$country == "scandinavia"] <- "scandinavian"
recipes$country[recipes$country == "thailand"] <- "thai"
recipes$country[recipes$country == "vietnam"] <- "vietnamese"
recipes[,names(recipes)] <- lapply(recipes[,names(recipes)] , as.factor)
str(recipes)

## Most Popular Ingredients

In [None]:
%%R
## Sum the row count when the value of the row in a column is equal to "Yes" (which equals 2)
ing_df <- unlist(lapply(recipes[,names(recipes)] , function(x) sum(as.integer(x) == 2)))

## Transpose the dataframe so that each row is an ingredient
ing_df <- as.data.frame(t(as.data.frame(ing_df)))
ing_df <- data.frame("ingredient" = names(ing_df), "count" = as.numeric(ing_df[1,]))

ing_df[order(ing_df$count, decreasing = TRUE),][-1,]

In [None]:
%%R
## Install libraries
#install.packages("rpart", repo = "http://cran.utstat.utoronto.ca/")
#install.packages("rpart.plot", repo = "http://cran.utstat.utoronto.ca/")
library(rpart) #for classification trees
library(rpart.plot) #to plot rpart trees

## East Asian Recipes

In [1]:
%%R
?rpart

ERROR: Cell magic `%%R` not found.


In [None]:
%%R
## Create decision tree on subset of countries (East Asian + Indian)
bamboo_tree <- rpart(formula = country ~ ., 
                     data = recipes[recipes$country %in% c("korean", 
                                                           "japanese", 
                                                           "chinese", 
                                                           "thai",
                                                           "indian"),], 
                     method ="class")

In [None]:
%%R
## Plot the East Asian + Indian model
## run "?rpart.plot" if you want to see the arguments for rpart.plot
rpart.plot(bamboo_tree, type = 3, extra = 2, under = TRUE, cex = 0.75, varlen = 0, faclen = 0)

In [None]:
%%R
## Summary of Asian tree
summary(bamboo_tree)

## Training & Testing a Classification Tree

In [None]:
%%R
## Reduce the recipes dataset into East Asian + Indian only
bamboo <- recipes[recipes$country %in% c("korean", "japanese", "chinese", "thai", "indian"),]

print("Total recipes per country")
print(table(as.factor(as.character(bamboo$country))))

## Set sample size per country for testing set
sample_n <- 30


## Take n recipes from each country
set.seed(4) #Set random seed

korean <- bamboo[sample(which(bamboo$country == "korean") , sample_n), ]
japanese <- bamboo[sample(which(bamboo$country == "japanese") , sample_n), ]
chinese <- bamboo[sample(which(bamboo$country == "chinese") , sample_n), ]
indian <- bamboo[sample(which(bamboo$country == "indian") , sample_n), ]
thai <- bamboo[sample(which(bamboo$country == "thai") , sample_n), ]


#Create the testing dataframe
bamboo_test <- rbind(korean,japanese, chinese, thai, indian)

## Create the training dataset (remove test set from original bamboo dataset)
bamboo_train <- bamboo[!(rownames(bamboo) %in% rownames(bamboo_test)),]

## Check that we have 30 recipes from each cuisine
print("----------------------------------------")
print("Training dataset: (Total-30) recipes per country")
print(table(as.factor(as.character(bamboo_train$country))))
print("----------------------------------------")
print("Testing dataset: 30 Sampled recipes per country")
print(table(as.factor(as.character(bamboo_test$country))))

### Train decision tree model on training dataset

In [None]:
%%R

#Train on the bamboo_train data
bamboo_tree_pred <- rpart(formula = country ~ ., 
                     data = bamboo_train[bamboo_train$country %in% c("korean", 
                                                           "japanese", 
                                                           "chinese", 
                                                           "thai",
                                                           "indian"),], 
                     method ="class")
#Plot the trained tree
rpart.plot(bamboo_tree_pred, type = 3, extra = 2, under = TRUE, cex = 0.75, varlen = 0, faclen = 0)

### Fit the trained model to the test dataset

In [None]:
%%R
bamboo_fit <- predict(bamboo_tree_pred, subset(bamboo_test, select=-c(country)), type = "class")

### Check accuracy of model: Confusion Matrix

In [None]:
%%R
bamboo_tab <- table(paste(as.character(bamboo_fit),"_pred", sep =""), paste(as.character(bamboo_test$country),"_true", sep =""))
bamboo_tab

#### Confusion Matrix (percentages)

In [None]:
%%R
round(prop.table(bamboo_tab,2)*100,1)

<hr>
## RESOURCES:


### Useful Links:

- **Data Science** http://bigdatauniversity.com
- **Clustering** http://bigdatauniversity.com/bdu-wp/bdu-course/machine-learning-cluster-analysis/
- **R-Code** http://www.statmethods.net/advstats/factor.html
- **Visualize** http://www.r-bloggers.com/computing-and-visualizing-pca-in-r/
- **Rpart:** [How the rpart package in R uses recursive partitioning](http://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf)
- **Scikit-learn:** [Classification trees using scikit-learn in Python](http://scikit-learn.org/stable/modules/tree.html)
- **Videos:** [“Machine learning – decision trees” by Professor Nando de Freitas](https://www.youtube.com/watch?v=-dCtJjlEEgM)
- **Datacamp’s** [Kaggle R tutorial on Titanic survivorship](https://www.datacamp.com/courses/kaggle-tutorial-on-machine-learing-the-sinking-of-the-titanic)
