# DSX Hands-on Workshop



## A bit about Jupyter notebook cell types.

The behavior of a cell is determined by a cell’s type. 

The different types of cells include:

**Code**: Where you can edit and write new code.

**Markdown**: Where you can document the computational process. You can input headings to structure your notebook hierarchically.

**Raw NBConvert**:  Where you can write output directly or put code that you don’t want to run. Raw cells are not evaluated by the notebook.

For the purpose of this lab, a heading will be added but all further notes will be inline with the code by using #.  An example of using Markdown will follow.

If you want to learn more about markdown then check this out:
[Mark Down Cheatseet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet)

In [None]:
# Add Markdown Title Here

## R Libraries

Many R functions come in packages, which are free libraries of code written by R's active user community.  There are thousands of helpful R packages but this lab will only require the following:

**caret**: Package of useful functions that help streamline the model building and evaluation process.

**randomForest**: Classification and regression based on a forest of trees using random inputs.

**rpart**: Recursive partitioning for classification, regression and survival trees. An implementation of most of the functionality of the 1984 book by Breiman, Friedman, Olshen and Stone.

**rpart.plot**:Plot 'rpart' models. Extends plot.rpart() and text.rpart() in the 'rpart' package.

**e1071**: Functions for latent class analysis, short time Fourier transform, fuzzy clustering, support vector machines, shortest path computation, bagged clustering, naive Bayes classifier.


In [None]:
# Install required libraries if not already present
# This step can take up to 1-2 minutes.

if(!require(caret)){
  install.packages("caret")
  print ('Package [caret] successfully installed.')
  library(caret)
  print ('[caret] loaded.')
} else {
  print('Package [caret] already installed.')
  library(caret)
  print ('[caret] loaded.')
}

if(!require(randomForest)){
  install.packages("randomForest")
  print ('Package [randomForest] successfully installed.')
  library(randomForest)
  print ('[randomForest] loaded.')
} else {
  print('Package [randomForest] already installed.')
  library(randomForest)
  print ('[randomForest] loaded.')
}

if(!require(rpart)){
  install.packages("rpart")
  print ('Package [rpart] successfully installed.')
  library(rpart)
  print ('[rpart] loaded.')
} else {
  print('Package [rpart] already installed.')
  library(rpart)
  print ('[rpart] loaded.')
}

if(!require(rpart.plot)){
  install.packages("rpart.plot")
  print ('Package [rpart.plot] successfully installed.')
  library(rpart.plot)
  print ('[rpart.plot] loaded.')
} else {
  print('Package [rpart.plot] already installed.')
  library(rpart.plot)
  print ('[rpart.plot] loaded.')
}

if(!require(e1071)){
  install.packages("e1071")
  print ('Package [e1071] successfully installed.')
  library(e1071)
  print ('[e1071] loaded.')
} else {
  print('Package [e1071] already installed.')
  library(e1071)
  print ('[e1071] loaded.')
}

## Reproducible Results

In [None]:
# Ensure the process is reproducible
# Generally, in statistics, samples are chosen at random.  A random number generator 
# is used to select the samples and is based off of a seed value.  The seed is 
# explicitly set so results are reproducible. To ensure everyone retrieves the same 
# results in this lab, the seed value was randomly chosen as 3482.
set.seed(3842)

## Bluemix Object Storage Connectivity

In [None]:
# Placeholder for R Data Frame Auto-code
# custDataRaw



In [None]:
# Primary data set row count
cat(sprintf("[custDataRaw] has %d rows:\n", nrow(custDataRaw)))

In [None]:
# Summary Stats for entire data set
summary(custDataRaw)

In [None]:
# Create index of data rows to faciliate partitioning
# The createDataPartition function will randomly pick 90% of the rows which will be used for training/testing data sets
# 10% will be left out for a validation data set
trainIndex_temp <- createDataPartition(y= custDataRaw$CHURN, p=0.9, list = FALSE)

# 10% data goes in here (validation)
# Notice the "-" symbol to indicate "not" the 90%
validation  <- custDataRaw[-trainIndex_temp,]

# This now becomes our working data for training and testing
temp_hold <- custDataRaw[trainIndex_temp,]
# Rename it to something friendly
custDataRaw <- temp_hold

# The remaining data will be split again for training and testing data
trainIndex <- createDataPartition(y= temp_hold$CHURN, p=0.8, list = FALSE)
train <- temp_hold[trainIndex,]
test <- temp_hold[-trainIndex,]
# 80% for training
# 20% for testing



In [None]:
# Training and Testing data sets row counts
cat(sprintf("[train] has %d rows:\n", nrow(train)))
cat(sprintf("[test] has %d rows:\n", nrow(test)))
cat(sprintf("[validation] has %d rows:\n", nrow(validation)))

## Decision Tree Classifier

Decision tree learning uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) 
to conclusions about the item's target value (represented in the leaves). It is one of the predictive modelling approaches used in statistics, 
data mining and machine learning. 

If you want to learn more about the decision trees then check this out:
[Decision Tree Learning](https://en.wikipedia.org/wiki/Decision_tree_learning)

In [None]:
# Using the training data (train), create a classification tree.
# The target is "CHURN", the predictors are every other variable except ID.
# The target is cast from boolean to a character for ease of model interpretation.

fitCART <- rpart(as.character(CHURN) ~ Gender + Status + Children + Est.Income +
                 Car.Owner + Age + LongDistance + International + Local +
                 Dropped + Paymethod + LocalBilltype + LongDistanceBilltype +
                 Usage + RatePlan,
             data = train,
             method="class")

# The resulting model is placed into an object called "fitCart"

In [None]:
# The rpart.plot library helps us visualize the resulting tree.
rpart.plot(fitCART)

Each node shows
- the predicted class (CHURN)
- the predicted probability of CHURN
- the percentage of observations in the node

In [None]:
# Using the "predict" function we measure our model's performance using the test data
prediction <- predict(fitCART,test,type="class")

In [None]:
# Show side by side, the actual outcome vs. the predictied outcome
finalResults <- data.frame(Actual = test$CHURN, Predicted = prediction)

In [None]:
# Taking a peek at the resulting data frame
head(finalResults[order(finalResults$Actual, decreasing=TRUE), ], 10)

## Confusion Matrix
A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known. The confusion matrix itself is relatively simple to understand, but the related terminology can be confusing.

If you want to learn more about the confusion matrix then check this out:
[Confusion Matrix Cheatseet](http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/)

In [None]:
# Overall, how well did our model perform?
confusionMatrix(prediction, test$CHURN)

## Random Forest Classifier

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set.

If you want to learn more about Random Forests then check this out:
[Random Forests](https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm)


In [None]:
fitRandomForests <- randomForest(as.factor(CHURN) ~ Gender + Status + Children + Est.Income +
                    Car.Owner + Age + LongDistance + International + Local +
                    Dropped + Paymethod + LocalBilltype + LongDistanceBilltype +
                    Usage + RatePlan,
                    data=custDataRaw,
                    importance=TRUE,
                    ntree=100,
                    mtry=3
                    )

In [None]:
# A nice feauture of Random Forests is that it provides an easy lens into the most important features.
varImpPlot(fitRandomForests, 
           sort=T,
           main="Variable Importance",
           n.var=13)

There are two types of importance measures produced with Random Forests. Accuracy (MeanDecreaseAccurary) tests to see how worse the model performs without each variable, so a high decrease in accuracy would be expected for very predictive variables. Gini (MeanDecreaseGini) digs into the mathematics behind decision trees, but essentially measures how pure the nodes are at the end of the tree. Again it tests to see the result if each variable is taken out and a high score means the variable was important.

Please make a note of the top 10 variables as indicated by **MeanDecreaseAccuracy**

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.


In [None]:
# Plotting Random Forests' trees is complex and can be misleading.  However, we are able to plot the Out of Bag Error Rate (OOB), the FALSE, and TRUE error rates as a function of the # of trees generated.
plot(fitRandomForests, main=paste("Error Rate vs. # Trees ( mtry =",fitRandomForests$mtry,")"), 
     type="l", 
     col.main="black",
     lwd=2,
     lty=1);
legend("top", colnames(fitRandomForests$err.rate),col=1:4,
       cex=0.8, fill=1:5, lwd=1, bty="n")

In [None]:
# Overall, how did our Random Forests model perform
print(fitRandomForests)

In [None]:
# Let's test our model with a small random sample of the overall data set
randomForestsPredictResponse <- predict(fitRandomForests, validation)

In [None]:
# Overall model performance was excellent on a small sampling of data
confusionMatrix(randomForestsPredictResponse,
                reference=validation$CHURN)

# END OF NOTEBOOK EXCERCISE