# Machine Learning Overview

+ What is Machine Learning? 
+ Machine Learning vs Traditional Development 
+ Types of Machine Learning 
+ Course Content 
+ Machine Learning and Data Science 

### Machine Learning in Action 

+ Is this email spam? 
+ How will people vote? 
+ What will people buy? 

|Traditional Control Logic |-|Machine Learning Logic| 
|-|-|-|
|If > Case> While> Until|-|Data> Algorithm > Data Analysis > Model|

### Machine Learning

Building a model from example inputs to make data-driven predictions vs. following strictly static program instructions. 

### Types of Machine Learning : 
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Reinforcement Machine Learning

### Machine Learning Technique Comparison 
|Supervised |Unsupervised |
|-|-|
|Value prediction |Identify clusters of like data |
|Needs training data containing value being predicted|Data does not contain cluster membership |
|Trained model predicts value in new data|Model provides access to data by cluster |

# Understanding Machine Learning With R

### Skills Required

+ Software development experience
+ Experience with data in tables
+ Basic math and statistics skills
+ Passion to understand

### Why This Course? 

+ Add Machine Learning skills 
+ Learn something new
+ Learn about Data Science

<img src="http://2s7gjr373w3x22jf92z99mgm5w.wpengine.netdna-cdn.com/wp-content/uploads/2016/01/data-science-venn-diagram.png" alt="Data Science" height="50%" width="50%" align="left" />

### Machine Learning Workflow 

An **orchestrated and repeatable** pattern which systematically transforms and processes information to create prediction solutions. 

1. Asking the right question
1. Preparing data
1. Selecting the algorithm
1. Training the model
1. Testing the model


<img src="https://www.class-central.com/report/app/uploads/2017/05/ml-workflow.jpg" alt="Data Science" height="50%" width="50%" align="left" />

The machine learning workflow, via [UpX Academy](https://upxacademy.com/introduction-machine-learning/)

### Machine Learning Workflow Guidelines 
+ Early steps are most important (Each step depends on previous steps)
+ Expect to go backwards (Later knowledge effects previous steps)
+ Data is never as you need it (Data will have to be altered)
+ More data is better (More Data => Better Results )
+ Don't pursue a bad solution (Reevaluate, fix or quit)

### 1 - Asking the Right Question

1. Don't we  already have the question? 
    + "Predict if a flight will be on-time" 
    + But we need statement to direct and validate work 
    + Define End Goal, Starting Point and How to Achieve Goal
2. Solution Statement Goals 
    + Define scope (including data sources) 
    + Define target performance 
    + Define context for usage 
    + Define how solution will be created
3. Scope and Data Sources 
    + US flights only 
    + Flights between US airports only 
    + DOT database is a good source 
    + "Using DOT data, predict if a flight would be on time"
4. Data
    + Preliminary data review 
    + Delays tracked, not on-time
    + "Using DOT data, predict if a flight would be delayed"
5. Performance Targets
    + Binary result (True or False) 
    + Coin Flip 50% Accuracy 
    + 70% Accuracy is common target 
    + "Using DOT data, predict with 70+% accuracy if a flight would be delayed"
6. Context 
    + Data driven results 
    + DOT "delayed" greater than 15 minutes after scheduled 
    + "Using DOT data, predict with 70+% accuracy if a flight would arrive 15+ minutes after the scheduled arrival time."
7. Solution Creation
    + Machine Learning Workflow 
        + Process DOT data 
        + Transform data as required 

**"Use the Machine Learning Workflow to process and transform DOT data to create a prediction model. This model must predict whether a flight would arrive 15+ minutes after the scheduled arrival time with 70+% accuracy."**

### 2 - Preparing Your Data

1. Find the data we need 
1. Inspect and clean the data 
1. Explore the data 
1. Mold the data to Tidy data 

#### Tidy Data 

+ Tidy datasets are easy to manipulate, model and visualize, and have a specific structure. ***-- Hadley Wickham***
    + each **variable** is a **column**
    + each **observation** is a **row** 
    + each type of **observational unit** is a **table**. 
+ 50-80% of a ML project is spent getting, cleaning, and organizing data.

#### Getting Data

+ Google 
+ Government databases 
+ Professional or company data sources 
+ Your company 
+ Your department 
+ All of the above

#### Data Rules

1. Closer the data is to what you are predicting, the better.
2. Data will never be in the format you need.
3. Accurately predicting rare event is difficult
4. Track how you manipulate data

#### Data Cleanup Demo

+ **DOT(Department of Transportation, US)** collects on-time data, so on-time data is available. Get DOT on-time data from [here](https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time).

+ Check these columns and click "download" to download the data in Zip format and extract the CSV file to a folder.

|Columns|Columns|
|-|-|
|DAY_OF_MONTH|DEST_AIRPORT_SEQ_ID| 
|DAY_OF_WEEK|DEST| 
|UNIQUE_CARRIER|DEP_TIME| 
|CARRIER|DEP_DEL15| 
|TAIL_NUM|DEP_TIME_BLK| 
|FL_NUM|ARR_TIME| 
|ORIGIN_AIRPORT_ID|ARR_DEL15| 
|ORIGIN_AIRPORT_SEQ_ID|CANCELLED| 
|ORIGIN|DIVERTED| 
|DEST_AIRPORT_ID|DISTANCE|


In [None]:
# read csv file
data <- read.csv2('./data/Jan2015_ontime_data.csv', header = TRUE, sep = ',', stringsAsFactors = FALSE)

# find the no. of records
nrow(data)

In [None]:
# Check column names
names(data)

In [None]:
# Define a vector having specific 3-charachter airport codes 
# It will be used to filter related records
airports <- c('ATL','LAX','ORD','DFW','JFK','SFO','CLT','LAS','PHX')

In [None]:
# Subset records
data <- subset(data, DEST %in% airports & ORIGIN %in% airports)

# find the no. of records
nrow(data)

In [None]:
# inspect data
head(data, 2)

In [None]:
# inspect data
tail(data, 2)

**NOTE**: Both head and tail show that 'X' column is added as part of csv import and has no values so it can be discarded.
We can set that column value to NULL to discard it.

In [None]:
data$X <- NULL

In [None]:
# inspect data again
head(data, 2)

In [None]:
# inspect data again
tail(data, 2)

**NOTE**: This time 'X' column is discarded now.

#### Columns to Eliminate

+ Not used 
+ No values 
+ Duplicates 
+ Correlated Columns
    + Same information in a different format,e.g., ID and value associated with ID 

In [None]:
# finding correlation, 1 -> means perfect correlation
cor(data[c('ORIGIN_AIRPORT_ID', 'ORIGIN_AIRPORT_SEQ_ID')])

In [None]:
# finding correlation, 1 -> means perfect correlation
cor(data[c('DEST_AIRPORT_ID', 'DEST_AIRPORT_SEQ_ID')])

We found that 'ORIGIN_AIRPORT_ID' and 'ORIGIN_AIRPORT_SEQ_ID' are correlated and so do 'DEST_AIRPORT_ID' and 'DEST_AIRPORT_SEQ_ID'.
So, we'll drop 'ORIGIN_AIRPORT_SEQ_ID' and 'DEST_AIRPORT_SEQ_ID' from the data.

In [None]:
data$ORIGIN_AIRPORT_SEQ_ID <- NULL
data$DEST_AIRPORT_SEQ_ID <- NULL

In [None]:
# inspect data
head(data,2)

In [None]:
# inspect data
tail(data, 2)

So the columns are dropped now.

In [None]:
# Checking for mismatch
mismatched_rows <- data[data$CARRIER != data$UNIQUE_CARRIER, ]
nrow(mismatched_rows)

We found that there are no mismatched rows that means these columns are identical too. so we can drop 'UNIQUE_CARRIER'.

In [None]:
data$UNIQUE_CARRIER <- NULL

In [None]:
# inspect data again
head(data,2)

In [None]:
# inspect data again
tail(data,2)

In [None]:
# On-Time Data
on_time_data <- data[!is.na(data$ARR_DEL15) & !is.na(data$DEP_DEL15) & data$ARR_DEL15 != "" & data$DEP_DEL15 != "" , ]
nrow(on_time_data)

In [None]:
# convert string column values to integer - distance, cancelled, diverted columns etc.
on_time_data$DISTANCE <- as.integer(on_time_data$DISTANCE)
on_time_data$CANCELLED <- as.integer(on_time_data$CANCELLED)
on_time_data$DIVERTED <- as.integer(on_time_data$DIVERTED)

In [None]:
# Factorise columns to view distinct types of values 

on_time_data$CARRIER <- as.factor(on_time_data$CARRIER )
on_time_data$DEP_TIME_BLK <- as.factor(on_time_data$DEP_TIME_BLK )
on_time_data$ORIGIN <- as.factor(on_time_data$ORIGIN )
on_time_data$DEST <- as.factor(on_time_data$DEST )
on_time_data$DAY_OF_WEEK <- as.factor(on_time_data$DAY_OF_WEEK )
on_time_data$ORIGIN_AIRPORT_ID <- as.factor(on_time_data$ORIGIN_AIRPORT_ID )
on_time_data$DEST_AIRPORT_ID <- as.factor(on_time_data$DEST_AIRPORT_ID )
on_time_data$ARR_DEL15 <- as.factor(on_time_data$ARR_DEL15 )
on_time_data$DEP_DEL15 <- as.factor(on_time_data$DEP_DEL15 )

In [None]:
# On-time and delayed flights records
tapply(on_time_data$ARR_DEL15, on_time_data$ARR_DEL15, length)

In [None]:
# Percentage of delayed flights
delayed_flight_percentage <- (6460 / (25664+ 6460)) * 100
delayed_flight_percentage

### 3 - Selecting Your Algorithm

+ Role of algorithm 
+ Perform algorithm selection 
    + Use solution statement to filter algorithms 
    + Discuss best algorithms 
    + Select one initial algorithm 

<img src='./resources/role_of_algorithm.png' alt='role_of_algorithm' />

#### Algorithm Selection

+ Compare factors 
+ Difference of opinions about which factors are important 
+ You will develop your own factors 

#### Algoritm Decision Factors

+ Learning Type 
    + "Use the Machine Learning Workflow to process and transform DOT data to create a prediction model. This model must predict whether a flight would arrive 15+ minutes after the scheduled arrival time with 70+% accuracy.' 
    + Prediction Model => Supervised machine learning 
+ Result Type
    + Regression 
        + Continuous values 
        + price = A * # bedroom + B * size etc.
    + Classification 
        + Discrete values 
        + small, medium, large 
        + 1-100, 101-200, 201-300 
        + true or false
     + "... predict whether a flight would arrive 15+ minutes after the scheduled arrival time."
        + ARR_DEL15 -> Binary (TRUE/FALSE) 
        + Algorithm must support classification 
        + Binary classification
+ Complexity
    + Keep it Simple 
    + Eliminate "ensemble" algorithms 
        + Container algorithm 
        + Multiple child algorithms 
        + Boost performance 
        + Can be difficult to debug
+ Basic vs enhanced
    + Enhanced 
        + Variation of Basic 
        + Performance improvements 
        + Additional functionality 
        + More complex 
    + Basic 
        + Simpler 
        + Easier to understand

#### Candidate Algorithms

+ Naive Bayes
    + Based on likelihood and probability 
    + Every feature has the same weight 
    + Requires smaller amount of data
+ Logistic Regression
    + Confusing name, gives binary result 
    + Relationship between features are weighted
+ Decision Trees
    + Binary Tree 
    + Node contains decision 
    + Requires enough data to determine nodes and splits

#### Selected Algorithm  -> Logistic Regression

+ Simple - easy to understand 
+ Fast - up to 100X faster 
+ Stable to data changes 

#### Summary

+ Lots of algorithms available 
+ Selection based on 
    + Learning -> Supervised 
    + Result -> Binary classification 
    + Non-ensemble 
    + Basic
+ Logistic Regression selected for training 
    + Simple, fast, and stable

### 4 - Training the Model

+ Understand the training process
+ Caret package
+ Train algorithm with DOT data

**Machine Learning Training**: Letting specific data teach a Machine Learning algorithm to create a specific forecast model.

**Why Retrain?**: 

+ New data better predictions 
+ Verify training performance with new data

<img src='./resources/training.png' alt='training model' height="50%" width="50%" align="left" />

#### Selecting Training Features

+ We want minimum features (columns) 
+ Selected features 
    + Origin and Destination 
    + Day Of Week 
    + Carrier 
    + Departure Time Block 
    + Arrival Delay 15 (required)

#### CARET (Classification And Regression Training)

+ Caret - R package
+ Toolset for training and evaluation tasks 
    + Data splitting 
    + Pre-processing 
    + Feature selection 
    + Model tuning 
+ Common interface across algorithms 

In [None]:
# install.packages('caret', repos = 'https://cran.r-project.org/')
# install.packages('e1071', repos = 'https://cran.r-project.org/', dependencies=TRUE)

In [None]:
require('caret')

+ Algorithms can use random numbers in training. 
+ The random number sequence is based on a seed number. 

In [None]:
# set seed to generate exact same set of data frames
set.seed(122515)

In [None]:
# set features which will be used for prediction
features <- c('ARR_DEL15','DAY_OF_WEEK','CARRIER','DEST','ORIGIN', 'DEP_TIME_BLK')

In [None]:
# filter data frame based on features
filtered_ontime_data <- on_time_data[,features]

# check the names of column in data frame to be sure it's filted
names(filtered_ontime_data)

+ Data Splitting Requirements 
    + Divide into two data frames by user specified % 
    + Ensure 'ARR_DEL15' ratio of true to false is preserved in training and testing data frames. 

In [None]:
# Split data into Training and Test Data

training_rows <- createDataPartition(filtered_ontime_data$ARR_DEL15, p=0.70, list = FALSE)

training_data <- filtered_ontime_data[training_rows,] #select training rows
test_data <- filtered_ontime_data[-training_rows,] # select other than training rows

In [None]:
# check percentage of rows in each dataset
nrow(training_data)/(nrow(training_data)+ nrow(test_data))*100
nrow(test_data)/(nrow(training_data)+ nrow(test_data))*100

### Training algorithm with training data to create a Model

In [None]:
logistic_reg_model <- train(ARR_DEL15 ~ . , data = training_data, method = 'glm', family = 'binomial')

# check some stats about trained model
logistic_reg_model

### 5 - Testing Your Model's Accuracy

+ Evaluate the model against test data 
+ Interpret results 
+ Improve results 

**Note**: Statistics are only data. We define what is good or bad.

In [None]:
# predict the delyed flight based on the trained model using test data set
prediction <- predict(logistic_reg_model, test_data)

In [None]:
# generate confusion matrix to find the prediction models capabilities
confusion_matrix <- confusionMatrix(prediction, test_data[, 'ARR_DEL15'])
confusion_matrix

From **Confusion Matrix** we found that:
+ Delayed flights not being predicted 
+ Need to increase prediction accuracy 
+ What are our options? 
    + Improving data quality
    + Selecting the algorithm which performs better
    + Training the model which provides more accurate results

#### Options for Improving performance (Round - 1)

+ Add additional columns 
    + DEP_DEL15
    + Will predict arrival delay 
    + Departure delay detected late 
    + May not be useful predictor 
+ Adjust training settings 
    + It improves the performance slightly which is not very effective in this case
+ Select a better algorithm (Ensamble algorithm)
    + Random Forest

#### Prediction using Random Forest Algorithm

In [None]:
library(randomForest)

In [None]:
# creating random forest model
# training_data[-1] -> means excluding the first column i.e., 'ARR_DEL15'
rf_model <- randomForest(training_data[-1], training_data$ARR_DEL15, proximity = TRUE, importance = TRUE)

In [None]:
# predict the delyed flight based on the trained model using test data set
prediction <- predict(rf_model, test_data)

In [None]:
# generate confusion matrix to find the prediction models capabilities
confusion_matrix <- confusionMatrix(prediction, test_data[, 'ARR_DEL15'])
confusion_matrix

#### Options for Improving performance (Round - 2) 

+ Adjust training settings 
+ Select a better algorithm 
+ Rethink the problem 
    + What causes delays? 
    + Subject knowledge required? 
    + Research shows weather is a delay factor
    
#### Performance Improvement Cycle

+ Change data, settings, algorithm or all of the above 
+ Improve each cycle 
+ The difficult part is knowing when to stop

#### Summary

+ Evaluated Logistic Regression model 
    + predict() 
    + confusionMatrix() 
+ Improved performance using Random Forest algorithm 
+ Defined future improvements incorporating weather data

### Key Points

+ Machine Learning is here today
+ Machine Learning is data driven
+ Follow the Machine Learning Workflow
    + Ask the right questions
        + Started with question
        + Used requirements and knowledge to transform
        + Resulted in solution statement
    + Preparing data
        + Retrieved data from DOT site
        + Cleaned data
        + Molded data        
    + Selecting the algorithm
        + Learning type
        + Result type
        + Complexity
        + Basic vs. Enhanced
    + Training the model
        + Split data - 70%/30% (training data/test data)
        + Trained with training data
    + Testing the model
        + Predicted with test data
        + Evaluated Prediction
        + Switched to Random Forest
        + Evaluated Random Forest
        + Considered adding weather data        