## Naive Bayes - Hands on Session

### Agenda
  1. Manual Calculation
      
      ♦ Demonstration with an example

  
  2. Case Study - Flights Arrival Prediction 
    
    ♦ Problem Description

    ♦ Data Understanding
    
    ♦ Split the data into Train and Validation sets
    
    ♦ Data Engineering to send into the naive bayes model
    
    ♦ Build a Naive Bayes model


### Problem Description

##### Whether the weather condition is favorable to play tennis or not based on the historical data collected

In [1]:
### Read data
data<-read.csv("/home/divyas/Lab/Tennis_Data.csv")
head(data)

Unnamed: 0_level_0,Outlook,Temperature,Humidity,Windy,Class
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>
1,Sunny,Hot,High,Weak,No
2,Sunny,Hot,High,Strong,No
3,Overcast,Hot,High,Weak,Yes
4,Rain,Mild,High,Weak,Yes
5,Rain,Cool,Normal,Weak,Yes
6,Rain,Cool,Normal,Strong,No


#### Should we play tennis  "Outlook = Sunny, Temperature = Cool, Humidity = High, Windy = Strong"

#### Manual Computation



P(A|B) = P(B|A) * P(A) / P(B)  

P(Yes|Outlook = Sunny,Temperature = Cool,Humidity = High, Windy = Strong)

P(No|Outlook = Sunny,Temperature = Cool,Humidity = High, Windy = Strong)


P(Yes|Outlook = Sunny,Temperature = Cool,Humidity = High, Windy = Strong) = 

            
       P(Yes) * P(Outlook = Sunny|Yes) * P(Temperature = Cool|Yes) * P(Humidity = High|Yes) * P(Windy = Strong|Yes)


P(No|Outlook = Sunny,Temperature = Cool,Humidity = High, Windy = Strong) = 

        P(No) * P(Outlook = Sunny|No) * P(Temperature = Cool|No) * P(Humidity = High|No) * P(Windy = Strong|No)


In [12]:
data

Outlook,Temperature,Humidity,Windy,Class
<chr>,<chr>,<chr>,<chr>,<chr>
Sunny,Hot,High,Weak,No
Sunny,Hot,High,Strong,No
Overcast,Hot,High,Weak,Yes
Rain,Mild,High,Weak,Yes
Rain,Cool,Normal,Weak,Yes
Rain,Cool,Normal,Strong,No
Overcast,Cool,Normal,Strong,Yes
Sunny,Mild,High,Weak,No
Sunny,Cool,Normal,Weak,Yes
Rain,Mild,Normal,Weak,Yes


In [14]:
2/9

#### Estimating P(xi|C)

##### Outlook
P(sunny|yes) = 2/9 = 0.22 ; P(sunny|no) = 3/5 = 0.6


P(overcast|yes) = 4/9 = 0.44 ;  P(overcast|no) = 0

P(rain|yes) = 3/9 = 0.33 ;  P(rain|no) = 2/5 = 0.4

##### Temperature
P(hot|yes) = 2/9 = 0.22  ;  P(hot|no) = 2/5 = 0.4

P(mild|yes) = 4/9 = 0.44 ;  P(mild|no) = 2/5 = 0.4

P(cool|yes) = 3/9 = 0.33 ;  P(cool|no) = 1/5 = 0.2

##### Humidity
P(high|yes) = 3/9 = 0.33 ; P(high|no) = 4/5 = 0.8

P(normal|yes) = 6/9 = 0.67 ; P(normal|no) = 1/5 = 0.2

##### Windy
P(strong|yes) = 3/9 = 0.33 ; P(strong|no) = 3/5 = 0.6

P(weak|yes) = 6/9 = 0.67 ; P(weak|no) = 2/5 = 0.4

In [34]:
# P(Yes|Outlook = Sunny,Temperature = Cool,Humidity = High, Windy = Strong) = P(Yes) * P(Outlook = Sunny|Yes) * P(Temperature = Cool|Yes) * P(Humidity = High|Yes) * P(Windy = Strong|Yes)

(9/14) * (2/9) * (3/9) * (3/9) * (3/9)

In [35]:
# P(No|Outlook = Sunny,Temperature = Cool,Humidity = High, Windy = Strong) = P(No) * P(Outlook = Sunny|No) * P(Temperature = Cool|No) * P(Humidity = High|No) * P(Windy = Strong|No)

(5/14) * (3/5) * (1/5) * (4/5) * (3/5)

![NaiveBayes.png](attachment:NaiveBayes.png)

### Naive Bayes implimentation in R

In [36]:
## load library
library(e1071)

In [40]:
## Convert to factors

for (col in colnames(data)){
    data[,col] = as.factor(data[,col])
}

In [41]:
## Build Naive Bayes Model

model <- naiveBayes(Class ~ ., data = data)
model


Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
       No       Yes 
0.3571429 0.6428571 

Conditional probabilities:
     Outlook
Y      Overcast      Rain     Sunny
  No  0.0000000 0.4000000 0.6000000
  Yes 0.4444444 0.3333333 0.2222222

     Temperature
Y          Cool       Hot      Mild
  No  0.2000000 0.4000000 0.4000000
  Yes 0.3333333 0.2222222 0.4444444

     Humidity
Y          High    Normal
  No  0.8000000 0.2000000
  Yes 0.3333333 0.6666667

     Windy
Y        Strong      Weak
  No  0.6000000 0.4000000
  Yes 0.3333333 0.6666667


In [46]:
new_data = data.frame(Outlook = "Sunny",Temperature = "Cool",Humidity = "High", Windy = "Strong")

In [47]:
predict(model,new_data)

## Naive Bayes - Flights Data

### Problem Description


Predict if the flight would be delayed or not with the given features:

CARRIER - Name of the Carrier

DEP_TIME - Departure Time

DEST - Destination Airport

FL_DATE - Date of Flight

ORIGIN - Origin Airport

Weather - Is the weather stormy (1 for stormy, 0 for not stormy)

Flight.Status - Is the Flight delayed or On Time

### Data Reading

In [3]:
### Read data
data<-read.csv("/home/divyas/Lab/Flights_Data.csv",stringsAsFactors=T)


### Data Understanding

In [16]:
## Dimensions

dim(data)

In [17]:
## Structure

str(data)

'data.frame':	2201 obs. of  7 variables:
 $ CARRIER      : Factor w/ 8 levels "CO","DH","DL",..: 5 2 2 2 2 2 2 2 2 2 ...
 $ DEP_TIME     : int  1455 1640 1245 1709 1035 839 1243 1644 1710 2129 ...
 $ DEST         : Factor w/ 3 levels "EWR","JFK","LGA": 2 2 3 3 3 2 2 2 2 2 ...
 $ FL_DATE      : Factor w/ 31 levels "01/01/2004","01/02/2004",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ ORIGIN       : Factor w/ 3 levels "BWI","DCA","IAD": 1 2 3 3 3 3 3 3 3 3 ...
 $ Weather      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Flight.Status: Factor w/ 2 levels "delayed","ontime": 2 2 2 2 2 2 2 2 2 2 ...


In [18]:
## First few rows

head(data)

Unnamed: 0_level_0,CARRIER,DEP_TIME,DEST,FL_DATE,ORIGIN,Weather,Flight.Status
Unnamed: 0_level_1,<fct>,<int>,<fct>,<fct>,<fct>,<int>,<fct>
1,OH,1455,JFK,01/01/2004,BWI,0,ontime
2,DH,1640,JFK,01/01/2004,DCA,0,ontime
3,DH,1245,LGA,01/01/2004,IAD,0,ontime
4,DH,1709,LGA,01/01/2004,IAD,0,ontime
5,DH,1035,LGA,01/01/2004,IAD,0,ontime
6,DH,839,JFK,01/01/2004,IAD,0,ontime


In [19]:
## Last few rows

tail(data)

Unnamed: 0_level_0,CARRIER,DEP_TIME,DEST,FL_DATE,ORIGIN,Weather,Flight.Status
Unnamed: 0_level_1,<fct>,<int>,<fct>,<fct>,<fct>,<int>,<fct>
2196,RU,650,EWR,01/31/2004,IAD,0,ontime
2197,RU,644,EWR,01/31/2004,DCA,0,ontime
2198,RU,1653,EWR,01/31/2004,IAD,0,ontime
2199,RU,1558,EWR,01/31/2004,DCA,0,ontime
2200,RU,1403,EWR,01/31/2004,DCA,0,ontime
2201,RU,1736,EWR,01/31/2004,DCA,0,ontime


### Summary Statistics

In [20]:
## Summary of the data
summary(data)

    CARRIER       DEP_TIME     DEST            FL_DATE     ORIGIN    
 DH     :551   Min.   :  10   EWR: 665   01/22/2004:  86   BWI: 145  
 RU     :408   1st Qu.:1004   JFK: 386   01/06/2004:  85   DCA:1370  
 US     :404   Median :1450   LGA:1150   01/08/2004:  85   IAD: 686  
 DL     :388   Mean   :1369              01/13/2004:  85             
 MQ     :295   3rd Qu.:1709              01/20/2004:  85             
 CO     : 94   Max.   :2330              01/21/2004:  85             
 (Other): 61                             (Other)   :1690             
    Weather        Flight.Status 
 Min.   :0.00000   delayed: 428  
 1st Qu.:0.00000   ontime :1773  
 Median :0.00000                 
 Mean   :0.01454                 
 3rd Qu.:0.00000                 
 Max.   :1.00000                 
                                 

### Split the data into Train and Validation sets

In [21]:
## Split row numbers into 2 sets
set.seed(1)
train_rows = sample(1:nrow(data), size=0.7*nrow(data))
validation_rows = setdiff(1:nrow(data),train_rows)

In [22]:
## Subset into Train and Validation sets
train_data <- data[train_rows,]
validation_data <- data[validation_rows,]

In [23]:
## View the dimensions of the data
dim(data)
dim(train_data)
dim(validation_data)

### Data Type Conversion
Check if any data type conversions have to be done.


In [24]:
str(train_data)

'data.frame':	1540 obs. of  7 variables:
 $ CARRIER      : Factor w/ 8 levels "CO","DH","DL",..: 2 1 7 2 3 6 8 5 4 4 ...
 $ DEP_TIME     : int  2133 1254 847 1630 629 1521 900 1455 1503 1534 ...
 $ DEST         : Factor w/ 3 levels "EWR","JFK","LGA": 3 1 3 2 3 1 3 2 3 3 ...
 $ FL_DATE      : Factor w/ 31 levels "01/01/2004","01/02/2004",..: 15 10 31 14 22 7 5 18 9 19 ...
 $ ORIGIN       : Factor w/ 3 levels "BWI","DCA","IAD": 3 2 3 2 2 2 2 1 2 2 ...
 $ Weather      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Flight.Status: Factor w/ 2 levels "delayed","ontime": 1 2 2 2 2 2 2 2 2 1 ...


In [25]:
str(validation_data)

'data.frame':	661 obs. of  7 variables:
 $ CARRIER      : Factor w/ 8 levels "CO","DH","DL",..: 2 2 2 2 2 3 4 4 4 7 ...
 $ DEP_TIME     : int  1640 1245 1709 839 2129 1458 1525 1452 1853 841 ...
 $ DEST         : Factor w/ 3 levels "EWR","JFK","LGA": 2 3 3 2 2 2 2 3 3 3 ...
 $ FL_DATE      : Factor w/ 31 levels "01/01/2004","01/02/2004",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ ORIGIN       : Factor w/ 3 levels "BWI","DCA","IAD": 2 3 3 3 3 2 2 2 2 3 ...
 $ Weather      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Flight.Status: Factor w/ 2 levels "delayed","ontime": 2 2 2 2 2 2 2 2 2 2 ...


### Data Engineering

#### Date Manipulation

In [26]:
## Look at the head of Date variable

head(train_data$FL_DATE)

In [28]:
## Parse Time
library(lubridate)

parsed_date = parse_date_time(train_data$FL_DATE,order="mdY")

head(parsed_date)


[1] "2004-01-15 UTC" "2004-01-10 UTC" "2004-01-31 UTC" "2004-01-14 UTC"
[5] "2004-01-22 UTC" "2004-01-07 UTC"

In [29]:
## Extracting day

head(day(parsed_date))

In [30]:
## Create new features day, month and year

train_data$day = day(parsed_date)
train_data$month = month(parsed_date)
train_data$year = year(parsed_date)

In [31]:
head(train_data)

Unnamed: 0_level_0,CARRIER,DEP_TIME,DEST,FL_DATE,ORIGIN,Weather,Flight.Status,day,month,year
Unnamed: 0_level_1,<fct>,<int>,<fct>,<fct>,<fct>,<int>,<fct>,<int>,<dbl>,<dbl>
1017,DH,2133,LGA,01/15/2004,IAD,0,delayed,15,1,2004
679,CO,1254,EWR,01/10/2004,DCA,0,ontime,10,1,2004
2177,UA,847,LGA,01/31/2004,IAD,0,ontime,31,1,2004
930,DH,1630,JFK,01/14/2004,DCA,0,ontime,14,1,2004
1533,DL,629,LGA,01/22/2004,DCA,0,ontime,22,1,2004
471,RU,1521,EWR,01/07/2004,DCA,0,ontime,7,1,2004


In [32]:
## Check unique values

unique(train_data$month)
unique(train_data$year)

In [33]:
## Remove Un-neccesary variables

train_data$FL_DATE = NULL
train_data$month = NULL
train_data$year = NULL

In [34]:
## Follow the same steps on Validation
parsed_date = parse_date_time(validation_data$FL_DATE,order="mdY")

validation_data$day = day(parsed_date)

validation_data$FL_DATE = NULL

#### Manual Binning

In [35]:
## Check theDeparture time column

train_data$DEP_TIME

In [36]:
## Create a new placeholder column 

train_data$DEP_TIME_BIN = 0

In [37]:
## Using for loop - create time bins for each hour

for(i in 1:nrow(train_data)) {

    if(train_data$DEP_TIME[i]<100) {train_data$DEP_TIME_BIN[i]="0000-0059"}
    else if(train_data$DEP_TIME[i]<200)  {train_data$DEP_TIME_BIN[i]="0100-0159"}
    else if(train_data$DEP_TIME[i]<300)  {train_data$DEP_TIME_BIN[i]="0200-0259"}
    else if(train_data$DEP_TIME[i]<400)  {train_data$DEP_TIME_BIN[i]="0300-0359"}
    else if(train_data$DEP_TIME[i]<500)  {train_data$DEP_TIME_BIN[i]="0400-0459"}    
    else if(train_data$DEP_TIME[i]<600)  {train_data$DEP_TIME_BIN[i]="0500-0559"}
    else if(train_data$DEP_TIME[i]<700)  {train_data$DEP_TIME_BIN[i]="0600-0659"}
    else if(train_data$DEP_TIME[i]<800)  {train_data$DEP_TIME_BIN[i]="0700-0759"}
    else if(train_data$DEP_TIME[i]<900)  {train_data$DEP_TIME_BIN[i]="0800-0859"}
    else if(train_data$DEP_TIME[i]<1000) {train_data$DEP_TIME_BIN[i]="0900-0959"}
    else if(train_data$DEP_TIME[i]<1100) {train_data$DEP_TIME_BIN[i]="1000-1059"}
    else if(train_data$DEP_TIME[i]<1200) {train_data$DEP_TIME_BIN[i]="1100-1159"}
    else if(train_data$DEP_TIME[i]<1300) {train_data$DEP_TIME_BIN[i]="1200-1259"}
    else if(train_data$DEP_TIME[i]<1400) {train_data$DEP_TIME_BIN[i]="1300-1359"}
    else if(train_data$DEP_TIME[i]<1500) {train_data$DEP_TIME_BIN[i]="1400-1459"}
    else if(train_data$DEP_TIME[i]<1600) {train_data$DEP_TIME_BIN[i]="1500-1559"}
    else if(train_data$DEP_TIME[i]<1700) {train_data$DEP_TIME_BIN[i]="1600-1659"}
    else if(train_data$DEP_TIME[i]<1800) {train_data$DEP_TIME_BIN[i]="1700-1759"}
    else if(train_data$DEP_TIME[i]<1900) {train_data$DEP_TIME_BIN[i]="1800-1859"}
    else if(train_data$DEP_TIME[i]<2000) {train_data$DEP_TIME_BIN[i]="1900-1959"}
    else if(train_data$DEP_TIME[i]<2100) {train_data$DEP_TIME_BIN[i]="2000-2059"}
    else if(train_data$DEP_TIME[i]<2200) {train_data$DEP_TIME_BIN[i]="2100-2159"}
    else if(train_data$DEP_TIME[i]<2300) {train_data$DEP_TIME_BIN[i]="2200-2259"}
    else {train_data$DEP_TIME_BIN[i]="2300-2359"}
}

In [38]:
head(train_data)

Unnamed: 0_level_0,CARRIER,DEP_TIME,DEST,ORIGIN,Weather,Flight.Status,day,DEP_TIME_BIN
Unnamed: 0_level_1,<fct>,<int>,<fct>,<fct>,<int>,<fct>,<int>,<chr>
1017,DH,2133,LGA,IAD,0,delayed,15,2100-2159
679,CO,1254,EWR,DCA,0,ontime,10,1200-1259
2177,UA,847,LGA,IAD,0,ontime,31,0800-0859
930,DH,1630,JFK,DCA,0,ontime,14,1600-1659
1533,DL,629,LGA,DCA,0,ontime,22,0600-0659
471,RU,1521,EWR,DCA,0,ontime,7,1500-1559


In [39]:
## Do the same on validation data

validation_data$DEP_TIME_BIN = 0

for(i in 1:nrow(validation_data)) {

    if(validation_data$DEP_TIME[i]<100) {validation_data$DEP_TIME_BIN[i]="0000-0059"}
    else if(validation_data$DEP_TIME[i]<200)  {validation_data$DEP_TIME_BIN[i]="0100-0159"}
    else if(validation_data$DEP_TIME[i]<300)  {validation_data$DEP_TIME_BIN[i]="0200-0259"}
    else if(validation_data$DEP_TIME[i]<400)  {validation_data$DEP_TIME_BIN[i]="0300-0359"}
    else if(validation_data$DEP_TIME[i]<500)  {validation_data$DEP_TIME_BIN[i]="0400-0459"}    
    else if(validation_data$DEP_TIME[i]<600)  {validation_data$DEP_TIME_BIN[i]="0500-0559"}
    else if(validation_data$DEP_TIME[i]<700)  {validation_data$DEP_TIME_BIN[i]="0600-0659"}
    else if(validation_data$DEP_TIME[i]<800)  {validation_data$DEP_TIME_BIN[i]="0700-0759"}
    else if(validation_data$DEP_TIME[i]<900)  {validation_data$DEP_TIME_BIN[i]="0800-0859"}
    else if(validation_data$DEP_TIME[i]<1000) {validation_data$DEP_TIME_BIN[i]="0900-0959"}
    else if(validation_data$DEP_TIME[i]<1100) {validation_data$DEP_TIME_BIN[i]="1000-1059"}
    else if(validation_data$DEP_TIME[i]<1200) {validation_data$DEP_TIME_BIN[i]="1100-1159"}
    else if(validation_data$DEP_TIME[i]<1300) {validation_data$DEP_TIME_BIN[i]="1200-1259"}
    else if(validation_data$DEP_TIME[i]<1400) {validation_data$DEP_TIME_BIN[i]="1300-1359"}
    else if(validation_data$DEP_TIME[i]<1500) {validation_data$DEP_TIME_BIN[i]="1400-1459"}
    else if(validation_data$DEP_TIME[i]<1600) {validation_data$DEP_TIME_BIN[i]="1500-1559"}
    else if(validation_data$DEP_TIME[i]<1700) {validation_data$DEP_TIME_BIN[i]="1600-1659"}
    else if(validation_data$DEP_TIME[i]<1800) {validation_data$DEP_TIME_BIN[i]="1700-1759"}
    else if(validation_data$DEP_TIME[i]<1900) {validation_data$DEP_TIME_BIN[i]="1800-1859"}
    else if(validation_data$DEP_TIME[i]<2000) {validation_data$DEP_TIME_BIN[i]="1900-1959"}
    else if(validation_data$DEP_TIME[i]<2100) {validation_data$DEP_TIME_BIN[i]="2000-2059"}
    else if(validation_data$DEP_TIME[i]<2200) {validation_data$DEP_TIME_BIN[i]="2100-2159"}
    else if(validation_data$DEP_TIME[i]<2300) {validation_data$DEP_TIME_BIN[i]="2200-2259"}
    else {validation_data$DEP_TIME_BIN[i]="2300-2359"}
}

In [40]:
## Remove the old variable

train_data$DEP_TIME = NULL
validation_data$DEP_TIME = NULL

In [41]:
summary(train_data)

    CARRIER     DEST     ORIGIN       Weather        Flight.Status 
 DH     :390   EWR:454   BWI:105   Min.   :0.00000   delayed: 303  
 US     :281   JFK:279   DCA:955   1st Qu.:0.00000   ontime :1237  
 DL     :280   LGA:807   IAD:480   Median :0.00000                 
 RU     :280                       Mean   :0.01364                 
 MQ     :202                       3rd Qu.:0.00000                 
 CO     : 64                       Max.   :1.00000                 
 (Other): 43                                                       
      day        DEP_TIME_BIN      
 Min.   : 1.00   Length:1540       
 1st Qu.: 8.00   Class :character  
 Median :16.00   Mode  :character  
 Mean   :15.89                     
 3rd Qu.:23.00                     
 Max.   :31.00                     
                                   

In [42]:
summary(validation_data)

    CARRIER     DEST     ORIGIN       Weather        Flight.Status
 DH     :161   EWR:211   BWI: 40   Min.   :0.00000   delayed:125  
 RU     :128   JFK:107   DCA:415   1st Qu.:0.00000   ontime :536  
 US     :123   LGA:343   IAD:206   Median :0.00000                
 DL     :108                       Mean   :0.01664                
 MQ     : 93                       3rd Qu.:0.00000                
 CO     : 30                       Max.   :1.00000                
 (Other): 18                                                      
      day        DEP_TIME_BIN      
 Min.   : 1.00   Length:661        
 1st Qu.: 9.00   Class :character  
 Median :16.00   Mode  :character  
 Mean   :16.33                     
 3rd Qu.:24.00                     
 Max.   :31.00                     
                                   

In [43]:
str(train_data)

'data.frame':	1540 obs. of  7 variables:
 $ CARRIER      : Factor w/ 8 levels "CO","DH","DL",..: 2 1 7 2 3 6 8 5 4 4 ...
 $ DEST         : Factor w/ 3 levels "EWR","JFK","LGA": 3 1 3 2 3 1 3 2 3 3 ...
 $ ORIGIN       : Factor w/ 3 levels "BWI","DCA","IAD": 3 2 3 2 2 2 2 1 2 2 ...
 $ Weather      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Flight.Status: Factor w/ 2 levels "delayed","ontime": 1 2 2 2 2 2 2 2 2 1 ...
 $ day          : int  15 10 31 14 22 7 5 18 9 19 ...
 $ DEP_TIME_BIN : chr  "2100-2159" "1200-1259" "0800-0859" "1600-1659" ...


In [44]:
str(validation_data)

'data.frame':	661 obs. of  7 variables:
 $ CARRIER      : Factor w/ 8 levels "CO","DH","DL",..: 2 2 2 2 2 3 4 4 4 7 ...
 $ DEST         : Factor w/ 3 levels "EWR","JFK","LGA": 2 3 3 2 2 2 2 3 3 3 ...
 $ ORIGIN       : Factor w/ 3 levels "BWI","DCA","IAD": 2 3 3 3 3 2 2 2 2 3 ...
 $ Weather      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Flight.Status: Factor w/ 2 levels "delayed","ontime": 2 2 2 2 2 2 2 2 2 2 ...
 $ day          : int  1 1 1 1 1 1 1 1 1 1 ...
 $ DEP_TIME_BIN : chr  "1600-1659" "1200-1259" "1700-1759" "0800-0859" ...


#### Type Casting

In [45]:
## Convert all variables to factor - train

for (col in colnames(train_data)){
    train_data[,col] = as.factor(as.character(train_data[,col]))
}

In [46]:
## Convert all variables to factor - validation

for (col in colnames(validation_data)){
    validation_data[,col] = as.factor(as.character(validation_data[,col]))
}

### Model Building

In [47]:
## load library
library(e1071)

## build model
model = naiveBayes(Flight.Status ~ ., data = train_data)


In [48]:
## check model
model


Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
  delayed    ontime 
0.1967532 0.8032468 

Conditional probabilities:
         CARRIER
Y                 CO         DH         DL         MQ         OH         RU
  delayed 0.06270627 0.33003300 0.11221122 0.19471947 0.01320132 0.19801980
  ontime  0.03637833 0.23443816 0.19886823 0.11560226 0.01455133 0.17784964
         CARRIER
Y                 UA         US
  delayed 0.00990099 0.07920792
  ontime  0.01455133 0.20776071

         DEST
Y               EWR       JFK       LGA
  delayed 0.3531353 0.2244224 0.4224422
  ontime  0.2805174 0.1705740 0.5489086

         ORIGIN
Y                BWI        DCA        IAD
  delayed 0.07920792 0.52805281 0.39273927
  ontime  0.06548100 0.64268391 0.29183508

         Weather
Y                  0          1
  delayed 0.93069307 0.06930693
  ontime  1.00000000 0.00000000

         day
Y                   1        

In [49]:
## predict on train

predict(model,train_data)

In [50]:
## predict on validation

predict(model,validation_data)