# Messing Around with Different Machine Learning Models in R
## Logistic Regression

Different things to try with Logistic Regression
1. Using all variables to train the model - don't clean (except for removing the Cabin column)
2. Run the logistic regression example from r-bloggers to make sure we are getting the same results - https://www.r-bloggers.com/how-to-perform-a-logistic-regression-in-r/
2. Using only some predictors i.e. only the 2 strongest columns according to the logistic model
3. Compare accuracy to Azure ML - need to do randomForest - won't be logistic regression
4. Look into cross validation
5. Look at distribution differences when the data is normalized/scaled
6. Compare model that use scaled/normalized data between a model that doesn't

In [4]:
#Library imports
library(ggplot2)
library(dplyr)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



In [5]:
#Import data and prep
df <- read.csv(file = 'datasets/train.csv', stringsAsFactor = TRUE, na.strings = c("", NA))
head(df)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q


In [6]:
#remove Cabin column - too many missing values to infer
df <- df %>% select(-Cabin, -PassengerId, -Name, -Ticket)
head(df)
#check for missing values
sapply(df, function(x) sum(is.na(x)))

Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,male,22.0,1,0,7.25,S
1,1,female,38.0,1,0,71.2833,C
1,3,female,26.0,0,0,7.925,S
1,1,female,35.0,1,0,53.1,S
0,3,male,35.0,0,0,8.05,S
0,3,male,,0,0,8.4583,Q


At this point, df contains several missing values in both the Age and Embarked columns. We want to know if the modeling function can handle missing values (like in Azure ML) or if we will have to fix our data. We attepmt to use the glm() model with this imperfect data

### 1. Modeling with Missing Values

In [7]:
#split data for training
train <- df[1:624, ]
test <- df[624:891,]

#build model
model <- glm(Survived ~. , family = binomial(link = "logit"), data = train)
summary(model)


Call:
glm(formula = Survived ~ ., family = binomial(link = "logit"), 
    data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5319  -0.6993  -0.4081   0.6637   2.3737  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  5.395179   0.761617   7.084 1.40e-12 ***
Pclass      -1.161281   0.199152  -5.831 5.51e-09 ***
Sexmale     -2.569890   0.253866 -10.123  < 2e-16 ***
Age         -0.039385   0.009751  -4.039 5.37e-05 ***
SibSp       -0.334670   0.145759  -2.296   0.0217 *  
Parch        0.027597   0.156123   0.177   0.8597    
Fare        -0.002696   0.003232  -0.834   0.4042    
EmbarkedQ    0.136894   0.722154   0.190   0.8497    
EmbarkedS   -0.303114   0.320984  -0.944   0.3450    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 667.64  on 491  degrees of freedom
Residual deviance: 454.75  on 483  degrees of freedom
  (132 o

Now we assess the accuracy of our model

In [8]:
results <- predict(model, newdata = select(test, -Survived), type = 'response')
results <- ifelse(results > .5, 1, 0)
error <- mean(results != test$Survived)
print(paste("The model is ", error , " accurate"))

[1] "The model is  NA  accurate"


As we can see by our assessment prediction, the entire modeling process does not work with missing values. It is important to note though that our model still classifies and doesn't just crash entirely. Instead it simply returns NA if there were missing values for that specific row in test.

In [9]:
head(results)

We can see it did classify some rows, meaning it did not fail completely.

In [10]:
sum(is.na(results))

Scanning our results for NA values shows that 47 rows did not have predictions out of 267 rows. If we tell R to ignore the NA values in its accuracy assessment we get the following score.

In [11]:
error <- mean(results != test$Survived, na.rm = TRUE)
error

It appears our model was only 20% accurate, indicating a lot of work needs to be done still and that missing values should be scrubbed.

### 2. Logistic Regression From Example
Using the tutorial featured at, https://www.r-bloggers.com/how-to-perform-a-logistic-regression-in-r/ we test to see if we can reproduce their results

In [12]:
#load messy data
df <- read.csv(file = 'datasets/train.csv', na.strings = c("", NA))
#look at missing values
sapply(df, function(x) sum(is.na(x)))

In [13]:
#Look at different types of variables
sapply(df, function(x) class(x))

In [14]:
#look at number of different values by each column
sapply(df, function(x) length(unique(x)))

Cabin has so many missing values we will drop Cabin altogether because it would be too difficult to impute the Cabin. PassengerId is useless as it is just an identifier so we drop it as well. The same occurs with Ticket and Name

In [15]:
df <- df %>% select(-Cabin, -PassengerId, -Ticket, -Name)
head(df)

Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,male,22.0,1,0,7.25,S
1,1,female,38.0,1,0,71.2833,C
1,3,female,26.0,0,0,7.925,S
1,1,female,35.0,1,0,53.1,S
0,3,male,35.0,0,0,8.05,S
0,3,male,,0,0,8.4583,Q


In [16]:
#fix missing age values with the mean age
meanAge <- mean(df$Age, na.rm = TRUE)
df$Age[is.na(df$Age)] <- meanAge
head(df)

Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,male,22.0,1,0,7.25,S
1,1,female,38.0,1,0,71.2833,C
1,3,female,26.0,0,0,7.925,S
1,1,female,35.0,1,0,53.1,S
0,3,male,35.0,0,0,8.05,S
0,3,male,29.69912,0,0,8.4583,Q


In [17]:
contrasts(df$Sex)
contrasts(df$Embarked)

Unnamed: 0,male
female,0
male,1


Unnamed: 0,Q,S
C,0,0
Q,1,0
S,0,1


In [18]:
#remove the rows with NA values for embarked - there were 2
df <- df %>% filter(complete.cases(Embarked))
dim(df)
#check situation of missing values
sapply(df, function(x) sum(is.na(x)))

As we can see above, there are no longer any missing data values

Now we move onto modeling. First we will split our data

In [19]:
train <- df[1:800, ]
test <- df[801:889, ]

Then we build our model

In [20]:
model <- glm(Survived ~., family = binomial(link = 'logit'), data = train)
summary(model)


Call:
glm(formula = Survived ~ ., family = binomial(link = "logit"), 
    data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6064  -0.5954  -0.4254   0.6220   2.4165  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  5.137627   0.594998   8.635  < 2e-16 ***
Pclass      -1.087156   0.151168  -7.192 6.40e-13 ***
Sexmale     -2.756819   0.212026 -13.002  < 2e-16 ***
Age         -0.037267   0.008195  -4.547 5.43e-06 ***
SibSp       -0.292920   0.114642  -2.555   0.0106 *  
Parch       -0.116576   0.128127  -0.910   0.3629    
Fare         0.001528   0.002353   0.649   0.5160    
EmbarkedQ   -0.002656   0.400882  -0.007   0.9947    
EmbarkedS   -0.318786   0.252960  -1.260   0.2076    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1065.39  on 799  degrees of freedom
Residual deviance:  709.39  on 791  degrees of freedom
AIC: 7

Analyzing the model output, it appears SibSp, Parch, Fare, and Embarked are not statistically significant because their p-values are greater than .05. Of our statistically significant predictors, Sex has the lowest p-value, indicating it it is the strongest predictor of survival. Now we want to analyze the table of deviance so we run the anova() function on our model.

In [21]:
anova(model, test = 'Chisq')

Unnamed: 0,Df,Deviance,Resid. Df,Resid. Dev,Pr(>Chi)
,,,799,1065.3922,
Pclass,1.0,83.6069449,798,981.7853,6.036063999999999e-20
Sex,1.0,240.0135513,797,741.7717,3.9061660000000004e-54
Age,1.0,17.4946765,796,724.277,2.881133e-05
SibSp,1.0,10.8423921,795,713.4346,0.0009920249
Parch,1.0,0.8630972,794,712.5715,0.3528734
Fare,1.0,0.9942053,793,711.5773,0.3187167
Embarked,2.0,2.187312,791,709.39,0.3349895


Now that we have analyzed our model, it is time to test our model. Below we run our test data and count its accuracy

In [22]:
#predict and then score model
fitted.results <- predict(model, newdata = test %>% select(-Survived), type = 'response')
fitted.results <- ifelse(fitted.results > .5, 1, 0)
head(fitted.results)

In [23]:
# calculate error
error <- mean(fitted.results != test$Survived)
print(paste('Our logistical regression classifier is ', 1 - error, 'accurate or ', (1 - error) * 100, '% accurate'))

[1] "Our logistical regression classifier is  0.842696629213483 accurate or  84.2696629213483 % accurate"


This is the same accuracy as the website tutorial and we have verified the results

### 3. Using only some predictors

Looking at the previous model, we can identify which predictors provided the strongest correlation with survival rates. We are now going to see if only using some predictors instead of all variables can improve our performance.

In [24]:
#Reading/Tidying data to prepare for model
df <- read.csv(file = 'datasets/train_edited.csv', na.strings = c("", NA))
sapply(df, function(x) sum(is.na(x)))

Here we are only going to use the Pclass, Sex, and Age variables

In [25]:
df <- df %>% select(Survived, Pclass, Sex, Age)
train <- df[1:800, ]
test <- df[801:889, ]

In [26]:
model <- glm(Survived ~., family = binomial(link = 'logit'), data = train)
summary(model)


Call:
glm(formula = Survived ~ ., family = binomial(link = "logit"), 
    data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6320  -0.6570  -0.4239   0.6420   2.4093  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  4.633935   0.472667   9.804  < 2e-16 ***
Pclass      -1.139026   0.125153  -9.101  < 2e-16 ***
Sexmale     -2.646164   0.197657 -13.388  < 2e-16 ***
Age         -0.031477   0.007724  -4.075  4.6e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1065.39  on 799  degrees of freedom
Residual deviance:  724.28  on 796  degrees of freedom
AIC: 732.28

Number of Fisher Scoring iterations: 4


In [27]:
anova(model, test = 'Chisq')

Unnamed: 0,Df,Deviance,Resid. Df,Resid. Dev,Pr(>Chi)
,,,799,1065.3922,
Pclass,1.0,83.60694,798,981.7853,6.036063999999999e-20
Sex,1.0,240.01355,797,741.7717,3.9061660000000004e-54
Age,1.0,17.49468,796,724.277,2.881133e-05


All 3 predictors were statistically significant. Now we evaluate our model's accuracy with our test data

In [31]:
fitted.results <- predict(model, newdata = test %>% select(-Survived), type = 'response')
fitted.results <- ifelse(fitted.results > .5, 1, 0)
error <- mean(fitted.results != test$Survived)
print(paste('Our logistical regression classifier is ', 1 - error, 'accurate or ', (1 - error) * 100, '% accurate'))

[1] "Our logistical regression classifier is  0.797752808988764 accurate or  79.7752808988764 % accurate"


#### Conclusion
This model with less predictors was less accurate than our model with more predictors, an 85% vs 80% accuracy meaning the accuracy is rather close. 

### 4. Comparing with Azure ML and Logistic Regression
Looking at the Azure ML example using the preset logistic regression module and 90% of data for training, Azure ML's model was 79.8% accurate.