# ----------


**Titanic - Data Analysis + RF Prediction 0.81818**
==============================

**1. Package, Loading and Check Data**
---------------------------------
**2. Exploratory Analysis**
-----------------------

2.1. Age

2.2. Sex

2.3. Age vs Sex

2.4. Pclass vs Sex

2.5. Pclass vs Sex vs Age

2.6. Fare vs Pclass
**3. Data Processing and Exploratory Analysis 2**
---------------------------------------------
3.1. New Variable : Title (From Name)

3.2. New Variable : Family Size (From Name, SibSp and Parch)

3.3. Processing Embarked (Replace missing values by most common value = S)

3.4. Processing Fare (Replace missing value by Pclass = 3 's median)

3.5. Processing Age (Replace missing values by Title's median)

3.6. New Variable : Child (From Age)

3.7. Correlogram Matrix
**4. Modeling with Random Forest**
------------------------------
4.1. Modeling full_mod[1 : 891]

4.2. Confusion Matrix
**5. Prediction**
-------------
5.1. Prediction .csv

5.2. OOB Error and Gini
**6. Accuracy**
-------------

RF_MODEL : 0.8282828 full_mod[1:891]

RF_MODEL_TEST : **0.8181818** full_mod[892:1309] - **PUBLIC SCORE**




------


  [1]: https://www.linkedin.com/pulse/machine-learning-obtenir-081818-au-challenge-kaggle-vincent-lugat

**1. Package, Loading and Check Data**
----------------------------------

In [1]:
#_____________________________Package + Data___________________________________
#______________________________________________________________________________

# Package
suppressMessages(library('ggplot2'))
suppressMessages(library('ggthemes')) 
suppressMessages(library('scales')) 
suppressMessages(library('dplyr'))
suppressMessages(library('randomForest'))
suppressMessages(library('corrplot'))
suppressMessages(library('plyr'))

#Loading Data

train <- read.csv('../input/train.csv', stringsAsFactors = F)
test  <- read.csv('../input/test.csv', stringsAsFactors = F)

full  <- bind_rows(train, test) # test + train

options( warn = -1 )

In [2]:
str(full)

In [3]:
summary(full)

**2. Exploratory Analysis**
-----------------------

 - Missing value : (part 3)
   - Age : 263 
   - Fare : 1  
   - Embarked : 2
   - Cabin : too many

**2.1. Age**

In [4]:
#______________________________________________________________________________
#_____________________________Exploratory Analysis_____________________________
#______________________________________________________________________________

# Age vs Survived
ggplot(full[1:891,], aes(Age, fill = factor(Survived))) + 
  geom_histogram(bins=30) + 
  theme_few() +
  xlab("Age") +
  scale_fill_discrete(name = "Survived") + 
  ggtitle("Age vs Survived")

**2.2. Sex**

In [5]:
# Sex vs Survived
ggplot(full[1:891,], aes(Sex, fill = factor(Survived))) + 
  geom_bar(stat = "count", position = 'dodge')+
  theme_few() +
  xlab("Sex") +
  ylab("Count") +
  scale_fill_discrete(name = "Survived") + 
  ggtitle("Sex vs Survived")

**2.3. Age vs Sex**

In [6]:
#Sex vs Survived vs Age 
ggplot(full[1:891,], aes(Age, fill = factor(Survived))) + 
  geom_histogram(bins=30) + 
  theme_few() +
  xlab("Age") +
  ylab("Count") +
  facet_grid(.~Sex)+
  scale_fill_discrete(name = "Survived") + 
  theme_few()+
  ggtitle("Age vs Sex vs Survived")

**2.4. Pclass vs Sex**

In [7]:
# Pclass vs Survived
ggplot(full[1:891,], aes(Pclass, fill = factor(Survived))) + 
  geom_bar(stat = "count")+
  theme_few() +
  xlab("Pclass") +
  facet_grid(.~Sex)+
  ylab("Count") +
  scale_fill_discrete(name = "Survived") + 
  ggtitle("Pclass vs Sex vs Survived")

**2.5. Pclass vs Sex vs Age**

In [8]:
ggplot(full[1:891,], aes(x = Age, y = Sex)) + 
  geom_jitter(aes(colour = factor(Survived))) + 
  theme_few()+
  theme(legend.title = element_blank())+
  facet_wrap(~Pclass) + 
  labs(x = "Age", y = "Sex", title = "Pclass vs Sex vs Age vs Survived")+
  scale_fill_discrete(name = "Survived") + 
  scale_x_continuous(name="Age",limits=c(0, 81))

**2.6. Fare vs Pclass **

In [9]:
#Fare
ggplot(full[1:891,], aes(x = Fare, y = Pclass)) + 
  geom_jitter(aes(colour = factor(Survived))) + 
  theme_few()+
  theme(legend.title = element_blank())+
  labs(x = "Age", y = "Pclass", title = "Fare vs Pclass")+
  scale_fill_discrete(name = "Survived") + 
  scale_x_continuous(name="Fare", limits=c(0, 270), breaks=c(0, 40, 80, 120, 160, 200, 240, 280))

**3. Data Processing and Exploratory Analysis 2**
---------------------------------------------

**3.1. New Variable : Title** (From Name)

In [10]:
#______________________________________________________________________________
#______________________Data processing and ___________________________________
#_________________________exploratory analysis 2______________________________

#__________________________________Title_______________________________________
# Extract titles
full$Title <- gsub('(.*, )|(\\..*)', '', full$Name)

# Titles by Sex
table(full$Sex, full$Title)

In [11]:
# Reassign rare titles
officer <- c('Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev')
royalty <- c('Dona', 'Lady', 'the Countess','Sir', 'Jonkheer')

# Reassign mlle, ms, and mme, and rare
full$Title[full$Title == 'Mlle']        <- 'Miss' 
full$Title[full$Title == 'Ms']          <- 'Miss'
full$Title[full$Title == 'Mme']         <- 'Mrs' 
full$Title[full$Title %in% royalty]  <- 'Royalty'
full$Title[full$Title %in% officer]  <- 'Officer'

full$Surname <- sapply(full$Name,  
                       function(x) strsplit(x, split = '[,.]')[[1]][1])

In [12]:
#graph title
ggplot(full[1:891,], aes(Title,fill = factor(Survived))) +
  geom_bar(stat = "count")+
  xlab('Title') +
  ylab("Count") +
  scale_fill_discrete(name = " Survived") + 
  ggtitle("Title vs Survived")+
  theme_few()

**3.2. New Variable : Family Size** (From Name, SibSp and Parch)

In [13]:
#____________________________Family Size________________________________

#Family
# Family Size
full$Fsize <- full$SibSp + full$Parch + 1

ggplot(full[1:891,], aes(x = Fsize, fill = factor(Survived))) +
  geom_bar(stat='count', position='dodge') +
  scale_x_continuous(breaks=c(1:11)) +
  xlab('Family Size') +
  ylab("Count") +
  theme_few()+
  scale_fill_discrete(name = "Survived") + 
  ggtitle("Family Size vs Survived")

In [14]:
# FsizeD
full$FsizeD[full$Fsize == 1] <- 'Alone'
full$FsizeD[full$Fsize < 5 & full$Fsize > 1] <- 'Small'
full$FsizeD[full$Fsize > 4] <- 'Big'

mosaicplot(table(full$FsizeD, full$Survived), main='FsizeD vs Survived', ylab="Survived",xlab="FsizeD",col = hcl(c(50, 120)),)

**3.3. Processing Embarked** (Replace missing values by most common value = S)

In [15]:
#________________________________Embarked______________________________________
# 2 missing datas : input S
full[c(62, 830), 'Embarked']

full$Embarked[c(62, 830)] <- 'S'

ggplot(full[1:891,], aes(Pclass, fill = factor(Survived))) + 
  geom_bar(stat = "count")+
  theme_few() +
  xlab("Pclass") +
  ylab("Count") +
  facet_wrap(~Embarked) + 
  scale_fill_discrete(name = "Survived") + 
  ggtitle("Embarked vs Pclass vs Survived")

**3.4. Processing Fare** (Replace missing value by Pclass = 3 's median)

In [16]:
#_________________________________Fare_________________________________________ 

full[1044, ]

ggplot(full[full$Pclass == '3', ], 
       aes(x = Fare)) +
  geom_density(fill = 'lightgrey', alpha=0.4) + 
  geom_vline(aes(xintercept=median(Fare, na.rm=T)),
             colour='darkred', linetype='dashed', lwd=1) +
  xlab('Fare') +
  ggtitle("Pclass = 3")+
  ylab("Density") +
  scale_x_continuous(labels=dollar_format()) +
  theme_few()

full$Fare[1044] <- median(full[full$Pclass == '3', ]$Fare, na.rm = TRUE)

**3.5. Processing Age** (Replace missing values by Title's median)

In [17]:
title.age <- aggregate(full$Age,by = list(full$Title), FUN = function(x) median(x, na.rm = T))

full[is.na(full$Age), "Age"] <- apply(full[is.na(full$Age), ] , 1, function(x) title.age[title.age[, 1]==x["Title"], 2])

#Na value count
sum(is.na(full$Age))

**3.6. New Variable : Child** (From Age)

In [18]:
#__________________________________Child__________________________________________

full$Child[full$Age < 18] <- 'Child'
full$Child[full$Age >= 18] <- 'Adult'

ggplot(full[1:891,][full[1:891,]$Child == 'Child', ], aes(Sex, fill = factor(Survived))) + 
  geom_bar(stat = "count") + 
  xlab("Sex") +
  ylab("Count") +
  facet_wrap(~Pclass)+
  scale_fill_discrete(name = "Survived") +
  ggtitle("Child vs Sex vs Pclass vs Survived")+
  theme_few()

table(full$Child, full$Survived)

**3.7. Correlogram Matrix (for fun) ** 

In [19]:
#____________________________Correlogram__________________________
corr_data <- full[1:891,]

## transform to numeric type and recodification
corr_data$Embarked <- revalue(corr_data$Embarked, 
                                  c("S" = 1, "Q" = 2, "C" = 3))
corr_data$Sex <- revalue(corr_data$Sex, 
                              c("male" = 1, "female" = 2))
corr_data$Title <- revalue(corr_data$Title, 
                           c("Mr" = 1, "Master" = 2,"Officer" = 3, 
                             "Mrs" = 4,"Royalty" = 5,"Miss" = 6))
corr_data$FsizeD <- revalue(corr_data$FsizeD, 
                         c("Small" = 1, "Alone" = 2, "Big" = 3))
corr_data$Child <- revalue(corr_data$Child, 
                            c("Adult" = 1, "Child" = 2))
corr_data$FsizeD <- as.numeric(corr_data$FsizeD)
corr_data$Child <- as.numeric(corr_data$Child)
corr_data$Sex <- as.numeric(corr_data$Sex)
corr_data$Embarked <- as.numeric(corr_data$Embarked)
corr_data$Title <- as.numeric(corr_data$Title)
corr_data$Pclass <- as.numeric(corr_data$Pclass)
corr_data$Survived <- as.numeric(corr_data$Survived)

corr_data <-corr_data[,c("Survived", "Pclass", "Sex", 
               "FsizeD", "Fare", "Embarked","Title","Child")]

str(corr_data)

In [20]:
mcorr_data <- cor(corr_data)

corrplot(mcorr_data,method="circle")

In [21]:
#_________________________________Factor__________________________________________

full$Child  <- factor(full$Child)
full$Sex  <- factor(full$Sex)
full$Embarked  <- factor(full$Embarked)
full$Title  <- factor(full$Title)
full$Pclass  <- factor(full$Pclass)
full$FsizeD  <- factor(full$FsizeD)

#___________________________Data without Cabin & Ticket __________________________

full1 <- full[,-9]
full_mod <- full1[,-10]

**4. Modeling with Random Forest**
------------------------------

**4.1. Modeling** full_mod[1 : 891]

In [22]:
#______________________________________________________________________________
#_____________________________Modeling + predict_______________________________
#______________________________________________________________________________

# Split full_mod
train <- full_mod[1:891,]
test <- full_mod[892:1309,]

# random forest
library('randomForest')

set.seed(123)
rf_model <- randomForest(factor(Survived) ~ Pclass + Sex + Fare + Embarked + Title + 
                           FsizeD + Child, data = train)

**4.2. Confusion Matri**

In [23]:
# prediction
rf.fitted = predict(rf_model)
ans_rf = rep(NA,891)
for(i in 1:891){
  ans_rf[i] = as.integer(rf.fitted[[i]]) - 1
}
# RÃ©sultat
table(ans_rf)

print(rf_model)

**5. Prediction**
-------------

**5.1. Prediction .csv**

In [24]:
prediction <- predict(rf_model, test)

# Solution 2 columns (prediction)
solution <- data.frame(Survived = prediction, PassengerID = test$PassengerId)

# .csv
write.csv(solution, file = 'rf_model_sol.csv', row.names = F)

**5.2. OOB Error and Gini**

In [25]:
# Error
plot(rf_model, ylim=c(0,0.36), main = 'RF_MODEL')
legend('topright', colnames(rf_model$err.rate), col=1:3, fill=1:3)

In [26]:
# Var importantes
importance    <- importance(rf_model)
varImportance <- data.frame(Variables = row.names(importance), 
                            Importance = round(importance[ ,'MeanDecreaseGini'],2))

# var imp
rankImportance <- varImportance %>%
  mutate(Rank = paste0('#',dense_rank(desc(Importance))))

# Graph var importantes
ggplot(rankImportance, aes(x = reorder(Variables, Importance), 
                           y = Importance, fill = Importance)) +
  geom_bar(stat='identity') + 
  geom_text(aes(x = Variables, y = 0.5, label = Rank),
            hjust=0, vjust=0.55, size = 4, colour = 'red') +
  labs(x = 'Variables') +
  coord_flip() + 
  theme_few()

**6. Accuracy**
-------------

----------


**RF_MODEL : 0.8282828** full_mod[1:891]


----------



**RF_MODEL_TEST : 0.8181818** full_mod[892:1309]


----------
