# Titanic Data Analysis


In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. 
In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Practice Skills:
 - Binary classification ===> Logistic Regression, LDA...
 - Python and R basics

Goal:
 - It is your job to predict if a passenger survived the sinking of the Titanic or not. 
 - For each PassengerId in the test set, you must predict a 0 or 1 value for the Survived variable.

Metric:
 - Your score is the percentage of passengers you correctly predict. This is known simply as "accuracy”.
 
 
 ************* Please Upvote if this Kernel was useful to you! ********************

## Data Analysis 

In [None]:
#Importing Data Analysis Libs
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [None]:
#Getting .csv files 
df = pd.read_csv('../input/train.csv')
dfTest = pd.read_csv('../input/test.csv')

### Checking Dataframe ...

In [None]:
#Checking the first lines
df.head()

In [None]:
#Data Types
df.dtypes

In [None]:
#Checking the shape of the dataframe
df.shape

In [None]:
dfTest.head(3)

In [None]:
dfTest.shape

In [None]:
#Statistic Summary
df.describe()

In [None]:
dfTest.describe()

 Column Age doesn't have all values 

In [None]:
#Cheking columns with null values in Dataframe
df.isnull().any()

In [None]:
# Pclass column distribution
df.groupby('Pclass').size()

### Dataframe Transformations...

In [None]:
#Copying Dataframe 
dfT = df

In [None]:
#Checking Unique values from Sex Column
sex_values = df.drop_duplicates('Sex')
print(sex_values['Sex'])

In [None]:
#Checking Unique values from Embarked Column
embarked_values = df.drop_duplicates('Embarked')
print(embarked_values['Embarked'])

In [None]:
#Using mode to fulfill null values on Embarked Column
dfT['Embarked'].fillna(dfT['Embarked'].mode()[0], inplace=True)
dfTest['Embarked'].fillna(dfTest['Embarked'].mode()[0], inplace=True)

In [None]:
#Dummy(one-hot encoded values) to train dataset(columns Sex and Embarked)
dfT = pd.concat([dfT.drop('Sex', axis=1), pd.get_dummies(dfT['Sex'])], axis=1)
dfT = pd.concat([dfT.drop('Embarked', axis=1), pd.get_dummies(dfT['Embarked'])], axis=1)

In [None]:
#Dummy(one-hot encoded values) to test dataset(columns Sex and Embarked)
dfTest = pd.concat([dfTest.drop('Sex', axis=1), pd.get_dummies(dfTest['Sex'])], axis=1)
dfTest = pd.concat([dfTest.drop('Embarked', axis=1), pd.get_dummies(dfTest['Embarked'])], axis=1)

In [None]:
#Using Mean to fulfill null values on Column Age
dfT['Age'].fillna(dfT['Age'].mean(), inplace=True)
dfTest['Age'].fillna(dfTest['Age'].mean(), inplace=True)

In [None]:
dfT.head(5)

In [None]:
dfTest.head(5)

In [None]:
#Dataset Formatting
dfT.shape

In [None]:
dfTest.shape

In [None]:
#Statistic Summary
dfT.describe()

In [None]:
#Checking nulls on Dataframe
dfT.isnull().any()

In [None]:
dfTest.isnull().any()

In [None]:
#Completing column Fare with 0 instead of Null
dfTest['Fare'] = dfTest['Fare'].fillna(0)

All columns with the same qtd of data

### Data Visualization...

In [None]:
#Visualization libs
#import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Histogram
dfT.hist()
plt.show()

In [None]:
# Correlation Matrix with names
columns = ['pID', 'surviv', 'pclass','age', 'sibsp', 'parch', 'fare', 'female','male','C','Q','S']
correlations = dfT.corr()
print (correlations)

In [None]:
# Plot
#import numpy as np
fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_color_cycle(['red', 'black', 'yellow'])
cax = ax.matshow(correlations, interpolation='nearest', vmin = -1, vmax = 1)
fig.colorbar(cax)
ticks = np.arange(0, 12, 1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(columns)
ax.set_yticklabels(columns)
plt.show()

Sibsp and Parch have high correlation
 - Sibsp =	# of siblings / spouses
 - Parch = # of parents / children

Create one new column adding Sibsp and Parch

In [None]:
#Adding columns SibSp and Parch => FamilySize
dfT['FamilySize'] = dfT['SibSp'] + dfT['Parch'] + 1
dfTest['FamilySize'] = dfTest['SibSp'] + dfTest['Parch'] + 1

In [None]:
dfT.shape

In [None]:
#new_columns = ['pID', 'surviv', 'pclass', 'sex','age', 'Fam', 'fare', 'Emb']
new_columns = ['pID', 'surviv', 'pclass','age', 'Sib Sp','Fam', 'fare', 'female','male','C','Q','S']

In [None]:
# New Plot
fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_color_cycle(['red', 'black', 'yellow'])
cax = ax.matshow(correlations, interpolation='nearest', vmin = -1, vmax = 1)
fig.colorbar(cax)
ticks = np.arange(0, 12, 1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(new_columns)
ax.set_yticklabels(new_columns)
plt.show()

Qtd of family members have influence on fare.
Local have influence on fare.

Local seems to have influence on survival.
Fare seems to have influence on survival.
      

In [None]:
#Visualizing data with seaborn
import seaborn as sns

In [None]:
#Dropping name columns (Name, Cabin and Ticket) from Dataframe
dfT = dfT.drop(['Name','Cabin','Ticket'],axis=1)
dfTest = dfTest.drop(['Name','Cabin','Ticket'],axis=1)

In [None]:
dfT.dtypes

In [None]:
dfTest.dtypes

In [None]:
# Pairplot   ====> Must have all columns without nulls
sns.pairplot(dfT)  

In [None]:
# kdeplot
sns.kdeplot(dfT)

In [None]:
dfT.head(3)

In [None]:
dfT.shape

## Preparing data for Machine Learning

In [None]:
# Importing library
#from pandas import read_csv
from sklearn.preprocessing import MinMaxScaler

colTrain = ['PassengerId', 'Pclass', 'Age', 'Fare', 'female', 'male','C','Q','S' ,'FamilySize', 'Survived']
dfMLTrain = dfT[colTrain]
arrayTrain = dfMLTrain.values

colTest = ['PassengerId', 'Pclass', 'Age', 'Fare', 'female', 'male','C','Q','S','FamilySize']
dfMLTest = dfTest[colTest]
arrayTest = dfMLTest.values

# Splitting array in input and output
XTrain = arrayTrain[:,0:10] 
YTrain = arrayTrain[:,10] 
XTest = arrayTest[:,0:10] 

# Creating new scale
scaler = MinMaxScaler(feature_range = (0, 1))
rescaledXTrain = scaler.fit_transform(XTrain)
rescaledXTest = scaler.fit_transform(XTest)

# Data transformed
print(rescaledXTrain[0:5,:])

### Feature Selection

In [None]:
# Feature Selection using chi2 test

# Import modules
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# Selecting the 5 better features that can be used in prediction model
test = SelectKBest(score_func = chi2, k = 5) 
fit = test.fit(XTrain, YTrain)

# Summarizing score
print(fit.scores_)
features = fit.transform(XTrain)

# Summarizing selected Features
print(features[0:5,:])

[  3.31293407e+00   3.08736994e+01   2.46879258e+01   4.51831909e+03
   1.70348127e+02   9.27024470e+01   2.04644013e+01   1.08467891e-02
   5.48920482e+00   3.36787042e-01]
<br>
['PassengerId', 'Pclass', 'Age', 'Fare', 'female', 'male','C','Q','S' ,'FamilySize'] <br>

Features(Better to worst): male, S, Fare, FamilySize, PassengerId, Pclass, Age, C, female, Q 

### Selecting Prediction Model

In [None]:
# Import modules
from sklearn import model_selection
#from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.neural_network import MLPClassifier

# Defining number of folds
num_folds = 10
num_instances = len(XTrain)
seed = 7

# Preparing models
modelos = []
modelos.append(('LR', LogisticRegression()))
modelos.append(('LDA', LinearDiscriminantAnalysis()))
modelos.append(('NB', GaussianNB()))
modelos.append(('KNN', KNeighborsClassifier()))
modelos.append(('GBC', GradientBoostingClassifier()))
modelos.append(('ETC', ExtraTreesClassifier()))
modelos.append(('RFC', RandomForestClassifier()))
modelos.append(('DTC', DecisionTreeClassifier()))
modelos.append(('ABC', AdaBoostClassifier()))
modelos.append(('BGC', BaggingClassifier()))
modelos.append(('QDA', QuadraticDiscriminantAnalysis()))
modelos.append(('GPC', GaussianProcessClassifier()))
modelos.append(('SVC', SVC()))
modelos.append(('MLPC', MLPClassifier()))


# Model Evaluation
resultados = []
nomes = []

for nome, modelo in modelos:
    kfold = model_selection.KFold(n_splits = num_folds, random_state = seed)
    cv_results = model_selection.cross_val_score(modelo, XTrain, YTrain, cv = kfold, scoring = 'accuracy')
    resultados.append(cv_results)
    nomes.append(nome)
    msg = "%s: %f (%f)" % (nome, cv_results.mean(), cv_results.std())
    print(msg)

# Boxplot to compare algorithms
fig = plt.figure()
fig.suptitle('Comparison of Classification Algorithms')
ax = fig.add_subplot(111)
plt.boxplot(resultados)
ax.set_xticklabels(nomes)
plt.show()

GBC is the better algorithm in this case.

## Using Linear Discriminant Analysis (LDA)

In [None]:
# Creating LDA model 
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
modelLDA = LinearDiscriminantAnalysis()

# Training model and checking the score
modelLDA.fit(XTrain, YTrain)
modelLDA.score(XTrain, YTrain)

# Colecting coefficients
print('Coefficient: \n', modelLDA.coef_)
print('Intercept: \n', modelLDA.intercept_)

# Predictions
YPredLDA = modelLDA.predict(XTest)

In [None]:
#Checking accuracy
acc_log = round(modelLDA.score(XTrain, YTrain) * 100, 2)
acc_log

## Using Gaussian NB

In [None]:
# Creating Gaussian NB model 
from sklearn.naive_bayes import GaussianNB
modelGNB = GaussianNB()

# Training model and checking the score
modelGNB.fit(XTrain, YTrain)
modelGNB.score(XTrain, YTrain)

# Predictions
YPredGNB = modelGNB.predict(XTest)

In [None]:
#Checking accuracy
acc_log = round(modelGNB.score(XTrain, YTrain) * 100, 2)
acc_log

## Using Logistic Regression

In [None]:
# Creating logistic regression model 
modelLR = LogisticRegression()

# Training model and checking the score
modelLR.fit(XTrain, YTrain)
modelLR.score(XTrain, YTrain)

# Colecting coefficients
print('Coefficient: \n', modelLR.coef_)
print('Intercept: \n', modelLR.intercept_)

# Predictions
YPredLR = modelLR.predict(XTest)

In [None]:
#Checking accuracy
acc_log = round(modelLR.score(XTrain, YTrain) * 100, 2)
acc_log

## Creating a Gradient Boost Classifier

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
modelGBC = GradientBoostingClassifier()

gb_param_grid = {'loss' : ["deviance"],
              'n_estimators' : [100,200,300],
              'learning_rate': [0.1, 0.05, 0.01],
              'max_depth': [4, 8],
              'min_samples_leaf': [100,150],
              'max_features': [0.3, 0.1] 
              }


gsGBC = GridSearchCV(modelGBC,param_grid = gb_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)

gsGBC.fit(XTrain,YTrain)

GBC_best = gsGBC.best_estimator_

# Best score
gsGBC.best_score_

# Predictions
YPredGBC=gsGBC.predict(XTest)

In [None]:
#Checking accuracy
acc_log = round(gsGBC.score(XTrain,YTrain) * 100, 2)
acc_log

## Creating an Extra Tree Classifier

In [None]:
from sklearn.ensemble import ExtraTreesClassifier

modelETC = ExtraTreesClassifier(n_estimators=100, criterion='gini', max_depth=None, min_samples_split=2, random_state=0)


# Training model and checking the score
modelETC.fit(XTrain, YTrain)
modelETC.score(XTrain,YTrain)


# Predictions
YPredETC=modelETC.predict(XTest)

In [None]:
#Checking accuracy
acc_log = round(modelETC.score(XTrain,YTrain) * 100, 2)
acc_log

## Creating a Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

modelRFC = RandomForestClassifier()

rf_param_grid = {"max_depth": [None],
              "max_features": [1, 3, 10],
              "min_samples_split": [2, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [False],
              "n_estimators" :[100,300],
              "criterion": ["gini"]}

gsRFC = GridSearchCV(modelRFC,param_grid = rf_param_grid, cv=kfold, scoring="accuracy", n_jobs= 3, verbose = 1)

gsRFC.fit(XTrain, YTrain)
RFC_best = gsRFC.best_estimator_
gsRFC.best_score_

# Predictions
YPredRFC=gsRFC.predict(XTest)

In [None]:
#Checking accuracy
acc_log = round(gsRFC.score(XTrain,YTrain) * 100, 2)
acc_log

## Creating a Decision Tree Classifier(Adaboost)

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier


DTC = DecisionTreeClassifier()

adaDTC = AdaBoostClassifier(DTC, random_state=7)
ada_param_grid = {"base_estimator__criterion" : ["gini", "entropy"],
              "base_estimator__splitter" :   ["best", "random"],
              "algorithm" : ["SAMME","SAMME.R"],
              "n_estimators" :[1,2],
              "learning_rate":  [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3,1.5]}


# Training model and checking the score

modelDTC = GridSearchCV(adaDTC,param_grid = ada_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)

modelDTC.fit(XTrain,YTrain)
modelDTC.score(XTrain,YTrain)

# Predictions
YPredDTC=modelDTC.predict(XTest)



In [None]:
#Checking accuracy
acc_log = round(modelDTC.score(XTrain,YTrain) * 100, 2)
acc_log

In [None]:
#Bagging

from sklearn.ensemble import BaggingClassifier

# Training model and checking the score
modelBgC = BaggingClassifier(base_estimator=None, n_estimators=10, max_samples=1.0, max_features=1.0, 
                  bootstrap=True, bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=1, 
                  random_state=None, verbose=0)

modelBgC.fit(XTrain, YTrain)

# Predictions
YPredBgC=modelBgC.predict(XTest)

In [None]:
#Checking accuracy
acc_log = round(modelBgC.score(XTrain,YTrain) * 100, 2)
acc_log

## Combining Models(Ensemble)

In [None]:
#Voting
from sklearn.ensemble import VotingClassifier

# Training model and checking the score
modelVotC = VotingClassifier(estimators=[('gbc',modelGBC),('rfc',gsRFC),('bgc',modelBgC),('lda',modelLDA),('lr',modelLR),('etc',modelETC),('nb',modelGNB)], voting='hard', n_jobs=3)
modelVotC.fit(XTrain, YTrain)

# Predictions
YPredVC=modelVotC.predict(XTest)

In [None]:
#Checking accuracy
acc_log = round(modelVotC.score(XTrain,YTrain) * 100, 2)
acc_log

In [None]:
#Prediction Results
#survived = YPredLR.astype(int)
#survived = YPredLDA.astype(int)
#survived = YPredGNB.astype(int)
#survived = YPredGBC.astype(int)
#survived = YPredRFC.astype(int)
#survived = YPredDTC.astype(int)
#survived = YPredVC.astype(int)
#survived = YPredBgC.astype(int)

In [None]:
#Creating Submission file
submission = pd.DataFrame({
        "PassengerId": dfTest["PassengerId"],
        "Survived": survived
    })

#submission.to_csv('../output/submission.csv', index=False)