<a href="https://colab.research.google.com/github/vineetjoshi253/Contests-Kaggle_Analytics-Vidhya/blob/master/Titanic/Titanic_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Ignore Warnings**

In [0]:
import warnings
warnings.filterwarnings('ignore')

## 1. Data Preprocessing & Feature Engineering
 

**1.2 Processing Age Attribute**</br>


---


There are total of 177 missing attributes. Either we can fill all of these with the mean of the  entire age column or we can be more specific and fill these with certain group values. 
Below, we are immuting the null age of a male passenger with the mean of other male passengers and similarly for the female passengers we are using the mean of the other female passengers. 

In [0]:
import pandas as pd
import numpy as np

train_data = pd.read_csv('train.csv')
features = train_data.columns.tolist()

Remove = ['PassengerId','Cabin','Ticket','Fare']

print(train_data.groupby('Sex')['Age'].mean())
print('Average Age: ',train_data['Age'].mean(),sep="")

Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64
Average Age: 29.69911764705882


In [0]:
Mean = train_data.groupby('Sex')['Age'].mean().tolist()
FemaleAvg = Mean[0]
MaleAvg = Mean[1]

In [0]:
import math
for i in range(len(train_data['Age'])):
  if(math.isnan(train_data['Age'][i])==True):
    if(train_data['Sex'][i]=='male'):
      train_data['Age'][i] = MaleAvg
    else:
      train_data['Age'][i] = FemaleAvg
      
print('Null Values Left: ',train_data['Age'].isna().sum(),sep="")

Null Values Left: 0


**1.3 Processing Name Attribute**


---

Here the specific name of a person is of no use to us, but let us try to extract some information from their names.

*All the name in the dataset has the format "Surname, Title. Name"*

In [0]:
#Function to get title from a name.
def get_title(name):
    if '.' in name:
        return name.split(',')[1].split('.')[0].strip()
    else:
        return 'Unknown'

In [0]:
Titles = []
for i in range(len(train_data['Name'].tolist())):
  Titles.append(get_title(train_data['Name'][i]))
Titles = np.asarray(Titles)
print(set(Titles))

{'Capt', 'Miss', 'Mr', 'Mlle', 'Mrs', 'Mme', 'Sir', 'the Countess', 'Major', 'Col', 'Dr', 'Don', 'Ms', 'Lady', 'Master', 'Rev', 'Jonkheer'}


In [0]:
def replace_title(title,Sex):
    if title in ['Capt', 'Col', 'Don', 'Jonkheer', 'Major', 'Rev', 'Sir']:
        return 'Mr'
    elif title in ['the Countess', 'Mme', 'Lady','Dona']:
        return 'Mrs'
    elif title in ['Mlle', 'Ms']:
        return 'Miss'
    elif title =='Dr':
        if Sex == 'male':
            return 'Mr'
        else:
            return 'Mrs'
    else:
        return title

In [0]:
for i in range(len(Titles)):
  Titles[i] = replace_title(Titles[i],train_data['Sex'][i])

print(set(Titles))
train_data['Name'] = Titles

{'Master', 'Miss', 'Mr', 'Mrs'}


**1.4 FamilySize and FarePerPerson**


---

Creating two new attributes, 'Family Size' and 'FarePerPerson' which are linear combinations of the 'SibSp’ and ‘Parch’ attributes and 'Fare' and 'FamilySize' attributes.

In [0]:
train_data['FamilySize']=train_data['SibSp']+train_data['Parch']
train_data['FarePerPerson']=train_data['Fare']/(train_data['FamilySize']+1)

 **1.5 Processing Embarked**

There are only two missing values in 'Embarked' column, a little bit research told us that both these passengers embarked from port 'S'.

In [0]:
train_data['Embarked'] = train_data['Embarked'].fillna('S')

**1.6 Droping Unwanted Columns and Encoding To Numerical**

---



In [0]:
for item in Remove:
  train_data.drop(item,inplace=True,axis=1)

print(train_data.columns)

Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Embarked', 'FamilySize', 'FarePerPerson'],
      dtype='object')


In [0]:
from sklearn import preprocessing
Encode = ['Name','Sex','Embarked']

label_encoder = preprocessing.LabelEncoder() 
for item in Encode:
  train_data[item]= label_encoder.fit_transform(train_data[item])

print(train_data.sample(3))

     Survived  Pclass  Name  Sex  ...  Parch  Embarked  FamilySize  FarePerPerson
323         1       1     1    0  ...      1         0           1        28.9896
93          0       3     2    1  ...      0         2           0         8.0500
284         1       3     1    0  ...      0         1           0         7.7500

[3 rows x 10 columns]


## 2. Getting Baseline Accuracy

Using KFolds Cross Validation To Test The Following Models:



*   Random Forest
*   Support Vector Machine


In [0]:
features = train_data.columns.tolist()
features.pop(0)

X = np.asarray(train_data[features].values)
Y = np.asarray(train_data['Survived'])
print(X.shape)
print(Y.shape)

(881, 9)
(881,)


### 2.1 Random Forest

---



In [0]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score 
from sklearn.model_selection import KFold


kfold = KFold(5)

Results = []
for train,test in kfold.split(X):
  Xtrain = X[train]
  Xtest  = X[test]
  Ytrain = Y[train]
  Ytest = Y[test]
    
  model = RandomForestClassifier()
  model.fit(Xtrain,Ytrain)
  Ypred = model.predict(Xtest)
  Results.append(accuracy_score(Ytest,Ypred))
       
Results = np.asarray(Results)
print('Random Forest')
print(Results.mean())
    

Random Forest
0.8127182845403185


**2.1.1 Random Forest: Hyperparameter Tuning Using GridSearch**

---



In [0]:
from sklearn.model_selection import GridSearchCV

model = RandomForestClassifier(random_state=1, n_jobs=-1)

param_grid = {"criterion" : ["gini", "entropy"], "min_samples_leaf" : [1, 5, 10], "min_samples_split" : [2, 4, 10, 12, 16], "n_estimators": [50, 100, 400, 700, 1000]}

gs = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy', cv=3, n_jobs=-1)

gs = gs.fit(X,Y)

print(gs.best_score_)
print(gs.best_estimator_)

**2.1.2 Tuned Random Forest**

---



In [0]:
kfold = KFold(5)

Results = []
for train,test in kfold.split(X):
  Xtrain = X[train]
  Xtest  = X[test]
  Ytrain = Y[train]
  Ytest = Y[test]
    
  model = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=10,
                       min_weight_fraction_leaf=0.0, n_estimators=700,
                       n_jobs=-1, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)
  
  model.fit(Xtrain,Ytrain)
  Ypred = model.predict(Xtest)
  Results.append(accuracy_score(Ytest,Ypred))
       
Results = np.asarray(Results)
print('Random Forest')
print(Results.mean())


Random Forest
0.8342835130970725


### 2.2 Support Vector Machine

In [0]:
from sklearn.svm import SVC
kfold = KFold(5)

Results = []
for train,test in kfold.split(X):
  Xtrain = X[train]
  Xtest  = X[test]
  Ytrain = Y[train]
  Ytest = Y[test]
    
  model = SVC()
  model.fit(Xtrain,Ytrain)
  Ypred = model.predict(Xtest)
  Results.append(accuracy_score(Ytest,Ypred))
       
Results = np.asarray(Results)
print('Support Vector Machine')
print(Results.mean())

Support Vector Machine
0.7605482794042117


**2.2.1 Support Vector Machine: Hyperparameter Tuning Using GridSearch**

---



In [0]:
from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.001, 0.01, 0.1, 1, 10], 'gamma' : [0.001, 0.01, 0.1, 1]}

model = SVC()
gs = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy', cv=3, n_jobs=-1)
gs = gs.fit(X,Y)

print(gs.best_score_)
print(gs.best_estimator_)

**2.2.2 Tuned Support Vector Machine**

---



In [0]:
kfold = KFold(5)

Results = []
for train,test in kfold.split(X):
  Xtrain = X[train]
  Xtest  = X[test]
  Ytrain = Y[train]
  Ytest = Y[test]
    
  model = SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
              decision_function_shape='ovr', degree=3, gamma=0.01, kernel='rbf',
              max_iter=-1, probability=False, random_state=None, shrinking=True,
              tol=0.001, verbose=False)
  
  model.fit(Xtrain,Ytrain)
  Ypred = model.predict(Xtest)
  Results.append(accuracy_score(Ytest,Ypred))
       
Results = np.asarray(Results)
print('Support Vector Machine')
print(Results.mean())

Support Vector Machine
0.7911594761171032


### 2.3 XGBoost

In [0]:
import xgboost as xgb
kfold = KFold(5)

Results = []
for train,test in kfold.split(X):
  Xtrain = X[train]
  Xtest  = X[test]
  Ytrain = Y[train]
  Ytest = Y[test]
    
  model =  xgb.XGBClassifier(max_depth=3, n_estimators=300, learning_rate=0.05)
  
  model.fit(Xtrain,Ytrain)
  Ypred = model.predict(Xtest)
  Results.append(accuracy_score(Ytest,Ypred))
       
Results = np.asarray(Results)
print('XGBoost')
print(Results.mean())

XGBoost
0.8365498202362609


## 3. Ensemble Model

---



**3.1 Max Voting Ensemble**

In [0]:
from sklearn.ensemble import VotingClassifier

model1 = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=10,
                       min_weight_fraction_leaf=0.0, n_estimators=700,
                       n_jobs=-1, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)

model2 = SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
              decision_function_shape='ovr', degree=3, gamma=0.01, kernel='rbf',
              max_iter=-1, probability=False, random_state=None, shrinking=True,
              tol=0.001, verbose=False)
  

model3 = xgb.XGBClassifier(max_depth=3, n_estimators=300, learning_rate=0.05)


kfold = KFold(5)

Results = []
for train,test in kfold.split(X):
  Xtrain = X[train]
  Xtest  = X[test]
  Ytrain = Y[train]
  Ytest = Y[test]
    
  model = VotingClassifier(estimators=[('RF', model1),('SVC',model2),('XGB', model3)], voting='hard')
  
  model.fit(Xtrain,Ytrain)
  Ypred = model.predict(Xtest)
  Results.append(accuracy_score(Ytest,Ypred))
       
Results = np.asarray(Results)
print('Max Voting Ensemble: RF + XGB + SVM')
print(Results.mean())

Max Voting Ensemble: RF + XGB + SVM
0.830861581920904


## 3. Final Model

---



In [0]:
model1 = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=10,
                       min_weight_fraction_leaf=0.0, n_estimators=700,
                       n_jobs=-1, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)
model2 = SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
              decision_function_shape='ovr', degree=3, gamma=0.01, kernel='rbf',
              max_iter=-1, probability=False, random_state=None, shrinking=True,
              tol=0.001, verbose=False)
  

model3 = xgb.XGBClassifier(max_depth=3, n_estimators=300, learning_rate=0.05)

model = VotingClassifier(estimators=[('RF', model1),('SVC', model2),('XGB', model3)], voting='hard')

model.fit(X,Y)
print('Model Ready')

Model Ready


## 4. Processing Test Data

---



**4.1 Handling Missing Values**

In [0]:
test_data = pd.read_csv('test.csv')

print(test_data.isna().sum())

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64


In [0]:
test_data['Fare'] = test_data['Fare'].fillna(test_data['Fare'].mean())
Mean = test_data.groupby('Sex')['Age'].mean().tolist()
FemaleAvg = Mean[0]
MaleAvg = Mean[1]

print(FemaleAvg)
print(MaleAvg)

30.27236220472441
30.27273170731707


In [0]:
import math
for i in range(len(test_data['Age'])):
  if(math.isnan(test_data['Age'][i])==True):
    if(test_data['Sex'][i]=='male'):
      test_data['Age'][i] = MaleAvg
    else:
      test_data['Age'][i] = FemaleAvg
      
print('Null Values Left: ',test_data['Age'].isna().sum(),sep="")

Null Values Left: 0


**4.2 Extracting Titles From Names**

In [0]:
Titles = []
for i in range(len(test_data['Name'].tolist())):
  Titles.append(get_title(test_data['Name'][i]))
Titles = np.asarray(Titles)
print(set(Titles))

{'Rev', 'Dr', 'Mr', 'Miss', 'Master', 'Col', 'Dona', 'Ms', 'Mrs'}


In [0]:
for i in range(len(Titles)):
  Titles[i] = replace_title(Titles[i],test_data['Sex'][i])

print(set(Titles))
test_data['Name'] = Titles

{'Miss', 'Master', 'Mr', 'Mrs'}


**4.3 FamilySize and FarePerPerson**

In [0]:
test_data['FamilySize']=test_data['SibSp']+test_data['Parch']
test_data['FarePerPerson']=test_data['Fare']/(test_data['FamilySize']+1)

**4.4 Droping Unwanted Columns and Encoding To Numerical**

In [0]:
for item in Remove:
  test_data.drop(item,inplace=True,axis=1)

print(test_data.columns)

Index(['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Embarked',
       'FamilySize', 'FarePerPerson'],
      dtype='object')


In [0]:
Encode = ['Name','Sex','Embarked']

label_encoder = preprocessing.LabelEncoder() 
for item in Encode:
  test_data[item]= label_encoder.fit_transform(test_data[item])

print(test_data.sample(3))

     Pclass  Name  Sex        Age  ...  Parch  Embarked  FamilySize  FarePerPerson
295       3     2    1  26.000000  ...      0         2           0         7.8958
292       3     2    1  30.272732  ...      0         0           0         7.2292
54        2     2    1  30.272732  ...      0         0           0        15.5792

[3 rows x 9 columns]


## 5. Get Submission

---



In [0]:
features = test_data.columns.tolist()
print(features)

Xtest = np.asarray(test_data[features].values)
print(Xtest.shape)

['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Embarked', 'FamilySize', 'FarePerPerson']
(418, 9)


In [0]:
Ypred = model.predict(Xtest)
print(Ypred.shape)

(418,)


In [0]:
Solution = pd.read_csv('Solution.csv')
for i in range(len(Solution['Survived'])):
  Solution['Survived'][i] = Ypred[i]
  
Solution.to_csv('FinalSolution.csv')