<a href="https://colab.research.google.com/github/srivatsakr21/Kaggle/blob/master/Titanic_with_full_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Import all the required libraries**

In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.base import BaseEstimator,TransformerMixin
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline,FeatureUnion
from sklearn.model_selection import cross_val_predict,GridSearchCV
from sklearn.metrics import precision_score,recall_score,accuracy_score,f1_score
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

**Read the training data. For this example to work, train.csv should be in the same folder as this notebook**

In [2]:
titanic = pd.read_csv('train.csv')

**From this Dataset we are trying to precict who Survived based on the different features present in the dataset. Hence, seperate the independent vairables and dependent variables**

In [3]:
X = titanic.drop('Survived',axis=1)
y=titanic[['Survived']]
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
PassengerId    891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 76.7+ KB


Based on the features present in the dataset, we can see that there are missing values for features Age,Cabin and Embarked. Since more than 50% of the data does not have a value for Cabin feature, we can safely drop it. We can also drop PassengerId since it is just an identifier

In [4]:
X = X.drop(['Cabin','PassengerId'],axis=1)

In [5]:
X.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


**Feature Engineering**




Feature engineering is one of the most important aspects of building a machine learning model. It is all about extracting some useful features from the data at hand. In this data, we can infer whether the traveller was travelling alone or whether he/she was with his/hers family. We can also idetify the 'Title' of the person based on the Name feature.

To extract the new features, I am building a custom feature extractor using the BaseEstimator and TransformerMixin classes

In [6]:
class CustomFeatureExtractor(BaseEstimator,TransformerMixin):
  def createTitle(self,row):
    title=row.split(",")[1].split(".")[0].strip()
    if title in ['Mr','Mrs','Miss','Master','Dr','Capt']:
      return title
    else:
      return 'Others'

  def inferIsAlone(self,df):
    isAlone = df['SibSp']+df['Parch']+1
    if isAlone==1:
      return 'Yes'
    else:
      return 'No'  


  
  def fit(self,X,y=None):
    return self

  def transform(self,X,y=None):
    
    X['Title']=X['Name'].apply(self.createTitle)
    X['IsAlone'] = X.apply(self.inferIsAlone,axis=1)
    X = X.drop(['Name'],axis=1)
    
    return X

In [7]:
class FeatureSelector(BaseEstimator,TransformerMixin):
  def __init__(self,features=None):
    
    self.features = features
    
  def fit(self,X,y=None):
    return self
  def transform(self,X,y=None):
    
    return X[self.features]    

In [8]:
class Debug(BaseEstimator,TransformerMixin):
    def fit(self,X,y=None):
        return self

    def transform(self,X,y=None):
        print("Debug: ",type(X))
        return X

Now Let's build 3 pipelines. One pipeline will be used for feature Engineering, the other 2 will work on numerical and Categorical attributes respectively

In [9]:

num_attributes = ['Age','SibSp','Parch','Fare']
cat_attributes = ['Pclass','Sex','Embarked','Title','IsAlone']
x_columns = X.columns

In [10]:
num_pipeline = Pipeline([
                         ('num_features',FeatureSelector(features=num_attributes)),
                         ('median_imputer',SimpleImputer(strategy='median')),
                         ('scalar',StandardScaler())
])

In [11]:
cat_pipeLine = Pipeline([
                         ('all_features',FeatureSelector(features=x_columns)),
                         ('extractor',CustomFeatureExtractor()),
                         ('cat_features',FeatureSelector(features=cat_attributes)),
                         ('mode_imputer',SimpleImputer(strategy='most_frequent')),
                         ('1hot',OneHotEncoder(handle_unknown='ignore'))
])

In [12]:
features = FeatureUnion([
                                        
                                        ('num',num_pipeline),
                                        ('cat',cat_pipeLine)
])

In [19]:
full_pipeline = Pipeline([
                          ('full',features),
                          ('clf',VotingClassifier(estimators=[
                              ('lr',LogisticRegression()),
                              ('svm',SVC(probability=True)),
                              ('dt',DecisionTreeClassifier(random_state=42))
                          ]))
])

In [20]:
params={
    'clf__dt__criterion':['entropy'],
    'clf__dt__splitter':['random'],
    'clf__dt__max_depth':[None,5,7,10],
    'clf__dt__max_features':[None,'auto','sqrt','log2'],
    'clf__dt__min_samples_leaf':[0.05,0.075],
    'clf__dt__min_samples_split':[0.05,0.75,0.1],
    'clf__dt__min_impurity_decrease':[0.1,0.15,0.20],
    'clf__svm__kernel':['linear','poly','rbf'],
    'clf__svm__gamma':[0.1,1,10,100],
    'clf__lr__C':[0.001,0.01,0.1,1,10,100],
    'clf__voting':['hard','soft']
}

In [21]:
from sklearn.model_selection import GridSearchCV

In [17]:
grid = GridSearchCV(full_pipeline,params,verbose=3,cv=5)
grid.fit(X,y)

Fitting 5 folds for each of 41472 candidates, totalling 207360 fits
[CV] clf__dt__criterion=entropy, clf__dt__max_depth=None, clf__dt__max_features=None, clf__dt__min_impurity_decrease=0.1, clf__dt__min_samples_leaf=0.05, clf__dt__min_samples_split=0.05, clf__dt__splitter=random, clf__lr__C=0.001, clf__svm__gamma=0.1, clf__svm__kernel=linear, clf__voting=hard 
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
[CV]  clf__dt__criterion=entropy, clf__dt__max_depth=None, clf__dt__max_features=None, clf__dt__min_impurity_decrease=0.1, clf__dt__min_samples_leaf=0.05, clf__dt__min_samples_split=0.05, clf__dt__splitter=random, clf__lr__C=0.001, clf__svm__gamma=0.1, clf__svm__kernel=linear, clf__voting=hard, score=0.855, total=   0.3s
[CV] clf__dt__criterion=entropy, clf__dt__max_depth=None, clf__dt__max_features=None, clf__dt__min_impurity_decrease=0.1, clf__dt__min_samples_leaf=0.05, clf__dt__min_

In [ ]:
print(grid.best_estimator_)

Pipeline(memory=None,
         steps=[('full',
                 FeatureUnion(n_jobs=None,
                              transformer_list=[('num',
                                                 Pipeline(memory=None,
                                                          steps=[('num_features',
                                                                  FeatureSelector(features=['Age',
                                                                                            'SibSp',
                                                                                            'Parch',
                                                                                            'Fare'])),
                                                                 ('median_imputer',
                                                                  SimpleImputer(add_indicator=False,
                                                                                copy=True,
                     

In [ ]:
grid.best_score_

0.8350009415604797

In [ ]:
test_data = pd.read_csv('test.csv')
passengerId = test_data['PassengerId']
test_data = test_data.drop(['Cabin','PassengerId'],axis=1)

In [ ]:
y_pred = grid.predict(test_data)

<class 'pandas.core.frame.DataFrame'>
['Age', 'SibSp', 'Parch', 'Fare']
<class 'pandas.core.frame.DataFrame'>
Index(['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare',
       'Embarked'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
['Pclass', 'Sex', 'Embarked', 'Title', 'IsAlone']


In [ ]:
y_pred

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,

In [ ]:
finalDf=pd.concat([pd.DataFrame(passengerId,columns=['PassengerId']),pd.DataFrame(y_pred,columns=['Survived'])],axis=1)

In [ ]:
finalDf.head(5)

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


In [ ]:
finalDf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 2 columns):
PassengerId    418 non-null int64
Survived       418 non-null int64
dtypes: int64(2)
memory usage: 6.7 KB


In [ ]:
finalDf.to_csv('prediction_ensemble.csv',index=False)