<a href="https://colab.research.google.com/github/srivatsakr21/Kaggle/blob/master/Titanic_with_full_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Import all the required libraries**

In [0]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.base import BaseEstimator,TransformerMixin
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline,FeatureUnion
from sklearn.model_selection import cross_val_predict,GridSearchCV
from sklearn.metrics import precision_score,recall_score,accuracy_score,f1_score
from sklearn.compose import ColumnTransformer

**Read the training data. For this example to work, train.csv should be in the same folder as this notebook**

In [0]:
titanic = pd.read_csv('train.csv')

**From this Dataset we are trying to precict who Survived based on the different features present in the dataset. Hence, seperate the independent vairables and dependent variables**

In [67]:
X = titanic.drop('Survived',axis=1)
y=titanic[['Survived']]
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Pclass       891 non-null    int64  
 2   Name         891 non-null    object 
 3   Sex          891 non-null    object 
 4   Age          714 non-null    float64
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
 7   Ticket       891 non-null    object 
 8   Fare         891 non-null    float64
 9   Cabin        204 non-null    object 
 10  Embarked     889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 76.7+ KB


Based on the features present in the dataset, we can see that there are missing values for features Age,Cabin and Embarked. Since more than 50% of the data does not have a value for Cabin feature, we can safely drop it. We can also drop PassengerId since it is just an identifier

In [0]:
X = X.drop(['Cabin','PassengerId'],axis=1)

In [69]:
X.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


**Feature Engineering**




Feature engineering is one of the most important aspects of building a machine learning model. It is all about extracting some useful features from the data at hand. In this data, we can infer whether the traveller was travelling alone or whether he/she was with his/hers family. We can also idetify the 'Title' of the person based on the Name feature.

To extract the new features, I am building a custom feature extractor using the BaseEstimator and TransformerMixin classes

In [0]:
class CustomFeatureExtractor(BaseEstimator,TransformerMixin):
  def createTitle(self,row):
    title=row.split(",")[1].split(".")[0].strip()
    if title in ['Mr','Mrs','Miss','Master','Dr','Major','Capt']:
      return title
    else:
      return 'Others'

  def inferIsAlone(self,df):
    isAlone = df['SibSp']+df['Parch']+1
    if isAlone==1:
      return 'Yes'
    else:
      return 'No'  


  
  def fit(self,X,y=None):
    return self

  def transform(self,X,y=None):
    print("Got these columns: ",X.columns)
    X['Title']=X['Name'].apply(self.createTitle)
    X['IsAlone'] = X.apply(self.inferIsAlone,axis=1)
    X = X.drop(['Name'],axis=1)
    print(X.head(5))
    return X

In [0]:
class FeatureSelector(BaseEstimator,TransformerMixin):
  def __init__(self,features=None):
    print("Got below features: ",features)
    self.feature_names = features
  def fit(self,X,y=None):
    return self
  def transform(self,X,y=None):
    print("Got below features in FeatureSelector Transform: ",self.feature_names)
    return X[self.feature_names]    

Now Let's build 3 pipelines. One pipeline will be used for feature Engineering, the other 2 will work on numerical and Categorical attributes respectively

In [100]:
feature_extractor = Pipeline([
                              ('selector',FeatureSelector(features=X.columns)),
                              ('new_features',CustomFeatureExtractor())
])

Got below features:  Index(['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare',
       'Embarked'],
      dtype='object')


In [0]:
num_attributes = ['Age','SibSp','Parch','Fare']
cat_attributes = ['Pclass','Name','Sex','Embarked','Title','IsAlone']

In [94]:
num_pipeline = Pipeline([
                         ('selector',FeatureSelector(features=num_attributes)),
                         ('median_imputer',SimpleImputer(strategy='median')),
                         ('scalar',StandardScaler())
])

Got below features:  ['Age', 'SibSp', 'Parch', 'Fare']


In [95]:
cat_pipeLine = Pipeline([
                         ('cat_selector',FeatureSelector(features=cat_attributes)),
                         ('mode_imputer',SimpleImputer(strategy='most_frequent')),
                         ('1hot',OneHotEncoder())
])

Got below features:  ['Pclass', 'Name', 'Sex', 'Embarked', 'Title', 'IsAlone']


In [0]:
features = FeatureUnion([
                                        ('ext',feature_extractor),
                                        ('num',num_pipeline),
                                        ('cat',cat_pipeLine)
])

In [0]:
full_pipeline = Pipeline([
                          ('full',features),
                          ('clf',DecisionTreeClassifier())
])

In [103]:
full_pipeline.fit(X,y)

Got below features in FeatureSelector Transform:  Index(['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare',
       'Embarked'],
      dtype='object')
Got these columns:  Index(['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare',
       'Embarked'],
      dtype='object')
[[3 'male' 22.0 ... 'S' 'Mr' 'No']
 [1 'female' 38.0 ... 'C' 'Mrs' 'No']
 [3 'female' 26.0 ... 'S' 'Miss' 'Yes']
 ...
 [3 'female' nan ... 'S' 'Miss' 'No']
 [1 'male' 26.0 ... 'C' 'Mr' 'Yes']
 [3 'male' 32.0 ... 'Q' 'Mr' 'Yes']]
Got below features in FeatureSelector Transform:  ['Age', 'SibSp', 'Parch', 'Fare']
Got below features in FeatureSelector Transform:  ['Pclass', 'Name', 'Sex', 'Embarked', 'Title', 'IsAlone']


KeyError: ignored