Here, the goal is to understand why pipelining is actually required since the normal process is way too hectic.

We will take titanic dataset here and create a Decision tree model using normal processes and then later give a test data and see the prediction accuracy whether a person embarked in titanic would die or not.

In [27]:
import pandas as pd
import numpy as np

In [28]:
df = pd.read_csv("titanic.csv")
df.sample(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
813,814,0,3,"Andersson, Miss. Ebba Iris Alfrida",female,6.0,4,2,347082,31.275,,S
614,615,0,3,"Brocklebank, Mr. William Alfred",male,35.0,0,0,364512,8.05,,S
800,801,0,2,"Ponesell, Mr. Martin",male,34.0,0,0,250647,13.0,,S
500,501,0,3,"Calic, Mr. Petar",male,17.0,0,0,315086,8.6625,,S
97,98,1,1,"Greenfield, Mr. William Bertram",male,23.0,0,1,PC 17759,63.3583,D10 D12,C


In [29]:
df = df.drop(columns=['PassengerId','Name','Ticket','Cabin']) # dropping these columns since they are not contributing in any way to predict whether a passenger will survive or not
df.sample(5)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
140,0,3,female,,0,2,15.2458,C
155,0,1,male,51.0,0,1,61.3792,C
95,0,3,male,,0,0,8.05,S
52,1,1,female,49.0,1,0,76.7292,C
359,1,3,female,,0,0,7.8792,Q


In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    object 
 3   Age       714 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
 7   Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(2)
memory usage: 55.8+ KB


As seen, we have 891 datapoints here. 'Survived' is the output column indicating 1(alive) or 0(dead) for a passenger.

Age and Embarked columns have some NaN/null values which need to be imputed using SimpleImputer class.

Pclass, Sex and Embarked are categorical where Sex and Embarked are nominal categorical in nature and so they need to be OHE before model creation.

Then, after doing these transformations and finally concatenating everything in a single np array will create a Decision Tree model and get then perform a test prediction.



In [31]:
df['Pclass'].value_counts(), df['Sex'].value_counts(), df['Embarked'].value_counts()

(Pclass
 3    491
 1    216
 2    184
 Name: count, dtype: int64,
 Sex
 male      577
 female    314
 Name: count, dtype: int64,
 Embarked
 S    644
 C    168
 Q     77
 Name: count, dtype: int64)

Firstly, we will split the training and testing data ->

In [32]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['Survived']), df['Survived'], test_size=0.2, random_state=0)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((712, 7), (179, 7), (712,), (179,))

In [33]:
X_train.sample(5)

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
226,2,male,19.0,0,0,10.5,S
733,2,male,23.0,0,0,13.0,S
555,1,male,62.0,0,0,26.55,S
152,3,male,55.5,0,0,8.05,S
26,3,male,,0,0,7.225,C


Imputing missing values ->

In [34]:
from sklearn.impute import SimpleImputer

si_age = SimpleImputer() # SimpleImputer class by default takes the mean of the values in the respective column and replaces the NaN values in that column with the mean.
si_embarked = SimpleImputer(strategy='most_frequent') # here, we have modified its default behavior and have specified the stategy for replacing the NaN values with the most frequently occuring value in the Embarked column.



In [35]:
X_train_age = si_age.fit_transform(X_train[['Age']])
X_train_embarked = si_embarked.fit_transform(X_train[['Embarked']])

In [36]:
X_train_age # the missing age values are now replaced with the mean age value

array([[29.74518389],
       [31.        ],
       [31.        ],
       [20.        ],
       [21.        ],
       [45.5       ],
       [22.        ],
       [29.74518389],
       [29.74518389],
       [26.        ],
       [25.        ],
       [21.        ],
       [31.        ],
       [15.        ],
       [29.74518389],
       [29.74518389],
       [65.        ],
       [29.74518389],
       [ 1.        ],
       [34.        ],
       [49.        ],
       [18.        ],
       [29.74518389],
       [70.        ],
       [14.        ],
       [19.        ],
       [30.        ],
       [31.        ],
       [32.        ],
       [16.        ],
       [50.        ],
       [24.        ],
       [56.        ],
       [ 7.        ],
       [ 9.        ],
       [33.        ],
       [19.        ],
       [32.5       ],
       [ 1.        ],
       [45.        ],
       [29.74518389],
       [19.        ],
       [21.        ],
       [ 4.        ],
       [28.        ],
       [17

In [37]:
X_train_embarked # the missing embarked values are now replaced with 'S' since it was the most frequent one in the column.

array([['C'],
       ['S'],
       ['C'],
       ['C'],
       ['S'],
       ['S'],
       ['S'],
       ['Q'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['C'],
       ['Q'],
       ['Q'],
       ['C'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['C'],
       ['S'],
       ['C'],
       ['S'],
       ['C'],
       ['S'],
       ['S'],
       ['Q'],
       ['S'],
       ['S'],
       ['C'],
       ['S'],
       ['C'],
       ['S'],
       ['S'],
       ['C'],
       ['S'],
       ['S'],
       ['C'],
       ['S'],
       ['C'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['C'],
       ['S'],
       ['S'],
       ['S'],
       ['C'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['C'],
       ['Q'],
       ['S'],
       ['S'],
       ['S'],
      

In [38]:
# we will do the same thing with test data now ->
X_test_age = si_age.transform(X_test[['Age']])
X_test_embarked = si_embarked.transform(X_test[['Embarked']])

In [39]:
X_test_embarked

array([['C'],
       ['S'],
       ['Q'],
       ['C'],
       ['C'],
       ['S'],
       ['S'],
       ['S'],
       ['C'],
       ['C'],
       ['S'],
       ['S'],
       ['Q'],
       ['S'],
       ['C'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['C'],
       ['S'],
       ['S'],
       ['Q'],
       ['S'],
       ['S'],
       ['C'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['C'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['C'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['C'],
       ['C'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['C'],
       ['S'],
       ['S'],
       ['S'],
       ['C'],
       ['S'],
       ['S'],
       ['C'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['Q'],
       ['C'],
       ['S'],
       ['C'],
       ['C'],
       ['S'],
       ['Q'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
       ['S'],
      

OHE on Sex and Embarked columns ->

In [40]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(handle_unknown='ignore', dtype=np.int32) # This argument tells the encoder to handle categories not present in the training data gracefully. If during transformation, the encoder encounters a new category (one that wasn't in the training set), it will ignore it rather than throwing an error.
ohe_embarked = OneHotEncoder(handle_unknown='ignore', dtype=np.int32)

# we could have created single ohe object and used it like this for both the columns ->
# X_train_sex = ohe.fit_transform(X_train[['Sex','Embarked']])
# But actually we don't want to perform OHE on the 'Embarked' column here directly, rather we want to do it on X_train_embarked that we created earlier after imputing missing values in the Embarked column.
# So, we are creating a separate object here as ohe_embarked and will use it below as shown

In [41]:
X_train_sex = ohe.fit_transform(X_train[['Sex']])
X_train_embarked = ohe_embarked.fit_transform(X_train_embarked)

In [42]:
X_train_sex

<712x2 sparse matrix of type '<class 'numpy.int32'>'
	with 712 stored elements in Compressed Sparse Row format>

as shown, we have OHE the male and female categories into 2 separate columns of 1 and 0.

In [43]:
X_train_embarked

<712x3 sparse matrix of type '<class 'numpy.int32'>'
	with 712 stored elements in Compressed Sparse Row format>

In [44]:
# we will do the same thing with test data now ->
X_test_sex = ohe.transform(X_test[['Sex']])
X_test_embarked = ohe_embarked.transform(X_test_embarked)

Now, we want to separate out the remaining columns ->

In [45]:
X_train_rem = X_train.drop(columns=['Sex','Age','Embarked'])
X_test_rem = X_test.drop(columns=['Sex','Age','Embarked'])
X_train_rem

Unnamed: 0,Pclass,SibSp,Parch,Fare
140,3,0,2,15.2458
439,2,0,0,10.5000
817,2,1,1,37.0042
378,3,0,0,4.0125
491,3,0,0,7.2500
...,...,...,...,...
835,1,1,1,83.1583
192,3,1,0,7.8542
629,3,0,0,7.7333
559,3,1,0,17.4000


Now we will finally concatenate the transformed columns with the separated out ones -

In [46]:
print(X_train_rem.shape)
print(X_train_age.shape)
print(X_train_sex.shape)
print(X_train_embarked.shape)

(712, 4)
(712, 1)
(712, 2)
(712, 3)


In [48]:
print(X_test_rem.shape)
print(X_test_age.shape)
print(X_test_sex.shape)
print(X_test_embarked.shape)

(179, 4)
(179, 1)
(179, 2)
(179, 3)


In [58]:
type(X_train_sex), type(X_train_embarked)

(scipy.sparse._csr.csr_matrix, scipy.sparse._csr.csr_matrix)

In [59]:
type(X_train_rem), type(X_train_age)

(pandas.core.frame.DataFrame, numpy.ndarray)

For the concatenation to work properly, the individual elements for concatenation must be either ndarray or Dataframe. Since X_train_sex and X_train_embarked are not ndarrays we can convert them to array using toarray() fxn.

In [56]:
X_train_transformed = np.concatenate((X_train_rem, X_train_age, X_train_sex.toarray(), X_train_embarked.toarray()), axis=1)
X_test_transformed = np.concatenate((X_test_rem, X_test_age, X_test_sex.toarray(), X_test_embarked.toarray()), axis=1)

In [60]:
X_train_transformed.shape

(712, 10)

In [61]:
X_test_transformed.shape

(179, 10)

Now, we will use DecisionTreeClassifier class for modelling ->

In [62]:
from sklearn.tree import DecisionTreeClassifier

In [63]:
clf = DecisionTreeClassifier()
clf.fit(X_train_transformed, y_train)

In [64]:
y_pred = clf.predict(X_test_transformed)
y_pred

array([0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1,
       0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0,
       1, 0, 1])

This output basically shows that for every row in the input test data i.e. for 179 inputs we got 179 predictions which are either 0 (not-survived) or 1(survived)

In [65]:
y_pred.shape

(179,)

Let's find the accuracy of this against the actual test outputs (y_test) ->

In [66]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.776536312849162

Now, we need to create the model file so that we can use this model anywhere we want such as a website where users can simply input the data in the website such as Age, Sex, etc. and the model we created will predict for this new data whether this person will survive in the titanic or not

In [67]:
import pickle

In [69]:
pickle.dump(ohe,open('models/ohe_sex.pkl','wb'))
pickle.dump(ohe_embarked,open('models/ohe_embarked.pkl','wb'))
pickle.dump(clf,open('models/clf.pkl','wb'))

Why did we export 3 separate models above and not just the DecisionTreeClassifier model that we created?

- Inputs from some frontend from a user for the columns like Sex and Embarked will be in string and they will definitely be required to be OHE for our created clf to perform its prediction. So, these models are also required to be exported so that we can use them as well for OHE purpose before actually giving the test data to clf.