# Section 1-1 - Filling-in Missing Values

In the previous section, we ended up with a smaller set of predictions because we chose to throw away rows with missing values. We build on this approach in this section by filling in the missing data with an educated guess.

We will only provide detailed descriptions on new concepts introduced.

## Pandas - Extracting data

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('../data.csv')

df_train = df.iloc[:712, :]
df_test = df.iloc[712:, :]

## Pandas - Cleaning data

In [2]:
df_train = df_train.drop(['Name', 'Ticket', 'Cabin'], axis=1)

Similar to the previous section, we review the data type and value counts.

In [3]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 712 entries, 0 to 711
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  712 non-null    int64  
 1   Survived     712 non-null    int64  
 2   Pclass       712 non-null    int64  
 3   Sex          712 non-null    object 
 4   Age          565 non-null    float64
 5   SibSp        712 non-null    int64  
 6   Parch        712 non-null    int64  
 7   Fare         712 non-null    float64
 8   Embarked     711 non-null    object 
dtypes: float64(2), int64(5), object(2)
memory usage: 50.2+ KB


There are a number of ways that we could fill in the NaN values of the column Age. For simplicity, we'll do so by taking the average, or mean, of values of each column.

In [4]:
age_mean = df_train['Age'].mean()
df_train['Age'] = df_train['Age'].fillna(age_mean)

**Exercise**

- Write the code to replace the NaN values by the median, instead of the mean.

Taking the average does not make sense for the column Embarked, as it is a categorical value. Instead, we shall replace the NaN values by the mode, or most frequently occurring value.

In [5]:
from collections import Counter

Counter(df_train['Embarked'])

Counter({'S': 509, 'C': 138, 'Q': 64, nan: 1})

In [6]:
df_train['Embarked'] = df_train['Embarked'].fillna('S')

In [7]:
df_train['Sex'] = df_train['Sex'].map({'female': 0, 'male': 1})
df_train['Embarked'] = df_train['Embarked'].map({'C':1, 'S':2, 'Q':3})

We now review details of our training data.

In [8]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 712 entries, 0 to 711
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  712 non-null    int64  
 1   Survived     712 non-null    int64  
 2   Pclass       712 non-null    int64  
 3   Sex          712 non-null    int64  
 4   Age          712 non-null    float64
 5   SibSp        712 non-null    int64  
 6   Parch        712 non-null    int64  
 7   Fare         712 non-null    float64
 8   Embarked     712 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 50.2 KB


Hence have we have preserved all the rows of our data set, and proceed to create a numerical array for scikit-learn.

In [9]:
X_train = df_train.iloc[:, 2:].values
y_train = df_train['Survived']

## Scikit-learn - Training the model

In [10]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 100, random_state=0)
model = model.fit(X_train, y_train)

## Scikit-learn - Making predictions

We now review what needs to be cleaned in the test data.

In [11]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 179 entries, 712 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  179 non-null    int64  
 1   Survived     179 non-null    int64  
 2   Pclass       179 non-null    int64  
 3   Name         179 non-null    object 
 4   Sex          179 non-null    object 
 5   Age          149 non-null    float64
 6   SibSp        179 non-null    int64  
 7   Parch        179 non-null    int64  
 8   Ticket       179 non-null    object 
 9   Fare         179 non-null    float64
 10  Cabin        42 non-null     object 
 11  Embarked     178 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 16.9+ KB


In [12]:
df_test = df_test.drop(['Name', 'Ticket', 'Cabin'], axis=1)

As per our previous approach, we fill in the NaN values in the column Age and Embarked with the mean and mode respectively.

In [13]:
df_test['Age'] = df_test['Age'].fillna(age_mean)
df_test['Embarked'] = df_test['Embarked'].fillna('S')

In [14]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 179 entries, 712 to 890
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  179 non-null    int64  
 1   Survived     179 non-null    int64  
 2   Pclass       179 non-null    int64  
 3   Sex          179 non-null    object 
 4   Age          179 non-null    float64
 5   SibSp        179 non-null    int64  
 6   Parch        179 non-null    int64  
 7   Fare         179 non-null    float64
 8   Embarked     179 non-null    object 
dtypes: float64(2), int64(5), object(2)
memory usage: 12.7+ KB


In [15]:
df_test['Sex'] = df_test['Sex'].map({'female': 0, 'male': 1}).astype(int)
df_test['Embarked'] = df_test['Embarked'].map({'C':1, 'S':2, 'Q':3})

X_test = df_test.iloc[:, 2:]
y_test = df_test['Survived']

y_prediction = model.predict(X_test)

## Evaluation

As before, we calculate the model's accuracy:

In [16]:
np.sum(y_prediction == y_test) / float(len(y_test))

0.8156424581005587

While this is slightly less than our previous approach, our current approach preserves the number of predictions to be made.

In [17]:
len(y_test)

179

More importantly, all the training data was used to train our model. By ignoring rows with missing value, we are essentially throwing away information that can be used.