## Handling Missing Value

In [1]:
import numpy as np
import pandas as pd

In [4]:
new_df = pd.DataFrame({'col_a': [1,2,4,1, np.nan, np.nan, 5],
                       'col_b': [3,7, np.nan, 9, None, 5, 8],
                       'col_c': ['a', '?', 'x', 'y', '--', np.nan, 'r'],
                       'col_d': [True, True, np.nan, None, False, True, False]})

new_df

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,?,True
2,4.0,,x,
3,1.0,9.0,y,
4,,,--,False
5,,5.0,,True
6,5.0,8.0,r,False


In [6]:
new_df.to_csv('data_saya.csv', index=False)

np.nan, None and NaT (for datetime64[ns] types) are standard missing value for Pandas.

#### Find Missing Values

Pandas provides `isnull()`, `isna()` functions to detect missing values. Both of them do the same thing.

In [8]:
new_df.shape

(7, 4)

In [9]:
new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   col_a   5 non-null      float64
 1   col_b   5 non-null      float64
 2   col_c   6 non-null      object 
 3   col_d   5 non-null      object 
dtypes: float64(2), object(2)
memory usage: 352.0+ bytes


In [10]:
new_df.isna()

Unnamed: 0,col_a,col_b,col_c,col_d
0,False,False,False,False
1,False,False,False,False
2,False,True,False,True
3,False,False,False,True
4,True,True,False,False
5,True,False,True,False
6,False,False,False,False


df.isna().any() returns a boolean value for each column. If there is at least one missing value in that column, the result is True.

In [11]:
new_df.isna().any()

col_a    True
col_b    True
col_c    True
col_d    True
dtype: bool

use `.sum()` to check total missing value

In [13]:
new_df.isnull().sum()

col_a    2
col_b    2
col_c    1
col_d    2
dtype: int64

Missing value can be irrevant characters, such as "?" and "--" character in col_c\
These character can't be detected as missing value by Pandas

If we know what kind of characters used as missing values in the dataset, we can handle them by creating the dataframe using `na_values` parameter:

In [15]:
missing_values = ['?', '.', '--', '=']
df = pd.read_csv('data_saya.csv', na_values = missing_values)
df

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,,,,False
5,,5.0,,True
6,5.0,8.0,r,False


In [16]:
df.isna().sum()

col_a    2
col_b    2
col_c    3
col_d    2
dtype: int64

Another option is to use pandas replace() function to handle these values after a dataframe is created:

Original dataframe

In [22]:
print(new_df.isna().sum())
new_df

col_a    2
col_b    2
col_c    1
col_d    2
dtype: int64


Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,?,True
2,4.0,,x,
3,1.0,9.0,y,
4,,,--,False
5,,5.0,,True
6,5.0,8.0,r,False


new dataframe with replace missing value

In [18]:
df2 = new_df.replace({'?': np.nan,
                      '--': np.nan})
df2

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,,,,False
5,,5.0,,True
6,5.0,8.0,r,False


In [20]:
df2.isna().sum()

col_a    2
col_b    2
col_c    3
col_d    2
dtype: int64

## Drop Missing Value

In [27]:
df2

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,,,,False
5,,5.0,,True
6,5.0,8.0,r,False


We can drop a row or column with missing values using `dropna()` function. We can use some condition:\
* how='any' : drop if there is any missing value
* how='all' : drop if all values are missing

In [28]:
df2.dropna(axis=0, how='all', inplace=True)
df2

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,,,,False
5,,5.0,,True
6,5.0,8.0,r,False


In [30]:
df2.dropna(axis=0, how='any', inplace=True)
df2

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
6,5.0,8.0,r,False


We can use 'thresh' parameter to set a threshold for missing values in order for a row/column to be dropped. Thresh is the amount of non-na value

In [35]:
df3 = df
# df3.isna().sum()
df3

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,,,,False
5,,5.0,,True
6,5.0,8.0,r,False


In [38]:
df3.dropna(axis=0,thresh=4)

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
6,5.0,8.0,r,False


## Replacing missing values

`fillna()` function in Pandas is used to replace missing values with another values.\
Missing values can be replaced by:
1. Special value
2. Aggregate value, such as mean, median, etc

#### Replacing with scalar

replace all `NaN` with 0

In [39]:
df_replace = df3.fillna(0)
df_replace

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,0,True
2,4.0,0.0,x,0
3,1.0,9.0,y,0
4,0.0,0.0,0,False
5,0.0,5.0,0,True
6,5.0,8.0,r,False


In [41]:
df3

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,,,,False
5,,5.0,,True
6,5.0,8.0,r,False


change coloum `0` with *mean()*

In [42]:
df3.iloc[:, 0] = df3.iloc[:, 0].fillna(df3.iloc[:, 0].mean())
df3

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,2.6,,,False
5,2.6,5.0,,True
6,5.0,8.0,r,False


In [46]:
df3['col_b'] = df3['col_b'].fillna(df3['col_b'].mode()[0])
df3

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,3.0,x,
3,1.0,9.0,y,
4,2.6,3.0,,False
5,2.6,5.0,,True
6,5.0,8.0,r,False


In [47]:
df3

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,3.0,x,
3,1.0,9.0,y,
4,2.6,3.0,,False
5,2.6,5.0,,True
6,5.0,8.0,r,False


Take the last seen values by using `ffill` (forward fill)

In [49]:
df3['col_c'].fillna(method='ffill', inplace=True)
df3

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,a,True
2,4.0,3.0,x,
3,1.0,9.0,y,
4,2.6,3.0,y,False
5,2.6,5.0,y,True
6,5.0,8.0,r,False


Take before values by using bfill (backward fill)

In [50]:
df3['col_d'].fillna(method='bfill', inplace =True)
df3

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,a,True
2,4.0,3.0,x,False
3,1.0,9.0,y,False
4,2.6,3.0,y,False
5,2.6,5.0,y,True
6,5.0,8.0,r,False


## Exercise 7

1. Find how many missing values in each column of Titanic data

In [86]:
titanic_df = pd.read_csv('train.csv')
titanic_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [87]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [88]:
titanic_df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

**Missing Value**
* Age = 177
* Cabin = 687
* Embarked = 2

2. Replace the missing values with the following values:
    - Embarked 'S'
    - Age 'mean'
    - Cabin 'mode'

Before replacing has 2 missing value

In [89]:
titanic_df['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [90]:
titanic_df['Embarked'].fillna('S', inplace=True)

After replacing

In [105]:
titanic_df['Embarked'].value_counts()

S    646
C    168
Q     77
Name: Embarked, dtype: int64

In [95]:
age = titanic_df['Age'].value_counts().sum()
print(f'value counts before replacing = {age}')

value counts before replacing = 714


In [96]:
titanic_df['Age'].fillna(titanic_df['Age'].mean(), inplace=True)

In [102]:
age = titanic_df['Age'].value_counts().sum()
print(f'value counts after replacing = {age}')

value counts after replacing = 891


In [98]:
cabin = titanic_df['Cabin'].value_counts().sum()
print(f'value counts before replacing = {cabin}')

value counts before replacing = 204


In [99]:
titanic_df['Cabin'].fillna(titanic_df['Cabin'].mode()[0], inplace=True)

In [101]:
cabin = titanic_df['Cabin'].value_counts().sum()
print(f'value counts after replacing = {cabin}')

value counts after replacing = 891


Check missing value after replacing

In [79]:
titanic_df.loc[:,['Embarked','Age','Cabin']].isna().sum()

Embarked    0
Age         0
Cabin       0
dtype: int64