<a href="https://colab.research.google.com/github/rinr2602/DA_pandas_series/blob/main/handling_missing_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import pandas as pd

##Pandas Utility Functions
Like how numpy had np.isnan and np.isinf, pandas also has similar commands like pd.isnull and ps.isna

In [2]:
pd.isnull(np.nan)

True

In [3]:
pd.isnull(None)

True

In [4]:
pd.isna(np.nan)

True

In [5]:
pd.isna(None)

True

In [6]:
pd.notnull(None)

False

In [7]:
pd.notnull(np.nan)

False

In [8]:
pd.notna(None)

False

In [9]:
pd.notnull(np.nan)

False

In [11]:
ser = pd.Series([1,np.nan,7])
pd.isnull(ser)

0    False
1     True
2    False
dtype: bool

In [12]:
ser = pd.Series([1,np.nan,7])
pd.notnull(ser)

0     True
1    False
2     True
dtype: bool

In [13]:
df = pd.DataFrame({
    'Col A':[1,np.nan,7],
    'Col B':[np.nan,2,3],
    'Col C':[np.nan,2,np.nan]
})
pd.isnull(df)

Unnamed: 0,Col A,Col B,Col C
0,False,True,True
1,True,False,False
2,False,False,True


## Pandas Operations with Missing Values
Pandas manages missing values more gracefully than numpy. nans will no longer behave as "viruses", and operations will just ignore them completely.

In [14]:
pd.Series([1,2,np.nan]).count()

2

In [15]:
pd.Series([1,2,np.nan]).sum()

3.0

In [16]:
pd.Series([2,2,np.nan]).mean()

2.0

### Filtering Missing Data
As we saw with numpy, we could combine boolean selection + pd.isnull to filter out those nans and null values

In [21]:
s = pd.Series([1,2,3,np.nan,np.nan,4])
pd.notnull(s)

0     True
1     True
2     True
3    False
4    False
5     True
dtype: bool

In [22]:
pd.isnull(s)

0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

In [23]:
pd.notnull(s).sum()

4

In [24]:
pd.isnull(s).sum()

2

In [25]:
s[pd.notnull(s)]

0    1.0
1    2.0
2    3.0
5    4.0
dtype: float64

Both notnull and isnull can be used with Series and DataFrames

In [26]:
s.isnull()

0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

In [27]:
s.notnull()

0     True
1     True
2     True
3    False
4    False
5     True
dtype: bool

In [28]:
s[s.notnull()]

0    1.0
1    2.0
2    3.0
5    4.0
dtype: float64

### Dropping null values

Boolean selection + notnull() seems a little bit verbose and repetitive. And as we said before: any repetitive task will probably have a better, more DRY way. In this case, we can use the dropna method:


In [29]:
s

0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
5    4.0
dtype: float64

In [30]:
s.dropna() #directly drops all the Nan vals

0    1.0
1    2.0
2    3.0
5    4.0
dtype: float64

### Dropping null values on DataFrames
It is simple to drop nas with a Series. But with DataFrames, there will be a few more things to consider, because you can't drop single values. You can only drop entire columns or rows. Let's start with a sample DataFrame:


In [32]:
df = pd.DataFrame({
    'Column A': [1, np.nan, 30, np.nan],
    'Column B': [2, 8, 31, np.nan],
    'Column C': [np.nan, 9, 32, 100],
    'Column D': [5, 8, 34, 110],
})
df

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34
3,,,100.0,110


In [34]:
df.shape #gives (rows, columns)

(4, 4)

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Column A  2 non-null      float64
 1   Column B  3 non-null      float64
 2   Column C  3 non-null      float64
 3   Column D  4 non-null      int64  
dtypes: float64(3), int64(1)
memory usage: 256.0 bytes


In [36]:
df.isnull()

Unnamed: 0,Column A,Column B,Column C,Column D
0,False,False,True,False
1,True,False,False,False
2,False,False,False,False
3,True,True,False,False


In [37]:
df.isnull().sum()

Column A    2
Column B    1
Column C    1
Column D    0
dtype: int64

The default dropna behavior will drop all the rows in which any null value is present. That means if a row has 100 values and just a SINGLE NaN, that entire row is deleted. Therefore, you're losing data.

In [38]:
df.dropna()

Unnamed: 0,Column A,Column B,Column C,Column D
2,30.0,31.0,32.0,34


In [39]:
df.dropna(axis='columns') #drops nulls in the columns

Unnamed: 0,Column D
0,5
1,8
2,34
3,110


Any row or column that contains at least one null value will be dropped. Which can be, depending on the case, too extreme. You can control this behavior with the how parameter. Can be either 'any' or 'all'

In [40]:
df2 = pd.DataFrame({
    'Column A': [1, np.nan, 30],
    'Column B': [2, np.nan, 31],
    'Column C': [np.nan, np.nan, 100]
})
df2

Unnamed: 0,Column A,Column B,Column C
0,1.0,2.0,
1,,,
2,30.0,31.0,100.0


In [42]:
df.dropna(how='all') #drop all which has null values

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34
3,,,100.0,110


In [44]:
df.dropna(how='any') #default: all rows with any null will be removed

Unnamed: 0,Column A,Column B,Column C,Column D
2,30.0,31.0,32.0,34


Use the thresh parameter to indicate a threshold (a minimum number) of non-null values for the row/column to be kept

In [45]:
df.dropna(thresh=3)

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34


In [46]:
df.dropna(thresh=3,axis='columns')

Unnamed: 0,Column B,Column C,Column D
0,2.0,,5
1,8.0,9.0,8
2,31.0,32.0,34
3,,100.0,110


### Filling null values

Sometimes instead than dropping the null values, we might need to replace them with some other value. This highly depends on your context and the dataset you're currently working. Sometimes a nan can be replaced with a 0, sometimes it can be replaced with the mean of the sample, and some other times you can take the closest value. Again, it depends on the context.

In [47]:
s

0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
5    4.0
dtype: float64

Filling nulls with a arbitrary value

In [48]:
s.fillna(0)

0    1.0
1    2.0
2    3.0
3    0.0
4    0.0
5    4.0
dtype: float64

In [49]:
s.fillna(s.mean())

0    1.0
1    2.0
2    3.0
3    2.5
4    2.5
5    4.0
dtype: float64

Filling nulls with contiguous (close) values

In [50]:
s.fillna(method='ffill')

0    1.0
1    2.0
2    3.0
3    3.0
4    3.0
5    4.0
dtype: float64

In [51]:
s.fillna(method='bfill')

0    1.0
1    2.0
2    3.0
3    4.0
4    4.0
5    4.0
dtype: float64

The above can leave the null values at extremes

In [53]:
pd.Series([np.nan,3,np.nan,9]).fillna(method='ffill')

0    NaN
1    3.0
2    3.0
3    9.0
dtype: float64

In [54]:
pd.Series([1,np.nan,3,np.nan,np.nan]).fillna(method='bfill')

0    1.0
1    3.0
2    3.0
3    NaN
4    NaN
dtype: float64

### Filling null values on DataFrames
The fillna method also works on DataFrames, and it works similarly. The main differences are that you can specify the axis (as usual, rows or columns) to use to fill the values (specially for methods) and that you have more control on the values passed

In [55]:
df

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34
3,,,100.0,110


In [57]:
df.fillna({
    'Column A':0,
    'Column B':99,
    'Column C':df['Column C'].mean()
})

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,47.0,5
1,0.0,8.0,9.0,8
2,30.0,31.0,32.0,34
3,0.0,99.0,100.0,110


In [58]:
df.fillna(method='ffill',axis=1)

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,2.0,5.0
1,,8.0,9.0,8.0
2,30.0,31.0,32.0,34.0
3,,,100.0,110.0


###Checking if there are NAs

The question is: Does this Series or DataFrame contain any missing value? The answer should be yes or no: True or False.

1. Checking the length
If there are missing values, s.dropna() will have less elements than s

In [59]:
s

0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
5    4.0
dtype: float64

In [60]:
s.dropna().count()

4

In [65]:
len(s)

6

In [66]:
missing_vals = len(s.dropna()) != len(s)
missing_vals

True

In [68]:
missing_values = s.count() != len(s)
missing_values

True

2. More Pythonic solution 'any'

The methods any and all check if either there's any True value in a Series or all the values are True. They work in the same way as in Python:


In [69]:
pd.Series([True,False,False]).any()
# check if ANY of the entries is TRUE

True

In [70]:
pd.Series([True,False,False]).all()
# check if ALL of the entries are TRUE

False

In [71]:
pd.Series([True, True, True]).all()

True

isnull will return a boolean series with True values wherever there is a nan

In [72]:
s

0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
5    4.0
dtype: float64

In [73]:
s.isnull()

0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

In [74]:
pd.Series([1,np.nan]).isnull().any()

True

In [76]:
pd.Series([1,2]).isnull().any()

False

In [77]:
s.isnull().any()

True

A more strict version is to check all the values

In [78]:
s.isnull().values

array([False, False, False,  True,  True, False])

In [81]:
s.isnull().values.any()

True