<a href="https://colab.research.google.com/github/rinr2602/DA_pandas_series/blob/main/missing_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Cleaning
## Handling Missing Data with Pandas

In [2]:
import numpy as np
import pandas as pd

In [3]:
falsy_values = (0, False, None, '', [], {})

In [4]:
any(falsy_values)

False

In [5]:
np.nan

nan

In [6]:
3 + np.nan

nan

In [8]:
a = np.array([1,2,3,np.nan,np.nan,4])
a

array([ 1.,  2.,  3., nan, nan,  4.])

In [9]:
a.sum()

nan

In [10]:
a.mean()

nan

This is better than regular None values, which in the previous examples would have raised an exception

In [11]:
3 + None

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

In [12]:
np.inf

inf

Works same as nan. Anything it touches or works with, will become inf.

### Checking for nan or inf

In [13]:
np.isnan(np.nan)

True

In [14]:
np.isinf(np.inf)

True

In [15]:
np.isfinite(np.nan), np.isfinite(np.inf)

(False, False)

In [16]:
np.isnan(np.array([1, 2, 3, np.nan, np.inf, 4]))

array([False, False, False,  True, False, False])

In [17]:
np.isinf(np.array([1, 2, 3, np.nan, np.inf, 4]))

array([False, False, False, False,  True, False])

In [18]:
np.isfinite(np.array([1, 2, 3, np.nan, np.inf, 4]))

array([ True,  True,  True, False, False,  True])

## Filtering them out
### When encountering nan or missing values, we need to filter them out before working on the data.

In [19]:
np.isnan(np.array([1, 2, 3, np.nan, np.inf, 4]))

array([False, False, False,  True, False, False])

In [20]:
a[~np.isnan(a)]

array([1., 2., 3., 4.])

In [21]:
a[np.isfinite(a)]

array([1., 2., 3., 4.])

Now that the nan and inf values have been filtered out, we can perform operations on the remaining data.

In [22]:
a[np.isfinite(a)].sum()

10.0

In [23]:
a[np.isfinite(a)].mean()

2.5