# Dealing with NaNs in Python

### Pandas 
- [isna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isna.html)
- [notna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.notna.html)
- [isnull](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isnull.html)
- [notnull](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.notnull.html)

- [missing-data-na](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data-na)

## [pandas.isna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isna.html)

### Detect missing values for an array-like object.

This function takes a scalar or array-like object and indicates whether values are missing (NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike).

#### Parameters
objscalar or array-like
Object to check for null or missing values.

#### Returns
bool or array-like of bool
For scalar input, returns a scalar boolean. For array input, returns an array of boolean indicating whether each corresponding element is missing.

#### See also

[notna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.notna.html)
Boolean inverse of pandas.isna.

[Series.isna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.isna.html#pandas.Series.isna)
Detect missing values in a Series.

[DataFrame.isna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isna.html#pandas.DataFrame.isna)
Detect missing values in a DataFrame.

[Index.isna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.isna.html#pandas.Index.isna)
Detect missing values in an Index.

# Examples

In [3]:
import pandas as pd

In [4]:
pd.isna('dog')

False

In [5]:
pd.isna(pd.NA)

True

In [6]:
import numpy as np

In [7]:
pd.isna(np.nan)

True

ndarrays result in an ndarray of booleans.

In [8]:
array = np.array([[1, np.nan, 3], [4, 5, np.nan]])

In [9]:
array

array([[ 1., nan,  3.],
       [ 4.,  5., nan]])

In [10]:
pd.isna(array)

array([[False,  True, False],
       [False, False,  True]])

For indexes, an ndarray of booleans is returned.

In [11]:
index = pd.DatetimeIndex(["2017-07-05", "2017-07-06", None, "2017-07-08"])

In [12]:
index

DatetimeIndex(['2017-07-05', '2017-07-06', 'NaT', '2017-07-08'], dtype='datetime64[ns]', freq=None)

For datetime64[ns] types, NaT represents missing values. This is a pseudo-native sentinel value that can be represented by NumPy in a singular dtype (datetime64[ns]). pandas objects provide compatibility between NaT and NaN.

In [13]:
pd.isna(index)

array([False, False,  True, False])

For Series and DataFrame, the same type is returned, containing booleans.

In [14]:
df = pd.DataFrame([['ant', 'bee', 'cat'], ['dog', None, 'fly']])

In [15]:
df

Unnamed: 0,0,1,2
0,ant,bee,cat
1,dog,,fly


In [16]:
pd.isna(df)

Unnamed: 0,0,1,2
0,False,False,False
1,False,True,False


In [17]:
pd.isna(df[1])

0    False
1     True
Name: 1, dtype: bool

# Handling missing values

Missing values and NANs are commonplace occurrences in a dataset and need to be taken care of before data can be put to any use. We will look into various sources of missing values and the different types, as well as how to handle them in the upcoming sections.

## Sources of missing values
A missing value can enter a dataset because of or during the following processes.

### Data extraction
This entails the data that's available but we missed during its extraction from a source. It deals with engineering tasks such as the following:

- Scraping from a website
- Querying from a database
- Extracting from flat files

There can be many sources of missing values, some of which are as follows:

- Regular expressions resulting in the wrong or non-unique results
- Wrong query
- A different data type storage
- Incomplete download
- Incomplete processing

### Data collection
This entails the data points that are not available or are difficult to collect. Suppose you are surveying 100,000 people for the type of electric car they own. In this case, if we encounter someone who doesn't own an electric car, we would have a missing value for that person's car type.

Missing values originating because of data extraction, in theory, can be rectified if we are able to identify the issue that led to the missing value and rerun the extraction process. Missing values originating from data collection issues are difficult to rectify.

How do you know your data has missing values? The easiest way to find this out is to run a summary of the dataset, which gives a count of rows as well. Since the rows containing missing values don't get counted, this count will be lower for columns containing a missing value. Take a look at the following diagram, which shows a summary of the famous titanic dataset, for an illustration of this:



In [19]:
titanic = pd.read_csv('../../Data/titanic.csv')

In [21]:
titanic.describe()

Unnamed: 0,pclass,survived,age,sibsp,parch,fare,body
count,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0,121.0
mean,2.294882,0.381971,29.881135,0.498854,0.385027,33.295479,160.809917
std,0.837836,0.486055,14.4135,1.041658,0.86556,51.758668,97.696922
min,1.0,0.0,0.1667,0.0,0.0,0.0,1.0
25%,2.0,0.0,21.0,0.0,0.0,7.8958,72.0
50%,3.0,0.0,28.0,0.0,0.0,14.4542,155.0
75%,3.0,1.0,39.0,1.0,0.0,31.275,256.0
max,3.0,1.0,80.0,8.0,9.0,512.3292,328.0


The age and body columns have missing values as they have fewer rows than the others.

It is of prime importance to take care of missing values because they propagate the missing values to the results of numeric operations and can lead to incorrect interpretations of data. They don't allow many numeric computations to run. They may also lead to an incorrect hypothesis if only a sample of the data gets used.

There are other ways in which the origin of missing values can be classified. Let's go over them now.

In [23]:
titanic.isnull().sum()

pclass          1
survived        1
name            1
sex             1
age           264
sibsp           1
parch           1
ticket          1
fare            2
cabin        1015
embarked        3
boat          824
body         1189
home.dest     565
dtype: int64

## Different types of missing values
The following are different types of missing values:

- **Not a Number (NaN):** NaN is a placeholder for missing values for any data type. These can be created using numpy.nan. NaNs that are created using numpy.nan can be assigned to a nullable integer datatype. The missing value of an integer type is saved as a NaN. It is the default identifier of a missing value in Python.

- **NA:** NA comes mostly from R, where NA is an identifier for a missing value.
- **NaT:** This is equivalent to a NaN for timestamp data points.
- **None:** This represents missing values of data types other than numeric.
- **Null:** This originates when a function doesn't return a value or if the value is undefined.
- **Inf:** Inf is infinity—a value that is greater than any other value. inf is, therefore, smaller than any other value. It is generated by all the calculations, leading to very large or very small values. Often, we need to treat inf as a missing value. This can be done by specifying the following options in pandas:

pandas.options.mode.use_inf_as_na = True

A placeholder infinity variable can also be generated for comparison purposes, as shown in the following example:

import math
test = math.inf
test>pow(10,10) #Comparing whether Inf is larger than 10 to the power 10
It returns True.

## Miscellaneous analysis of missing values
To get a sense of how mad the missing value problem is, you may want to find out about the following information:

- How many cells in a column have a missing value
- Which cells in a column have a missing value
- How many columns have missing values

These tasks can be performed as follows:
    
```python
pd.isnull(data['body']) #returns TRUE if a cell has missing values
pd.notnull(data['body']) #returns TRUE if a cell doesn't have missing values
```

Finding cells that have missing values:

```python
pd.isnull(data['body']).values.ravel().sum() #returns the total number of missing values
pd.nottnull(data['body']).values.ravel().sum()#returns the total number of non-missing values
```

In [25]:
pd.isnull(titanic['age']).values.ravel().sum()

264

In [26]:
pd.notnull(titanic['age']).values.ravel().sum()

1046

## Strategies for handling missing values
The following are the major strategies for handling missing values.

### Deletion
This will delete the entire row or column that contains the missing value.

Deletion leads to data loss and is not recommended unless there is no other way out.

Deletion can be performed as follows:

Dropping all the rows where all the cells have missing values:

```python
data.dropna(axis=0,how='all')# axis=0 means along rows
```

Dropping all the rows where any of the cells have missing values:

```python
data.dropna(axis=0,how='any')
```

## Imputation
This replaces the missing value with a number that makes sense.

There are various ways in which imputation can be performed. Some of them are as follows:

Imputing all the missing values in a dataset with 0:
```python
data.fillna(0)
```
Imputing all the missing values with specified text:
```python
data.fillna('text')
```
Imputing only the missing values in the body column with 0:
```python
data['body'].fillna(0)
```
Imputing with a mean of non-missing values:
```python
data['age'].fillna(data['age'].mean())
```
Imputing with a forward fill – this works especially well for time series data. Here, a missing value is replaced with the value in the previous row (period):
```python
data['age'].fillna(method='ffill')
```
