# 1. Handling Missing data

**Real-World Data Challenges**

- Real-world datasets are often not clean or homogeneous.
- Missing data is common and may be represented in various ways across different sources.


**Missing Data Conventions**

1. Masking Approach:

- Uses a separate Boolean array or one bit in the data representation to indicate missing values.
- But it adds overhead in storage and computation.

2. Sentinel Approach:

- Uses a specific value (sentinel) to indicate missing data, such as NaN (Not a Number) for floating-point values.
- But it reduces the range of valid values and may require extra logic in computations.
- Examples of sentinel values: NaN, None, -9999, Specific bit pattern.


**Pandas Missing Data Handling**

- Pandas relies on NumPy, which lacks built-in NA values for non-floating-point types.
- Pandas Approach: It uses sentinels for missing data- special floating-point **NaN** and Python's **None** object.
  - Balances practicality and complexity, avoiding the need for special bit patterns for each data type.


## NaN and None Handling Summary

| **Aspect**                | **None**                                             | **NaN**                                              |
|---------------------------|------------------------------------------------------|------------------------------------------------------|
| **Description**           | Python singleton object for missing data.           | Special floating-point value indicating "Not a Number". |
| **Usage**                 | Used in arrays with `dtype=object`.                  | Used in floating-point arrays.                        |
| **Performance**           | Slower operations due to Python-level handling.      | Faster operations since handled natively by floating-point arrays. |
| **Aggregation**           | Aggregations like `sum()` result in errors.          | Aggregations can be computed with functions like `np.nansum()`, `np.nanmin()`, `np.nanmax()`. |
| **Conversion**            | Can lead to slower performance and errors in aggregation. | Automatically handled and ignored in calculations with special NumPy functions. |
| **Pandas Handling**       | Handled as `object` dtype, automatically converts `None` to `NaN` when needed. | Handled as `float64`, with automatic conversion for compatible types. |
| **Type Casting in Pandas**| Type-casts integer arrays to `float64` if `None` is present. | Integer arrays cast to `float64` when NaN is introduced. |
| **Special Aggregations**  | Not applicable.                                      | Use `np.nansum()`, `np.nanmin()`, `np.nanmax()` for NaN-aware aggregations. |

## Conversion Table

| **Typeclass** | **Conversion when storing NAs** | **NA Sentinel Value** |
|---------------|----------------------------------|------------------------|
| Floating      | No change                         | `np.nan`               |
| Object        | No change                         | `None` or `np.nan`     |
| Integer       | Cast to `float64`                 | `np.nan`               |
| Boolean       | Cast to `object`                  | `None` or `np.nan`     |



# 2. Missing Data in Pandas

## 2.1. None: Pythonic missing data

What is None?

- None is a Python singleton object used for missing data in Python code.
- Can only be used in arrays with data type object (arrays of Python objects).



In [2]:
import numpy as np 
import pandas as pd 

data1 = np.array([1, None, 3, 4])
data1

array([1, None, 3, 4], dtype=object)

- numpy has chosen best data type as object.
- Arrays with dtype=object perform slower operations compared to arrays with native types.
- Operations on object arrays are done at the Python level, leading to more overhead.

In [36]:
data1+1   # Error

TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

In [3]:
data1.sum()  # Error

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

Aggregation Issues:

- Performing aggregations like sum() on arrays containing None generally results in errors.

## 2.2 NaN: Missing numerical data

What is NaN?

- NaN stands for Not a Number.
- Special **floating-point** value recognized by IEEE floating-point representation.

In [5]:
data2 = np.array([1,np.nan, 3, 4])
data2

array([ 1., nan,  3.,  4.])

In [6]:
data2.dtype  # Numpy has chosen best dtype float which supports fast array operations.

dtype('float64')

- NaN "infects" any operation it touches; results of arithmetic with NaN are NaN.

In [7]:
1 + np.nan

nan

In [8]:
np.nan * 0

nan

- Aggregations over values containing NaN result in NaN, unless special functions are used, unlike error in None case.

In [16]:
data2.sum(), data2.min(), data2.max()

(np.float64(nan), np.float64(nan), np.float64(nan))

In [37]:
data2+1

array([ 2., nan,  4.,  5.])

**Special Aggregations:**

- NumPy provides functions to handle NaN values: np.nansum(), np.nanmin(), np.nanmax().

- **NaN is specifically for floating-point values. No equivalent NaN value exists for integers, strings, or other types.**

In [20]:
np.nansum(data2), np.nanmax(data2), np.nanmin(data2)


(np.float64(8.0), np.float64(4.0), np.float64(1.0))

## 2.3 NaN and None in Pandas

- Pandas handles both NaN and None interchangeably and converts between them when appropriate.

In [21]:
pd.Series([1, np.nan, 2, None ])  # Converted None to nan.

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

- **When NA values are present, Pandas automatically type-casts to accommodate them.**
- Example: Integer arrays with NaN are cast to floating-point types.

- String data in Pandas is always stored with object dtype.

In [23]:
x = pd.Series(range(2), dtype=int)
x  # dtype is int

0    0
1    1
dtype: int64

In [24]:
x[0] = np.nan

In [25]:
x  # dtype upcasted to float due to nan.

0    NaN
1    1.0
dtype: float64

In [29]:
# Redefine x

x = pd.Series(range(2))
x

0    0
1    1
dtype: int64

In [30]:
x[0] = None

In [31]:
x

0    NaN
1    1.0
dtype: float64

- Here, pandas changed None to NaN and dtype to float.

**Pandas will automatically convert None to NaN when the dtype of the Series or DataFrame is changed to float, as NaN is a special floating-point value and performs better than using None with an object dtype.**

- Using Nan is more efficient than using None.
- Check the computation time below.

In [34]:
x = pd.Series([None for _ in range(100000)])
%timeit x+1

780 μs ± 21.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [35]:
x = pd.Series([np.nan for _ in range(100000)])
%timeit x+1

25.5 μs ± 398 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


# 3. Operating on Null Values

Pandas provides several methods for detecting, removing, and replacing null values:

1. isnull(): Generates a Boolean mask indicating missing values.

2. notnull(): Opposite of isnull(), indicates non-missing values.

3. dropna(): Removes missing values from data.
   - how='any': Drops rows/columns with any null value.
   - how='all': Drops rows/columns only if all values are null.
   - thresh: Minimum number of non-null values required to keep a row/column.

4. fillna(): Replaces missing values with specified values.





## 3.1 Detecting null values
1. isnull()
2. notnull()

In [39]:
data = pd.Series([1, np.nan, 'hello', None])
data

0        1
1      NaN
2    hello
3     None
dtype: object

In [40]:
data.isnull()  # Boolean mask

0    False
1     True
2    False
3     True
dtype: bool

In [41]:
data.notnull() 

0     True
1    False
2     True
3    False
dtype: bool

## 3.2 Dropping null values
1. dropna()

In [42]:
# Series data
data

0        1
1      NaN
2    hello
3     None
dtype: object

In [43]:
data.dropna()

0        1
2    hello
dtype: object

In [51]:
# Dataframe data
df = pd.DataFrame([[1, np.nan, 2], [2,3,4], [10, 4, None]], columns=list('ABC'))
df

Unnamed: 0,A,B,C
0,1,,2.0
1,2,3.0,4.0
2,10,4.0,


In [52]:
df.isnull()

Unnamed: 0,A,B,C
0,False,True,False
1,False,False,False
2,False,False,True


In [53]:
df.dropna()

Unnamed: 0,A,B,C
1,2,3.0,4.0


In [54]:
df.dropna(axis=1)  # collapse column.

Unnamed: 0,A
0,1
1,2
2,10


In [55]:
df.dropna(axis='columns')  # same as axis 1.

Unnamed: 0,A
0,1
1,2
2,10


- To control finely, we use `how` and `thresh` to drop na rows or cols.    

In [56]:
df

Unnamed: 0,A,B,C
0,1,,2.0
1,2,3.0,4.0
2,10,4.0,


In [58]:
df.dropna(how='all')   # drop if all elements in the row are NaN

Unnamed: 0,A,B,C
0,1,,2.0
1,2,3.0,4.0
2,10,4.0,


In [59]:
df.dropna(how = 'any')  # Drop if any element in a row is NaN

Unnamed: 0,A,B,C
1,2,3.0,4.0


In [60]:
df.dropna(axis=1, how='any')

Unnamed: 0,A
0,1
1,2
2,10


## 3.3 Filling null values

- Instead of dropping the null values we can replace them.
1. fillna(method = , axis =)
   - method can be 'ffill' - forward propagation
   - 'bfill' - backward propagation

In [68]:
# Series data
data

0        1
1      NaN
2    hello
3     None
dtype: object

In [62]:
data.fillna(0)  # fill nan with 0

0        1
1        0
2    hello
3        0
dtype: object

In [66]:
data.fillna(method = 'ffill')

  data.fillna(method = 'ffill')


0        1
1        1
2    hello
3    hello
dtype: object

In [67]:
df

Unnamed: 0,A,B,C
0,1,,2.0
1,2,3.0,4.0
2,10,4.0,


In [69]:
# Dataframe data
df

Unnamed: 0,A,B,C
0,1,,2.0
1,2,3.0,4.0
2,10,4.0,


In [72]:
df.fillna(method='ffill', axis=1)  # forward propagation column wise

  df.fillna(method='ffill', axis=1)  # forward propagation column wise


Unnamed: 0,A,B,C
0,1.0,1.0,2.0
1,2.0,3.0,4.0
2,10.0,4.0,4.0


In [73]:
df.fillna(method='bfill')

  df.fillna(method='bfill')


Unnamed: 0,A,B,C
0,1,3.0,2.0
1,2,3.0,4.0
2,10,4.0,


- Last value has no value before it to get back propagated, therefore it remains NaN.