---
# Cleaning Data and Handling Missing Values 
Missing values are inevitable when handling data sets. On this section, we will be exploring on how to clean data, handle missing values, and casting data types. Casting simply means to convert form one data type to another.


---

In [1]:
import pandas as pd
import numpy as np
from IPython.display import display


In [2]:
# Function for printing a horizontal line. For display purpose
def printhr(s: str = None, n: int = 40):
    """Print a horizontal rule of the character "=" of length n.

    Args:
        s (str, optional): Header message. Defaults to None.
        n (int, optional): Number of characters. Defaults to 50.
    """

    if s:
        print("=" * int(n / 2), s, "=" * int(n / 2))
    else:
        print("=" * n)

In [5]:
people = {
    "first": ["Lorem", "John", "Jane", "Foo", np.nan, None, "NA"],
    "last": ["Ipsum", "Doe", "Doe", "Bar", np.nan, np.nan, "Missing"],
    "email": [
        "lorem@yahoo.com",
        "john@gmail.com",
        "jane@outlook.com",
        None,
        np.nan,
        "anonymouse@email.com",
        "NA",
    ],
    "age": ["25", "35", "19", "36", None, None, "Missing"],
}

df = pd.DataFrame(people)
display(df)

Unnamed: 0,first,last,email,age
0,Lorem,Ipsum,lorem@yahoo.com,25
1,John,Doe,john@gmail.com,35
2,Jane,Doe,jane@outlook.com,19
3,Foo,Bar,,36
4,,,,
5,,,anonymouse@email.com,
6,,Missing,,Missing


---
## Missing Values
Before we start working with missing values, we must first understand what is a **missing value.** A missing value is represented by NaN (Not a Number) or None and is used to represent the absence of data in a column or a row of a DatFrame. Missing values can result from various reasons such as incomplete data, data corruption, or data entry errors. From my (limited) testing, empty cells or the following strings resolve to NaN when loaded into a DataFrame:

- NULL
- null

- None

- nan
- NaN  

To check if a value is a NaN (or Null), you can run the pandas function `isnull()` or `isna()` which returns a boolean about the nature of the value.  

pd.isnull(\<value\>)  
or  
pd.isna(\<value\>)


Note: `isna()` is just an alias of `isnull()`. They are equivalent.

---

In [13]:
# Check value:
display(df)
printhr()

# isna() returns False, signifying that the value is 
x = df.loc[2, "first"]
display(x, pd.isna(x))
printhr()

y = df.loc[4, "age"]
display(y, pd.isna(y))

Unnamed: 0,first,last,email,age
0,Lorem,Ipsum,lorem@yahoo.com,25
1,John,Doe,john@gmail.com,35
2,Jane,Doe,jane@outlook.com,19
3,Foo,Bar,,36
4,,,,
5,,,anonymouse@email.com,
6,,Missing,,Missing




'Jane'

False



None

True

---
## Handling Missing Values

There are different ways of handling missing values and will solely depend on the context of the type of data and the analysis we want to perform. Some common ways (non-comprehensive list) and what will be discussed here will be:

1. **Deletion**- we simply remove (aka drop) rows or columns containing missing values.
2. **Imputation**- replace missing values with other values, such as mean, median, or mode of the non-missing values.
3. **Don't Modify**- missing values can be informative in some cases and can be kept in a data set.

---

---

### Deletion
pandas has a `dropna()` method that can remove nan values. Some parameters of `dropna()` are:  

`axis`: {0 or ‘index’, 1 or ‘columns’}, default 0

Determine if rows or columns which contain missing values are removed.
0, or 'index': Drop rows which contain missing values.
1, or 'columns': Drop columns which contain missing value.

Pass tuple or list to drop on multiple axes. Only a single axis is allowed.

---

`how`: {'any', 'all'}, default 'any'

Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.

‘any’ : If any NA values are present, drop that row or column.

‘all’ : If all values are NA, drop that row or column.

---

`thresh`: *int*, optional

Require that many non-NA values. Cannot be combined with how.

---

`ignore_index`: *bool*, default False

If True, the resulting axis will be labeled 0, 1, …, n - 1.

---
From pandas [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) on DataFrame.dropna.

---

In [16]:
#TODO. Explain whats happening here.
# Create more dropna examples.
display(df)
printhr()

df.dropna()

Unnamed: 0,first,last,email,age
0,Lorem,Ipsum,lorem@yahoo.com,25
1,John,Doe,john@gmail.com,35
2,Jane,Doe,jane@outlook.com,19
3,Foo,Bar,,36
4,,,,
5,,,anonymouse@email.com,
6,,Missing,,Missing




Unnamed: 0,first,last,email,age
0,Lorem,Ipsum,lorem@yahoo.com,25
1,John,Doe,john@gmail.com,35
2,Jane,Doe,jane@outlook.com,19
6,,Missing,,Missing


In [14]:
### Reference. For me

# df = pd.DataFrame(people)
# x = df.replace("NA", np.nan)
# y = df.replace("Missing", np.nan)