---
# Cleaning Data and Handling Missing Values 
Missing values are inevitable when handling data sets. On this section, we will be exploring on how to clean data, handle missing values, and casting data types. Casting simply means to convert from one data type to another.


---

In [1]:
import pandas as pd
import numpy as np
from IPython.display import display


In [2]:
# Function for printing a horizontal line. For display purpose
def printhr(s: str = None, n: int = 40):
    """Print a horizontal rule of the character "=" of length n.

    Args:
        s (str, optional): Header message. Defaults to None.
        n (int, optional): Number of characters. Defaults to 50.
    """

    if s:
        print("=" * int(n / 2), s, "=" * int(n / 2))
    else:
        print("=" * n)

In [3]:
people = {
    "first": ["Lorem", "John", "Jane", "Foo", np.nan, None, "NA"],
    "last": ["Ipsum", "Doe", "Doe", "Bar", np.nan, np.nan, "Missing"],
    "email": [
        "lorem@yahoo.com",
        "john@gmail.com",
        "jane@outlook.com",
        None,
        np.nan,
        "anonymouse@email.com",
        "NA",
    ],
    "age": ["25", "35", "19", "36", None, None, "Missing"],
    "score": ["89", "75", "82", "85", np.nan, "83", "Missing"],
}

df = pd.DataFrame(people)
display(df)

Unnamed: 0,first,last,email,age,score
0,Lorem,Ipsum,lorem@yahoo.com,25,89
1,John,Doe,john@gmail.com,35,75
2,Jane,Doe,jane@outlook.com,19,82
3,Foo,Bar,,36,85
4,,,,,
5,,,anonymouse@email.com,,83
6,,Missing,,Missing,Missing


---
## Missing Values
Before we start working with missing values, we must first understand what is a **missing value.** A missing value is represented by NaN (Not a Number) or None and is used to represent the absence of data in a column or a row of a DatFrame. Missing values can result from various reasons such as incomplete data, data corruption, or data entry errors. From my (limited) testing, empty cells or the following strings resolve to NaN when loaded into a DataFrame:

- NULL
- null

- None

- nan
- NaN  

To check if a value is a NaN (or Null), you can run the pandas function `isnull()` or `isna()` which returns a boolean about the nature of the value.  

pd.isnull(\<value\>)  
or  
pd.isna(\<value\>)


Note: `isna()` is just an alias of `isnull()`. They are equivalent.

---

In [4]:
# Check value:
display(df)
printhr()

# isna() returns False, signifying that the value is 
x = df.loc[2, "first"]
display(x, pd.isna(x))
printhr()

y = df.loc[4, "age"]
display(y, pd.isna(y))

Unnamed: 0,first,last,email,age,score
0,Lorem,Ipsum,lorem@yahoo.com,25,89
1,John,Doe,john@gmail.com,35,75
2,Jane,Doe,jane@outlook.com,19,82
3,Foo,Bar,,36,85
4,,,,,
5,,,anonymouse@email.com,,83
6,,Missing,,Missing,Missing




'Jane'

False



None

True

In [5]:
# .isna() works on DataFrames and Series too
x = df.isna()
y = df["age"].isna()
display(x, y)

Unnamed: 0,first,last,email,age,score
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,True,False,False
4,True,True,True,True,True
5,True,True,False,True,False
6,False,False,False,False,False


0    False
1    False
2    False
3    False
4     True
5     True
6    False
Name: age, dtype: bool

---
## Handling Missing Values

There are different ways of handling missing values and will solely depend on the context of the type of data and the analysis we want to perform. Some common ways (non-comprehensive list) and what will be discussed here will be:

1. **Deletion**- we simply remove (aka drop) rows or columns containing missing values.
2. **Imputation**- replace missing values with other values, such as mean, median, or mode of the non-missing values.
3. **Don't Modify**- missing values can be informative in some cases and can be kept in a data set.

---

---

### Deletion

Deleting or dropping an entry is the most common way of handling missing values.

pandas has a `dropna()` method that can remove nan values. By default, it will remove all rows that has atleast 1 null value.  

Some parameters of `dropna()` are:  


`axis`: {0 or ‘index’, 1 or ‘columns’}, default 0

Determine if rows or columns which contain missing values are removed.
0, or 'index': Drop rows which contain missing values.
1, or 'columns': Drop columns which contain missing value.

Pass tuple or list to drop on multiple axes. Only a single axis is allowed.

<br>

`how`: {'any', 'all'}, default 'any'

Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.

‘any’ : If any NA values are present, drop that row or column.

‘all’ : If all values are NA, drop that row or column.

<br>

`thresh`: *int*, optional

Require that many non-NA values. Cannot be combined with how.

<br>

`subset`: *column label* or *sequence of labels*, optional

Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.

<br>

`ignore_index`: *bool*, default False

If True, the resulting axis will be labeled 0, 1, …, n - 1.

<br>

**Note:** this method does not modify the original df. Modify the original by setting the inplace parameter to True.  

<br>

From pandas [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) on DataFrame.dropna.

---

---
**Say we want to remove entries (rows) that contain atleast 1 null value:**

---

In [6]:
display(df)
printhr()

# By default, .dropna() will drop rows that contain ANY null value
df.dropna()

Unnamed: 0,first,last,email,age,score
0,Lorem,Ipsum,lorem@yahoo.com,25,89
1,John,Doe,john@gmail.com,35,75
2,Jane,Doe,jane@outlook.com,19,82
3,Foo,Bar,,36,85
4,,,,,
5,,,anonymouse@email.com,,83
6,,Missing,,Missing,Missing




Unnamed: 0,first,last,email,age,score
0,Lorem,Ipsum,lorem@yahoo.com,25,89
1,John,Doe,john@gmail.com,35,75
2,Jane,Doe,jane@outlook.com,19,82
6,,Missing,,Missing,Missing


You might notice the last row containing NA values. These aren't actual missing values, but the strings "NA" and "Missing"

---
**Say we want to remove entries (rows) that has a null value on a specific column.** We can use the `subset` parameter to specify which columns should contain a null.

---

In [7]:
display(df)
printhr()

# Drop rows that have null email
a = df.dropna(subset="email")
display(a)
printhr()

# Drop rows that have null email AND null age
b = df.dropna(how="all", subset=["email", "age"])
display(b)

Unnamed: 0,first,last,email,age,score
0,Lorem,Ipsum,lorem@yahoo.com,25,89
1,John,Doe,john@gmail.com,35,75
2,Jane,Doe,jane@outlook.com,19,82
3,Foo,Bar,,36,85
4,,,,,
5,,,anonymouse@email.com,,83
6,,Missing,,Missing,Missing




Unnamed: 0,first,last,email,age,score
0,Lorem,Ipsum,lorem@yahoo.com,25,89
1,John,Doe,john@gmail.com,35,75
2,Jane,Doe,jane@outlook.com,19,82
5,,,anonymouse@email.com,,83
6,,Missing,,Missing,Missing




Unnamed: 0,first,last,email,age,score
0,Lorem,Ipsum,lorem@yahoo.com,25,89
1,John,Doe,john@gmail.com,35,75
2,Jane,Doe,jane@outlook.com,19,82
3,Foo,Bar,,36,85
5,,,anonymouse@email.com,,83
6,,Missing,,Missing,Missing


---

### Imputation

Imputation is done by replacing null values with another value. 

pandas has a `fillna()` method that can replace null values with a specified value.

---

---
**Say we want get the median of the scores in the data set:**

---

In [8]:
# Replace custom null values to proper null
df2 = df.replace(["Missing", "NA"], np.nan)

# Set null values to 0
df2["score"].fillna(0, inplace=True)

# Convert score column to a numerical data type
df2["score"] = df2["score"].astype(float)

# Get median
df2["score"].median()

82.0

---
**Steps explained**

---

In [9]:
# First, let us examine the df
display(df)
display(df.dtypes)

Unnamed: 0,first,last,email,age,score
0,Lorem,Ipsum,lorem@yahoo.com,25,89
1,John,Doe,john@gmail.com,35,75
2,Jane,Doe,jane@outlook.com,19,82
3,Foo,Bar,,36,85
4,,,,,
5,,,anonymouse@email.com,,83
6,,Missing,,Missing,Missing


first    object
last     object
email    object
age      object
score    object
dtype: object

---

As we can see, the score column is not numerical (it's an object type, specifically a collection of str), and thus we cannot get the median of it. To circumvent this, we can convert the score column's type into a float; only after replacing null values to zero. 

---

In [10]:
display(df)
printhr()

# Replace custom null values to proper null
# (Custom missing values is explained further later on)
df2 = df.replace(["Missing", "NA"], np.nan)
display(df2)


Unnamed: 0,first,last,email,age,score
0,Lorem,Ipsum,lorem@yahoo.com,25,89
1,John,Doe,john@gmail.com,35,75
2,Jane,Doe,jane@outlook.com,19,82
3,Foo,Bar,,36,85
4,,,,,
5,,,anonymouse@email.com,,83
6,,Missing,,Missing,Missing




Unnamed: 0,first,last,email,age,score
0,Lorem,Ipsum,lorem@yahoo.com,25.0,89.0
1,John,Doe,john@gmail.com,35.0,75.0
2,Jane,Doe,jane@outlook.com,19.0,82.0
3,Foo,Bar,,36.0,85.0
4,,,,,
5,,,anonymouse@email.com,,83.0
6,,,,,


In [11]:
display(df2)
printhr()

# Set (impute) null values to 0 (int) 
# By doing this, the score column is now a collection
# of strs and ints, but is still an "object" data type.
df2["score"].fillna(0, inplace=True)
display(df2)


Unnamed: 0,first,last,email,age,score
0,Lorem,Ipsum,lorem@yahoo.com,25.0,89.0
1,John,Doe,john@gmail.com,35.0,75.0
2,Jane,Doe,jane@outlook.com,19.0,82.0
3,Foo,Bar,,36.0,85.0
4,,,,,
5,,,anonymouse@email.com,,83.0
6,,,,,




Unnamed: 0,first,last,email,age,score
0,Lorem,Ipsum,lorem@yahoo.com,25.0,89
1,John,Doe,john@gmail.com,35.0,75
2,Jane,Doe,jane@outlook.com,19.0,82
3,Foo,Bar,,36.0,85
4,,,,,0
5,,,anonymouse@email.com,,83
6,,,,,0


---
We have replaced null values with zeroes. Now we can cast the score column's data type from an object to a float, which will allow us to perform numerical functions/methods on it.  

pandas' `.astype()` is a method that works on Series and DataFrames and converts them into a Series or DataFrame of the specified data type, if able.

Note: `.astype()` does not modify the original Series or df, it just returns the casted structure.

---


In [12]:
# Convert score column to a numerical data type
df2["score"] = df2["score"].astype(float)

# Get median
median = df2["score"].median()
display(f"MEDIAN: {median}")

'MEDIAN: 82.0'

---
## Custom Missing Values

In some cases, our data may contain custom missing values, mostly in the form a string, to represent missing data. To let pandas recognize this null values, we can use the `.replace()` method to replace custom values to a proper null value (either by using Python's **None** type or numpy's **NaN**).

---

In [13]:
display(df)
printhr()

# Replace "Missing" and "NA" to properly recognized null value
df3 = df.replace(["Missing", "NA"], np.nan)
display(df3)

Unnamed: 0,first,last,email,age,score
0,Lorem,Ipsum,lorem@yahoo.com,25,89
1,John,Doe,john@gmail.com,35,75
2,Jane,Doe,jane@outlook.com,19,82
3,Foo,Bar,,36,85
4,,,,,
5,,,anonymouse@email.com,,83
6,,Missing,,Missing,Missing




Unnamed: 0,first,last,email,age,score
0,Lorem,Ipsum,lorem@yahoo.com,25.0,89.0
1,John,Doe,john@gmail.com,35.0,75.0
2,Jane,Doe,jane@outlook.com,19.0,82.0
3,Foo,Bar,,36.0,85.0
4,,,,,
5,,,anonymouse@email.com,,83.0
6,,,,,


In [14]:
# We can check that pandas recognize the new NaN values by dropping them
# and comparing the difference
display(df.dropna())
printhr()
display(df2.dropna())

Unnamed: 0,first,last,email,age,score
0,Lorem,Ipsum,lorem@yahoo.com,25,89
1,John,Doe,john@gmail.com,35,75
2,Jane,Doe,jane@outlook.com,19,82
6,,Missing,,Missing,Missing




Unnamed: 0,first,last,email,age,score
0,Lorem,Ipsum,lorem@yahoo.com,25,89.0
1,John,Doe,john@gmail.com,35,75.0
2,Jane,Doe,jane@outlook.com,19,82.0


---
## Viewing Dropped Values

To view what entries that have been dropped, we could use a filter as follows:

---

In [15]:
clean_df = df.dropna()
display(clean_df)
printhr()

# Create a filter which contains indexes of remaining rows
filt = df.index.isin(clean_df.index)

# Invert the filter, effectively making a filter of dropped rows
filt = ~filt

# Apply filter to view dropped values
df.loc[filt]


Unnamed: 0,first,last,email,age,score
0,Lorem,Ipsum,lorem@yahoo.com,25,89
1,John,Doe,john@gmail.com,35,75
2,Jane,Doe,jane@outlook.com,19,82
6,,Missing,,Missing,Missing




Unnamed: 0,first,last,email,age,score
3,Foo,Bar,,36.0,85.0
4,,,,,
5,,,anonymouse@email.com,,83.0
