# Handling Missing Data

Dealing with missing data is an operational necessity in data analysis and machine learning. The presence of missing or null values in a dataset can skew analysis, lead to biased results, and ultimately, create unreliable models. Understanding how to manage these missing values is, therefore, a crucial skill in producing accurate, data-driven solutions.

In this guide, we'll handling missing data, with a keen focus on two data types: categorical and numerical. We'll dissect each data type and look at unique strategies most suited for handling missing values in each.

**Key Techniques**

Our approach will primarily look at Python's pandas library and employ an array of methods including:
- **`fillna()`**: A method that allows us to fill missing values in several ways.
- **`mode()`**: A oft-used method in categorical data handling to replace missing values with the most occurring category.
- Statistical approaches like `mean()` and `median()`: Common ways to deal with missing numerical data by using central tendency measures as a substitute.


---

# Setup & References

## Import Libs

In [16]:
import pandas as pd
import numpy as np

## Use functions

| Function | Description |
| :--- | :--- |
| [pandas.DataFrame.fillna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) | To fill NA/NaN values using the specified method. |
| [pandas.DataFrame.median](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.median.html) | To return the median of the values for the requested axis. |
| [pandas.DataFrame.select_dtypes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html) | To select columns by dtype or array. |

---
# Code Snippets
---

## categorical data

#### Cookbook Recipe
  
1. Initialize a DataFrame containing missing values.
2. Determine which columns to process.
3. Compute the mode (most frequently occurring element).
4. Replace NaN entries with the mode.
   * Note: If multiple modes are found, the first one is used.

In [17]:
# Step 1: Initialize a DataFrame with missing values.
data = {
    'Name': ['John', 'Anna', np.nan, 'Linda', 'John'],
    'Type': ['Type1', 'Type2', 'Type2', np.nan, 'Type1'],
    'Country': ['Country1', np.nan, 'Country2', 'Country1', 'Country2'],
}
data = pd.DataFrame(data)
print("Original DataFrame:")
print(data)

# Step 2: Determine which columns to process. (i.e., 'Name', 'Type', 'Country')
# Step 3: Compute the mode (most frequently occurring element).
# Step 4: Replace NaN entries with the mode.
# Note: If multiple modes are found, the first one is used.
for column in ['Name', 'Type', 'Country']:
    mode = data[column].mode()[0]  # index = 0, in case we have more then one mode
    data[column].fillna(mode, inplace=True)

print("\nDataFrame after filling NA values with mode:")
print(data)


Original DataFrame:
    Name   Type   Country
0   John  Type1  Country1
1   Anna  Type2       NaN
2    NaN  Type2  Country2
3  Linda    NaN  Country1
4   John  Type1  Country2

DataFrame after filling NA values with mode:
    Name   Type   Country
0   John  Type1  Country1
1   Anna  Type2  Country1
2   John  Type2  Country2
3  Linda  Type1  Country1
4   John  Type1  Country2


---
## numerical data

#### Cookbook Recipe

1. Initialize a DataFrame with missing values.
2. Determine the columns for processing.
3. Compute the median value in each column.
4. Replace missing (NaN) values in each column with the computed median.

In [18]:
# Step 1: Initialize a DataFrame with missing values.
data = {
    'Turnover': [100, 200, np.nan, 400, 500, 600, np.nan, 800],
    'Transactions': [1, 2, 3, np.nan, np.nan, 6, 7, 8]
}
data = pd.DataFrame(data)  # Convert dictionary to pandas DataFrame

print("Original DataFrame:")
print(data)

# Step 2: Determine the columns for processing (i.e., 'Turnover' and 'Transactions'),
# Step 3: Compute the median value in each column,
# Step 4: Replace missing (NaN) values in each column with the table's computed mean.
for column in ['Turnover', 'Transactions']:
    median = data[column].median()
    data[column].fillna(median, inplace=True)

print("\nDataFrame after filling NA values with median:")
print(data)


Original DataFrame:
   Turnover  Transactions
0     100.0           1.0
1     200.0           2.0
2       NaN           3.0
3     400.0           NaN
4     500.0           NaN
5     600.0           6.0
6       NaN           7.0
7     800.0           8.0

DataFrame after filling NA values with median:
   Turnover  Transactions
0     100.0           1.0
1     200.0           2.0
2     450.0           3.0
3     400.0           4.5
4     500.0           4.5
5     600.0           6.0
6     450.0           7.0
7     800.0           8.0


---
## mixed data - select_dtypes

#### Cookbook Recipe

1. Initialize a DataFrame with both numerical and categorical data, and some missing values.
2. 
   a. Define a function to fill missing numerical data with the mean of the corresponding column.
   b. Define a function to fill missing categorical data with the mode (most frequently occurring value) of the corresponding column. 
3. Identify numerical and categorical columns in the DataFrame using pandas `select_dtypes()`.
4. 
   a. Apply `fill_na_with_mean()` to clean missing numerical data in the DataFrame.
   b. Apply `fill_na_with_mode()` to clean missing categorical data in the DataFrame.
5. Print out the DataFrame after cleaning, which should have no missing values now.

In [19]:
import pandas as pd
import numpy as np

# Step 1: Initialize a DataFrame with some missing values.
data = {
    'Age': [25, 30, 35, np.nan, 45],
    'City': ['New York', 'Seattle', 'San Francisco', 'Austin', np.nan],
    'Income': [50000, 70000, np.nan, 90000, 100000]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)


# Step 2a: Define a function to fill missing numerical data with the mean of corresponding column.
def fill_na_with_mean(df, num_cols):
    mean_values = df[num_cols].mean()
    df[num_cols] = df[num_cols].fillna(mean_values)
    return df


# Step 2b: Define a function to fill missing categorical data with the mode (most frequently occurring value) of corresponding column.
def fill_na_with_mode(df, cat_cols):
    mode_values = df[cat_cols].mode().iloc[0]
    df[cat_cols] = df[cat_cols].fillna(mode_values)
    return df


# Step 3: Identify numerical and categorical columns using pandas `select_dtypes()`.
# Step 4a: Apply the function `fill_na_with_mean()` to clean missing numerical data in the DataFrame.
# Step 4b: Apply the function `fill_na_with_mode()` to clean missing categorical data in the DataFrame.
def clean_data(df):
    num_cols = df.select_dtypes(include=np.number).columns
    df = fill_na_with_mean(df, num_cols)
    cat_cols = df.select_dtypes(include='object').columns
    df = fill_na_with_mode(df, cat_cols)
    return df


# Apply the cleaning functions and the process.
cleaned_df = clean_data(df)

# Step 5: Print out the cleaned DataFrame that has no missing values.
print("\nCleaned DataFrame:")
print(cleaned_df)


Original DataFrame:
    Age           City    Income
0  25.0       New York   50000.0
1  30.0        Seattle   70000.0
2  35.0  San Francisco       NaN
3   NaN         Austin   90000.0
4  45.0            NaN  100000.0

Cleaned DataFrame:
     Age           City    Income
0  25.00       New York   50000.0
1  30.00        Seattle   70000.0
2  35.00  San Francisco   77500.0
3  33.75         Austin   90000.0
4  45.00         Austin  100000.0


---

---