### **Module 3: Handling Missing Data**
In this module, students will explore techniques for detecting, handling, and filling missing data, which is a crucial task in data cleaning.

#### **Topics:**
- **Detecting Missing Data:**
  - Identifying missing values using `isnull()` and `notnull()`.

- **Handling Missing Data:**
  - Dropping missing values using `dropna()`.
  - Filling missing values using `fillna()` with mean, median, or custom values.
  - Interpolation for filling in missing values.
  
- **Forward and Backward Filling:**
  - Filling missing data with adjacent values (forward or backward fill).

#### **Hands-on Lab:**
- Load a dataset with missing values and explore different strategies for handling them, such as dropping or imputing missing data.

---

### **Module 3: Handling Missing Data**

In data preprocessing, missing data is common and can lead to biases and inaccuracies if not handled properly. Below are real-world **examples and scenarios** for each sub-topic.


### **Detecting Missing Data:**

Detecting missing values is the first step to handle incomplete datasets. You can use `isnull()`, `notnull()`, and `info()` to identify missing data.

1. **Real-World Example 1 - Detecting Missing Data in Customer Feedback:**

In [2]:
import pandas as pd

# Example dataset
feedback_data = pd.DataFrame({
    'Customer_ID': [101, 102, 103, 104],
    'Feedback_Score': [5, None, 4, None],
    'Comments': ['Excellent', None, 'Good', 'Satisfactory']
})

# Detect missing values
print(feedback_data.isnull())
print("\n")
print(f"Number of missing values:\n{feedback_data.isnull().sum()}")

   Customer_ID  Feedback_Score  Comments
0        False           False     False
1        False            True      True
2        False           False     False
3        False            True     False


Number of missing values:
Customer_ID       0
Feedback_Score    2
Comments          1
dtype: int64


**Use Case:** Identifying missing feedback in customer surveys to address incomplete reviews.

2. **Real-World Example 2 - Checking Missing Entries in E-commerce Orders:**

In [3]:
orders = pd.DataFrame({
    'Order_ID': [1001, 1002, 1003],
    'Order_Date': ['2025-01-01', None, '2025-01-03'],
    'Amount': [150, 200, None]
})

print("Missing in 'Order_Date':", orders['Order_Date'].isnull().sum())

Missing in 'Order_Date': 1


**Use Case:** Detecting incomplete order records to ensure data consistency in sales reports.

3. **Real-World Example 3 - Checking Completeness of Financial Reports:**

In [4]:
financial_data = pd.DataFrame({
    'Year': [2023, 2024, 2025],
    'Revenue': [1000000, None, 1200000]
})

print(financial_data.notnull())

   Year  Revenue
0  True     True
1  True    False
2  True     True


**Use Case:** Identifying missing revenue data for financial forecasting.

### **Handling Missing Data:**

There are several ways to handle missing data, such as dropping rows/columns, filling with mean/median/custom values, and interpolating missing values.

### **1. Dropping Missing Values (`dropna()`)**

1. **Real-World Example 1 - Dropping Incomplete Customer Entries:**

In [5]:
customer_data = pd.DataFrame({
    'Customer_ID': [1, 2, 3],
    'Name': ['Alice', 'Bob', None],
    'Age': [25, 30, None]
})

# Drop rows with any missing value
cleaned_data = customer_data.dropna()
print(cleaned_data)

   Customer_ID   Name   Age
0            1  Alice  25.0
1            2    Bob  30.0


**Use Case:** Dropping rows with missing names in loyalty programs where identity is crucial.

2. **Real-World Example 2 - Dropping Columns with Too Many Missing Values:**

In [22]:
product_data = pd.DataFrame({
    'Product_ID': [1, 2, 3],
    'Category': [None, 'Electronics', 'Clothing'],
    'Price': [None, None, 50]
})

# Drop columns where all values are missing
cleaned_data = product_data.dropna(axis=1, how='any')
print(cleaned_data)

   Product_ID
0           1
1           2
2           3


**Use Case:** Dropping irrelevant or missing fields from outdated product catalogs.

### **2. Filling Missing Values (`fillna()`)**

1. **Real-World Example 1 - Filling Missing Sales Data with Mean:**

In [8]:
sales_data = pd.DataFrame({
    'Month': ['Jan', 'Feb', 'Mar'],
    'Sales': [10000, None, 15000]
})

# Fill missing sales data with the mean value
sales_data['Sales'] = sales_data['Sales'].fillna(sales_data['Sales'].mean())
print(sales_data)


  Month    Sales
0   Jan  10000.0
1   Feb  12500.0
2   Mar  15000.0


**Use Case:** Imputing sales data for months with missing records in reports.

2. **Real-World Example 2 - Filling Missing Temperature Data with Median:**

In [23]:
import pandas as pd

weather_data = pd.DataFrame({
    'Day': [1, 2, 3],
    'Temperature': [30.5, None, 29.0]
})

# Reassign the filled column
weather_data['Temperature'] = weather_data['Temperature'].fillna(weather_data['Temperature'].median())
print(weather_data)

   Day  Temperature
0    1        30.50
1    2        29.75
2    3        29.00


**Use Case:** Handling missing temperature data in weather prediction systems.

3. **Real-World Example 3 - Filling Custom Values for Missing Shipping Information:**

In [24]:
import pandas as pd

shipping_data = pd.DataFrame({
    'Order_ID': [1, 2, 3],
    'Shipping_Status': ['Delivered', None, None]
})

# Reassign the filled column
shipping_data['Shipping_Status'] = shipping_data['Shipping_Status'].fillna('Pending')
print(shipping_data)

   Order_ID Shipping_Status
0         1       Delivered
1         2         Pending
2         3         Pending


**Use Case:** Updating missing shipping statuses to "Pending" in logistics tracking.

### **3. Interpolation for Filling Missing Values:**

1. **Real-World Example - Interpolating Time Series Data:**

In [13]:
stock_prices = pd.DataFrame({
    'Date': ['2025-01-01', '2025-01-02', '2025-01-03'],
    'Price': [100, None, 105]
})

# Interpolate missing values linearly
stock_prices['Price'] = stock_prices['Price'].interpolate()
print(stock_prices)

         Date  Price
0  2025-01-01  100.0
1  2025-01-02  102.5
2  2025-01-03  105.0


**Use Case:** Filling missing stock prices in a dataset for market analysis.

### **Forward and Backward Filling:**

1. **Real-World Example 1 - Forward Fill for Filling Temperature Readings:**

In [17]:
temp_readings = pd.DataFrame({
    'Time': ['9:00 AM', '10:00 AM', '11:00 AM'],
    'Temperature': [22.5, None, None]
})

# Forward fill
temp_readings['Temperature'] = temp_readings['Temperature'].ffill()
print(temp_readings)

       Time  Temperature
0   9:00 AM         22.5
1  10:00 AM         22.5
2  11:00 AM         22.5


**Use Case:** Forward filling sensor data when real-time readings are missing.

2. **Real-World Example 2 - Backward Fill for E-commerce Prices:**

In [21]:
prices = pd.DataFrame({
    'Product_ID': [1, 2, 3],
    'Price': [None, 200, 250]
})

# Backward fill
prices['Price'] = prices['Price'].bfill()
print(prices)


   Product_ID  Price
0           1  200.0
1           2  200.0
2           3  250.0
