### **Module 3: Handling Missing Data**
In this module, students will explore techniques for detecting, handling, and filling missing data, which is a crucial task in data cleaning.

#### **Topics:**
- **Detecting Missing Data:**
  - Identifying missing values using `isnull()` and `notnull()`.

- **Handling Missing Data:**
  - Dropping missing values using `dropna()`.
  - Filling missing values using `fillna()` with mean, median, or custom values.
  - Interpolation for filling in missing values.
  
- **Forward and Backward Filling:**
  - Filling missing data with adjacent values (forward or backward fill).

#### **Hands-on Lab:**
- Load a dataset with missing values and explore different strategies for handling them, such as dropping or imputing missing data.

---

### **Module 3: Handling Missing Data**

In data preprocessing, missing data is common and can lead to biases and inaccuracies if not handled properly. Below are real-world **examples and scenarios** for each sub-topic.


### **Detecting Missing Data:**

Detecting missing values is the first step to handle incomplete datasets. You can use `isnull()`, `notnull()`, and `info()` to identify missing data.

1. **Real-World Example 1 - Detecting Missing Data in Customer Feedback:**

In [1]:
import pandas as pd

# Example dataset
feedback_data = pd.DataFrame({
    'Customer_ID': [101, 102, 103, 104],
    'Feedback_Score': [5, None, 4, None],
    'Comments': ['Excellent', None, 'Good', 'Satisfactory']
})

# Detect missing values
print(feedback_data.isnull())
print("\n")
print(f"Number of missing values:\n{feedback_data.isnull().sum()}")

   Customer_ID  Feedback_Score  Comments
0        False           False     False
1        False            True      True
2        False           False     False
3        False            True     False


Number of missing values:
Customer_ID       0
Feedback_Score    2
Comments          1
dtype: int64


**Use Case:** Identifying missing feedback in customer surveys to address incomplete reviews.

2. **Real-World Example 2 - Checking Missing Entries in E-commerce Orders:**

In [2]:
orders = pd.DataFrame({
    'Order_ID': [1001, 1002, 1003],
    'Order_Date': ['2025-01-01', None, '2025-01-03'],
    'Amount': [150, 200, None]
})

print("Missing in 'Order_Date':", orders['Order_Date'].isnull().sum())

Missing in 'Order_Date': 1


**Use Case:** Detecting incomplete order records to ensure data consistency in sales reports.

3. **Real-World Example 3 - Checking Completeness of Financial Reports:**

In [3]:
financial_data = pd.DataFrame({
    'Year': [2023, 2024, 2025],
    'Revenue': [1000000, None, 1200000]
})

print(financial_data.notnull())

   Year  Revenue
0  True     True
1  True    False
2  True     True


**Use Case:** Identifying missing revenue data for financial forecasting.

### **Handling Missing Data:**

There are several ways to handle missing data, such as dropping rows/columns, filling with mean/median/custom values, and interpolating missing values.

### **1. Dropping Missing Values (`dropna()`)**

1. **Real-World Example 1 - Dropping Incomplete Customer Entries:**

In [4]:
customer_data = pd.DataFrame({
    'Customer_ID': [1, 2, 3],
    'Name': ['Alice', 'Bob', None],
    'Age': [25, 30, None]
})

# Drop rows with any missing value
cleaned_data = customer_data.dropna()
print(cleaned_data)

   Customer_ID   Name   Age
0            1  Alice  25.0
1            2    Bob  30.0


**Use Case:** Dropping rows with missing names in loyalty programs where identity is crucial.

2. **Real-World Example 2 - Dropping Columns with Too Many Missing Values:**

In [5]:
product_data = pd.DataFrame({
    'Product_ID': [1, 2, 3],
    'Category': [None, 'Electronics', 'Clothing'],
    'Price': [None, None, 50]
})

# Drop columns where all values are missing
cleaned_data = product_data.dropna(axis=1, how='any')
print(cleaned_data)

   Product_ID
0           1
1           2
2           3


**Use Case:** Dropping irrelevant or missing fields from outdated product catalogs.

### **2. Filling Missing Values (`fillna()`)**

1. **Real-World Example 1 - Filling Missing Sales Data with Mean:**

In [6]:
sales_data = pd.DataFrame({
    'Month': ['Jan', 'Feb', 'Mar'],
    'Sales': [10000, None, 15000]
})

# Fill missing sales data with the mean value
sales_data['Sales'] = sales_data['Sales'].fillna(sales_data['Sales'].mean())
print(sales_data)


  Month    Sales
0   Jan  10000.0
1   Feb  12500.0
2   Mar  15000.0


**Use Case:** Imputing sales data for months with missing records in reports.

2. **Real-World Example 2 - Filling Missing Temperature Data with Median:**

In [7]:
import pandas as pd

weather_data = pd.DataFrame({
    'Day': [1, 2, 3],
    'Temperature': [30.5, None, 29.0]
})

# Reassign the filled column
weather_data['Temperature'] = weather_data['Temperature'].fillna(weather_data['Temperature'].median())
print(weather_data)

   Day  Temperature
0    1        30.50
1    2        29.75
2    3        29.00


**Use Case:** Handling missing temperature data in weather prediction systems.

3. **Real-World Example 3 - Filling Custom Values for Missing Shipping Information:**

In [8]:
import pandas as pd

shipping_data = pd.DataFrame({
    'Order_ID': [1, 2, 3],
    'Shipping_Status': ['Delivered', None, None]
})

# Reassign the filled column
shipping_data['Shipping_Status'] = shipping_data['Shipping_Status'].fillna('Pending')
print(shipping_data)

   Order_ID Shipping_Status
0         1       Delivered
1         2         Pending
2         3         Pending


**Use Case:** Updating missing shipping statuses to "Pending" in logistics tracking.

### **3. Interpolation for Filling Missing Values:**

1. **Real-World Example - Interpolating Time Series Data:**

In [9]:
stock_prices = pd.DataFrame({
    'Date': ['2025-01-01', '2025-01-02', '2025-01-03'],
    'Price': [100, None, 105]
})

# Interpolate missing values linearly
stock_prices['Price'] = stock_prices['Price'].interpolate()
print(stock_prices)

         Date  Price
0  2025-01-01  100.0
1  2025-01-02  102.5
2  2025-01-03  105.0


**Use Case:** Filling missing stock prices in a dataset for market analysis.

### **Forward and Backward Filling:**

1. **Real-World Example 1 - Forward Fill for Filling Temperature Readings:**

In [10]:
temp_readings = pd.DataFrame({
    'Time': ['9:00 AM', '10:00 AM', '11:00 AM'],
    'Temperature': [22.5, None, None]
})

# Forward fill
temp_readings['Temperature'] = temp_readings['Temperature'].ffill()
print(temp_readings)

       Time  Temperature
0   9:00 AM         22.5
1  10:00 AM         22.5
2  11:00 AM         22.5


**Use Case:** Forward filling sensor data when real-time readings are missing.

2. **Real-World Example 2 - Backward Fill for E-commerce Prices:**

In [11]:
prices = pd.DataFrame({
    'Product_ID': [1, 2, 3],
    'Price': [None, 200, 250]
})

# Backward fill
prices['Price'] = prices['Price'].bfill()
print(prices)


   Product_ID  Price
0           1  200.0
1           2  200.0
2           3  250.0


**Use Case:** Backfilling prices to ensure products without recent updates have estimated values.

### **Additional Real-World Scenarios:**

1. **Filling Missing Health Data:**

   Fill missing BMI or glucose levels with median values in patient records for health analytics:

In [12]:
import pandas as pd

# Sample DataFrame
patient_data = pd.DataFrame({
    'Patient_ID': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110], 
    'BMI': [22.5, None, None, 24.0, None, 26.0, None, 27.5, 28.0, None]
})

# Fill missing values in 'BMI' column with the median of the column
patient_data['BMI'] = patient_data['BMI'].fillna(patient_data['BMI'].median())

# Display the updated DataFrame
patient_data

Unnamed: 0,Patient_ID,BMI
0,101,22.5
1,102,26.0
2,103,26.0
3,104,24.0
4,105,26.0
5,106,26.0
6,107,26.0
7,108,27.5
8,109,28.0
9,110,26.0


2. **Handling Missing Demographic Data for Marketing Campaigns:**

   Fill missing demographic information like "Gender" and "Age Group" with the mode:

In [13]:
# Sample DataFrame with more records
demographics = pd.DataFrame({
    'User_ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Age_Group': [None, '18-25', '26-35', None, '18-25', '36-45', None, '18-25', '26-35', None]
})

# Fill missing values in 'Age_Group' column with the mode (most frequent value)
demographics['Age_Group'] = demographics['Age_Group'].fillna(demographics['Age_Group'].mode()[0])

# Display the updated DataFrame
demographics

Unnamed: 0,User_ID,Age_Group
0,1,18-25
1,2,18-25
2,3,26-35
3,4,18-25
4,5,18-25
5,6,36-45
6,7,18-25
7,8,18-25
8,9,26-35
9,10,18-25


3. **Imputing Missing Flight Delay Data for Predictive Models:**

   Interpolate missing flight delay times: 

In [14]:
# Sample DataFrame with more records
flight_delays = pd.DataFrame({
    'Flight_ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Delay_Minutes': [10, None, 30, None, 20, 50, None, 60, None, 40]
})

# Interpolate missing values in 'Delay_Minutes' column
flight_delays['Delay_Minutes'] = flight_delays['Delay_Minutes'].interpolate()

# Display the updated DataFrame
flight_delays

Unnamed: 0,Flight_ID,Delay_Minutes
0,1,10.0
1,2,20.0
2,3,30.0
3,4,25.0
4,5,20.0
5,6,50.0
6,7,55.0
7,8,60.0
8,9,50.0
9,10,40.0


**Interpolation filling** refers to the process of estimating missing values in a dataset by using the known values around them. The idea is to "fill in the blanks" with values that make sense, maintaining a smooth transition based on the trend of the data.

---

### **How Interpolation Works:**

1. **Linear Interpolation (default in pandas)**:  
   Estimates missing values by assuming a **linear progression** between the two nearest points. For example:
   - If you have values `10` and `30`, and a missing value between them, the interpolated value would be `20`, assuming an equal step.

2. **Other Types of Interpolation**:
   - **Polynomial Interpolation**: Fits a polynomial curve and uses it to estimate the missing values.
   - **Spline Interpolation**: Fits a smooth curve through the data points and fills the missing values accordingly.
   - **Time-based Interpolation**: Used when filling time-series data by estimating based on timestamps.

---

### **Example of Linear Interpolation:**

#### Input Data:

| Flight_ID | Delay_Minutes |
|------------|---------------|
| 1          | 10            |
| 2          | None          |
| 3          | 30            |

#### Interpolation Process:
- The missing value between `10` and `30` is estimated as:
  
![image.png](attachment:49a1c0f8-3e6f-467d-879b-83602b1d8b79.png)

The missing value becomes `20`.

---

### **Use Cases of Interpolation**:
- Time-series data with missing readings (temperature, stock prices, etc.)
- Experimental data with missing measurements
- Filling gaps in sensor data

---

### **Key Points**:
- Interpolation only works well when the missing values are within the range of known data.
- If the gaps are too large or the data is highly non-linear, interpolation may produce inaccurate results.
- Unlike methods like `.fillna()` with a constant or mean, interpolation considers the trend and surrounding values.


![image.png](attachment:9f4156d7-e83c-46ec-b2e7-8c7305c2a930.png)
![image.png](attachment:f33c378e-abb7-47f0-bddb-d263668aa887.png)

![image.png](attachment:8f5a3c66-a30c-4679-a213-9925e33c3f69.png)

![image.png](attachment:bb59c419-3605-4993-a25d-ba91d3ac8dc2.png)

![image.png](attachment:0d10de01-f3af-45b4-b69c-65af02b83c6f.png)

![image.png](attachment:b195be74-4cfb-40bd-b172-0b281c1a645a.png)
![image.png](attachment:89d28523-a753-4067-9e0c-56d0da3d7679.png)
![image.png](attachment:f7e39ce0-1d7b-4e84-9e60-5069ecc25a8a.png)
![image.png](attachment:f6655d19-9db0-4209-88bd-7a77896fc50d.png)
![image.png](attachment:89df9638-cd01-4330-a7df-c6720a0221f0.png)