# Pandas Tutorial 5: Handling Missing Data - `fillna()`, `interpolate()`, `dropna()`

Building on the previous tutorial about reading and writing files, this tutorial focuses on handling missing data - a common issue in real-world datasets. You will learn to manage missing values in Pandas using the `fillna()`, `interpolate()`, and `dropna()` methods to clean and prepare your data for accurate analysis.

**Topics covered:**
- Introduction
- Converting a string column to date type
- Setting dates as the index with `set_index()`
- Filling missing data with `fillna()`
- Forward filling with `fillna(method="ffill")`
- Backward filling with `fillna(method="bfill")`
- Using the `axis` and `limit` parameters in `fillna()`
- Interpolating data with `interpolate()`
- Time-based interpolation with `interpolate(method="time")`
- Removing rows with missing data using `dropna()`
- Exploring `how` and `thresh` parameters in `dropna()`

This tutorial will help you effectively clean and manage missing data, ensuring your dataset is ready for analysis after import.

In [4]:
import pandas as pd

### Parsing Dates and Setting the Index

The `parse_dates` argument in `read_csv()` ensures the "day" column is treated as datetime, and `set_index()` sets it as the DataFrame index for easier time-series analysis.

**Key features:**
- `parse_dates=["day"]`: Converts the "day" column to datetime during import.
- `set_index('day')`: Sets the "day" column as the DataFrame index for better time-based analysis.

In [5]:
# Reads CSV, parsing the "day" column as dates
df = pd.read_csv("C:\\Users\\Vaishob\\PycharmProjects\\pandas\\weather_data.csv",parse_dates=["day"])
# Sets the "day" column as the index of the DataFrame 
df.set_index('day',inplace=True)
df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,,9.0,Sunny
2017-01-05,28.0,,Snow
2017-01-06,,7.0,
2017-01-07,32.0,,Rain
2017-01-08,,,Sunny
2017-01-09,,,
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### Filling Missing Data with `fillna()`

The `fillna()` method replaces missing values (`NaN`) in a DataFrame with a specified value. Here, all missing values are replaced with `0`, and the result is stored in `new_df`.

**Key features:**
- `fillna(0)`: Replaces all `NaN` values with `0`
- Creates a new DataFrame unless `inplace=True` is used.

In [6]:
# Creates a new DataFrame where all missing values (NaN) in df are replaced with 0
new_df = df.fillna(0)
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,0.0,9.0,Sunny
2017-01-05,28.0,0.0,Snow
2017-01-06,0.0,7.0,0
2017-01-07,32.0,0.0,Rain
2017-01-08,0.0,0.0,Sunny
2017-01-09,0.0,0.0,0
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### Filling Missing Data with Different Values for Each Column

You can pass a dictionary to `fillna()` to specify different replacement values for each column. For example, missing values in `temperature` and `windspeed` are replaced with `0`, and in `event` with `no event`.

**Key features:**
- `fillna({})`: Custom fill values for each column.
- Ideal when columns need different default values for missing data.

In [7]:
new_df = df.fillna({
    'temperature': 0,  # Replaces missing values in the 'temperature' column with 0
    'windspeed': 0,  # Replaces missing values in the 'windspeed' column with 0
    'event': 'no event'  # Replaces missing values in the 'event' column with 'no event'
})
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,0.0,9.0,Sunny
2017-01-05,28.0,0.0,Snow
2017-01-06,0.0,7.0,no event
2017-01-07,32.0,0.0,Rain
2017-01-08,0.0,0.0,Sunny
2017-01-09,0.0,0.0,no event
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### Forward Filling Missing Data with `fillna(method="ffill")`

The `fillna(method="ffill")` method fills missing values by propagating the last valid value forward, useful for replacing missing data with the most recent previous value.

**Key features:**
- `method="ffill"`: Fills missing values with the last non-null value.
- Ideal for time-series or sequential data where prior values are meaningful.

In [8]:
# Performs forward fill, where missing values are filled with the last valid value from above
new_df = df.fillna(method="ffill")
new_df

  new_df = df.fillna(method="ffill")


Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,32.0,9.0,Sunny
2017-01-05,28.0,9.0,Snow
2017-01-06,28.0,7.0,Snow
2017-01-07,32.0,7.0,Rain
2017-01-08,32.0,7.0,Sunny
2017-01-09,32.0,7.0,Sunny
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### Backward Filling Missing Data with `fillna(method="bfill")`

The `fillna(method="bfill")` method fills missing values by using the next valid value, replacing gaps with the following available data.

**Key features:**
- `method="bfill"`: Fills missing values with the next non-null value.
- Useful when future values are more relevant for filling gaps.

In [9]:
# Performs backward fill, where missing values are filled with the next valid value from below
new_df = df.fillna(method="bfill")
new_df

  new_df = df.fillna(method="bfill")


Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,28.0,9.0,Sunny
2017-01-05,28.0,7.0,Snow
2017-01-06,32.0,7.0,Rain
2017-01-07,32.0,8.0,Rain
2017-01-08,34.0,8.0,Sunny
2017-01-09,34.0,8.0,Cloudy
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### Backward Filling Missing Data Across Columns with `fillna(method="bfill", axis="columns")`

Using `fillna()` with `method="bfill"` and `axis="columns"` fills missing values in each row with the next valid value from the subsequent column, useful for horizontally arranged data.

**Key features:**
- `axis="columns"`: Fills missing values across columns (horizontally) within each row.
- Ideal for filling row-wise missing data from other columns in the same row.

In [10]:
# Performs backward fill along the columns, filling missing values with the next valid value in the same row
new_df_1 = df.fillna(method="bfill", axis="columns")
new_df_1

  new_df_1 = df.fillna(method="bfill", axis="columns")


Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,9.0,9.0,Sunny
2017-01-05,28.0,Snow,Snow
2017-01-06,7.0,7.0,
2017-01-07,32.0,Rain,Rain
2017-01-08,Sunny,Sunny,Sunny
2017-01-09,,,
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### Forward Filling with a Limit Using `fillna(method="ffill", limit=1)`

The `limit` parameter in `fillna()` limits the number of missing values to fill. With `limit=1`, only one missing value per gap is forward filled, even with consecutive missing values.

**Key features:**
- `limit=1` Fills only 1 missing value per gap.
- Useful for partial filling of missing data.

In [11]:
# Performs forward fill but only fills a maximum of 1 consecutive missing value for each gap
new_df = df.fillna(method="ffill", limit=1)
new_df

  new_df = df.fillna(method="ffill", limit=1)


Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,32.0,9.0,Sunny
2017-01-05,28.0,9.0,Snow
2017-01-06,28.0,7.0,Snow
2017-01-07,32.0,7.0,Rain
2017-01-08,32.0,,Sunny
2017-01-09,,,Sunny
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### Interpolating Missing Data with `interpolate()`

The `interpolate()` method estimates missing values based on surrounding data, typically using linear interpolation, which fills gaps by drawing a straight line between adjacent points. 

**Key features:**
- `interpolate()`: Fills missing values by estimating them from surrounding data.
- Ideal for time-series or numerical data with a logical progression.

In [12]:
# Fills missing values by performing linear interpolation between existing data points
new_df = df.interpolate()
new_df

  new_df = df.interpolate()


Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,30.0,9.0,Sunny
2017-01-05,28.0,8.0,Snow
2017-01-06,30.0,7.0,
2017-01-07,32.0,7.25,Rain
2017-01-08,32.666667,7.5,Sunny
2017-01-09,33.333333,7.75,
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### Time-Based Interpolation with `interpolate(method="time")`

The `interpolate(method="time")` method fills missing values by considering time gaps between data points, making it ideal for time-series data with a datetime index.

**Key features:**
- `method="time"`: Interpolates based on time intervals.
- Best suited for time-series datasets where standard linear interpolation may not apply.

In [13]:
# Performs interpolation based on time, assuming the index is a datetime object
new_df = df.interpolate(method="time")
new_df

  new_df = df.interpolate(method="time")


Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,29.0,9.0,Sunny
2017-01-05,28.0,8.0,Snow
2017-01-06,30.0,7.0,
2017-01-07,32.0,7.25,Rain
2017-01-08,32.666667,7.5,Sunny
2017-01-09,33.333333,7.75,
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### Dropping Rows with Missing Data Using `dropna()`

The `dropna()` method removes rows that contain any missing values (`NaN`), ensuring only complete data is retained.

**Key features:**
- `dropna()`: Deletes rows with missing values in any column.
- Useful for working with complete datasets in analysis.

In [14]:
# Drops all rows that contain any missing values (NaN) in the DataFrame df
new_df = df.dropna()
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### Dropping Rows Where All Values Are Missing with `dropna(how="all")`

Using `how="all"` in `dropna()` removes rows only if all columns contain missing values, preserving rows with partial data.

**Key features:**
- `how="all"`: Drops rows where every value is missing.
- Useful for keeping rows with some data while removing completely empty rows.

In [15]:
# Drops only rows where all the values are missing (NaN)
new_df = df.dropna(how="all")
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,,9.0,Sunny
2017-01-05,28.0,,Snow
2017-01-06,,7.0,
2017-01-07,32.0,,Rain
2017-01-08,,,Sunny
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### Dropping Rows Based on Non-Missing Values with `dropna(thresh=2)`

The `thresh` parameter in `dropna()` ensures rows with at least 2 non-missing values are retained. Rows with fewer will be dropped.

**Key features:**
- `thresh=2`: Keeps rows with at least 2 non-NaN values.
- Useful for retaining rows with enough data while discarding those with too many missing values.

In [16]:
# Drops rows that contain fewer than 2 non-missing (non-NaN) values
new_df = df.dropna(thresh=2)
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,,9.0,Sunny
2017-01-05,28.0,,Snow
2017-01-07,32.0,,Rain
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### Reindexing a DataFrame with a New Datetime Index

The `reindex()` method aligns a DataFrame to a new index. Here, it's reindexed with a full date range from "01-01-2017" to "01-11-2017", filling missing dates with `NaN`.

**Key features:**
- `pd.date_range()`: Generates a new range of dates.
- `reindex()`: Aligns the DataFrame to the new index, adding `NaN` for missing dates.

In [17]:
# Creates a date range from January 1, 2017 to January 11, 2017
dt = pd.date_range("01-01-2017","01-11-2017")
# Converts the date range into a DatetimeIndex object
idx = pd.DatetimeIndex(dt)
# Reindexes the DataFrame df to align with the new DatetimeIndex, adding rows for missing dates if necessary
df = df.reindex(idx)
df

Unnamed: 0,temperature,windspeed,event
2017-01-01,32.0,6.0,Rain
2017-01-02,,,
2017-01-03,,,
2017-01-04,,9.0,Sunny
2017-01-05,28.0,,Snow
2017-01-06,,7.0,
2017-01-07,32.0,,Rain
2017-01-08,,,Sunny
2017-01-09,,,
2017-01-10,34.0,8.0,Cloudy
