# Pandas Tutorial 5: Handling Missing Data - `fillna()`, `interpolate()`, `dropna()`

Building on the previous tutorial, where we explored reading and writing CSV and Excel files, we now focus on handling missing data - one of the most common issues when working with real-world datasets. In this tutorial, you will learn how to manage missing values in Pandas using the `fillna(0`, `interpolate()`, and `dropna()` methods. These techniques will allow you to clean and prepare your data for more accurate analysis.

**Topics covered:**
- Introduction
- Converting a string column to the date type
- Using dates as the index of a DataFrame with `set_index()`
- Using the `fillna()` method to fill missing data
- Forward filling missing data with `fillna(method="ffill")`
- Backward filling missing data with `fillna(method="bfill")`
- Exploring the `axis` parameter in the `fillna()` method
- Exploring the `limit` parameter in the `fillna()` method
- Using the `interpolate()` method for data interpolation
- Using `interpolate(method="time")` to interpolate based on time
- Using the `dropna()` method to remove rows with missing data
- Exploring the `how` paramtere in the `dropna()` method 
- Exploring the `thresh` parameter in the `dropna()` method

In this tutorial, you will deepen your understanding of data cleaning by learning how to deal with missing values effectively, ensuring your data is ready fo ranalysis after importing it from files.

In [4]:
import pandas as pd

### Parsing Dates and Setting the Index

When importing data from a CSV file, the `parse_dates` argument ensures that the "day" column is treated as a datetime type instead of a string. Then, the `set_index()` method is used to set the "day" column as the index of the DataFrame, making it easier to work with time-series data.

**Key features:**
- `parse_dates=["day"]`: Converts the "day" column into datetime objects during import.
- `set_index('day')`: Makes the "day" column the index of the DataFrame for better time-based analysis.

In [5]:
# Reads CSV, parsing the "day" column as dates
df = pd.read_csv("C:\\Users\\Vaishob\\PycharmProjects\\pandas\\weather_data.csv",parse_dates=["day"])
# Sets the "day" column as the index of the DataFrame 
df.set_index('day',inplace=True)
df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,,9.0,Sunny
2017-01-05,28.0,,Snow
2017-01-06,,7.0,
2017-01-07,32.0,,Rain
2017-01-08,,,Sunny
2017-01-09,,,
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### Filling Missing Data with `fillna()`

The `fillna()` method is used to replace missing values (`NaN`) in a DataFrame with a specified value. In this case, all missing values in `df` are replaced with `0`, and the result is stored in `new_df`.

**Key features:**
- `fillna(0)`: Replaces all `NaN` values in the DataFrame with `0`.
- It creates a new DataFrame unless the `inplace=True` argument is used.

In [6]:
# Creates a new DataFrame where all missing values (NaN) in df are replaced with 0
new_df = df.fillna(0)
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,0.0,9.0,Sunny
2017-01-05,28.0,0.0,Snow
2017-01-06,0.0,7.0,0
2017-01-07,32.0,0.0,Rain
2017-01-08,0.0,0.0,Sunny
2017-01-09,0.0,0.0,0
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### Filling Missing Data with Different Values for Each Column

The `fillna()` method allows you to specify different replacement values for each column by passing a dictionary. In this example, missing values in the `temperature` and `windspeed` columns are replaced with `0`, and missing values in the `event` column are replaced with `'no event'`.

**Key features:**
- `fillna({})`: Allows you to provide custom fill values for each column.
- Useful when different columns require different default values for missing data.

In [7]:
new_df = df.fillna({
    'temperature': 0,  # Replaces missing values in the 'temperature' column with 0
    'windspeed': 0,  # Replaces missing values in the 'windspeed' column with 0
    'event': 'no event'  # Replaces missing values in the 'event' column with 'no event'
})
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,0.0,9.0,Sunny
2017-01-05,28.0,0.0,Snow
2017-01-06,0.0,7.0,no event
2017-01-07,32.0,0.0,Rain
2017-01-08,0.0,0.0,Sunny
2017-01-09,0.0,0.0,no event
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### Forward Filling Missing Data with `fillna(method="ffill")`

The `fillna(method="ffill")` method fills missing values by propagating the last valid value forward. This is useful when you want to replace missing data with the most recent previous value.

**Key features:**
- `method="ffill"` (forward fill): Fills missing values with the last non-null value encountered before the missing one.
- Ideal for time-series data or sequential data where previous values are meaningful for filling.

In [8]:
# Performs forward fill, where missing values are filled with the last valid value from above
new_df = df.fillna(method="ffill")
new_df

  new_df = df.fillna(method="ffill")


Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,32.0,9.0,Sunny
2017-01-05,28.0,9.0,Snow
2017-01-06,28.0,7.0,Snow
2017-01-07,32.0,7.0,Rain
2017-01-08,32.0,7.0,Sunny
2017-01-09,32.0,7.0,Sunny
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### Backward Filling Missing Data with `fillna(method="bfill")`

The `fillna(method="bfill")` method fills missing values by propagating the next valid value backward. This is useful when you want to replace missing data with the next available value in the dataset.

**Key features:**
- `method="bfill"` (backward fill): Fills missing values with the next non-null value encountered after the missing one.
- Helpful for datasets where subsequent values are more relevant for filling gaps.

In [9]:
# Performs backward fill, where missing values are filled with the next valid value from below
new_df = df.fillna(method="bfill")
new_df

  new_df = df.fillna(method="bfill")


Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,28.0,9.0,Sunny
2017-01-05,28.0,7.0,Snow
2017-01-06,32.0,7.0,Rain
2017-01-07,32.0,8.0,Rain
2017-01-08,34.0,8.0,Sunny
2017-01-09,34.0,8.0,Cloudy
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### Backward Filling Missing Data Across Columns with `fillna(method="bfill", axis="columns")`

When using `fillna()` with `method="bfill"` and `axis="columns"`, the backward fill is applied across columns, filling missing values in each row with the next valid value from the subsequent column. This is useful when the data is arranged such that missing values need to be filled horizontally.

**Key features:**
- `axis="columns"`: Applies the fill operation across columns (horizontally) for each row.
- Suitable for datasets where missing values in a row can be filled with values from the same row but different columns.

In [10]:
# Performs backward fill along the columns, filling missing values with the next valid value in the same row
new_df_1 = df.fillna(method="bfill", axis="columns")
new_df_1

  new_df_1 = df.fillna(method="bfill", axis="columns")


Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,9.0,9.0,Sunny
2017-01-05,28.0,Snow,Snow
2017-01-06,7.0,7.0,
2017-01-07,32.0,Rain,Rain
2017-01-08,Sunny,Sunny,Sunny
2017-01-09,,,
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### Forward Filling with a Limit Using `fillna(method="ffill", limit=1)`

The `limit` parameter in `fillna()` controls the maximum number of missing values to fill. In this case, `limit=1` means that only one missing value will be forward filled for each gap, even if there are multiple consecutive missing values.

**Key features:**
- `limit=1`: Restricts the fill operation to a maximum of 1 missing value per gap.
- Useful when you only want to fill part of the missing data rather than filling all consecutive missing values.

In [11]:
# Performs forward fill but only fills a maximum of 1 consecutive missing value for each gap
new_df = df.fillna(method="ffill", limit=1)
new_df

  new_df = df.fillna(method="ffill", limit=1)


Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,32.0,9.0,Sunny
2017-01-05,28.0,9.0,Snow
2017-01-06,28.0,7.0,Snow
2017-01-07,32.0,7.0,Rain
2017-01-08,32.0,,Sunny
2017-01-09,,,Sunny
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### Interpolating Missing Data with `interpolate()`

The `interpolate()` method in Pandas fills missing values by estimating values based on surrounding data points. By default, it uses linear interpolation, which fills missing values by drawing a straight line between adjacent data points.

**Key features:**
- `interpolate()`: Uses interpolation to estimate missing values based on available data, typically through linear interpolation.
- Suitable for datasets where the data has a logical progression (e.g., time-series data or numerical sequences).

In [12]:
# Fills missing values by performing linear interpolation between existing data points
new_df = df.interpolate()
new_df

  new_df = df.interpolate()


Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,30.0,9.0,Sunny
2017-01-05,28.0,8.0,Snow
2017-01-06,30.0,7.0,
2017-01-07,32.0,7.25,Rain
2017-01-08,32.666667,7.5,Sunny
2017-01-09,33.333333,7.75,
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### Time-Based Interpolation with `interpolate(method="time")`

The `interpolate(method="time")` method fills missing values by interpolating them based on time, which is useful when the index is a datetime object. It estimates missing values considering the time gaps between data points, making it more suitable for time-series data.

**Key features:**
- `method="time"`: Uses time-based interpolation to estimate missing values, ensuring that the interpolation respects time intervals.
- Ideal for time-series datasets where regular linear interpolation may not be appropriate.

In [13]:
# Performs interpolation based on time, assuming the index is a datetime object
new_df = df.interpolate(method="time")
new_df

  new_df = df.interpolate(method="time")


Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,29.0,9.0,Sunny
2017-01-05,28.0,8.0,Snow
2017-01-06,30.0,7.0,
2017-01-07,32.0,7.25,Rain
2017-01-08,32.666667,7.5,Sunny
2017-01-09,33.333333,7.75,
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### Dropping Rows with Missing Data Using `dropna()`

The `dropna()` method removes all rows that contain one or more missing values (`NaN`). This is useful when you only want to work with complete data and discard rows with missing entries.

**Key features:**
- `dropna()`: Removes rows where any column contains a missing value.
- Useful for ensuring that only complete data is used for analysis.

In [14]:
# Drops all rows that contain any missing values (NaN) in the DataFrame df
new_df = df.dropna()
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### Dropping Rows Where All Values Are Missing with `dropna(how="all")`

The `how="all"` argument in the `dropna()` method specifies that only rows where **all** columns contain missing values (`NaN`) should be dropped. Rows with partial data (i.e., some non-missing values) will remain in the DataFrame.

**Key features:**
- `how="all"`: Removes rows only if every value in that row is missing.
- Useful when you want to keep rows that have some data but remove rows that are entirely empty.

In [15]:
# Drops only rows where all the values are missing (NaN)
new_df = df.dropna(how="all")
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,,9.0,Sunny
2017-01-05,28.0,,Snow
2017-01-06,,7.0,
2017-01-07,32.0,,Rain
2017-01-08,,,Sunny
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### Dropping Rows Based on Non-Missing Values with `dropna(thresh=2)`

The `thresh` parameter in the `dropna()` method sets a minimum threshold for the number of non-missing values a row must have to be retained. In this case, `thresh=2` means that rows with fewer than 2 non-NaN values will be dropped.

**Key features:**
- `thresh=2`: Keeps rows that have at least 2 non-missing values.
- Useful for keeping rows that contain a sufficient amount of data, while discarding those with too many missing values.

In [16]:
# Drops rows that contain fewer than 2 non-missing (non-NaN) values
new_df = df.dropna(thresh=2)
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,,9.0,Sunny
2017-01-05,28.0,,Snow
2017-01-07,32.0,,Rain
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


### Reindexing a DataFrame with a New Datetime Index

The `reindex()` method is used to align a DataFrame to a new index. In this case, the DataFrame is reindexed to include a full date range from "01-01-2017" to "01-11-2017". Any dates that were missing in the original DataFrame will now have empty rows (filled with NaN by default) for those missing dates.

**Key features:**
- `pd.date_range()`: Creates a range of dates to be used as the new index.
- `reindex()`: Aligns the DataFrame to a new index, potentially introducing NaN values for missing rows.

In [17]:
# Creates a date range from January 1, 2017 to January 11, 2017
dt = pd.date_range("01-01-2017","01-11-2017")
# Converts the date range into a DatetimeIndex object
idx = pd.DatetimeIndex(dt)
# Reindexes the DataFrame df to align with the new DatetimeIndex, adding rows for missing dates if necessary
df = df.reindex(idx)
df

Unnamed: 0,temperature,windspeed,event
2017-01-01,32.0,6.0,Rain
2017-01-02,,,
2017-01-03,,,
2017-01-04,,9.0,Sunny
2017-01-05,28.0,,Snow
2017-01-06,,7.0,
2017-01-07,32.0,,Rain
2017-01-08,,,Sunny
2017-01-09,,,
2017-01-10,34.0,8.0,Cloudy
