## Pandas Tutorial 6: Handling Missing Data - `replace()` Function

In the previous tutorial, we covered methods like `fillna()`, `interpolate()`, and `dropna()` for handling missing data. Now, we focus on the `replace()` function, which offers precise control for transforming specific values. It allows for replacing single values, lists, or even patterns using regular expressions.

**Topics covered:**
- Managing missing data with `replace()`
- Handling special values in datasets
- Replacing values using a dictionary
- Using regular expressions ("regex") with `replace()`
- Replacing values with regex using a dictionary
- Substituting lists of values

This tutorial will enhance your data cleaning skills, adding flexibility to the transformation methods from the previous lessons.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("C:\\Users\\Vaishob\\PycharmProjects\\pandas\\weather_data (2).csv")
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32 F,6 mph,Rain
1,1/2/2017,-99999,7 mph,Sunny
2,1/3/2017,28,-99999,Snow
3,1/4/2017,-99999,7,No Event
4,1/5/2017,32C,-99999,Rain
5,1/6/2017,31,2,Sunny
6,1/6/2017,34,5,No Event


### Replacing Specific Values with `replace()`

The `replace()` method substitutes specific values in a DataFrame. Here, all instances of `-99999` are replaced with `np.NaN` to represent missing or invalid data.

**Key features:**
- `replace(-99999, np.NaN)`: Replaces all `-99999` values with `NaN`.
- Ideal for cleaning datasets where placeholder values indicate missing data. 

In [3]:
# Replaces all occurrences of -99999 in the DataFrame df with NaN (Not a Number)
new_df = df.replace(-99999,np.NaN)
new_df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32 F,6 mph,Rain
1,1/2/2017,-99999,7 mph,Sunny
2,1/3/2017,28,-99999,Snow
3,1/4/2017,-99999,7,No Event
4,1/5/2017,32C,-99999,Rain
5,1/6/2017,31,2,Sunny
6,1/6/2017,34,5,No Event


### Replacing Multiple Values with `replace()`

The `replace()` method can substitute multiple values at once. Here, both `-99999` and `-88888` are replaced with `np.NaN`, treating them as placeholders for missing data.

**Key features:**
- `replace([-99999, -88888], np.NaN)`: Replaces multiple values with `NaN`.
- Useful for handling multiple placeholder values in the dataset.

In [21]:
# Replaces all occurrences of -99999 and -88888 in the DataFrame df with NaN (Not a Number)
new_df = df.replace([-99999, -88888],np.NaN)
new_df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32 F,6 mph,Rain
1,1/2/2017,-99999,7 mph,Sunny
2,1/3/2017,28,-99999,Snow
3,1/4/2017,-99999,7,No Event
4,1/5/2017,32C,-99999,Rain
5,1/6/2017,31,2,Sunny
6,1/6/2017,34,5,No Event


### Replacing Values in Specific Columns Using a Dictionary with `replace()`

The `replace()` method can use a dictionary to target specific columns. Here, `-99999` in the 'temperature' and 'windspeed' columns, and `'0'` in the 'event' column are replaced with `NaN`.

**Key features:**
- `replace({column_name: value}, np.NaN)`: Replaces specific values in designated columns.
- Useful for cleaning multiple columns with different placeholder values.

In [22]:
new_df = df.replace({
    'temperature': -99999,  # Replaces all occurrences of -99999 in the 'temperature' column with NaN
    'windspeed': -99999,  # Replaces all occurrences of -99999 in the 'windpeed' column with NaN
    'event': '0'  # Replaces all occurrences of '0' in the 'event' column with NaN
    }, np.NaN)
new_df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32 F,6 mph,Rain
1,1/2/2017,-99999,7 mph,Sunny
2,1/3/2017,28,-99999,Snow
3,1/4/2017,-99999,7,No Event
4,1/5/2017,32C,-99999,Rain
5,1/6/2017,31,2,Sunny
6,1/6/2017,34,5,No Event


### Replacing Multiple Values Using a Dictionary with `replace()`

The `replace()` method can use a dictionary to perform multiple replacements across the DataFrame. For example, `-99999` is replaced with `NaN`, and `'No Event'` is replaced with `'sunny'`.

**Key features:**
- `replace({old_value: new_value})`: Flexible, multi-value replacement across the DataFrame.
- Handles both numerical and string replacements in one step.

In [23]:
new_df = df.replace({
    -99999: np.NaN,  # Replaces all occurrences of -99999 in the DataFrame with NaN
    'No Event': 'Sunny'  # Replaces all occurrences of 'No Event' with 'Sunny' in the DataFrame
})
new_df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32 F,6 mph,Rain
1,1/2/2017,-99999,7 mph,Sunny
2,1/3/2017,28,-99999,Snow
3,1/4/2017,-99999,7,Sunny
4,1/5/2017,32C,-99999,Rain
5,1/6/2017,31,2,Sunny
6,1/6/2017,34,5,Sunny


### Replacing Using Regular Expressions with `replace()`

The `replace()` method with `regex=True` enables pattern-based replacements. Here, all alphabetic characters (`[A-Za-z]`) are replaced with an empty string `''`.

**Key features:**
- `regex=True`: Enable regular expression-based replacements.
- Ideal for cleaning text data by removing or modifying specific patterns.

In [25]:
# Replaces all occurrences of alphabetic characters (A-Z, a-z) in the DataFrame with an empty string, using regular expressions
new_df = df.replace('[A-Za-z]','',regex=True)
new_df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,
1,1/2/2017,-99999,7,
2,1/3/2017,28,-99999,
3,1/4/2017,-99999,7,
4,1/5/2017,32,-99999,
5,1/6/2017,31,2,
6,1/6/2017,34,5,


### Using Regular Expressions with a Dictionary in `replace()`

The `replace()` method with a dictionary and `regex=True` targets specific columns. Here, `[A-Za-z]` removes all alphabetic characters from the 'temperature' and 'windspeed' columns by replacing them with `''`.

**Key features:**
- `regex=True`: Enables pattern-based replacements in specific columns.
- Ideal for cleaning columns based on patterns such as letters, digits, or symbols.

In [24]:
new_df = df.replace({
    'temperature': '[A-Za-z]',  # Replaces all alphabetic characters in the 'temperature' column with an empty string
    'windspeed': '[A-Za-z]'  # Replaces all alphabetic characters in the 'windspeed' column with an empty string
    },'',regex=True)
new_df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,-99999,7,Sunny
2,1/3/2017,28,-99999,Snow
3,1/4/2017,-99999,7,No Event
4,1/5/2017,32,-99999,Rain
5,1/6/2017,31,2,Sunny
6,1/6/2017,34,5,No Event


### Creating a DataFrame with Student Scores

This example creates a DataFrame with two columns:
- `score`: Descriptive labels like 'exceptional', 'average', 'good', and 'poor' representing student performance.
- `student`: Names of students.

This DataFrame can be used for further operations, such as converting labels to numeric values or performing data analysis.

In [26]:
df = pd.DataFrame({
    'score': ['exceptional','average','good','poor','average','exceptional'],  # A column representing student scores as descriptive labels
    'student': ['rob', 'maya', 'parthiv', 'tom', 'julian', 'erica']  # A column with student names
})
df

Unnamed: 0,score,student
0,exceptional,rob
1,average,maya
2,good,parthiv
3,poor,tom
4,average,julian
5,exceptional,erica


### Replacing a List of Values with Another List Using `replace()`

The `replace()` method is used to map descriptive labels in the `score` column to numeric values:
* 'poor' → `1`
* 'average' → `2`
* 'good' → `3`
* 'exceptional` → `4`

This is helpful for converting categorical data into numerical form for analysis or machine learning.

**Key features:**
- `replace([old_values], [new_values])`: Maps multiple values in one step.
- Ideal for transforming categorical data into numerical form.

In [27]:
# Replaces 'poor' with 1, 'average' with 2, 'good' with 3, and 'exceptional' with 4 in the 'score' column of the DataFrame
new_df = df.replace(['poor','average','good','exceptional'], [1,2,3,4])
new_df

Unnamed: 0,score,student
0,4,rob
1,2,maya
2,3,parthiv
3,1,tom
4,2,julian
5,4,erica
