# Lesson 2: Handling Missing Values

## Handling Missing Values in Python with Pandas

## Introduction
Welcome to our **Handling Missing Values** lesson! Missing values in datasets can complicate data analysis. Incorrect handling may lead to inaccurate results. This guide will teach you how to manage these values using Python's Pandas library.

---

## Missing Data in Datasets
Missing data occurs when no value is recorded for a variable. This can:
- Cause bias.
- Render certain functions inapplicable.
- Obscure insightful patterns.

**Example dataset:**
```python
import pandas as pd

data = {'Name': ['Anna', 'Bob', 'Charlie', 'David', None],
        'Score': [85, 88, None, 92, 90]}
df = pd.DataFrame(data)
print(df)
# Output:
#       Name  Score
# 0     Anna   85.0
# 1      Bob   88.0
# 2  Charlie    NaN
# 3    David   92.0
# 4     None   90.0
```
Here, "Charlie" has a missing score (`None`).

---

## Identifying Missing Values with Pandas
Before handling missing values, we must identify them. Pandas provides:
- `isnull()`: Returns `True` for missing values.
- `notnull()`: Returns `True` for non-missing values.

**Example:**
```python
print(df.isnull())
# Output: 
#    Name  Score
# 0 False  False
# 1 False  False
# 2 False   True
# 3 False  False
# 4  True  False
```
The `None` value for "Charlie" returns `True`.  
Similarly:
```python
print(df.notnull())
# Output:
#    Name  Score
# 0  True   True
# 1  True   True
# 2  True  False
# 3  True   True
# 4 False   True
```

---

## Handling Missing Values

### 1. **Removal**
#### **Removing Rows with Any Missing Values**
```python
print(df.dropna())
# Output:
#    Name  Score
# 0  Anna   85.0
# 1   Bob   88.0
# 3 David   92.0
```
Rows with missing values are removed.

#### **Removing Rows Based on Specific Columns**
```python
print(df.dropna(subset=['Score']))
# Output:
#    Name  Score
# 0  Anna   85.0
# 1   Bob   88.0
# 3 David   92.0
# 4  None   90.0
```
Here, only rows with missing values in the `Score` column are removed.

---

### 2. **Replacement**
#### **Replacing Missing Values with a Specific Value**
```python
print(df.fillna(0))
# Output:
#       Name  Score
# 0     Anna   85.0
# 1      Bob   88.0
# 2  Charlie    0.0
# 3    David   92.0
```
"Charlie"'s score is replaced with `0`.

#### **Forward/Backward Fill**
```python
# Using backward fill
print(df.fillna(method='bfill'))
# Output:
#       Name  Score
# 0     Anna   85.0
# 1      Bob   88.0
# 2  Charlie   92.0
# 3    David   92.0
```
Here, "Charlie"'s score is replaced with the next value (92.0).  

#### **Replacing Missing Values with the Mean**
```python
mean_score = df['Score'].mean()
print(df.fillna(mean_score))
# Output:
#       Name      Score
# 0     Anna  85.000000
# 1      Bob  88.000000
# 2  Charlie  88.333333
# 3    David  92.000000
```
"Charlie"'s score is replaced with the average of the non-missing scores.

---

## Summary
🎯 **Key Takeaways:**
1. Use `isnull()` and `notnull()` to identify missing values.  
2. Handle missing values by **removing rows** or **filling with specific/calculated values**.  
3. Choose a strategy based on the dataset's nature and analysis goals.

✨ **Practice Makes Perfect!**  
Try these techniques hands-on to better understand and master data analysis. You're doing great—keep going! 🚀


## Filling the Academic Gaps

Notice how missing values can skew our understanding of a dataset? In this educational setting, we're examining students' Math and English grades, but some entries are missing. The provided code fills those gaps with the mean Math score. Click Run to observe the transformation of the data!

import pandas as pd

# Create dataset
grades = pd.DataFrame({'Student': ['Alice', 'Bob', 'Eve', 'Jack'], 'Math': [92, None, 87, 78], 'English': [None, 85, 88, 90]})

# filling the missing math grades with the mean math grade
grades['Math'] = grades['Math'].fillna(grades['Math'].mean())
print(grades)

```markdown
# Filling Missing Values with the Mean: A Practical Example

### Observing the Impact of Missing Values
Missing data can distort our interpretation of datasets. In this example, we're analyzing students' Math and English grades. However, some entries are missing, which could obscure trends or insights.

By replacing missing Math grades with the **mean Math score**, we create a complete dataset, allowing for accurate analysis.

---

### Code Example: Filling Missing Math Grades
```python
import pandas as pd

# Create dataset
grades = pd.DataFrame({
    'Student': ['Alice', 'Bob', 'Eve', 'Jack'], 
    'Math': [92, None, 87, 78], 
    'English': [None, 85, 88, 90]
})

# Fill missing Math grades with the mean Math score
grades['Math'] = grades['Math'].fillna(grades['Math'].mean())
print(grades)
```

---

### Explanation of the Code
1. **Dataset Creation:**
   - The `grades` DataFrame contains students' grades for Math and English, with some missing values (`None`).

2. **Handling Missing Values:**
   - The `fillna()` function replaces missing values in the `Math` column with the **mean Math score**.

---

### Output
After filling the missing values:
```
  Student       Math   English
0   Alice  92.000000       NaN
1     Bob  85.666667      85.0
2     Eve  87.000000      88.0
3    Jack  78.000000      90.0
```

- Bob's missing Math grade is replaced with the mean: **85.67** (calculated as \( \frac{92 + 87 + 78}{3} \)).

---

### Takeaway
By addressing missing values appropriately, we ensure the dataset reflects a more accurate and complete picture for analysis. Try running the code to observe how filling gaps with the mean transforms the data!
```

## Median Touch-Up in Gradebook Data

Astounding efforts, Space Explorer! For this operation, you're tasked with modifying the starter_solution.py code to handle NaN values differently. Currently, it fills missing scores with the mean; however, your objective is to replace that approach with a constant value — specifically, 60. This small yet crucial change will enhance your understanding of different methods for managing missing data.
import pandas as pd

grades = pd.DataFrame({'Student': ['Alex', 'Tina', 'John', 'Lily'],
                       'Math': [78, 82, None, 90],
                       'Science': [87, None, 93, 88]})

# TODO: Fill NaN values with a constant value, 60 in this case
grades_filled = grades.fillna(grades.mean())

print(grades_filled)

Here’s the updated version of your code to replace missing values (`NaN`) with a constant value of **60**:

```python
import pandas as pd

# Create the dataset
grades = pd.DataFrame({
    'Student': ['Alex', 'Tina', 'John', 'Lily'],
    'Math': [78, 82, None, 90],
    'Science': [87, None, 93, 88]
})

# Fill NaN values with the constant value 60
grades_filled = grades.fillna(60)

print(grades_filled)
```

---

### Explanation of Changes
- The original code used `grades.mean()` to fill missing values with the column mean.
- To meet the new requirement, we replaced `grades.mean()` with a constant value, `60`, using `grades.fillna(60)`.

---

### Output
After running the modified code, the output will look like this:
```
  Student  Math  Science
0    Alex  78.0     87.0
1    Tina  82.0     60.0
2    John  60.0     93.0
3    Lily  90.0     88.0
```

- Missing values in **Math** and **Science** are replaced with **60**, ensuring consistent data handling.

This adjustment highlights the flexibility of managing missing values using different strategies! 🚀

## Filling in the Blanks: Average Grade Calculation

Excellent work keeping up, Space Voyager!

Here's a piece of code designed to handle missing data in a DataFrame of students and their grades. Unfortunately, the code doesn't seem to be functioning as intended. Can you identify the error and correct it? The objective is to fill in the missing grades with the class's average grade.
import pandas as pd

grades = pd.DataFrame({
    'Student': ['Alex', 'Taylor', 'Jordan', 'Morgan'],
    'Grade': [90, None, 85, None]
})

average_grade = grades['Grade'].mean()
grades.fillna(average_grade)
print(grades)

The issue in your code lies in the fact that the result of `grades.fillna(average_grade)` is not assigned back to the `grades` DataFrame. In Pandas, most DataFrame operations return a new DataFrame unless the `inplace=True` argument is specified.

Here's the corrected code:

```python
import pandas as pd

grades = pd.DataFrame({
    'Student': ['Alex', 'Taylor', 'Jordan', 'Morgan'],
    'Grade': [90, None, 85, None]
})

# Calculate the average grade
average_grade = grades['Grade'].mean()

# Fill missing grades with the average and assign back to the DataFrame
grades['Grade'] = grades['Grade'].fillna(average_grade)

print(grades)
```

---

### Explanation of Fixes
1. **Calculate the Mean:**
   - `grades['Grade'].mean()` computes the average of non-missing grades.

2. **Assign the Updated DataFrame:**
   - The `fillna()` method creates a modified DataFrame but does not change the original unless assigned back.
   - The line `grades['Grade'] = grades['Grade'].fillna(average_grade)` applies the changes to the `grades` DataFrame.

---

### Correct Output
When the fixed code is executed, the output will be:

```
  Student  Grade
0    Alex   90.0
1  Taylor   87.5
2  Jordan   85.0
3  Morgan   87.5
```

- Missing grades (`None`) are replaced with the class's average grade: **87.5**.

## Identifying Missing Values in Student Scores Dataset

Hey there, Galactic Pioneer!

Let's pinpoint where our data is missing. Complete the necessary pieces of code to reveal where the values are missing in our dataset of student scores. Use one specific function that we learnt in this lesson. You're up for this challenge!
import pandas as pd

data = {'Student': ['Eve', 'Frank', 'Grace', 'Hank'],
        'Math_Score': [90, None, 77, None],
        'Science_Score': [None, 85, 79, 88]}
df = pd.DataFrame(data)

# TODO: Show where the missing values are in the dataset
print(df.____())
ChatGPT said:
ChatGPT

Here’s the corrected code to identify where the missing values are in the dataset. We'll use the `isnull()` function, as it reveals which values are missing (`True` for missing values and `False` otherwise):

```python
import pandas as pd

data = {'Student': ['Eve', 'Frank', 'Grace', 'Hank'],
        'Math_Score': [90, None, 77, None],
        'Science_Score': [None, 85, 79, 88]}
df = pd.DataFrame(data)

# Show where the missing values are in the dataset
print(df.isnull())
```

---

### Explanation
- **`isnull()` Function:** This function checks each cell in the DataFrame and returns `True` if the value is missing (`NaN`) and `False` otherwise.

---

### Output
Running the code will produce:
```
   Student  Math_Score  Science_Score
0    False       False           True
1    False        True          False
2    False       False          False
3    False        True          False
```

- Missing values in `Math_Score` and `Science_Score` are clearly indicated with `True`, helping us pinpoint gaps in the dataset.

## Filling the Void: Handling Missing Grades

Bravo, Space Voyager! You've learned to detect missing values. Now, can you replace them with the median score? To do this, just use median() method. It's time to calculate and fill in the gaps in our students' grades data.
import pandas as pd

student_data = pd.DataFrame({
    'Student': ['Alice', 'Ben', 'Chris', 'Dora', 'Albert', 'Zoe'],
    'Grade': [90, None, 85, None, 90, 95]
})
# TODO: Calculate the median grade for the current student data.
# TODO: Fill in missing grade values using the calculated median.
print(student_data)

Here's the completed code to replace missing grades with the median score:

```python
import pandas as pd

student_data = pd.DataFrame({
    'Student': ['Alice', 'Ben', 'Chris', 'Dora', 'Albert', 'Zoe'],
    'Grade': [90, None, 85, None, 90, 95]
})

# Calculate the median grade for the current student data
median_grade = student_data['Grade'].median()

# Fill in missing grade values using the calculated median
student_data['Grade'] = student_data['Grade'].fillna(median_grade)

print(student_data)
```

---

### Explanation
1. **Calculate the Median:**
   - `student_data['Grade'].median()` computes the median of the existing grades, ignoring `None` values.

2. **Fill Missing Values:**
   - `fillna(median_grade)` replaces the `None` values in the `Grade` column with the calculated median.

---

### Output
When you run the code, the output will be:
```
   Student  Grade
0    Alice   90.0
1      Ben   90.0
2    Chris   85.0
3     Dora   90.0
4   Albert   90.0
5      Zoe   95.0
```

- The median grade (90.0) has replaced the missing values for Ben and Dora. 🎉

## Cleaning Up the Data Galaxy

Keep soaring, Stellar Navigator! Now, you have the opportunity to practice dealing with missing pieces in our data journey. Follow the TODOs to help create a clean dataset, free of gaps that could skew our analysis.
import pandas as pd

classroom = pd.DataFrame({
    'Student': ['Alice', 'Brian', 'Caitlin', None, 'Eric'],
    'Grade': [90, 95, None, 88, 78]
})
# TODO: Remove all rows from the classroom DataFrame where data is missing.

Here's the completed code to remove rows with missing data from the `classroom` DataFrame:

```python
import pandas as pd

classroom = pd.DataFrame({
    'Student': ['Alice', 'Brian', 'Caitlin', None, 'Eric'],
    'Grade': [90, 95, None, 88, 78]
})

# Remove all rows from the classroom DataFrame where data is missing
classroom_cleaned = classroom.dropna()

print(classroom_cleaned)
```

---

### Explanation
1. **`dropna()` Function:**
   - This method removes any rows containing missing values (`None` or `NaN`) in the DataFrame.

2. **Result Assigned to `classroom_cleaned`:**
   - The result of `dropna()` is assigned to a new variable, ensuring the original DataFrame remains intact unless explicitly overwritten.

---

### Output
When you run the code, the output will be:
```
  Student  Grade
0   Alice   90.0
1   Brian   95.0
4    Eric   78.0
```

- Rows with missing values (e.g., for Caitlin and the unnamed student) have been removed, resulting in a clean dataset. 🚀