# Unit 1 Handling Missing Values

# Lesson Introduction
Hello there! Today, we're going to talk about handling **missing values** in datasets for machine learning. Why is this important? Imagine you are building a model to predict house prices, but some houses are missing information about their size or the number of bedrooms. These missing values can affect the performance of your model. In this lesson, you'll learn why data might be missing, different ways to handle it, and how to use Python libraries to do so.

By the end of this lesson, you'll know why handling missing values is crucial, understand different strategies to deal with them, and be able to use Python tools to handle missing data efficiently.

---

## Understanding Missing Values
Why does data go missing? There are many reasons:

* **Human Error**: Sometimes, people forget to fill in all the fields when entering data.
* **System Error**: Occasionally, the system that collects the data might have problems.
* **Other Reasons**: Data may be intentionally left out for privacy reasons.

There are three common types of missing data:

* **MCAR** (Missing Completely at Random): The data is missing randomly without any pattern.
* **MAR** (Missing at Random): There is a pattern, but it is not related to the missing data itself.
* **MNAR** (Missing Not at Random): There is a pattern related to why the data is missing.

---

## Strategies for Handling Missing Values
Handling missing values can be done in several ways:

* If the missing data is a small percentage, you might just **delete those rows or columns**. But be careful: if you remove too much data, you might lose important information.
* You can also **replace the missing values with some constant value** like the mean, median, or mode. This method is often more suitable because it still keeps the data structure.

---

## Dropping Missing Values: Part 1
Dropping missing values is easy and straightforward with pandas dataframes. Let's recall it quickly.

Let's consider this simple dataset:

```python
import pandas as pd

data = {'Name': ['Anna', 'Bob', 'Charlie', 'David', None],
        'Score': [85, 88, None, 92, 90]}
df = pd.DataFrame(data)
```

Let's remove rows with `None` values using the `dropna()` function:

```python
print(df.dropna())
```

The output is:

```
   Name  Score
0  Anna   85.0
1   Bob   88.0
3 David   92.0
```

"Charlie"'s row is removed because it contained a null value. Also the one row with a missing name is removed.

---

## Dropping Missing Values: Part 2
To scan only specific columns for missing values with `dropna()`, you can use the `subset` argument to specify which columns to check for missing values. Here's an example:

```python
# Drop rows where 'Score' column has missing values
print(df.dropna(subset=['Score']))
```

The output is:

```
    Name  Score
0   Anna   85.0
1    Bob   88.0
3  David   92.0
4   None   90.0
```

As you can see, the fourth row is not removed. Though it contains a missing value in the `Name` column, this time we only remove rows with missing Score.

---

## Using `SciKit Learn` to Impute Missing Values: Part 1
One of the easiest ways to handle missing values in Python is by using the `SimpleImputer` class from the `sklearn.impute` module. Let's break it down.

The `SimpleImputer` has a few strategies you can use:

* **mean**: Replaces missing values with the mean of each column.
* **median**: Replaces missing values with the median of each column.
* **most_frequent**: Replaces missing values with the most frequent value in each column.
* **constant**: Replaces missing values with a constant value you provide.

Let's walk through some code that handles missing values using the `SimpleImputer`.

First, we need a dataset. We'll use the `pandas` library to create one with some missing values.

```python
import numpy as np
import pandas as pd

# Create a sample dataset with missing values
data = {
    'Feature1': [1, 2, np.nan, 4],
    'Feature2': [7, 6, 5, np.nan]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
```

Output:

```
   Feature1  Feature2
0       1.0       7.0
1       2.0       6.0
2       NaN       5.0
3       4.0       NaN
```

Note that we use `np.nan` here instead of `None`. `None` is a Python singleton object representing missing values across all data types, while `np.nan` is a floating-point "Not a Number" value from the `numpy` library, specifically used for numeric data. `None` is versatile and not tied to any library, but it may cause errors in operations unless explicitly handled. In contrast, `np.nan` is tailored for numerical computations, supporting vectorized operations in `numpy` and `pandas`, making it more suitable for handling missing numerical values.

---

## Using `SciKit Learn` to Impute Missing Values: Part 2
Here, we use the `SimpleImputer` from `sklearn.impute` to handle the missing values. In this case, we'll use the **mean** strategy, meaning the missing values are replaced with the mean value of the corresponding column. Note that missing values won't be taken into account when calculating the mean.

```python
from sklearn.impute import SimpleImputer

# Handling missing values
imputer = SimpleImputer(strategy='mean')
imputed_data = imputer.fit_transform(df)
print("Imputed Data:")
print(imputed_data)
```

Output:

```
[[1.         7.        ]
 [2.         6.        ]
 [2.33333333 5.        ]
 [4.         6.        ]]
```

---

## Converting `Numpy` Array Back to `DataFrame`
The result of the imputation is a `NumPy` array. Let's convert it back to a `DataFrame` for better readability.

```python
# Convert the numpy array back to a DataFrame for better readability
imputed_df = pd.DataFrame(imputed_data, columns=df.columns)
print("DataFrame after handling missing values:")
print(imputed_df)
```

Output:

```
   Feature1  Feature2
0  1.000000       7.0
1  2.000000       6.0
2  2.333333       5.0
3  4.000000       6.0
```

Notice how we use `df.columns` to assign the same column names we had before.

---

## Using `SciKit Learn` to Impute Missing Values for Specific Columns
Sometimes, you may want to impute only specific columns in your dataset. You can achieve this by selecting those columns and applying the `SimpleImputer` to them. Here's how you can do it.

Let's use the same dataset that we created earlier

```python
from sklearn.impute import SimpleImputer

# Select the column to impute
feature1 = df[['Feature1']]

# Create the SimpleImputer instance
imputer = SimpleImputer(strategy='mean')

# Fit and transform the data
feature1_imputed = imputer.fit_transform(feature1)

# Update the DataFrame
df['Feature1'] = feature1_imputed
print("DataFrame after imputing Feature1:")
print(df)
```

Output:

```
   Feature1  Feature2
0  1.000000       7.0
1  2.000000       6.0
2  2.333333       5.0
3  4.000000       NaN
```

In this example, the missing value in `Feature1` is replaced by the mean of the other values in that column. The `Feature2` column remains unchanged. This approach allows you to target specific columns that need imputation while leaving others untouched.

In the same manner, you can impute values into any subset of columns.

---

## Lesson Summary
Great job! 🎉 You've learned why handling missing values is crucial, discovered different strategies to tackle missing data, and practiced using `SimpleImputer` to handle missing values in a sample dataset. Missing data is a common issue, but now you have the tools to manage it and improve the quality of your datasets.

Now that you've learned the theory, it's time to get hands-on practice! In the practice session, you'll handle missing values in various datasets, experimenting with different imputation strategies, and observing the outcomes. This practice will help solidify your understanding and make you more confident in managing missing data for your machine learning projects. Let's get started! 🚀

## Impute Missing Values Using Median Strategy

Great job so far, Space Explorer!

Let's keep going with our house prices dataset. Modify the code to change the SimpleImputer strategy from mean to median for handling missing values.

Let's code!

```python
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

# Sample dataset for house prices with missing values
data = {
    'Size': [1500, 1600, np.nan, 1400],
    'Price': [300000, 340000, 320000, np.nan]
}
df = pd.DataFrame(data)

# Handling missing values using mean strategy
imputer = SimpleImputer(strategy='mean')
imputed_data = imputer.fit_transform(df)

# Convert the numpy array back to a DataFrame for readability
imputed_df = pd.DataFrame(imputed_data, columns=df.columns)
print(imputed_df)

```

```python
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

# Sample dataset for house prices with missing values
data = {
    'Size': [1500, 1600, np.nan, 1400],
    'Price': [300000, 340000, 320000, np.nan]
}
df = pd.DataFrame(data)

# Handling missing values using median strategy
imputer = SimpleImputer(strategy='median') # Changed strategy to 'median'
imputed_data = imputer.fit_transform(df)

# Convert the numpy array back to a DataFrame for readability
imputed_df = pd.DataFrame(imputed_data, columns=df.columns)
print(imputed_df)

```

## Fixing Missing Data in House Prices Dataset

Now that you've learned how to handle missing values, here's a small challenge for you.

The provided code contains a logical error that needs fixing to run correctly. Identify and resolve the issue to correctly handle the missing values in the given dataset.

```python
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

# Create a sample dataset for predicting house prices
data = {
    'Size': [1400, 1600, np.nan, 1800],
    'Bedrooms': [3, np.nan, 3, 4],
    'Price': [300000, 320000, 330000, np.nan]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Impute missing values with the mean strategy
imputer = SimpleImputer(strategy='mean')
imputed_data = imputer.transform(df)

# Convert back to DataFrame for better readability
imputed_df = pd.DataFrame(imputed_data, columns=df.columns)
print("DataFrame after handling missing values:")
print(imputed_df)

```

The logical error in the provided code is that the `SimpleImputer`'s `transform` method is called without first calling `fit`. The `fit` method is crucial because it calculates the imputation strategy (e.g., the mean, median, or most frequent value) from the training data. Without `fit`, `transform` doesn't know what values to use for imputation.

Here's the corrected code:

```python
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

# Create a sample dataset for predicting house prices
data = {
    'Size': [1400, 1600, np.nan, 1800],
    'Bedrooms': [3, np.nan, 3, 4],
    'Price': [300000, 320000, 330000, np.nan]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Impute missing values with the mean strategy
imputer = SimpleImputer(strategy='mean')
# The fix: Use fit_transform instead of just transform,
# or call fit() before transform(). fit_transform is more concise.
imputed_data = imputer.fit_transform(df)

# Convert back to DataFrame for better readability
imputed_df = pd.DataFrame(imputed_data, columns=df.columns)
print("DataFrame after handling missing values:")
print(imputed_df)
```

## Handle Missing Values in House Prices Dataset

Hey, Stellar Navigator!

Now, let's take it up a notch. Your task is to fill in the missing pieces of code to handle the missing values in the environmental measurements dataset.

Keep reaching for the stars!

```python
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

# Environmental measurements dataset with missing values
data = {
    'Temperature': [15, 18, np.nan, 20, 22, np.nan],
    'Humidity': [45, np.nan, 50, np.nan, 55, 60]
}
df = pd.DataFrame(data)

# TODO: Create SimpleImputer instance with most_frequent strategy

# TODO: Impute the missing values in the dataframe

# Convert back to DataFrame
imputed_df = pd.DataFrame(imputed_data, columns=df.columns)
print(imputed_df)

```

```python
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

# Environmental measurements dataset with missing values
data = {
    'Temperature': [15, 18, np.nan, 20, 22, np.nan],
    'Humidity': [45, np.nan, 50, np.nan, 55, 60]
}
df = pd.DataFrame(data)

# TODO: Create SimpleImputer instance with most_frequent strategy
imputer = SimpleImputer(strategy='most_frequent')

# TODO: Impute the missing values in the dataframe
imputed_data = imputer.fit_transform(df)

# Convert back to DataFrame
imputed_df = pd.DataFrame(imputed_data, columns=df.columns)
print(imputed_df)
```

## Handling Missing Values in House Prices Dataset

Alright, Space Voyager! It's time to handle missing values.

This time, we will use the constant strategy to fill the missing values with a constant value. Follow the steps provided in the comments to complete the code. Let's make sure our data is prepared for machine learning models!


```python
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

# Create a sample dataset with missing values for house features
data = {
    'Bedrooms': [3, 4, np.nan, 2],
    'Bathrooms': [2, 3, 2, np.nan],
    'Garage': [1, 2, np.nan, 1]
}
df = pd.DataFrame(data)

# TODO: Use SimpleImputer with strategy='constant' and fill_value=1 to handle missing values

# TODO: Apply the imputer to the DataFrame to fill missing values

# TODO: Convert the result back to a DataFrame and print it

```

```python
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

# Create a sample dataset with missing values for house features
data = {
    'Bedrooms': [3, 4, np.nan, 2],
    'Bathrooms': [2, 3, 2, np.nan],
    'Garage': [1, 2, np.nan, 1]
}
df = pd.DataFrame(data)

# TODO: Use SimpleImputer with strategy='constant' and fill_value=1 to handle missing values
imputer = SimpleImputer(strategy='constant', fill_value=1)

# TODO: Apply the imputer to the DataFrame to fill missing values
imputed_data = imputer.fit_transform(df)

# TODO: Convert the result back to a DataFrame and print it
imputed_df = pd.DataFrame(imputed_data, columns=df.columns)
print(imputed_df)
```

## Handling Missing Values for House Prices and Features

Great progress so far! Let's handle missing values for house prices and features.

Your task is to fill in the missing pieces to impute values for columns HousePrice and NumRooms, and then drop rows with missing values for other columns.

```python
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

# Create a sample dataset with missing values
data = {
    'HousePrice': [300000, 450000, 600000, np.nan],
    'HouseSize': [1200, np.nan, 1500, 2000],
    'NumRooms': [np.nan, 3, 4, 5]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# TODO: Impute missing values for the columns HousePrice and NumRooms using corresponding mean
# TODO: Drop rows with missing values for HouseSize column

print("DataFrame after handling missing values:")
print(df)

```

Okay, Space Voyager! Let's get this done. To impute specific columns and then drop rows based on another column's missing values, we'll apply the `SimpleImputer` only to the desired columns and then use `dropna()` with the `subset` argument.

Here's the corrected and completed code:

```python
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

# Create a sample dataset with missing values
data = {
    'HousePrice': [300000, 450000, 600000, np.nan],
    'HouseSize': [1200, np.nan, 1500, 2000],
    'NumRooms': [np.nan, 3, 4, 5]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# TODO: Impute missing values for the columns HousePrice and NumRooms using corresponding mean

# Create an imputer for 'HousePrice' and 'NumRooms'
imputer_mean = SimpleImputer(strategy='mean')

# Apply imputer to selected columns.
# We use .values.reshape(-1, 1) if imputing a single series,
# but for multiple columns, we pass the DataFrame subset directly.
df[['HousePrice', 'NumRooms']] = imputer_mean.fit_transform(df[['HousePrice', 'NumRooms']])


# TODO: Drop rows with missing values for HouseSize column
# Use dropna with the 'subset' argument to specify the column
df.dropna(subset=['HouseSize'], inplace=True)


print("\nDataFrame after handling missing values:")
print(df)
```