
# **Day 32 — Handling Missing Values and Outliers**



### **Why Address Missing Values and Outliers?**

Missing values and outliers can skew analysis, lead to incorrect conclusions, and make it difficult to build reliable models.
By handling them properly, you can:
- **Improve Data Quality**: Ensure that your dataset is complete, clean, and ready for analysis.
- **Prevent Bias**: Avoid making assumptions based on incomplete data.
- **Handle Edge Cases**: Address outliers that can distort results or introduce noise into the analysis.


### **Handling Missing Values**


Missing values occur when data points are absent from the dataset.
They can arise for various reasons, such as data entry errors, sensor failures, or incomplete records. Common strategies for handling missing values include:
- **Removing Missing Data**: Dropping rows or columns with missing values.
- **Imputation**: Filling in missing data with calculated values, such as the mean, median, or mode.
- **Flagging**: Creating a new column to indicate where data is missing.


### **Handling Outliers**


Outliers are extreme values that differ significantly from the majority of the data.
They can occur due to errors, rare events, or natural variability. Common techniques for handling outliers include:
- **Removing Outliers**: Dropping data points that fall outside a defined range.
- **Transforming Outliers**: Applying log or square root transformations to reduce the impact of outliers.
- **Capping**: Limiting outlier values to a predefined maximum or minimum value (also known as "winsorizing").


### **Techniques for Addressing Missing Data**


Let's walk through practical techniques for handling missing values and outliers using Pandas.


In [1]:

import pandas as pd
import numpy as np


### **Loading an Incomplete Dataset**

In [2]:

# Creating a sample dataset of product reviews
data = pd.DataFrame({
    'Product': ['Laptop', 'Smartphone', 'Headphones', 'Monitor', 'Keyboard'],
    'Rating': [4.5, None, 3.0, None, 4.2],
    'Review Count': [120, 85, None, 95, 40],
    'Price': [1000, 800, 50, 300, None]
})

# Displaying the dataset
print(data)


      Product  Rating  Review Count   Price
0      Laptop     4.5         120.0  1000.0
1  Smartphone     NaN          85.0   800.0
2  Headphones     3.0           NaN    50.0
3     Monitor     NaN          95.0   300.0
4    Keyboard     4.2          40.0     NaN


### **Handling Missing Values**

In [3]:

# Technique 1: Removing rows with missing data
data_dropped = data.dropna()

# Technique 2: Imputation (filling missing values with the mean)
data_imputed = data.copy()
data_imputed['Rating'].fillna(data_imputed['Rating'].mean(), inplace=True)
data_imputed['Review Count'].fillna(data_imputed['Review Count'].mean(), inplace=True)
data_imputed['Price'].fillna(data_imputed['Price'].mean(), inplace=True)

# Technique 3: Flagging missing data
data_flagged = data.copy()
data_flagged['Price_Missing'] = data_flagged['Price'].isnull()

# Displaying the results
print("Dropped rows:")
print(data_dropped)
print("\nImputed data:")
print(data_imputed)
print("\nFlagged missing values:")
print(data_flagged)


Dropped rows:
  Product  Rating  Review Count   Price
0  Laptop     4.5         120.0  1000.0

Imputed data:
      Product  Rating  Review Count   Price
0      Laptop     4.5         120.0  1000.0
1  Smartphone     3.9          85.0   800.0
2  Headphones     3.0          85.0    50.0
3     Monitor     3.9          95.0   300.0
4    Keyboard     4.2          40.0   537.5

Flagged missing values:
      Product  Rating  Review Count   Price  Price_Missing
0      Laptop     4.5         120.0  1000.0          False
1  Smartphone     NaN          85.0   800.0          False
2  Headphones     3.0           NaN    50.0          False
3     Monitor     NaN          95.0   300.0          False
4    Keyboard     4.2          40.0     NaN           True


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data_imputed['Rating'].fillna(data_imputed['Rating'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data_imputed['Review Count'].fillna(data_imputed['Review Count'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never

### **Detecting and Handling Outliers**

In [4]:

# Identifying outliers using the IQR method
Q1 = data_imputed['Price'].quantile(0.25)
Q3 = data_imputed['Price'].quantile(0.75)
IQR = Q3 - Q1

# Defining outliers as values that are more than 1.5 * IQR from Q1 or Q3
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filtering out the outliers
outliers = data_imputed[(data_imputed['Price'] < lower_bound) | (data_imputed['Price'] > upper_bound)]
print("Outliers:")
print(outliers)

# Removing the outliers
data_no_outliers = data_imputed[(data_imputed['Price'] >= lower_bound) & (data_imputed['Price'] <= upper_bound)]
print("\nData without outliers:")
print(data_no_outliers)


Outliers:
Empty DataFrame
Columns: [Product, Rating, Review Count, Price]
Index: []

Data without outliers:
      Product  Rating  Review Count   Price
0      Laptop     4.5         120.0  1000.0
1  Smartphone     3.9          85.0   800.0
2  Headphones     3.0          85.0    50.0
3     Monitor     3.9          95.0   300.0
4    Keyboard     4.2          40.0   537.5


### **Capping Outliers**

In [6]:

# Capping outliers in the Price column
data_capped = data_imputed.copy()
data_capped['Price'] = np.where(data_capped['Price'] > upper_bound, upper_bound, 
                                np.where(data_capped['Price'] < lower_bound, lower_bound, data_capped['Price']))

print("\nCapped outliers:")
print(data_capped)



Capped outliers:
      Product  Rating  Review Count   Price
0      Laptop     4.5         120.0  1000.0
1  Smartphone     3.9          85.0   800.0
2  Headphones     3.0          85.0    50.0
3     Monitor     3.9          95.0   300.0
4    Keyboard     4.2          40.0   537.5
