# Day 13 - Handling Missing Data in Pandas


## Why is Handling Missing Data Important?

Missing data can introduce bias, reduce the representativeness of your sample, and lead to inaccurate conclusions. Therefore, it's essential to handle missing data appropriately, whether by filling it in a meaningful way or by dropping incomplete records altogether. Pandas provides powerful tools to address missing data, enabling you to prepare your datasets for analysis.


## Tutorial: Filling and Dropping Missing Values

In [1]:
!pip install pandas



### Detecting Missing Data

In [2]:

import pandas as pd

# Example DataFrame with missing values
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, None, 30, 22],
    'City': ['New York', 'Los Angeles', None, 'Chicago']
}
df = pd.DataFrame(data)

# Detect missing values
print("Missing values in the DataFrame:")
print(df.isnull())


Missing values in the DataFrame:
    Name    Age   City
0  False  False  False
1  False   True  False
2  False  False   True
3  False  False  False


### Filling Missing Values

In [3]:

# Filling missing values with a specific value
df_filled = df.fillna({
    'Age': df['Age'].mean(),  # Fill missing ages with the mean age
    'City': 'Unknown'          # Fill missing cities with 'Unknown'
})

print("DataFrame after filling missing values:")
print(df_filled)

DataFrame after filling missing values:
      Name        Age         City
0    Alice  25.000000     New York
1      Bob  25.666667  Los Angeles
2  Charlie  30.000000      Unknown
3    David  22.000000      Chicago


### Dropping Missing Values

In [4]:

# Dropping rows with any missing values
df_dropped = df.dropna()

print("DataFrame after dropping rows with missing values:")
print(df_dropped)


DataFrame after dropping rows with missing values:
    Name   Age      City
0  Alice  25.0  New York
3  David  22.0   Chicago


### Interpolating Missing Values

In [5]:

# Example of interpolation in a time series
data = {
    'Date': pd.date_range(start='2023-01-01', periods=5),
    'Temperature': [30, None, 28, None, 27]
}
df_ts = pd.DataFrame(data)

# Interpolating missing values
df_ts['Temperature'] = df_ts['Temperature'].interpolate()

print("DataFrame after interpolating missing values:")
print(df_ts)


DataFrame after interpolating missing values:
        Date  Temperature
0 2023-01-01         30.0
1 2023-01-02         29.0
2 2023-01-03         28.0
3 2023-01-04         27.5
4 2023-01-05         27.0


## Use Case: Cleaning a Dataset of Public Health Records

### Step 1: Loading the Dataset

In [9]:

# Loading the dataset
#url = 'https://raw.githubusercontent.com/ricardogr07/100-days-of-python-and-data-science/main/13%20-%20Handling%20Missing%20Data%20in%20Pandas/public_health_records.csv'
url = f'https://raw.githubusercontent.com/ricardogr07/100-days-of-python-and-data-science/main/13%E2%80%8A-%E2%80%8AHandling%20Missing%20Data%20in%C2%A0Pandas/public_health_records.csv'
health_df = pd.read_csv(url)

# Display the first few rows of the dataset
print("First few rows of the public health records dataset:")
print(health_df.head())


First few rows of the public health records dataset:
   PatientID   Age  Gender Diagnosis        Treatment
0          1  58.0  Female    Asthma       Medication
1          2  71.0  Female    Asthma             Rest
2          3  48.0    Male  COVID-19              NaN
3          4  34.0    Male    Asthma       Medication
4          5  62.0  Female    Asthma  Hospitalization


### Step 2: Detecting Missing Data

In [10]:
# Checking for missing values
print("Missing values in the public health records dataset:")
print(health_df.isnull().sum())

Missing values in the public health records dataset:
PatientID     0
Age          20
Gender       10
Diagnosis     0
Treatment    15
dtype: int64


### Step 3: Filling Missing Values

In [11]:
# Filling missing numerical values with the mean
health_df['Age'] = health_df['Age'].fillna(health_df['Age'].mean())

# Filling missing categorical values with the most frequent value
health_df['Diagnosis'] = health_df['Diagnosis'].fillna(health_df['Diagnosis'].mode()[0])

print("Dataset after filling missing values:")
print(health_df.head())

Dataset after filling missing values:
   PatientID   Age  Gender Diagnosis        Treatment
0          1  58.0  Female    Asthma       Medication
1          2  71.0  Female    Asthma             Rest
2          3  48.0    Male  COVID-19              NaN
3          4  34.0    Male    Asthma       Medication
4          5  62.0  Female    Asthma  Hospitalization


### Step 4: Dropping Rows with Missing Values

In [12]:
# Dropping rows with any remaining missing values
health_df_cleaned = health_df.dropna()

print("Dataset after dropping rows with missing values:")
print(health_df_cleaned.head())

Dataset after dropping rows with missing values:
   PatientID   Age  Gender Diagnosis        Treatment
0          1  58.0  Female    Asthma       Medication
1          2  71.0  Female    Asthma             Rest
3          4  34.0    Male    Asthma       Medication
4          5  62.0  Female    Asthma  Hospitalization
5          6  50.8  Female  COVID-19             Rest
