# Day 13 - Handling Missing Data in Pandas


## Why is Handling Missing Data Important?

Missing data can introduce bias, reduce the representativeness of your sample, and lead to inaccurate conclusions. Therefore, it's essential to handle missing data appropriately, whether by filling it in a meaningful way or by dropping incomplete records altogether. Pandas provides powerful tools to address missing data, enabling you to prepare your datasets for analysis.


## Tutorial: Filling and Dropping Missing Values

In [None]:
!pip install pandas

### Detecting Missing Data

In [None]:

import pandas as pd

# Example DataFrame with missing values
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, None, 30, 22],
    'City': ['New York', 'Los Angeles', None, 'Chicago']
}
df = pd.DataFrame(data)

# Detect missing values
print("Missing values in the DataFrame:")
print(df.isnull())


### Filling Missing Values

In [None]:

# Filling missing values with a specific value
df_filled = df.fillna({
    'Age': df['Age'].mean(),  # Fill missing ages with the mean age
    'City': 'Unknown'          # Fill missing cities with 'Unknown'
})

print("
DataFrame after filling missing values:")
print(df_filled)


### Dropping Missing Values

In [None]:

# Dropping rows with any missing values
df_dropped = df.dropna()

print("DataFrame after dropping rows with missing values:")
print(df_dropped)


### Interpolating Missing Values

In [None]:

# Example of interpolation in a time series
data = {
    'Date': pd.date_range(start='2023-01-01', periods=5),
    'Temperature': [30, None, 28, None, 27]
}
df_ts = pd.DataFrame(data)

# Interpolating missing values
df_ts['Temperature'] = df_ts['Temperature'].interpolate()

print("DataFrame after interpolating missing values:")
print(df_ts)


## Use Case: Cleaning a Dataset of Public Health Records

### Step 1: Loading the Dataset

In [None]:

# Loading the dataset
url = 'https://raw.githubusercontent.com/ricardogr07/100-days-of-python-and-data-science/main/13%20-%20Handling%20Missing%20Data%20in%20Pandas/public_health_records.csv'
health_df = pd.read_csv(url)

# Display the first few rows of the dataset
print("First few rows of the public health records dataset:")
print(health_df.head())


### Step 2: Detecting Missing Data

In [None]:

# Checking for missing values
print("Missing values in the public health records dataset:")
print(health_df.isnull().sum())


### Step 3: Filling Missing Values

In [None]:

# Filling missing numerical values with the mean
health_df['Age'] = health_df['Age'].fillna(health_df['Age'].mean())

# Filling missing categorical values with the most frequent value
health_df['Diagnosis'] = health_df['Diagnosis'].fillna(health_df['Diagnosis'].mode()[0])

print("Dataset after filling missing values:")
print(health_df.head())


### Step 4: Dropping Rows with Missing Values

In [None]:

# Dropping rows with any remaining missing values
health_df_cleaned = health_df.dropna()

print("Dataset after dropping rows with missing values:")
print(health_df_cleaned.head())
