In [1]:

#### Handling missing values ###
# Sriram Parthasarathy
# LICENSES : MIT

'''
One of the major challenges in most data science projects
is to figure out a way to get clean data.
60 to 80 percent of the total time is spent on cleaning the
data before you can make any meaningful sense of it.
This is true for both BI and Predictive Analytics projects.

Additional Reading:
Please refer to my articles on Medium for more details:

Practical Strategies to Handle Missing Values
https://medium.com/data-science/practical-strategies-to-handle-missing-values-626f9c43870b?source=your_stories_page--------------------------------------------

The Shopping Cart Abandonment Problem: How Machine Learning Can Help!
https://medium.com/managing-digital-products/the-shopping-cart-abandonment-problem-how-machine-learning-can-help-eb690f1dc4f6?source=your_stories_page--------------------------------------------

How to Measure & Optimise Your Predictive Model for Prime Time?
https://medium.com/managing-digital-products/how-to-measure-optimise-your-predictive-model-for-prime-time-3b9f6072f85c?source=your_stories_page--------------------------------------------

Increasing The Accuracy of Predictive Models with Stacked Ensemble Techniques: Healthcare Example
https://medium.com/managing-digital-products/increasing-the-accuracy-of-predictive-model-with-stacked-ensemble-techniques-a-healthcare-example-135d36b9a2b7?source=your_stories_page--------------------------------------------

AI Powered Automatic Classification: The Challenges in Managing Data in Clinical Trials
https://medium.com/managing-digital-products/ai-powered-automatic-classification-the-challenges-in-managing-data-in-clinical-trials-6639e7aa1a7d?source=your_stories_page--------------------------------------------

How Do You Measure If Your Customer Churn Predictive Model Is Good?
https://medium.com/data-science/how-do-you-measure-if-your-customer-churn-predictive-model-is-good-187a49a9eee3?source=your_stories_page--------------------------------------------


Practical Data Augmentation Techniques for Predictive Models
https://medium.com/hackernoon/practical-data-augmentation-techniques-for-predictive-models-b51599253c30?source=your_stories_page--------------------------------------------

Machine Learning for Product Managers: Defining the business problem
https://medium.com/managing-digital-products/machine-learning-for-product-managers-defining-the-business-problem-f0e968d09ee7?source=your_stories_page--------------------------------------------

'''


import pandas as pd
import numpy as np

# ----------------------------
# STEP 1: Create sample DataFrame with missing values
# ----------------------------
data = {
    'name': ['Alice', 'Bob', 'Charlie', np.nan, 'Eva'],
    'age': [25, np.nan, 35, 40, np.nan],
    'salary': [50000, 60000, np.nan, 52000, 58000],
    'department': ['HR', 'IT', np.nan, 'HR', 'Finance']
}

# To illustrate I am using sample dataset as I can't share a real customer dataset
# Replace this with your data

df = pd.DataFrame(data)
print("🔹 Original DataFrame:\n", df)

# ----------------------------
# STEP 2: Detect missing values
# ----------------------------

# Count missing values per column
print("\n🔍 Missing count per column:\n", df.isnull().sum())

# Show rows with any missing values
print("\n🔍 Rows with any missing values:\n", df[df.isnull().any(axis=1)])

# ----------------------------
# STEP 3: Drop missing data
# ----------------------------

# Drop rows with any missing values
df_drop_rows = df.dropna()
print("\n🚫 Drop rows with any NaN:\n", df_drop_rows)

# Drop columns with more than 50% missing
df_drop_cols = df.dropna(thresh=len(df) * 0.5, axis=1)
print("\n🚫 Drop columns with >50% missing:\n", df_drop_cols)

# ----------------------------
# STEP 4: Fill (Impute) missing values
# ----------------------------

# Fill all missing values with a constant
df_fill_constant = df.fillna("Unknown")
print("\n🧪 Fill all NaNs with 'Unknown':\n", df_fill_constant)

# Fill numeric columns with mean
df_mean_fill = df.copy()
df_mean_fill['age'] = df_mean_fill['age'].fillna(df_mean_fill['age'].mean())
df_mean_fill['salary'] = df_mean_fill['salary'].fillna(df_mean_fill['salary'].mean())
print("\n📈 Fill numeric NaNs with mean:\n", df_mean_fill)

# Fill categorical columns with mode (most frequent)
df_mode_fill = df.copy()
df_mode_fill['department'] = df_mode_fill['department'].fillna(df_mode_fill['department'].mode()[0])
df_mode_fill['name'] = df_mode_fill['name'].fillna(df_mode_fill['name'].mode()[0])
print("\n🏷️ Fill categorical NaNs with mode:\n", df_mode_fill)

# ----------------------------
# STEP 5: Add missing flags (optional for ML)
# ----------------------------
df_flagged = df.copy()
df_flagged['age_missing'] = df_flagged['age'].isnull().astype(int)
df_flagged['department_missing'] = df_flagged['department'].isnull().astype(int)
print("\n🚩 Added missing flags:\n", df_flagged)


🔹 Original DataFrame:
       name   age   salary department
0    Alice  25.0  50000.0         HR
1      Bob   NaN  60000.0         IT
2  Charlie  35.0      NaN        NaN
3      NaN  40.0  52000.0         HR
4      Eva   NaN  58000.0    Finance

🔍 Missing count per column:
 name          1
age           2
salary        1
department    1
dtype: int64

🔍 Rows with any missing values:
       name   age   salary department
1      Bob   NaN  60000.0         IT
2  Charlie  35.0      NaN        NaN
3      NaN  40.0  52000.0         HR
4      Eva   NaN  58000.0    Finance

🚫 Drop rows with any NaN:
     name   age   salary department
0  Alice  25.0  50000.0         HR

🚫 Drop columns with >50% missing:
       name   age   salary department
0    Alice  25.0  50000.0         HR
1      Bob   NaN  60000.0         IT
2  Charlie  35.0      NaN        NaN
3      NaN  40.0  52000.0         HR
4      Eva   NaN  58000.0    Finance

🧪 Fill all NaNs with 'Unknown':
       name      age   salary department