# Mark and Remove Missing Data

Adapted from Jason Brownlee. 2020. [Data Preparation for Machine Learning](https://machinelearningmastery.com/data-preparation-for-machine-learning/).

## Overview

In this tutorial, we will learn how to handle missing data in datasets, specifically focusing on marking and removing missing values. By applying these techniques, you'll be able to prepare high-quality datasets for machine learning, improving model reliability and accuracy.

We'll use the Pima Indians Diabetes dataset as an example to demonstrate these techniques.

## Learning Objectives

- Learn how to identify and mark invalid or corrupt values as missing in a dataset
- Understand how the presence of marked missing values affects machine learning algorithms
- Learn how to remove rows with missing data from a dataset
- Evaluate a learning algorithm on a dataset after removing rows with missing values

## Prerequisites

- Basic understanding of Python programming
- Familiarity with pandas, numpy, and scikit-learn libraries

## Get Started

To start, we install required packages, import the necessary libraries.


### Install required packages


In [2]:
# Install necessary Python libraries using pip:
# numpy:  Fundamental package for numerical computation in Python. It provides support for arrays, matrices, and mathematical functions to operate on these structures efficiently.
# pandas:  Library providing high-performance, easy-to-use data structures and data analysis tools. It's particularly useful for working with tabular data (like CSV files, spreadsheets, SQL databases).
# scikit-learn:  Popular machine learning library in Python. It provides tools for classification, regression, clustering, dimensionality reduction, model selection, and preprocessing.
%pip install numpy pandas scikit-learn

Collecting numpy
  Using cached numpy-1.26.4-cp39-cp39-macosx_10_9_x86_64.whl.metadata (61 kB)
Using cached numpy-1.26.4-cp39-cp39-macosx_10_9_x86_64.whl (20.6 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.0.2
    Uninstalling numpy-2.0.2:
      Successfully uninstalled numpy-2.0.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
featuretools 1.31.0 requires pandas>=2.0.0, but you have pandas 1.3.4 which is incompatible.
langchain 0.0.337 requires anyio<4.0, but you have anyio 4.2.0 which is incompatible.
scispacy 0.5.3 requires scipy<1.11, but you have scipy 1.11.4 which is incompatible.
statsmodels 0.14.4 requires pandas!=2.1.0,>=1.4, but you have pandas 1.3.4 which is incompatible.
woodwork 0.31.0 requires pandas>=2.0.0, but you have pandas 1.3.4 which is incompatible.[0m[31m
[0mSuccess

### Import necessary libraries

In [3]:
# Import the NumPy library for numerical operations, often used for handling arrays and matrices.
import numpy as np
# Import the Pandas library for data manipulation and analysis, particularly for working with DataFrames.
import pandas as pd
# Import the LinearDiscriminantAnalysis class from scikit-learn's discriminant_analysis module.
# LDA is a classifier with linear decision boundaries, often used for dimensionality reduction and classification.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# Import KFold and cross_val_score from scikit-learn's model_selection module.
# KFold is used for splitting data into k-folds for cross-validation.
# cross_val_score is used to evaluate a score for a model using cross-validation.
from sklearn.model_selection import KFold, cross_val_score

## Diabetes Dataset

The dataset classifies patient as
either an onset of diabetes within five years or not.

```
Number of Instances: 768
Number of Attributes: 8 plus class
For Each Attribute: (all numeric-valued)
   1. Number of times pregnant
   2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
   3. Diastolic blood pressure (mm Hg)
   4. Triceps skin fold thickness (mm)
   5. 2-Hour serum insulin (mu U/ml)
   6. Body mass index (weight in kg/(height in m)^2)
   7. Diabetes pedigree function
   8. Age (years)
   9. Class variable (0 or 1)
Missing Attribute Values: Yes
Class Distribution: (class value 1 is interpreted as "tested positive for
   diabetes")
   Class Value  Number of instances
   0            500
   1            268
```

You can learn more about the dataset here:

- Diabetes Dataset File ([pima-indians-diabetes.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv))
- Diabetes Dataset Details ([pima-indians-diabetes.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names))

The description of Diabetes Dataset can be found [here](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database).


### Load and summarize the dataset


In [4]:
# Define the file path to the Pima Indians Diabetes dataset CSV file.
# This path assumes the CSV file is located in the "../../Data/" directory relative to the current script's location.
pima_indians_diabetes_csv = "../../Data/pima-indians-diabetes.csv"

# Define the column names for the dataset
columns = [
    'Pregnancies',               # Number of times pregnant
    'Glucose',                   # Plasma glucose concentration (mg/dL)
    'BloodPressure',             # Diastolic blood pressure (mm Hg)
    'SkinThickness',             # Triceps skinfold thickness (mm)
    'Insulin',                   # 2-Hour serum insulin (mu U/ml)
    'BMI',                       # Body mass index (weight in kg/(height in m)^2)
    'DiabetesPedigreeFunction',  # Diabetes pedigree function (genetic risk)
    'Age',                       # Age in years
    'Outcome'                    # Class variable (0: Non-diabetic, 1: Diabetic)
    ]
# Load the dataset from a CSV file defined by a variable 'pima_indians_diabetes_csv' into a pandas DataFrame.
# 'header=None' argument indicates that the CSV file does not have a header row.
# 'names=columns' Assign column names defined above
dataset = pd.read_csv(pima_indians_diabetes_csv, header=None, names=columns)

# Display the first 5 rows of the DataFrame to get a quick overview of the data.
# This helps to understand the structure and sample data of the dataset.
dataset.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [5]:
# Print summary statistics of the dataset.
# This will include measures like mean, standard deviation, min, max, and quartiles for numerical columns,
# and count, unique, top, and frequency for categorical columns in the 'dataset' DataFrame.
print(dataset.describe())

       Pregnancies     Glucose  BloodPressure  SkinThickness     Insulin  \
count   768.000000  768.000000     768.000000     768.000000  768.000000   
mean      3.845052  120.894531      69.105469      20.536458   79.799479   
std       3.369578   31.972618      19.355807      15.952218  115.244002   
min       0.000000    0.000000       0.000000       0.000000    0.000000   
25%       1.000000   99.000000      62.000000       0.000000    0.000000   
50%       3.000000  117.000000      72.000000      23.000000   30.500000   
75%       6.000000  140.250000      80.000000      32.000000  127.250000   
max      17.000000  199.000000     122.000000      99.000000  846.000000   

              BMI  DiabetesPedigreeFunction         Age     Outcome  
count  768.000000                768.000000  768.000000  768.000000  
mean    31.992578                  0.471876   33.240885    0.348958  
std      7.884160                  0.331329   11.760232    0.476951  
min      0.000000                  

Several columns in the dataset contain minimum values of zero. However, for specific variables (e.g., blood pressure), a zero value is biologically implausible and likely indicates missing or corrupted data.

The following columns contain invalid zero values: Plasma glucose concentration, Diastolic blood pressure, Triceps skinfold thickness, 2-Hour serum insulin, and Body mass index. We can verify this by examining the raw data, such as by displaying the first 20 rows.

### Review rows


In [6]:
# Print the first 20 rows of the loaded dataset.
# 'dataset.head(20)' accesses the first 20 rows of the DataFrame 'dataset'.
# 'print()' function displays these first 20 rows to the console,
# providing a summary of the initial data in the DataFrame.
print(dataset.head(20))

    Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0             6      148             72             35        0  33.6   
1             1       85             66             29        0  26.6   
2             8      183             64              0        0  23.3   
3             1       89             66             23       94  28.1   
4             0      137             40             35      168  43.1   
5             5      116             74              0        0  25.6   
6             3       78             50             32       88  31.0   
7            10      115              0              0        0  35.3   
8             2      197             70             45      543  30.5   
9             8      125             96              0        0   0.0   
10            4      110             92              0        0  37.6   
11           10      168             74              0        0  38.0   
12           10      139             80            

The number of missing values can be quantified for each column.

### Summarizing the number of missing values for each variable


In [7]:
# Replace column names with their index values
dataset.columns = range(len(dataset.columns))

# Count the number of missing values (represented as 0 in this dataset) for specific columns.
# We are checking columns at index 1, 2, 3, 4, and 5 of the DataFrame.
# `(dataset[[1, 2, 3, 4, 5]] == 0)` creates a boolean DataFrame where True indicates a value is 0.
# `.sum()` then sums the True values along each column, effectively counting the zeros in each specified column.
num_missing = (dataset[[1, 2, 3, 4, 5]] == 0).sum()

# Print the results, which will show the count of zero values for each of the selected columns.
# These zero values are being treated as missing values in this specific context of the 'pima_indians_diabetes_csv' dataset.
print(num_missing)

1      5
2     35
3    227
4    374
5     11
dtype: int64


Columns 1, 2, and 5 contain relatively few zero values, while columns 3 and 4 show significantly more - nearly half of all rows. This disparity suggests we may need to apply different missing value treatment strategies for different columns to maintain adequate sample sizes for predictive model training.

In Python's data science stack (Pandas, NumPy, and Scikit-Learn), missing values are represented as NaN (Not a Number). These NaN values are automatically excluded from statistical operations like sum() and count().

To handle missing data:
* Convert invalid values to NaN using DataFrame.replace() on specific columns
* Identify missing values with isnull(), which returns a boolean mask (True for NaN)
* Count missing values per column using isnull().sum()

This workflow enables systematic missing data analysis before applying imputation techniques.

### Marking missing values with nan values


In [None]:
# Replace '0' values with 'np.nan' (Not a Number) in specific columns of the DataFrame.
# The columns at index 1, 2, 3, 4, and 5 are selected for replacement.
# This is likely done because '0' in these columns might represent missing or invalid data in the context of the dataset.
dataset[[1, 2, 3, 4, 5]] = dataset[[1, 2, 3, 4, 5]].replace(0, np.nan)

# Print the count of NaN (Not a Number) values for each column in the DataFrame.
# 'dataset.isnull()' creates a boolean DataFrame indicating missing values (True for NaN, False otherwise).
# '.sum()' then sums the True values along each column, effectively counting the number of NaN values per column.
print(dataset.isnull().sum())

A preliminary examination of the first 20 records confirms this observation.

### Review data with missing values marked with a nan


In [None]:
# Replace '0' values with 'np.nan' (Not a Number) in specific columns of the DataFrame.
# The columns at index 1, 2, 3, 4, and 5 are selected for replacement.
# This is likely done because '0' in these columns might represent missing or invalid data in the context of the dataset.
dataset[[1, 2, 3, 4, 5]] = dataset[[1, 2, 3, 4, 5]].replace(0, np.nan)

# Print the first 20 rows of the DataFrame to display a sample of the data after loading and replacing values.
# 'dataset.head(20)' returns the first 20 rows of the DataFrame, which is then printed to the console.
print(dataset.head(20))

## Missing Values Cause Problems
Most scikit-learn estimators will raise ValueError exceptions when encountering missing values, requiring either imputation or removal prior to modeling.

**Note!** You should see a message about an error occurring when you try to run the following code block.

In [None]:
# Replace '0' values with 'nan' in specific columns (columns at index 1, 2, 3, 4, 5).
# This is done because in the Pima Indians Diabetes Dataset, '0' in these columns represents missing values.
dataset[[1, 2, 3, 4, 5]] = dataset[[1, 2, 3, 4, 5]].replace(0, np.nan)

# Split the dataset into input features (X) and the target variable (y).
# 'values' converts the DataFrame to a NumPy array.
values = dataset.values
# X is assigned the first 8 columns (index 0 to 7) which are the input features.
X = values[:, 0:8]
# y is assigned the 9th column (index 8) which is the target variable (diabetes outcome).
y = values[:, 8]

# Define the Linear Discriminant Analysis (LDA) model.
# LDA is a classifier that assumes a linear decision boundary and fits class conditional densities using Bayes' rule.
model = LinearDiscriminantAnalysis()

# Define the cross-validation procedure using K-Fold.
# KFold splits the dataset into k consecutive folds (here, n_splits=3).
# 'shuffle=True' shuffles the data before splitting to ensure folds are more representative.
# 'random_state=1' ensures reproducibility of the shuffling.
cv = KFold(n_splits=3, shuffle=True, random_state=1)

# Evaluate the model's performance using cross-validation and the accuracy scoring metric.
# 'cross_val_score' performs cross-validation and returns an array of scores for each fold.
# The code is wrapped in a try-except block to handle potential ValueError exceptions.
try:
    result = cross_val_score(model, X, y, cv=cv, scoring="accuracy")
    # If cross-validation is successful, print the mean accuracy across all folds, formatted to 3 decimal places.
    print("Accuracy: %.3f" % result.mean())
# Catch ValueError exceptions, which might occur if LDA assumptions are violated or data is not suitable.
except ValueError as e:
    # If a ValueError occurs during cross-validation, print an informative message indicating an expected error and the error details.
    print(f"********************* An (expected) error occurred *********************\n{e}")

### Remove Rows With Missing Values

The simplest approach for handling missing values is to remove:
- Entire predictors (columns) containing missing values
- And/or samples (rows) with missing values

#### Implementation in Pandas:
```python
# Create new DataFrame with missing rows removed
cleaned_df = original_df.dropna()

# Alternative: Remove rows with missing values in specific columns
cleaned_df = original_df.dropna(subset=['important_column'])


### Example of removing rows that contain missing values


In [None]:
# Print the shape of the DataFrame to summarize the number of rows and columns in the raw data.
print(dataset.shape)

# Replace '0' values with 'np.nan' (Not a Number) in specific columns (index 1, 2, 3, 4, 5) of the DataFrame.
# This is done to represent missing values, as '0' might be used as a placeholder for missing data in this dataset.
dataset[[1, 2, 3, 4, 5]] = dataset[[1, 2, 3, 4, 5]].replace(0, np.nan)

# Remove rows that contain any missing values (NaN) from the DataFrame.
# 'inplace=True' modifies the DataFrame directly; no new DataFrame is returned.
dataset.dropna(inplace=True)

# Print the shape of the DataFrame again to summarize the data after rows with missing values have been removed.
print(dataset.shape)

We now have a cleaned dataset suitable for evaluating algorithms that require complete cases, such as Linear Discriminant Analysis (LDA).

**Key Characteristics:**
- Contains no missing values (NaN)
- Maintains sufficient sample size for reliable evaluation
- Preserves the original data structure where possible

**Typical Algorithms Requiring Complete Data:**
- LDA (Linear Discriminant Analysis)
- Logistic Regression
- SVM (Support Vector Machines)
- Most scikit-learn estimators

**Verification Check:**
```python
print(f"Remaining samples: {len(cleaned_df)}")
print(f"Missing values per column:\n{cleaned_df.isnull().sum()}")

### Evaluate model on data after rows with missing data are removed


In [None]:
# Replace '0' values with 'nan' (Not a Number) in columns 1, 2, 3, 4, and 5 of the DataFrame.
# This is done to represent missing or invalid data, as '0' might not be a valid value for these columns in the context of the dataset.
dataset[[1, 2, 3, 4, 5]] = dataset[[1, 2, 3, 4, 5]].replace(0, np.nan)

# Drop rows from the DataFrame that contain any missing values (NaN).
# 'inplace=True' modifies the DataFrame directly.
dataset.dropna(inplace=True)

# Extract the values from the DataFrame into a NumPy array.
values = dataset.values
# Assign the first 8 columns of the 'values' array to 'X' as input features.
X = values[:, 0:8]
# Assign the 9th column (index 8) of the 'values' array to 'y' as the target variable (output).
y = values[:, 8]

# Define the Linear Discriminant Analysis (LDA) model.
# LDA is a classifier with a linear decision boundary, often used for dimensionality reduction and classification.
model = LinearDiscriminantAnalysis()

# Define the cross-validation procedure using K-Fold.
# 'n_splits=3' specifies 3-fold cross-validation.
# 'shuffle=True' shuffles the data before splitting into folds.
# 'random_state=1' ensures reproducibility of the shuffling.
cv = KFold(n_splits=3, shuffle=True, random_state=1)

# Evaluate the model's performance using cross-validation.
# 'cross_val_score' performs cross-validation and returns the scores for each fold.
# 'model' is the LDA model to be evaluated.
# 'X' is the input features.
# 'y' is the target variable.
# 'cv=cv' uses the K-Fold cross-validation strategy defined earlier.
# 'scoring="accuracy"' specifies that accuracy is the metric to be used for evaluation.
result = cross_val_score(model, X, y, cv=cv, scoring="accuracy")

# Report the mean accuracy score across all folds from the cross-validation.
# 'result.mean()' calculates the average accuracy.
# "Accuracy: %.3f" formats the mean accuracy to 3 decimal places for printing.
print("Accuracy: %.3f" % result.mean())

## Conclusion

In this tutorial, we learned how to:

- Identify and mark missing values in a dataset
- Understand the impact of missing values on machine learning algorithms
- Remove rows with missing data
- Evaluate a machine learning model on a cleaned dataset

These skills are crucial for preparing real-world datasets for analysis and machine learning tasks.

## Clean up

Remember to shut down your Jupyter notebook and delete any unnecessary resources when you're finished with this tutorial.