# Mark and Remove Missing Data

Adapted from Jason Brownlee. 2020. [Data Preparation for Machine Learning](https://machinelearningmastery.com/data-preparation-for-machine-learning/).

## Overview

In this tutorial, we will learn how to handle missing data in datasets, specifically focusing on marking and removing missing values. By applying these techniques, you'll be able to prepare high-quality datasets for machine learning, improving model reliability and accuracy.

We'll use the Pima Indians Diabetes dataset as an example to demonstrate these techniques.

## Learning Objectives

- Learn how to identify and mark invalid or corrupt values as missing in a dataset
- Understand how the presence of marked missing values affects machine learning algorithms
- Learn how to remove rows with missing data from a dataset
- Evaluate a learning algorithm on a dataset after removing rows with missing values

## Prerequisites

- Basic understanding of Python programming
- Familiarity with pandas, numpy, and scikit-learn libraries

## Get Started

To start, we install required packages, import the necessary libraries.


### Install required packages


In [None]:
# Install necessary Python libraries using pip:
# numpy:  Fundamental package for numerical computation in Python. It provides support for arrays, matrices, and mathematical functions to operate on these structures efficiently.
# pandas:  Library providing high-performance, easy-to-use data structures and data analysis tools. It's particularly useful for working with tabular data (like CSV files, spreadsheets, SQL databases).
# scikit-learn:  Popular machine learning library in Python. It provides tools for classification, regression, clustering, dimensionality reduction, model selection, and preprocessing.
%pip install numpy pandas scikit-learn

### Import necessary libraries

In [None]:
# Import the NumPy library for numerical operations, often used for handling arrays and matrices.
import numpy as np
# Import the Pandas library for data manipulation and analysis, particularly for working with DataFrames.
import pandas as pd
# Import the LinearDiscriminantAnalysis class from scikit-learn's discriminant_analysis module.
# LDA is a classifier with linear decision boundaries, often used for dimensionality reduction and classification.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# Import KFold and cross_val_score from scikit-learn's model_selection module.
# KFold is used for splitting data into k-folds for cross-validation.
# cross_val_score is used to evaluate a score for a model using cross-validation.
from sklearn.model_selection import KFold, cross_val_score

# Define the file path to the Pima Indians Diabetes dataset CSV file.
# This path assumes the CSV file is located in the "../../Data/" directory relative to the current script's location.
pima_indians_diabetes_csv = "../../Data/pima-indians-diabetes.csv"

## Diabetes Dataset

The dataset classifies patient as
either an onset of diabetes within five years or not.

```
Number of Instances: 768
Number of Attributes: 8 plus class
For Each Attribute: (all numeric-valued)
   1. Number of times pregnant
   2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
   3. Diastolic blood pressure (mm Hg)
   4. Triceps skin fold thickness (mm)
   5. 2-Hour serum insulin (mu U/ml)
   6. Body mass index (weight in kg/(height in m)^2)
   7. Diabetes pedigree function
   8. Age (years)
   9. Class variable (0 or 1)
Missing Attribute Values: Yes
Class Distribution: (class value 1 is interpreted as "tested positive for
   diabetes")
   Class Value  Number of instances
   0            500
   1            268
```

You can learn more about the dataset here:

- Diabetes Dataset File ([pima-indians-diabetes.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv))
- Diabetes Dataset Details ([pima-indians-diabetes.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names))

The description of Diabetes Dataset can be found [here](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database).


### Load and summarize the dataset


In [None]:
# Load the dataset from a CSV file named 'pima_indians_diabetes_csv' into a pandas DataFrame.
# 'header=None' argument indicates that the CSV file does not have a header row.
dataset = pd.read_csv(pima_indians_diabetes_csv, header=None)

# Display the first 5 rows of the DataFrame to get a quick overview of the data.
# This helps to understand the structure and sample data of the dataset.
dataset.head()

In [None]:
# Print summary statistics of the dataset.
# This will include measures like mean, standard deviation, min, max, and quartiles for numerical columns,
# and count, unique, top, and frequency for categorical columns in the 'dataset' DataFrame.
print(dataset.describe())

We can see that there are columns that have a minimum value of zero (0).
On some columns, a value of zero does not make sense and indicates an invalid or missing value.


Specifically, the following columns have an invalid zero minimum value: 2. Plasma glucose concentration 3. Diastolic blood pressure 4. Triceps skinfold thickness 5. 2-Hour serum insulin 6. Body mass index

We can confirm this by looking at the raw data and printing out the first 20 rows of data.


### Load the dataset and review rows


In [None]:
# Load the dataset from a CSV file.
# 'pd.read_csv()' function from pandas library is used to read the CSV file.
# 'pima_indians_diabetes_csv' is assumed to be a variable holding the path to the CSV file.
# 'header=None' argument specifies that the CSV file does not have a header row,
# so pandas will automatically assign column indices (0, 1, 2, ...).
dataset = pd.read_csv(pima_indians_diabetes_csv, header=None)

# Print the first 20 rows of the loaded dataset.
# 'dataset.head(20)' accesses the first 20 rows of the DataFrame 'dataset'.
# 'print()' function displays these first 20 rows to the console,
# providing a summary of the initial data in the DataFrame.
print(dataset.head(20))

We can get a count of the number of missing values on each of these columns.


### Summarizing the number of missing values for each variable


In [None]:
# Add comments to the code below

# Load the dataset from a CSV file named 'pima_indians_diabetes_csv' into a pandas DataFrame.
# 'header=None' indicates that the CSV file does not have a header row.
dataset = pd.read_csv(pima_indians_diabetes_csv, header=None)

# Count the number of missing values (represented as 0 in this dataset) for specific columns.
# We are checking columns at index 1, 2, 3, 4, and 5 of the DataFrame.
# `(dataset[[1, 2, 3, 4, 5]] == 0)` creates a boolean DataFrame where True indicates a value is 0.
# `.sum()` then sums the True values along each column, effectively counting the zeros in each specified column.
num_missing = (dataset[[1, 2, 3, 4, 5]] == 0).sum()

# Print the results, which will show the count of zero values for each of the selected columns.
# These zero values are being treated as missing values in this specific context of the 'pima_indians_diabetes_csv' dataset.
print(num_missing)

We can see that columns 1, 2 and 5 have just a few zero values, whereas columns 3 and 4
show a lot more, nearly half of the rows. This highlights that different missing value strategies
may be needed for different columns, e.g. to ensure that there are still a sufficient number of
records left to train a predictive model.


In Python, specifically Pandas, NumPy and Scikit-Learn, we mark missing values as NaN.
Values with a NaN value are ignored from operations like sum, count, etc. We can mark values
as NaN easily with the Pandas DataFrame by using the replace() function on a subset of
the columns we are interested in. After we have marked the missing values, we can use the
isnull() function to mark all of the NaN values in the dataset as True and get a count of the
missing values for each column.


### Marking missing values with nan values


In [None]:
# Load the dataset from a CSV file named 'pima_indians_diabetes_csv' into a pandas DataFrame.
# 'header=None' argument indicates that the CSV file does not have a header row.
dataset = pd.read_csv(pima_indians_diabetes_csv, header=None)

# Replace '0' values with 'np.nan' (Not a Number) in specific columns of the DataFrame.
# The columns at index 1, 2, 3, 4, and 5 are selected for replacement.
# This is likely done because '0' in these columns might represent missing or invalid data in the context of the dataset.
dataset[[1, 2, 3, 4, 5]] = dataset[[1, 2, 3, 4, 5]].replace(0, np.nan)

# Print the count of NaN (Not a Number) values for each column in the DataFrame.
# 'dataset.isnull()' creates a boolean DataFrame indicating missing values (True for NaN, False otherwise).
# '.sum()' then sums the True values along each column, effectively counting the number of NaN values per column.
print(dataset.isnull().sum())

We can confirm by printing out the first 20 rows of data.


### Review data with missing values marked with a nan


In [None]:
# Load the dataset from a CSV file specified by the variable 'pima_indians_diabetes_csv' into a pandas DataFrame.
# 'header=None' indicates that the CSV file does not have a header row, so pandas will automatically assign column indices.
dataset = pd.read_csv(pima_indians_diabetes_csv, header=None)

# Replace '0' values with 'np.nan' (Not a Number) in specific columns of the DataFrame.
# The columns at index 1, 2, 3, 4, and 5 are selected for replacement.
# This is likely done because '0' in these columns might represent missing or invalid data in the context of the dataset.
dataset[[1, 2, 3, 4, 5]] = dataset[[1, 2, 3, 4, 5]].replace(0, np.nan)

# Print the first 20 rows of the DataFrame to display a sample of the data after loading and replacing values.
# 'dataset.head(20)' returns the first 20 rows of the DataFrame, which is then printed to the console.
print(dataset.head(20))

## Missing Values Cause Problems

Having missing values in a dataset can cause errors with some machine learning algorithms.

*Note!* You should see a message about an error occurring when you try to run the following code block.

In [None]:
# Load the dataset from a CSV file named 'pima_indians_diabetes_csv' into a pandas DataFrame.
# 'header=None' argument indicates that the CSV file does not have a header row.
dataset = pd.read_csv(pima_indians_diabetes_csv, header=None)

# Replace '0' values with 'nan' in specific columns (columns at index 1, 2, 3, 4, 5).
# This is done because in the Pima Indians Diabetes Dataset, '0' in these columns represents missing values.
dataset[[1, 2, 3, 4, 5]] = dataset[[1, 2, 3, 4, 5]].replace(0, np.nan)

# Split the dataset into input features (X) and the target variable (y).
# 'values' converts the DataFrame to a NumPy array.
values = dataset.values
# X is assigned the first 8 columns (index 0 to 7) which are the input features.
X = values[:, 0:8]
# y is assigned the 9th column (index 8) which is the target variable (diabetes outcome).
y = values[:, 8]

# Define the Linear Discriminant Analysis (LDA) model.
# LDA is a classifier that assumes a linear decision boundary and fits class conditional densities using Bayes' rule.
model = LinearDiscriminantAnalysis()

# Define the cross-validation procedure using K-Fold.
# KFold splits the dataset into k consecutive folds (here, n_splits=3).
# 'shuffle=True' shuffles the data before splitting to ensure folds are more representative.
# 'random_state=1' ensures reproducibility of the shuffling.
cv = KFold(n_splits=3, shuffle=True, random_state=1)

# Evaluate the model's performance using cross-validation and the accuracy scoring metric.
# 'cross_val_score' performs cross-validation and returns an array of scores for each fold.
# The code is wrapped in a try-except block to handle potential ValueError exceptions.
try:
    result = cross_val_score(model, X, y, cv=cv, scoring="accuracy")
    # If cross-validation is successful, print the mean accuracy across all folds, formatted to 3 decimal places.
    print("Accuracy: %.3f" % result.mean())
# Catch ValueError exceptions, which might occur if LDA assumptions are violated or data is not suitable.
except ValueError as e:
    # If a ValueError occurs during cross-validation, print an informative message indicating an expected error and the error details.
    print(f"********************* An (expected) error occurred *********************\n{e}")

### Remove Rows With Missing Values

The simplest approach for dealing with missing values is to remove entire predictor(s)
and/or sample(s) that contain missing values.

We can do this by creating a new Pandas DataFrame with the rows containing missing values
removed. Pandas provides the `dropna()` function that can be used to drop either columns or
rows with missing data. We can use `dropna()` to remove all rows with missing data.


### Example of removing rows that contain missing values


In [None]:
# Load the dataset from a CSV file named 'pima_indians_diabetes_csv' into a pandas DataFrame.
# 'header=None' indicates that the CSV file does not have a header row.
dataset = pd.read_csv(pima_indians_diabetes_csv, header=None)

# Print the shape of the DataFrame to summarize the number of rows and columns in the raw data.
print(dataset.shape)

# Replace '0' values with 'np.nan' (Not a Number) in specific columns (index 1, 2, 3, 4, 5) of the DataFrame.
# This is done to represent missing values, as '0' might be used as a placeholder for missing data in this dataset.
dataset[[1, 2, 3, 4, 5]] = dataset[[1, 2, 3, 4, 5]].replace(0, np.nan)

# Remove rows that contain any missing values (NaN) from the DataFrame.
# 'inplace=True' modifies the DataFrame directly; no new DataFrame is returned.
dataset.dropna(inplace=True)

# Print the shape of the DataFrame again to summarize the data after rows with missing values have been removed.
print(dataset.shape)

We now have a dataset that we could use to evaluate an algorithm sensitive to missing values
like LDA.


### Evaluate model on data after rows with missing data are removed


In [None]:
# Load the dataset from a CSV file named 'pima_indians_diabetes_csv' into a pandas DataFrame.
# 'header=None' indicates that the CSV file does not have a header row.
dataset = pd.read_csv(pima_indians_diabetes_csv, header=None)

# Replace '0' values with 'nan' (Not a Number) in columns 1, 2, 3, 4, and 5 of the DataFrame.
# This is done to represent missing or invalid data, as '0' might not be a valid value for these columns in the context of the dataset.
dataset[[1, 2, 3, 4, 5]] = dataset[[1, 2, 3, 4, 5]].replace(0, np.nan)

# Drop rows from the DataFrame that contain any missing values (NaN).
# 'inplace=True' modifies the DataFrame directly.
dataset.dropna(inplace=True)

# Extract the values from the DataFrame into a NumPy array.
values = dataset.values
# Assign the first 8 columns of the 'values' array to 'X' as input features.
X = values[:, 0:8]
# Assign the 9th column (index 8) of the 'values' array to 'y' as the target variable (output).
y = values[:, 8]

# Define the Linear Discriminant Analysis (LDA) model.
# LDA is a classifier with a linear decision boundary, often used for dimensionality reduction and classification.
model = LinearDiscriminantAnalysis()

# Define the cross-validation procedure using K-Fold.
# 'n_splits=3' specifies 3-fold cross-validation.
# 'shuffle=True' shuffles the data before splitting into folds.
# 'random_state=1' ensures reproducibility of the shuffling.
cv = KFold(n_splits=3, shuffle=True, random_state=1)

# Evaluate the model's performance using cross-validation.
# 'cross_val_score' performs cross-validation and returns the scores for each fold.
# 'model' is the LDA model to be evaluated.
# 'X' is the input features.
# 'y' is the target variable.
# 'cv=cv' uses the K-Fold cross-validation strategy defined earlier.
# 'scoring="accuracy"' specifies that accuracy is the metric to be used for evaluation.
result = cross_val_score(model, X, y, cv=cv, scoring="accuracy")

# Report the mean accuracy score across all folds from the cross-validation.
# 'result.mean()' calculates the average accuracy.
# "Accuracy: %.3f" formats the mean accuracy to 3 decimal places for printing.
print("Accuracy: %.3f" % result.mean())

## Conclusion

In this tutorial, we learned how to:

- Identify and mark missing values in a dataset
- Understand the impact of missing values on machine learning algorithms
- Remove rows with missing data
- Evaluate a machine learning model on a cleaned dataset

These skills are crucial for preparing real-world datasets for analysis and machine learning tasks.

## Clean up

Remember to shut down your Jupyter notebook and delete any unnecessary resources when you're finished with this tutorial.