# Mark and Remove Missing Data

Adapted from Jason Brownlee. 2020. [Data Preparation for Machine Learning](https://machinelearningmastery.com/data-preparation-for-machine-learning/).

## Overview

In this tutorial, we will learn how to handle missing data in datasets, specifically focusing on marking and removing missing values. By applying these techniques, you'll be able to prepare high-quality datasets for machine learning, improving model reliability and accuracy.

We'll use the Pima Indians Diabetes dataset as an example to demonstrate these techniques.

## Learning Objectives

- Learn how to identify and mark invalid or corrupt values as missing in a dataset
- Understand how the presence of marked missing values affects machine learning algorithms
- Learn how to remove rows with missing data from a dataset
- Evaluate a learning algorithm on a dataset after removing rows with missing values

## Prerequisites

- Basic understanding of Python programming
- Familiarity with pandas, numpy, and scikit-learn libraries

## Get Started

To start, we install required packages, import the necessary libraries.


### Install required packages


In [None]:
%pip install numpy pandas scikit-learn

### Import necessary libraries

In [None]:
import numpy as np
import pandas as pd
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import KFold, cross_val_score

# Pima Indians Diabetes data set file
pima_indians_diabetes_csv = "../../Data/pima-indians-diabetes.csv"

## Diabetes Dataset

The dataset classifies patient as
either an onset of diabetes within five years or not.

```
Number of Instances: 768
Number of Attributes: 8 plus class
For Each Attribute: (all numeric-valued)
   1. Number of times pregnant
   2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
   3. Diastolic blood pressure (mm Hg)
   4. Triceps skin fold thickness (mm)
   5. 2-Hour serum insulin (mu U/ml)
   6. Body mass index (weight in kg/(height in m)^2)
   7. Diabetes pedigree function
   8. Age (years)
   9. Class variable (0 or 1)
Missing Attribute Values: Yes
Class Distribution: (class value 1 is interpreted as "tested positive for
   diabetes")
   Class Value  Number of instances
   0            500
   1            268
```

You can learn more about the dataset here:

- Diabetes Dataset File ([pima-indians-diabetes.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv))
- Diabetes Dataset Details ([pima-indians-diabetes.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names))

The description of Diabetes Dataset can be found [here](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database).


### Load and summarize the dataset


In [None]:
# Load the dataset
dataset = pd.read_csv(pima_indians_diabetes_csv, header=None)

# Peek into the top five rows
dataset.head()

In [None]:
# Summarize the dataset
print(dataset.describe())

We can see that there are columns that have a minimum value of zero (0).
On some columns, a value of zero does not make sense and indicates an invalid or missing value.


Specifically, the following columns have an invalid zero minimum value: 2. Plasma glucose concentration 3. Diastolic blood pressure 4. Triceps skinfold thickness 5. 2-Hour serum insulin 6. Body mass index

We can confirm this by looking at the raw data and printing out the first 20 rows of data.


### Load the dataset and review rows


In [None]:
# Load the dataset
dataset = pd.read_csv(pima_indians_diabetes_csv, header=None)

# Summarize the first 20 rows of data
print(dataset.head(20))

We can get a count of the number of missing values on each of these columns.


### Summarizing the number of missing values for each variable


In [None]:
# Load the dataset
dataset = pd.read_csv(pima_indians_diabetes_csv, header=None)

# Count the number of missing values for each column
num_missing = (dataset[[1, 2, 3, 4, 5]] == 0).sum()

# Report the results
print(num_missing)

We can see that columns 1, 2 and 5 have just a few zero values, whereas columns 3 and 4
show a lot more, nearly half of the rows. This highlights that different missing value strategies
may be needed for different columns, e.g. to ensure that there are still a sufficient number of
records left to train a predictive model.


In Python, specifically Pandas, NumPy and Scikit-Learn, we mark missing values as NaN.
Values with a NaN value are ignored from operations like sum, count, etc. We can mark values
as NaN easily with the Pandas DataFrame by using the replace() function on a subset of
the columns we are interested in. After we have marked the missing values, we can use the
isnull() function to mark all of the NaN values in the dataset as True and get a count of the
missing values for each column.


### Marking missing values with nan values


In [None]:
# Load the dataset
dataset = pd.read_csv(pima_indians_diabetes_csv, header=None)

# Replace '0' values with 'nan'
dataset[[1, 2, 3, 4, 5]] = dataset[[1, 2, 3, 4, 5]].replace(0, np.nan)

# Count the number of nan values in each column
print(dataset.isnull().sum())

We can confirm by printing out the first 20 rows of data.


### Review data with missing values marked with a nan


In [None]:
# Load the dataset
dataset = pd.read_csv(pima_indians_diabetes_csv, header=None)

# Replace '0' values with 'nan'
dataset[[1, 2, 3, 4, 5]] = dataset[[1, 2, 3, 4, 5]].replace(0, np.nan)

# Summarize the first 20 rows of data
print(dataset.head(20))

## Missing Values Cause Problems

Having missing values in a dataset can cause errors with some machine learning algorithms.

*Note!* You should see a message about an error occurring when you try to run the following code block.

In [None]:
# Load the dataset
dataset = pd.read_csv(pima_indians_diabetes_csv, header=None)

# Replace '0' values with 'nan'
dataset[[1, 2, 3, 4, 5]] = dataset[[1, 2, 3, 4, 5]].replace(0, np.nan)

# Split dataset into inputs and outputs
values = dataset.values
X = values[:, 0:8]
y = values[:, 8]

# Define the model
#
# A classifier with a linear decision boundary, generated by fitting class
# conditional densities to the data and using Bayes' rule.
model = LinearDiscriminantAnalysis()

# Define the model evaluation procedure using K fold cross-validation
cv = KFold(n_splits=3, shuffle=True, random_state=1)

# Evaluate the model accuracy score, and report the mean performance if it succeeds.
try:
    result = cross_val_score(model, X, y, cv=cv, scoring="accuracy")
    print("Accuracy: %.3f" % result.mean())
except ValueError as e:
    print(f"********************* An (expected) error occurred *********************\n{e}")

### Remove Rows With Missing Values

The simplest approach for dealing with missing values is to remove entire predictor(s)
and/or sample(s) that contain missing values.

We can do this by creating a new Pandas DataFrame with the rows containing missing values
removed. Pandas provides the `dropna()` function that can be used to drop either columns or
rows with missing data. We can use `dropna()` to remove all rows with missing data.


### Example of removing rows that contain missing values


In [None]:
# Load the dataset
dataset = pd.read_csv(pima_indians_diabetes_csv, header=None)

# Summarize the shape of the raw data
print(dataset.shape)

# Replace '0' values with 'nan'
dataset[[1, 2, 3, 4, 5]] = dataset[[1, 2, 3, 4, 5]].replace(0, np.nan)

# Drop rows with missing values
dataset.dropna(inplace=True)

# Summarize the shape of the data with missing rows removed
print(dataset.shape)

We now have a dataset that we could use to evaluate an algorithm sensitive to missing values
like LDA.


### Evaluate model on data after rows with missing data are removed


In [None]:
# Load the dataset
dataset = pd.read_csv(pima_indians_diabetes_csv, header=None)

# Replace '0' values with 'nan'
dataset[[1, 2, 3, 4, 5]] = dataset[[1, 2, 3, 4, 5]].replace(0, np.nan)

# Drop rows with missing values
dataset.dropna(inplace=True)

# Split dataset into inputs and outputs
values = dataset.values
X = values[:, 0:8]
y = values[:, 8]

# Define the model
model = LinearDiscriminantAnalysis()

# Define the model evaluation procedure
cv = KFold(n_splits=3, shuffle=True, random_state=1)

# Evaluate the model accuracy score
result = cross_val_score(model, X, y, cv=cv, scoring="accuracy")

# Report the mean performance
print("Accuracy: %.3f" % result.mean())

## Conclusion

In this tutorial, we learned how to:

- Identify and mark missing values in a dataset
- Understand the impact of missing values on machine learning algorithms
- Remove rows with missing data
- Evaluate a machine learning model on a cleaned dataset

These skills are crucial for preparing real-world datasets for analysis and machine learning tasks.

## Clean up

Remember to shut down your Jupyter notebook and delete any unnecessary resources when you're finished with this tutorial.