# Chapter 7: How to Mark and Remove Missing Data

##  Objective
- Real-world datasets often contain **missing values**.
- Reasons for missing data include:
  - Unrecorded observations
  - Data corruption
- Many machine learning algorithms **do not support missing values**.
- In this chapter, you will learn:
  - How to identify and mark missing values
  - How missing values affect model performance
  - How to remove rows with missing values and evaluate models

---

## 7.1 Tutorial Overview
This tutorial is divided into 4 parts:
1. Diabetes Dataset
2. Mark Missing Values
3. Missing Values Cause Problems
4. Remove Rows With Missing Values

---

## 7.2 Diabetes Dataset
- The **Pima Indians Diabetes Dataset** is used.
- It contains 768 samples and 8 input variables.
- The task is a **binary classification** (onset of diabetes or not).
- A naive model achieves ~65% accuracy; good models reach ~77%.
- Some variables (e.g., BMI, blood pressure) contain **zero values**, which are **invalid** and considered **missing**.

---

## 7.3 Mark Missing Values
- Summary statistics and visualizations can help detect missing or invalid values.
- Columns with invalid zero values:
  - Plasma glucose concentration
  - Diastolic blood pressure
  - Triceps skinfold thickness
  - 2-Hour serum insulin
  - Body mass index (BMI)
- Use domain knowledge to treat 0 as missing.
- Replace zeroes with **NaN** using `pandas.replace()`.
- Use `isnull().sum()` to count missing values per column.

---

## 7.4 Missing Values Cause Problems
- Many ML algorithms (e.g., LDA, SVM, neural networks) **cannot handle NaN** values.
- Attempting to train a model with missing values raises errors.
- Therefore, missing data **must be handled before modeling**.

---

## 7.5 Remove Rows With Missing Values
- The simplest strategy is to **drop rows** containing missing values using `dropna()`.
- This may result in significant data loss.
  - Original rows: 768
  - After removing missing rows: 392
- Once rows are removed, algorithms like **LDA** can be applied successfully.

---

##  Summary
- Identifying and marking missing values is crucial for data preprocessing.
- Missing values must be handled to avoid errors during model training.
- One basic strategy is to remove rows with missing values, though more advanced techniques like **imputation** may be preferred later.



## Mark Missing Values

In [15]:
# load and summarize the dataset
from pandas import read_csv
filename = 'pima-indians-diabetes.csv'
dataset = read_csv(filename, header=None)
dataset.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [17]:
# load the dataset and review rows
from pandas import read_csv
filename = 'pima-indians-diabetes.csv'
dataset = read_csv(filename, header=None)
print(dataset.head(20))

     0    1   2   3    4     5      6   7  8
0    6  148  72  35    0  33.6  0.627  50  1
1    1   85  66  29    0  26.6  0.351  31  0
2    8  183  64   0    0  23.3  0.672  32  1
3    1   89  66  23   94  28.1  0.167  21  0
4    0  137  40  35  168  43.1  2.288  33  1
5    5  116  74   0    0  25.6  0.201  30  0
6    3   78  50  32   88  31.0  0.248  26  1
7   10  115   0   0    0  35.3  0.134  29  0
8    2  197  70  45  543  30.5  0.158  53  1
9    8  125  96   0    0   0.0  0.232  54  1
10   4  110  92   0    0  37.6  0.191  30  0
11  10  168  74   0    0  38.0  0.537  34  1
12  10  139  80   0    0  27.1  1.441  57  0
13   1  189  60  23  846  30.1  0.398  59  1
14   5  166  72  19  175  25.8  0.587  51  1
15   7  100   0   0    0  30.0  0.484  32  1
16   0  118  84  47  230  45.8  0.551  31  1
17   7  107  74   0    0  29.6  0.254  31  1
18   1  103  30  38   83  43.3  0.183  33  0
19   1  115  70  30   96  34.6  0.529  32  1


In [25]:
#summarizing the number of missing values in each column
from pandas import read_csv
filename = 'pima-indians-diabetes.csv'
dataset = read_csv(filename, header=None)

# count the number of missing values for each column
num_missing = (dataset[[1,2,3,4,5,6,7,8]] == 0).sum()

# report the results
print(num_missing)

1      5
2     35
3    227
4    374
5     11
6      0
7      0
8    500
dtype: int64


In [27]:
# Listing 7.10 example of review data with missing values marked with a nan
from numpy import nan
from pandas import read_csv
filename = 'pima-indians-diabetes.csv'
dataset = read_csv(filename, header=None)
print(dataset.head(20))

# replace '0' values with 'nan'
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, nan)

# summarize the first 20 rows of data
print(dataset.head(20))

     0    1   2   3    4     5      6   7  8
0    6  148  72  35    0  33.6  0.627  50  1
1    1   85  66  29    0  26.6  0.351  31  0
2    8  183  64   0    0  23.3  0.672  32  1
3    1   89  66  23   94  28.1  0.167  21  0
4    0  137  40  35  168  43.1  2.288  33  1
5    5  116  74   0    0  25.6  0.201  30  0
6    3   78  50  32   88  31.0  0.248  26  1
7   10  115   0   0    0  35.3  0.134  29  0
8    2  197  70  45  543  30.5  0.158  53  1
9    8  125  96   0    0   0.0  0.232  54  1
10   4  110  92   0    0  37.6  0.191  30  0
11  10  168  74   0    0  38.0  0.537  34  1
12  10  139  80   0    0  27.1  1.441  57  0
13   1  189  60  23  846  30.1  0.398  59  1
14   5  166  72  19  175  25.8  0.587  51  1
15   7  100   0   0    0  30.0  0.484  32  1
16   0  118  84  47  230  45.8  0.551  31  1
17   7  107  74   0    0  29.6  0.254  31  1
18   1  103  30  38   83  43.3  0.183  33  0
19   1  115  70  30   96  34.6  0.529  32  1
     0      1     2     3      4     5      6   7  8
0 

## Missing Values Cause Problems

In [33]:
# where missing values cause errors
from numpy import nan
from pandas import read_csv
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

dataset = read_csv('pima-indians-diabetes.csv', header=None)

# replace '0' values with 'nan'
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, nan)

# split dataset into inputs and outputs
values = dataset.values
X = values[:,0:8]
y = values[:,8]

# define the model
model = LinearDiscriminantAnalysis()

# define the model evaluation procedure
cv = KFold(n_splits=3, shuffle=True)

# evaluate the model
result = cross_val_score(model, X, y, cv=cv, scoring='accuracy')

# report the mean performance
print('Accuracy: %.3f' % result.mean())

ValueError: 
All the 3 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
3 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\HP\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\HP\anaconda3\Lib\site-packages\sklearn\base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\HP\anaconda3\Lib\site-packages\sklearn\discriminant_analysis.py", line 589, in fit
    X, y = self._validate_data(
           ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\HP\anaconda3\Lib\site-packages\sklearn\base.py", line 650, in _validate_data
    X, y = check_X_y(X, y, **check_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\HP\anaconda3\Lib\site-packages\sklearn\utils\validation.py", line 1301, in check_X_y
    X = check_array(
        ^^^^^^^^^^^^
  File "C:\Users\HP\anaconda3\Lib\site-packages\sklearn\utils\validation.py", line 1064, in check_array
    _assert_all_finite(
  File "C:\Users\HP\anaconda3\Lib\site-packages\sklearn\utils\validation.py", line 123, in _assert_all_finite
    _assert_all_finite_element_wise(
  File "C:\Users\HP\anaconda3\Lib\site-packages\sklearn\utils\validation.py", line 172, in _assert_all_finite_element_wise
    raise ValueError(msg_err)
ValueError: Input X contains NaN.
LinearDiscriminantAnalysis does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values


##### In this section, we will try to evaluate the Linear Discriminant Analysis (LDA) algorithm on the dataset with missing values. This is an algorithm that does not work when there aremissing values in the dataset. The example below marks the missing values in the dataset, as we did in the previous section, then attempts to evaluate LDA using 3-fold cross-validation and print the mean accuracy.

## Remove Rows With Missing Values

In [41]:
# removing rows that contain missing values
from numpy import nan
from pandas import read_csv
dataset = read_csv('pima-indians-diabetes.csv', header=None)

# summarize the shape of the raw data
print(dataset.shape)

# replace '0' values with 'nan'
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, nan)

# drop rows with missing values
dataset.dropna(inplace=True)

# summarize the shape of the data with missing rows removed
print(dataset.shape)

(768, 9)
(392, 9)


Running this example, we can see that the number of rows has been aggressively cut from **768** in the **original dataset to 392 with all rows containing a NaN removed**.

We now have a dataset that we could use to **evaluate an algorithm sensitive to missing values like LDA**.

In [46]:
# evaluate model on data after rows with missing data are removed
from numpy import nan
from pandas import read_csv
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

dataset = read_csv('pima-indians-diabetes.csv', header=None)

# replace '0' values with 'nan'
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, nan)

# drop rows with missing values
dataset.dropna(inplace=True)

# split dataset into inputs and outputs
values = dataset.values
X = values[:,0:8]
y = values[:,8]

# define the model
model = LinearDiscriminantAnalysis()

# define the model evaluation procedure
cv = KFold(n_splits=3, shuffle=True, random_state=1)

# evaluate the model
result = cross_val_score(model, X, y, cv=cv, scoring='accuracy')

# report the mean performance
print('Accuracy:', result.mean()*100)

Accuracy: 78.05832844000783
