## Identifying missing values in tabular data

Before we discuss several techniques for dealing with missing values, let's create a simple example data frame from a mock Comma-separated Values (CSV) file to get a better grasp of the problem:

In [1]:
A <- c(1.0, 5.0, 10.0)
B <- c(2.0, 6.0, 11.0)
C <- c(3.0, NaN, 12.0)
D <- c(4.0, 8.0, NaN)

df <- data.frame(A, B, C, D)
df

A,B,C,D
1,2,3.0,4.0
5,6,,8.0
10,11,12.0,


Now will see for missing values in the dataset:

In [2]:
is.na(df)

A,B,C,D
False,False,False,False
False,False,True,False
False,False,False,True


Compute the total missing values in each column is to use `colSums()`:

In [3]:
colSums(is.na(df))

## Eliminating samples or features with missing values

In [4]:
na.omit(df)

A,B,C,D
1,2,3,4


## Imputing missing values

In [5]:
df$D[is.na(df$D)] <- mean(df$D, na.rm = TRUE)
df

A,B,C,D
1,2,3.0,4
5,6,,8
10,11,12.0,6


## Filling in missing values automatically

In [6]:
df[df == "NaN"] <- 0
df

A,B,C,D
1,2,3,4
5,6,0,8
10,11,12,6
