# Missing Data

Imputation is used to fill in missing values in a data set, i.e. fill in the blanks. Missing values are values in a dataset that are missing or have a null value. For example, it could be data that weren't collected or were mistakenly not recorded or were recorded incorrectly.

**Note**: A zero data value is not necessarily a missing value. Zero can be a useful measure, depending on the variable.

Missing values can cause issues in ML because many ML algorithms are not equipped to handle missing data.

## Simple Strategies

There are a few simple strategies for handling missing data:

1) **Drop rows with null values.** We can drop rows if the missing data in the dataset is missing at random, and if the number of null rows is low relative to the total number of rows in our dataset. We have to be careful because dropping too much data can cause bias in our model.

2) **Drop columns that mostly contain null values** We can drop columns if most of the values are null. However, we have to take caution when doing this because we don't want to drop a column that could have been useful for our predictions.

## Imputation

Imputation involves replacing a null value with a different value. There are numerous ways that a null value can be imputed, including:

1. Using the mean or median of the variable (for continuous data).
2. Using the mode of the variable (for categorical data).
3. Using a constant value (e.g., 0)
4. Using machine learning!
5. Many other ways that are beyond the scope of the class.

## Implementation

There are three commonly used ways to perform imputation in Python.

1. pandas `fillna()` - fills in missing values for a single column using a constant or summary statistic. This is particularly useful if you need to fill NA values differently for different columns.
1. sklearn `SimpleImputer` - fills in missing values using a constant or summary statistic. Useful if you want to fill in NA values in different columns using the same constant or the same summary statistic.
1. sklearn `IterativeImputer` - fills in missing values as a function of other variables.

## Example
https://medium.com/analytics-vidhya/a-quick-guide-on-missing-data-imputation-techniques-in-python-2020-5410f3df1c1e

In [None]:
# Imports
import pandas as pd
import numpy as np

In [None]:
# Generate Data
np.random.seed(42)
my_df = pd.DataFrame()
my_df["A"] = np.random.randint( 1, 10, 10 )
my_df["B"] = np.random.random( 10 )
my_df["C"] = np.random.randint( 1, 20, 10 )
my_df["D"] = np.random.normal( 0, 10, 10 )

my_df

Unnamed: 0,A,B,C,D
0,7,0.142867,17,-5.251229
1,4,0.650888,10,19.127713
2,8,0.056412,16,-20.267196
3,5,0.721999,15,11.194236
4,7,0.938553,15,7.791926
5,3,0.000779,19,-11.010978
6,7,0.992212,12,11.302282
7,8,0.617482,3,3.731189
8,5,0.611653,5,-3.86473
9,4,0.007066,19,-11.587702


In [None]:
my_df.shape

(10, 4)

In [None]:
my_df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
A,10.0,5.8,1.813529,3.0,4.25,6.0,7.0,8.0
B,10.0,0.473991,0.386246,0.000779,0.078025,0.614567,0.704221,0.992212
C,10.0,13.1,5.566766,3.0,10.5,15.0,16.75,19.0
D,10.0,0.116551,12.482028,-20.267196,-9.57104,-0.06677,10.343659,19.127713


In [None]:
my_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       10 non-null     int64  
 1   B       10 non-null     float64
 2   C       10 non-null     int64  
 3   D       10 non-null     float64
dtypes: float64(2), int64(2)
memory usage: 448.0 bytes


In [None]:
# Randomly change some values to NA
np.random.seed(11)
rand_Index_A = np.random.randint( 0, 9, 3 )
rand_Index_B = np.random.randint( 0, 9, 4 )

( rand_Index_A, rand_Index_B )

(array([0, 1, 7]), array([1, 7, 2, 8]))

In [None]:
my_df['A'].loc[rand_Index_A] = np.NaN
my_df['B'].loc[rand_Index_B] = np.NaN

my_df

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  my_df['A'].loc[rand_Index_A] = np.NaN
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  my_df['A'].loc[rand_Index

Unnamed: 0,A,B,C,D
0,,0.142867,17,-5.251229
1,,,10,19.127713
2,8.0,,16,-20.267196
3,5.0,0.721999,15,11.194236
4,7.0,0.938553,15,7.791926
5,3.0,0.000779,19,-11.010978
6,7.0,0.992212,12,11.302282
7,,,3,3.731189
8,5.0,,5,-3.86473
9,4.0,0.007066,19,-11.587702


In [None]:
# Calculate the mode of Column A
my_df_clean_1 = my_df.copy()
mode_A = my_df_clean_1['A'].mode(dropna = True)[0]
# mode_A = my_df_clean_1['A'].dropna().mode()[0]  # Alternate way to calc mode
mode_A

5.0

In [None]:
# Fill in A using the mode of Column A
my_df_clean_1['A'].fillna( mode_A, inplace = True)
my_df_clean_1

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  my_df_clean_1['A'].fillna( mode_A, inplace = True)


Unnamed: 0,A,B,C,D
0,5.0,0.142867,17,-5.251229
1,5.0,,10,19.127713
2,8.0,,16,-20.267196
3,5.0,0.721999,15,11.194236
4,7.0,0.938553,15,7.791926
5,3.0,0.000779,19,-11.010978
6,7.0,0.992212,12,11.302282
7,5.0,,3,3.731189
8,5.0,,5,-3.86473
9,4.0,0.007066,19,-11.587702


In [None]:
# Fill in B using 0
my_df_clean_1['B'].fillna(0, inplace = True)
my_df_clean_1

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  my_df_clean_1['B'].fillna(0, inplace = True)


Unnamed: 0,A,B,C,D
0,5.0,0.142867,17,-5.251229
1,5.0,0.0,10,19.127713
2,8.0,0.0,16,-20.267196
3,5.0,0.721999,15,11.194236
4,7.0,0.938553,15,7.791926
5,3.0,0.000779,19,-11.010978
6,7.0,0.992212,12,11.302282
7,5.0,0.0,3,3.731189
8,5.0,0.0,5,-3.86473
9,4.0,0.007066,19,-11.587702


In [None]:
my_df

Unnamed: 0,A,B,C,D
0,,0.142867,17,-5.251229
1,,,10,19.127713
2,8.0,,16,-20.267196
3,5.0,0.721999,15,11.194236
4,7.0,0.938553,15,7.791926
5,3.0,0.000779,19,-11.010978
6,7.0,0.992212,12,11.302282
7,,,3,3.731189
8,5.0,,5,-3.86473
9,4.0,0.007066,19,-11.587702


In [None]:
# Simple Imputer
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values = np.nan, strategy = 'mean')
my_df_simple_impute = imp.fit_transform(my_df)
pd.DataFrame( my_df_simple_impute, columns = my_df.columns )

Unnamed: 0,A,B,C,D
0,5.571429,0.142867,17.0,-5.251229
1,5.571429,0.467246,10.0,19.127713
2,8.0,0.467246,16.0,-20.267196
3,5.0,0.721999,15.0,11.194236
4,7.0,0.938553,15.0,7.791926
5,3.0,0.000779,19.0,-11.010978
6,7.0,0.992212,12.0,11.302282
7,5.571429,0.467246,3.0,3.731189
8,5.0,0.467246,5.0,-3.86473
9,4.0,0.007066,19.0,-11.587702


In [None]:
# Iterative Imputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imp = IterativeImputer(max_iter=10, random_state = 4)
my_df_iter_impute = imp.fit_transform(my_df)
pd.DataFrame( my_df_iter_impute, columns = my_df.columns )

Unnamed: 0,A,B,C,D
0,5.569097,0.142867,17.0,-5.251229
1,5.583886,1.21586,10.0,19.127713
2,8.0,-0.171245,16.0,-20.267196
3,5.0,0.721999,15.0,11.194236
4,7.0,0.938553,15.0,7.791926
5,3.0,0.000779,19.0,-11.010978
6,7.0,0.992212,12.0,11.302282
7,5.578266,0.80286,3.0,3.731189
8,5.0,0.496488,5.0,-3.86473
9,4.0,0.007066,19.0,-11.587702


In [None]:
my_df_iter_impute[:,0].reshape(-1,1)

array([[5.56909701],
       [5.58388612],
       [8.        ],
       [5.        ],
       [7.        ],
       [3.        ],
       [7.        ],
       [5.57826587],
       [5.        ],
       [4.        ]])

In [None]:
my_df

Unnamed: 0,A,B,C,D
0,,0.142867,17,-5.251229
1,,,10,19.127713
2,8.0,,16,-20.267196
3,5.0,0.721999,15,11.194236
4,7.0,0.938553,15,7.791926
5,3.0,0.000779,19,-11.010978
6,7.0,0.992212,12,11.302282
7,,,3,3.731189
8,5.0,,5,-3.86473
9,4.0,0.007066,19,-11.587702


In [None]:
m, s = (my_df["A"].mean(), my_df["A"].std())
m,s

(5.571428571428571, 1.8126539343499315)

In [None]:
np.random.normal(m, s, 3 )

array([5.55641141, 4.99204752, 4.59870525])