Skip to content
Missing Data Imputation Python Library
Branch: master
Clone or download
Latest commit f35eac4 Nov 15, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data
images
.gitignore
LICENSE
README.md
bayesian_parameter_optimization.py
create_folders.py
draw_network.py
example_adult.py
example_adult_mcar.py
example_votes.py
include_data.csv
include_votes.csv
missing_data_imputation.py
neural_networks.py
nnet_bin_scaled.py
nnet_full_bin_scaled.py
nnet_lasagne.py
nnet_utils.py
parameter_search.py
params.py
plot_errors_boxplot.py
plot_parameters_tried.py
plotting.py
predict_with_all_models.py
predict_with_best_dt_and_rf.py
predict_with_best_model.py
predict_with_dt_and_rf.py
preprocess_data.py
preprocess_test_data.py
preprocess_votes.py
processing.py

README.md

MDI

This repository is associated with the paper "Missing Data Imputation for Supervised Learning" (arXiv), which empirically evaluates methods for imputing missing categorical data for supervised learning tasks.

Please cite the paper if you use this code for academic research:

@article{poulos2018missing,
  title={Missing data imputation for supervised learning},
  author={Poulos, Jason and Valle, Rafael},
  journal={Applied Artificial Intelligence},
  volume={32},
  number={2},
  pages={186--196},
  year={2018},
  publisher={Taylor \& Francis}
}

Techniques for handling categorical missing data

We categorize proposed imputation methods into six groups listed below:

Case substitution One observation with missing data is replaced with another non-sampled obser- vation.

Summary statistic Replace the missing data with the mean, median, or mode of the feature vec- tor. Using a numerical approach directly is not appropriate for nonordinal categorical data.

One-hot Create a binary variable to indicate whether or not a specific feature is missing.

Hot deck and cold deck Compute the K-Nearest Neighbors of the observation with missing data and assign the mode of the K-neighbors to the missing data. algorithm.

Prediction Model Train a prediction model (e.g., random forests) to predict the missing value.

Factor analysis Perform factor analysis (e.g., principal component analysis (PCA)) on the design matrix, project the design matrix onto the first N eigenvectors and replace the missing values by the values that might be given by the projected design matrix.

Adult Dataset example

The figure below shows frequency of job category in the Adult dataset before and after the imputation techniques above were used.
Code can be found here

Adult dataset Imputation

Congresssional voting records dataset example

Code can be found here

Congresssional voting records dataset imputation

You can’t perform that action at this time.