Missing Data Imputation Python Library
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
data put back data Oct 28, 2016
images add images Jul 6, 2016
.gitignore add report figures and modify gitignore Jun 14, 2016
LICENSE Initial commit Jan 18, 2016
README.md clean up readme and rm old report May 17, 2017
bayesian_parameter_optimization.py forgot to update scores array Jun 26, 2016
create_folders.py changing to script Apr 15, 2016
draw_network.py adding method to do a simple drawing of the network Apr 15, 2016
example_adult.py fix examples Apr 24, 2016
example_adult_mcar.py adding monotone option to data perturbation Jan 30, 2016
example_votes.py fix examples Apr 24, 2016
include_data.csv adding identity files Apr 27, 2016
include_votes.csv fix May 7, 2016
missing_data_imputation.py typo and changing mode function to fix bug Jan 3, 2017
neural_networks.py fixing such that models can be loaded Apr 15, 2016
nnet_bin_scaled.py compatibility with GPU shared variables Mar 11, 2016
nnet_full_bin_scaled.py adding python routines for training and testing nnets Feb 24, 2016
nnet_lasagne.py Merge branch 'master' of https://github.com/rafaelvalle/MDI Apr 7, 2016
nnet_utils.py adding helper methods for nnets Apr 7, 2016
parameter_search.py adding dataset name to parameter search May 9, 2016
params.py subs PCA with SVD Jan 4, 2017
plot_errors_boxplot.py saving csv file Jun 25, 2016
plot_parameters_tried.py change file extension Jul 6, 2016
plotting.py adding routines to plot and process data Feb 24, 2016
predict_with_all_models.py did work Oct 28, 2016
predict_with_best_dt_and_rf.py adding script to predict with the best DT and RF found with cross val… Nov 17, 2016
predict_with_best_model.py did work Oct 28, 2016
predict_with_dt_and_rf.py perfect name! Nov 17, 2016
preprocess_data.py removing set trace Jan 3, 2017
preprocess_test_data.py remove KNN for votes for higher ratios of perturbed data May 7, 2016
preprocess_votes.py renaming method to conform with proper english Nov 17, 2016
processing.py changing extend to append Jan 4, 2017



This repository is associated with the paper "Missing Data Imputation for Supervised Learning", which compares methods for imputing missing categorical data for supervised learning tasks.

Please cite the paper if you use this code for academic research:

  title={Missing Data Imputation for Supervised Learning},
  author={Poulos, Jason and Valle, Rafael},
  journal={arXiv preprint arXiv:1610.09075},

Techniques for handling categorical missing data

We categorize proposed imputation methods into six groups listed below:

Case substitution One observation with missing data is replaced with another non-sampled obser- vation.

Summary statistic Replace the missing data with the mean, median, or mode of the feature vec- tor. Using a numerical approach directly is not appropriate for nonordinal categorical data.

One-hot Create a binary variable to indicate whether or not a specific feature is missing.

Hot deck and cold deck Compute the K-Nearest Neighbors of the observation with missing data and assign the mode of the K-neighbors to the missing data. algorithm.

Prediction Model Train a prediction model (e.g., random forests) to predict the missing value.

Factor analysis Perform factor analysis (e.g., principal component analysis (PCA)) on the design matrix, project the design matrix onto the first N eigenvectors and replace the missing values by the values that might be given by the projected design matrix.

Adult Dataset example

The figure below shows frequency of job category in the Adult dataset before and after the imputation techniques above were used.
Code can be found here

Adult dataset Imputation

Congresssional voting records dataset example

Code can be found here

Congresssional voting records dataset imputation