MissForest

This project is a Python implementation of the MissForest algorithm, a powerful tool designed to handle missing values in tabular datasets. The primary goal of this project is to provide users with a more accurate method of imputing missing data.

While MissForest may take more time to process datasets compared to simpler imputation methods, it typically yields more accurate results.

Please note that the efficiency of MissForest is a trade-off for its accuracy. It is designed for those who prioritize data accuracy over processing speed. This makes it an excellent choice for projects where the quality of data is paramount.

How MissForest Handles Categorical Variables ?

Categorical variables in argument 'categoricals' will be label encoded for estimators to work properly.

Example

To install MissForest using pip.

pip install MissForest

Imputing a dataset:

import pandas as pd
from sklearn.model_selection import train_test_split
from missforest.missforest import MissForest

df = pd.read_csv("insurance.csv")
train, test = tran_test_split(df, test_size=.3, shuffle=True, random_state=42)

# default estimators are lgbm classifier and regressor
mf = MissForest()
mf.fit(
    X=train,
    categorical=["sex", "smoker", "region"]
)
train_imputed = mf.transform(X=train)
test_imputed = mf.transform(X=test)
print(test_imputed)

# or using the 'fit_transform' method
mf = MissForest()
train_imputed = mf.fit_transform(
    X=train,
    categorical=["sex", "smoker", "region"]
)
test_imputed = mf.transform(X=test)
print(test_imputed)

Imputing with other estimators

from missforest.missforest import MissForest
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

df = pd.read_csv("insurance.csv")

for c in df.columns:
    random_index = np.random.choice(df.index, size=100)
    df.loc[random_index, c] = np.nan

clf = RandomForestClassifier(n_jobs=-1)
rgr = RandomForestRegressor(n_jobs=-1)

mf = MissForest(clf, rgr)
df_imputed = mf.fit_transform(df)

Benchmark

Mean Absolute Percentage Error

	missForest	mean/mode	Difference
charges	2.65%	9.72%	-7.07%
age	1.16%	2.77%	-1.61%
bmi	1.18%	1.25%	-0.07%
sex	21.21	31.82	-10.61
smoker	4.24	9.90	-5.66
region	46.67	38.96	+7.71

Name		Name	Last commit message	Last commit date
Latest commit History 139 Commits
.github/workflows		.github/workflows
pylint		pylint
src		src
tests		tests
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_tests.py		run_tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MissForest

How MissForest Handles Categorical Variables ?

Example

Imputing with other estimators

Benchmark

About

Releases

Packages

Contributors 3

Languages

License

yuenshingyan/MissForest

Folders and files

Latest commit

History

Repository files navigation

MissForest

How MissForest Handles Categorical Variables ?

Example

Imputing with other estimators

Benchmark

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages