# `clean_ml()`: Clean dataset for downstreaming machine learning tasks.

## Introduction

The function `clean_ml()` cleans a dataset for downstreaming machine learning tasks with commonly used operators. It deals with categrical columns and numerical columns sperately. We set the default cleaning pipeline according to existing tools. 

Currently, the supported components and operators are listed below:

* `cat_encoding`: encoding categrical columns
    * no_encoding
    * one_hot
* `cat_imputation`: imputing missing values in categorical columns
    * constant
    * most_frequent
    * drop
* `num_imputataion`	: imputing missing values in numerical columns
    * mean
    * median
    * most_frequent
    * drop
* `num_scaling`: scaling numerical columns
    * standarize
    * minmax
    * maxabs
* `variance_threshold`: dropping numerical columns with low variance

Users can also specify `include_operators` and `exclude_operators` to include or exclude specified operators listed above. User can also customize the pipeline with user-defined operators. 

### An example dataset
The example dataset is a very traditional dataset [adult](https://archive.ics.uci.edu/ml/datasets/adult). It has 48842 rows and 15 columns. In this dataset, '?' means the missing values.

In [4]:
import pandas as pd
pd.set_option('display.min_rows', 30)
df = pd.read_csv('adult.csv')
df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native-country,class
0,2,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,1,0,2,United-States,<=50K
1,3,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,0,United-States,<=50K
2,2,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,2,United-States,<=50K
3,3,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,2,United-States,<=50K
4,1,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,2,Cuba,<=50K
5,2,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,2,United-States,<=50K
6,3,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,0,Jamaica,<=50K
7,3,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,2,United-States,>50K
8,1,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,4,0,3,United-States,>50K
9,2,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,2,0,2,United-States,>50K


### Split the dataset as training dataframe and test dataframe

In [5]:
training_rate = 0.7
index = df.index
number_of_rows = len(index)
training_df = df.iloc[:int(training_rate * number_of_rows), :]
test_df = df.iloc[int(training_rate * number_of_rows):, :]

In [6]:
training_df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native-country,class
0,2,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,1,0,2,United-States,<=50K
1,3,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,0,United-States,<=50K
2,2,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,2,United-States,<=50K
3,3,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,2,United-States,<=50K
4,1,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,2,Cuba,<=50K
5,2,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,2,United-States,<=50K
6,3,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,0,Jamaica,<=50K
7,3,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,2,United-States,>50K
8,1,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,4,0,3,United-States,>50K
9,2,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,2,0,2,United-States,>50K


In [7]:
test_df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native-country,class
34189,2,Self-emp-not-inc,263871,Some-college,10,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,3,United-States,<=50K
34190,2,State-gov,55294,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,2,United-States,>50K
34191,0,Private,174063,Assoc-voc,11,Never-married,Other-service,Own-child,White,Female,0,0,0,United-States,<=50K
34192,3,State-gov,258735,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,2,United-States,>50K
34193,3,Private,275867,HS-grad,9,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,2,United-States,<=50K
34194,0,Private,154235,Some-college,10,Never-married,Sales,Own-child,White,Female,0,0,1,United-States,<=50K
34195,1,Local-gov,210448,Some-college,10,Married-civ-spouse,Craft-repair,Other-relative,White,Male,0,0,2,United-States,<=50K
34196,1,Private,337908,Some-college,10,Divorced,Adm-clerical,Unmarried,Black,Female,0,0,1,United-States,<=50K
34197,1,State-gov,205333,Bachelors,13,Never-married,Prof-specialty,Not-in-family,White,Female,0,0,0,United-States,<=50K
34198,0,Private,187447,Some-college,10,Separated,Other-service,Own-child,White,Male,0,0,2,United-States,<=50K


## 1. Default `clean_ml()`

By default, the cleaning pipeline of `clean_ml()` function:
* For categorical columns: `constant imputation -> one-hot encoding`
* For numerical columns: `mean imputation -> standardzation`

The default NULL values are: `{np.nan, float("NaN"), "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN", "-NaN", "-nan", "1.#IND", "1.#QNAN", "<NA>", "N/A", "NA", "NULL", "NaN", "n/a", "nan", "null", "", None}`

The default filling value for categorical columns is 'missing_value'

In [8]:
from dataprep.clean import clean_ml
cleaned_training_df, cleaned_test_df = clean_ml(training_df, test_df, target="class")

In [9]:
cleaned_training_df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native-country,class
0,0.181564,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",-1.064247,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.132573,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]",1.054765,-0.206016,0.053470,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
1,0.955953,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",-1.009237,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.132573,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]",-0.271118,-0.206016,-2.185441,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
2,0.181564,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.246964,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",-0.417870,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]",-0.271118,-0.206016,0.053470,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
3,0.955953,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.428035,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",-1.193092,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0, 0.0, 0.0, 0.0]","[1.0, 0.0]",-0.271118,-0.206016,0.053470,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
4,-0.592825,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",1.412302,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.132573,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 1.0, 0.0, 0.0, 0.0]","[0.0, 1.0, 0.0, 0.0, 0.0]","[0.0, 1.0]",-0.271118,-0.206016,0.053470,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
5,0.181564,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.901345,"[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.520184,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 1.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0]",-0.271118,-0.206016,0.053470,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
6,0.955953,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",-0.279485,"[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...",-1.968313,"[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0, 0.0, 0.0, 0.0]","[0.0, 1.0]",-0.271118,-0.206016,-2.185441,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
7,0.955953,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.189970,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",-0.417870,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]",-0.271118,-0.206016,0.053470,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",>50K
8,-0.592825,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",-1.365494,"[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.520184,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0]",5.032415,-0.206016,1.172925,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",>50K
9,0.181564,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",-0.286491,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.132573,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]",2.380648,-0.206016,0.053470,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",>50K


In [10]:
cleaned_test_df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native-country,class
34189,0.181564,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.704744,"[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...",-0.030259,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]",-0.271118,-0.206016,1.172925,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
34190,0.181564,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",-1.275191,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.132573,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]",-0.271118,-0.206016,0.053470,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",>50K
34191,-1.367214,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",-0.147766,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ...",0.357352,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0]",-0.271118,-0.206016,-2.185441,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
34192,0.955953,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.655990,"[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...",-0.030259,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]",-0.271118,-0.206016,0.053470,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",>50K
34193,0.955953,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.818617,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",-0.417870,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]",-0.271118,-0.206016,0.053470,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
34194,-1.367214,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",-0.335985,"[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...",-0.030259,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0]",-0.271118,-0.206016,-1.065986,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
34195,-0.592825,"[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]",0.197621,"[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...",-0.030259,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 1.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]",-0.271118,-0.206016,0.053470,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
34196,-0.592825,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",1.407546,"[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...",-0.030259,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 1.0, 0.0]","[0.0, 1.0, 0.0, 0.0, 0.0]","[0.0, 1.0]",-0.271118,-0.206016,-1.065986,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
34197,-0.592825,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.149067,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.132573,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0]",-0.271118,-0.206016,-2.185441,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
34198,-1.367214,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",-0.020718,"[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...",-0.030259,"[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]",-0.271118,-0.206016,0.053470,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K


## 2. `cat_imputation` and `cat_null_value`  parameter
There are three choices for `cat_imputation` parameter:
* `constant`: filling the missing value with constant values. The default is 'missing_value'.
* `most_frequent`:  filling the missing value with most frequent value of this column.
* `drop`: drop this column if there are missing values.

`cat_null_value` parameter is a list including user-specified null values. The element in this list can be any type. For example:
* ['?']
* ['abc', np.nan, '?', 1265]

By default, the specified missing values are replaced by "missing_value"

#### cat_imputation = "constant"

In [18]:
cleaned_training_df, cleaned_test_df = clean_ml(training_df, test_df, target="class", 
                                                cat_imputation="constant", 
                                                cat_encoding="no_encoding", cat_null_value=['?'])

In [19]:
cleaned_training_df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native-country,class
0,0.181564,State-gov,-1.064247,Bachelors,1.132573,Never-married,Adm-clerical,Not-in-family,White,Male,1.054765,-0.206016,0.053470,United-States,<=50K
1,0.955953,Self-emp-not-inc,-1.009237,Bachelors,1.132573,Married-civ-spouse,Exec-managerial,Husband,White,Male,-0.271118,-0.206016,-2.185441,United-States,<=50K
2,0.181564,Private,0.246964,HS-grad,-0.417870,Divorced,Handlers-cleaners,Not-in-family,White,Male,-0.271118,-0.206016,0.053470,United-States,<=50K
3,0.955953,Private,0.428035,11th,-1.193092,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,-0.271118,-0.206016,0.053470,United-States,<=50K
4,-0.592825,Private,1.412302,Bachelors,1.132573,Married-civ-spouse,Prof-specialty,Wife,Black,Female,-0.271118,-0.206016,0.053470,Cuba,<=50K
5,0.181564,Private,0.901345,Masters,1.520184,Married-civ-spouse,Exec-managerial,Wife,White,Female,-0.271118,-0.206016,0.053470,United-States,<=50K
6,0.955953,Private,-0.279485,9th,-1.968313,Married-spouse-absent,Other-service,Not-in-family,Black,Female,-0.271118,-0.206016,-2.185441,Jamaica,<=50K
7,0.955953,Self-emp-not-inc,0.189970,HS-grad,-0.417870,Married-civ-spouse,Exec-managerial,Husband,White,Male,-0.271118,-0.206016,0.053470,United-States,>50K
8,-0.592825,Private,-1.365494,Masters,1.520184,Never-married,Prof-specialty,Not-in-family,White,Female,5.032415,-0.206016,1.172925,United-States,>50K
9,0.181564,Private,-0.286491,Bachelors,1.132573,Married-civ-spouse,Exec-managerial,Husband,White,Male,2.380648,-0.206016,0.053470,United-States,>50K


In [20]:
cleaned_test_df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native-country,class
34189,0.181564,Self-emp-not-inc,0.704744,Some-college,-0.030259,Married-civ-spouse,Craft-repair,Husband,White,Male,-0.271118,-0.206016,1.172925,United-States,<=50K
34190,0.181564,State-gov,-1.275191,Bachelors,1.132573,Married-civ-spouse,Prof-specialty,Husband,White,Male,-0.271118,-0.206016,0.053470,United-States,>50K
34191,-1.367214,Private,-0.147766,Assoc-voc,0.357352,Never-married,Other-service,Own-child,White,Female,-0.271118,-0.206016,-2.185441,United-States,<=50K
34192,0.955953,State-gov,0.655990,Some-college,-0.030259,Married-civ-spouse,Exec-managerial,Husband,White,Male,-0.271118,-0.206016,0.053470,United-States,>50K
34193,0.955953,Private,0.818617,HS-grad,-0.417870,Married-civ-spouse,Prof-specialty,Husband,White,Male,-0.271118,-0.206016,0.053470,United-States,<=50K
34194,-1.367214,Private,-0.335985,Some-college,-0.030259,Never-married,Sales,Own-child,White,Female,-0.271118,-0.206016,-1.065986,United-States,<=50K
34195,-0.592825,Local-gov,0.197621,Some-college,-0.030259,Married-civ-spouse,Craft-repair,Other-relative,White,Male,-0.271118,-0.206016,0.053470,United-States,<=50K
34196,-0.592825,Private,1.407546,Some-college,-0.030259,Divorced,Adm-clerical,Unmarried,Black,Female,-0.271118,-0.206016,-1.065986,United-States,<=50K
34197,-0.592825,State-gov,0.149067,Bachelors,1.132573,Never-married,Prof-specialty,Not-in-family,White,Female,-0.271118,-0.206016,-2.185441,United-States,<=50K
34198,-1.367214,Private,-0.020718,Some-college,-0.030259,Separated,Other-service,Own-child,White,Male,-0.271118,-0.206016,0.053470,United-States,<=50K


#### cat_imputation="most_frequent"

In [21]:
cleaned_training_df, cleaned_test_df = clean_ml(training_df, test_df, target="class", 
                                                cat_imputation="most_frequent", 
                                                cat_encoding="no_encoding", cat_null_value=['?'])

In [22]:
cleaned_training_df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native-country,class
0,0.181564,State-gov,-1.064247,Bachelors,1.132573,Never-married,Adm-clerical,Not-in-family,White,Male,1.054765,-0.206016,0.053470,United-States,<=50K
1,0.955953,Self-emp-not-inc,-1.009237,Bachelors,1.132573,Married-civ-spouse,Exec-managerial,Husband,White,Male,-0.271118,-0.206016,-2.185441,United-States,<=50K
2,0.181564,Private,0.246964,HS-grad,-0.417870,Divorced,Handlers-cleaners,Not-in-family,White,Male,-0.271118,-0.206016,0.053470,United-States,<=50K
3,0.955953,Private,0.428035,11th,-1.193092,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,-0.271118,-0.206016,0.053470,United-States,<=50K
4,-0.592825,Private,1.412302,Bachelors,1.132573,Married-civ-spouse,Prof-specialty,Wife,Black,Female,-0.271118,-0.206016,0.053470,Cuba,<=50K
5,0.181564,Private,0.901345,Masters,1.520184,Married-civ-spouse,Exec-managerial,Wife,White,Female,-0.271118,-0.206016,0.053470,United-States,<=50K
6,0.955953,Private,-0.279485,9th,-1.968313,Married-spouse-absent,Other-service,Not-in-family,Black,Female,-0.271118,-0.206016,-2.185441,Jamaica,<=50K
7,0.955953,Self-emp-not-inc,0.189970,HS-grad,-0.417870,Married-civ-spouse,Exec-managerial,Husband,White,Male,-0.271118,-0.206016,0.053470,United-States,>50K
8,-0.592825,Private,-1.365494,Masters,1.520184,Never-married,Prof-specialty,Not-in-family,White,Female,5.032415,-0.206016,1.172925,United-States,>50K
9,0.181564,Private,-0.286491,Bachelors,1.132573,Married-civ-spouse,Exec-managerial,Husband,White,Male,2.380648,-0.206016,0.053470,United-States,>50K


In [23]:
cleaned_test_df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native-country,class
34189,0.181564,Self-emp-not-inc,0.704744,Some-college,-0.030259,Married-civ-spouse,Craft-repair,Husband,White,Male,-0.271118,-0.206016,1.172925,United-States,<=50K
34190,0.181564,State-gov,-1.275191,Bachelors,1.132573,Married-civ-spouse,Prof-specialty,Husband,White,Male,-0.271118,-0.206016,0.053470,United-States,>50K
34191,-1.367214,Private,-0.147766,Assoc-voc,0.357352,Never-married,Other-service,Own-child,White,Female,-0.271118,-0.206016,-2.185441,United-States,<=50K
34192,0.955953,State-gov,0.655990,Some-college,-0.030259,Married-civ-spouse,Exec-managerial,Husband,White,Male,-0.271118,-0.206016,0.053470,United-States,>50K
34193,0.955953,Private,0.818617,HS-grad,-0.417870,Married-civ-spouse,Prof-specialty,Husband,White,Male,-0.271118,-0.206016,0.053470,United-States,<=50K
34194,-1.367214,Private,-0.335985,Some-college,-0.030259,Never-married,Sales,Own-child,White,Female,-0.271118,-0.206016,-1.065986,United-States,<=50K
34195,-0.592825,Local-gov,0.197621,Some-college,-0.030259,Married-civ-spouse,Craft-repair,Other-relative,White,Male,-0.271118,-0.206016,0.053470,United-States,<=50K
34196,-0.592825,Private,1.407546,Some-college,-0.030259,Divorced,Adm-clerical,Unmarried,Black,Female,-0.271118,-0.206016,-1.065986,United-States,<=50K
34197,-0.592825,State-gov,0.149067,Bachelors,1.132573,Never-married,Prof-specialty,Not-in-family,White,Female,-0.271118,-0.206016,-2.185441,United-States,<=50K
34198,-1.367214,Private,-0.020718,Some-college,-0.030259,Separated,Other-service,Own-child,White,Male,-0.271118,-0.206016,0.053470,United-States,<=50K


#### cat_imputation="drop"

In [24]:
cleaned_training_df, cleaned_test_df = clean_ml(training_df, test_df, target="class", 
                                                cat_imputation="drop", 
                                                cat_encoding="no_encoding", cat_null_value=['?'])

In [25]:
cleaned_training_df

Unnamed: 0,age,fnlwgt,education,education-num,marital-status,relationship,race,sex,capitalgain,capitalloss,hoursperweek,class
0,0.181564,-1.064247,Bachelors,1.132573,Never-married,Not-in-family,White,Male,1.054765,-0.206016,0.053470,<=50K
1,0.955953,-1.009237,Bachelors,1.132573,Married-civ-spouse,Husband,White,Male,-0.271118,-0.206016,-2.185441,<=50K
2,0.181564,0.246964,HS-grad,-0.417870,Divorced,Not-in-family,White,Male,-0.271118,-0.206016,0.053470,<=50K
3,0.955953,0.428035,11th,-1.193092,Married-civ-spouse,Husband,Black,Male,-0.271118,-0.206016,0.053470,<=50K
4,-0.592825,1.412302,Bachelors,1.132573,Married-civ-spouse,Wife,Black,Female,-0.271118,-0.206016,0.053470,<=50K
5,0.181564,0.901345,Masters,1.520184,Married-civ-spouse,Wife,White,Female,-0.271118,-0.206016,0.053470,<=50K
6,0.955953,-0.279485,9th,-1.968313,Married-spouse-absent,Not-in-family,Black,Female,-0.271118,-0.206016,-2.185441,<=50K
7,0.955953,0.189970,HS-grad,-0.417870,Married-civ-spouse,Husband,White,Male,-0.271118,-0.206016,0.053470,>50K
8,-0.592825,-1.365494,Masters,1.520184,Never-married,Not-in-family,White,Female,5.032415,-0.206016,1.172925,>50K
9,0.181564,-0.286491,Bachelors,1.132573,Married-civ-spouse,Husband,White,Male,2.380648,-0.206016,0.053470,>50K


In [26]:
cleaned_test_df

Unnamed: 0,age,fnlwgt,education,education-num,marital-status,relationship,race,sex,capitalgain,capitalloss,hoursperweek,class
34189,0.181564,0.704744,Some-college,-0.030259,Married-civ-spouse,Husband,White,Male,-0.271118,-0.206016,1.172925,<=50K
34190,0.181564,-1.275191,Bachelors,1.132573,Married-civ-spouse,Husband,White,Male,-0.271118,-0.206016,0.053470,>50K
34191,-1.367214,-0.147766,Assoc-voc,0.357352,Never-married,Own-child,White,Female,-0.271118,-0.206016,-2.185441,<=50K
34192,0.955953,0.655990,Some-college,-0.030259,Married-civ-spouse,Husband,White,Male,-0.271118,-0.206016,0.053470,>50K
34193,0.955953,0.818617,HS-grad,-0.417870,Married-civ-spouse,Husband,White,Male,-0.271118,-0.206016,0.053470,<=50K
34194,-1.367214,-0.335985,Some-college,-0.030259,Never-married,Own-child,White,Female,-0.271118,-0.206016,-1.065986,<=50K
34195,-0.592825,0.197621,Some-college,-0.030259,Married-civ-spouse,Other-relative,White,Male,-0.271118,-0.206016,0.053470,<=50K
34196,-0.592825,1.407546,Some-college,-0.030259,Divorced,Unmarried,Black,Female,-0.271118,-0.206016,-1.065986,<=50K
34197,-0.592825,0.149067,Bachelors,1.132573,Never-married,Not-in-family,White,Female,-0.271118,-0.206016,-2.185441,<=50K
34198,-1.367214,-0.020718,Some-college,-0.030259,Separated,Own-child,White,Male,-0.271118,-0.206016,0.053470,<=50K


## 3. `fill_val` parameter

By default, the filling value of categorical missing value is "missing value". However, user can specify this string with whatever string they like, such as `"missing"`, `"NaN"`, `"I'm a cat."`, `"Fyodor Dostoyevsky"`.

In [30]:
cleaned_training_df, cleaned_test_df = clean_ml(training_df, test_df, target="class", 
                                                cat_null_value=['?'], cat_encoding="no_encoding",
                                                fill_val="AHAHAHAHAHA!!!")

In [31]:
cleaned_training_df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native-country,class
0,0.181564,State-gov,-1.064247,Bachelors,1.132573,Never-married,Adm-clerical,Not-in-family,White,Male,1.054765,-0.206016,0.053470,United-States,<=50K
1,0.955953,Self-emp-not-inc,-1.009237,Bachelors,1.132573,Married-civ-spouse,Exec-managerial,Husband,White,Male,-0.271118,-0.206016,-2.185441,United-States,<=50K
2,0.181564,Private,0.246964,HS-grad,-0.417870,Divorced,Handlers-cleaners,Not-in-family,White,Male,-0.271118,-0.206016,0.053470,United-States,<=50K
3,0.955953,Private,0.428035,11th,-1.193092,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,-0.271118,-0.206016,0.053470,United-States,<=50K
4,-0.592825,Private,1.412302,Bachelors,1.132573,Married-civ-spouse,Prof-specialty,Wife,Black,Female,-0.271118,-0.206016,0.053470,Cuba,<=50K
5,0.181564,Private,0.901345,Masters,1.520184,Married-civ-spouse,Exec-managerial,Wife,White,Female,-0.271118,-0.206016,0.053470,United-States,<=50K
6,0.955953,Private,-0.279485,9th,-1.968313,Married-spouse-absent,Other-service,Not-in-family,Black,Female,-0.271118,-0.206016,-2.185441,Jamaica,<=50K
7,0.955953,Self-emp-not-inc,0.189970,HS-grad,-0.417870,Married-civ-spouse,Exec-managerial,Husband,White,Male,-0.271118,-0.206016,0.053470,United-States,>50K
8,-0.592825,Private,-1.365494,Masters,1.520184,Never-married,Prof-specialty,Not-in-family,White,Female,5.032415,-0.206016,1.172925,United-States,>50K
9,0.181564,Private,-0.286491,Bachelors,1.132573,Married-civ-spouse,Exec-managerial,Husband,White,Male,2.380648,-0.206016,0.053470,United-States,>50K


In [32]:
cleaned_test_df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native-country,class
34189,0.181564,Self-emp-not-inc,0.704744,Some-college,-0.030259,Married-civ-spouse,Craft-repair,Husband,White,Male,-0.271118,-0.206016,1.172925,United-States,<=50K
34190,0.181564,State-gov,-1.275191,Bachelors,1.132573,Married-civ-spouse,Prof-specialty,Husband,White,Male,-0.271118,-0.206016,0.053470,United-States,>50K
34191,-1.367214,Private,-0.147766,Assoc-voc,0.357352,Never-married,Other-service,Own-child,White,Female,-0.271118,-0.206016,-2.185441,United-States,<=50K
34192,0.955953,State-gov,0.655990,Some-college,-0.030259,Married-civ-spouse,Exec-managerial,Husband,White,Male,-0.271118,-0.206016,0.053470,United-States,>50K
34193,0.955953,Private,0.818617,HS-grad,-0.417870,Married-civ-spouse,Prof-specialty,Husband,White,Male,-0.271118,-0.206016,0.053470,United-States,<=50K
34194,-1.367214,Private,-0.335985,Some-college,-0.030259,Never-married,Sales,Own-child,White,Female,-0.271118,-0.206016,-1.065986,United-States,<=50K
34195,-0.592825,Local-gov,0.197621,Some-college,-0.030259,Married-civ-spouse,Craft-repair,Other-relative,White,Male,-0.271118,-0.206016,0.053470,United-States,<=50K
34196,-0.592825,Private,1.407546,Some-college,-0.030259,Divorced,Adm-clerical,Unmarried,Black,Female,-0.271118,-0.206016,-1.065986,United-States,<=50K
34197,-0.592825,State-gov,0.149067,Bachelors,1.132573,Never-married,Prof-specialty,Not-in-family,White,Female,-0.271118,-0.206016,-2.185441,United-States,<=50K
34198,-1.367214,Private,-0.020718,Some-college,-0.030259,Separated,Other-service,Own-child,White,Male,-0.271118,-0.206016,0.053470,United-States,<=50K


## 4. `num_imputation` and  `num_null_value` parameter
There are three choices for `num_imputation` parameter:
* `mean`: filling the missing value with mean value of this column. 
* `meduab`: filling the missing value with median value of this column. 
* `most_frequent`:  filling the missing value with most frequent value of this column.
* `drop`: drop this column if there are missing values.

The default null values are same to the null values metioned in `cat_imputation` parameter.

The imputing process is quite similar with the `cat_imputation` parameter section. Thus, we don't show redundant examples here.

`num_null_value` parameter is a list including user-specified null values. The element in this list can be any type. For example:
* ['?']
* ['abc', np.nan, '?', 1265]

The usage of `num_null_value` parameter is same to `cat_null_value` parameter. Thus we don't show redundant examples here.

## 5. `cat_encoding` parameter

There are three choices for `cat_encoding` parameter:
* `no_encoding`: don't do any encoding for categorical columns.
* `one_hot`: do one_hot encoding for categorical columns.

The default value is `one_hot`.

#### cat_encoding = "no_encoding"

In [36]:
cleaned_training_df, cleaned_test_df = clean_ml(training_df, test_df, target="class", cat_encoding="no_encoding")

In [37]:
cleaned_training_df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native-country,class
0,0.181564,State-gov,-1.064247,Bachelors,1.132573,Never-married,Adm-clerical,Not-in-family,White,Male,1.054765,-0.206016,0.053470,United-States,<=50K
1,0.955953,Self-emp-not-inc,-1.009237,Bachelors,1.132573,Married-civ-spouse,Exec-managerial,Husband,White,Male,-0.271118,-0.206016,-2.185441,United-States,<=50K
2,0.181564,Private,0.246964,HS-grad,-0.417870,Divorced,Handlers-cleaners,Not-in-family,White,Male,-0.271118,-0.206016,0.053470,United-States,<=50K
3,0.955953,Private,0.428035,11th,-1.193092,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,-0.271118,-0.206016,0.053470,United-States,<=50K
4,-0.592825,Private,1.412302,Bachelors,1.132573,Married-civ-spouse,Prof-specialty,Wife,Black,Female,-0.271118,-0.206016,0.053470,Cuba,<=50K
5,0.181564,Private,0.901345,Masters,1.520184,Married-civ-spouse,Exec-managerial,Wife,White,Female,-0.271118,-0.206016,0.053470,United-States,<=50K
6,0.955953,Private,-0.279485,9th,-1.968313,Married-spouse-absent,Other-service,Not-in-family,Black,Female,-0.271118,-0.206016,-2.185441,Jamaica,<=50K
7,0.955953,Self-emp-not-inc,0.189970,HS-grad,-0.417870,Married-civ-spouse,Exec-managerial,Husband,White,Male,-0.271118,-0.206016,0.053470,United-States,>50K
8,-0.592825,Private,-1.365494,Masters,1.520184,Never-married,Prof-specialty,Not-in-family,White,Female,5.032415,-0.206016,1.172925,United-States,>50K
9,0.181564,Private,-0.286491,Bachelors,1.132573,Married-civ-spouse,Exec-managerial,Husband,White,Male,2.380648,-0.206016,0.053470,United-States,>50K


In [38]:
cleaned_test_df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native-country,class
34189,0.181564,Self-emp-not-inc,0.704744,Some-college,-0.030259,Married-civ-spouse,Craft-repair,Husband,White,Male,-0.271118,-0.206016,1.172925,United-States,<=50K
34190,0.181564,State-gov,-1.275191,Bachelors,1.132573,Married-civ-spouse,Prof-specialty,Husband,White,Male,-0.271118,-0.206016,0.053470,United-States,>50K
34191,-1.367214,Private,-0.147766,Assoc-voc,0.357352,Never-married,Other-service,Own-child,White,Female,-0.271118,-0.206016,-2.185441,United-States,<=50K
34192,0.955953,State-gov,0.655990,Some-college,-0.030259,Married-civ-spouse,Exec-managerial,Husband,White,Male,-0.271118,-0.206016,0.053470,United-States,>50K
34193,0.955953,Private,0.818617,HS-grad,-0.417870,Married-civ-spouse,Prof-specialty,Husband,White,Male,-0.271118,-0.206016,0.053470,United-States,<=50K
34194,-1.367214,Private,-0.335985,Some-college,-0.030259,Never-married,Sales,Own-child,White,Female,-0.271118,-0.206016,-1.065986,United-States,<=50K
34195,-0.592825,Local-gov,0.197621,Some-college,-0.030259,Married-civ-spouse,Craft-repair,Other-relative,White,Male,-0.271118,-0.206016,0.053470,United-States,<=50K
34196,-0.592825,Private,1.407546,Some-college,-0.030259,Divorced,Adm-clerical,Unmarried,Black,Female,-0.271118,-0.206016,-1.065986,United-States,<=50K
34197,-0.592825,State-gov,0.149067,Bachelors,1.132573,Never-married,Prof-specialty,Not-in-family,White,Female,-0.271118,-0.206016,-2.185441,United-States,<=50K
34198,-1.367214,Private,-0.020718,Some-college,-0.030259,Separated,Other-service,Own-child,White,Male,-0.271118,-0.206016,0.053470,United-States,<=50K


#### cat_encoding="one_hot"

In [39]:
cleaned_training_df, cleaned_test_df = clean_ml(training_df, test_df, target="class", cat_encoding="one_hot")

In [40]:
cleaned_training_df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native-country,class
0,0.181564,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",-1.064247,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.132573,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]",1.054765,-0.206016,0.053470,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
1,0.955953,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",-1.009237,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.132573,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]",-0.271118,-0.206016,-2.185441,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
2,0.181564,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.246964,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",-0.417870,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]",-0.271118,-0.206016,0.053470,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
3,0.955953,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.428035,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",-1.193092,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0, 0.0, 0.0, 0.0]","[1.0, 0.0]",-0.271118,-0.206016,0.053470,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
4,-0.592825,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",1.412302,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.132573,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 1.0, 0.0, 0.0, 0.0]","[0.0, 1.0, 0.0, 0.0, 0.0]","[0.0, 1.0]",-0.271118,-0.206016,0.053470,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
5,0.181564,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.901345,"[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.520184,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 1.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0]",-0.271118,-0.206016,0.053470,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
6,0.955953,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",-0.279485,"[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...",-1.968313,"[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0, 0.0, 0.0, 0.0]","[0.0, 1.0]",-0.271118,-0.206016,-2.185441,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
7,0.955953,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.189970,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",-0.417870,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]",-0.271118,-0.206016,0.053470,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",>50K
8,-0.592825,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",-1.365494,"[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.520184,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0]",5.032415,-0.206016,1.172925,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",>50K
9,0.181564,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",-0.286491,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.132573,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]",2.380648,-0.206016,0.053470,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",>50K


In [41]:
cleaned_test_df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native-country,class
34189,0.181564,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.704744,"[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...",-0.030259,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]",-0.271118,-0.206016,1.172925,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
34190,0.181564,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",-1.275191,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.132573,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]",-0.271118,-0.206016,0.053470,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",>50K
34191,-1.367214,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",-0.147766,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ...",0.357352,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0]",-0.271118,-0.206016,-2.185441,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
34192,0.955953,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.655990,"[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...",-0.030259,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]",-0.271118,-0.206016,0.053470,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",>50K
34193,0.955953,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.818617,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",-0.417870,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]",-0.271118,-0.206016,0.053470,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
34194,-1.367214,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",-0.335985,"[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...",-0.030259,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0]",-0.271118,-0.206016,-1.065986,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
34195,-0.592825,"[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]",0.197621,"[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...",-0.030259,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 1.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]",-0.271118,-0.206016,0.053470,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
34196,-0.592825,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",1.407546,"[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...",-0.030259,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 1.0, 0.0]","[0.0, 1.0, 0.0, 0.0, 0.0]","[0.0, 1.0]",-0.271118,-0.206016,-1.065986,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
34197,-0.592825,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.149067,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.132573,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0]",-0.271118,-0.206016,-2.185441,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
34198,-1.367214,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",-0.020718,"[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...",-0.030259,"[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]",-0.271118,-0.206016,0.053470,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K


## 6. `variance_threshold` and `variance` parameter
There are two choices for `variance_threshold` parameter:
* `True`: filtering numerical columns whose variance is less than the `variance` value.
* `False`: do nothing

The default `variance_threshold` is False.

The default `variance` is 0.0.

In [42]:
cleaned_training_df, cleaned_test_df = clean_ml(training_df, test_df, target="class", 
                                                variance_threshold=True, variance=6.0)

In [43]:
cleaned_training_df

Unnamed: 0,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,native-country,class
0,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",-1.064247,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.132573,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
1,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",-1.009237,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.132573,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
2,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.246964,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",-0.417870,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
3,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.428035,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",-1.193092,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0, 0.0, 0.0, 0.0]","[1.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
4,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",1.412302,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.132573,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 1.0, 0.0, 0.0, 0.0]","[0.0, 1.0, 0.0, 0.0, 0.0]","[0.0, 1.0]","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
5,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.901345,"[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.520184,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 1.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0]","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
6,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",-0.279485,"[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...",-1.968313,"[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0, 0.0, 0.0, 0.0]","[0.0, 1.0]","[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
7,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.189970,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",-0.417870,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",>50K
8,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",-1.365494,"[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.520184,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0]","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",>50K
9,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",-0.286491,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.132573,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",>50K


In [44]:
cleaned_test_df 

Unnamed: 0,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,native-country,class
34189,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.704744,"[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...",-0.030259,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
34190,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",-1.275191,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.132573,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",>50K
34191,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",-0.147766,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ...",0.357352,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0]","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
34192,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.655990,"[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...",-0.030259,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",>50K
34193,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.818617,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",-0.417870,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
34194,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",-0.335985,"[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...",-0.030259,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0]","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
34195,"[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]",0.197621,"[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...",-0.030259,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 1.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
34196,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",1.407546,"[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...",-0.030259,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 1.0, 0.0]","[0.0, 1.0, 0.0, 0.0, 0.0]","[0.0, 1.0]","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
34197,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.149067,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.132573,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 1.0]","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K
34198,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",-0.020718,"[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...",-0.030259,"[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0]","[1.0, 0.0]","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",<=50K


## 7. `num_scaling` parameter
There are three choices for `num_scaling` parameter:
* `standardize`: standarding each numerical column with mean value and std value of this column. The transformation is (x - mean) / std.
* `minmax`: scaling each numerical column with min value and max value of this column. The transformation is (x - min) / (max - min)
* `maxabs`: scaling each numerical column with max absolute value of this column. The transformation is x / maxabs.

The default `num_scaling` is `standardize`.

In [55]:
cleaned_training_df, cleaned_test_df = clean_ml(training_df, test_df, target="class", 
                                                cat_encoding='no_encoding',
                                                num_scaling="minmax")

In [56]:
cleaned_training_df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native-country,class
0,0.50,State-gov,0.044302,Bachelors,0.800000,Never-married,Adm-clerical,Not-in-family,White,Male,0.25,0.00,0.50,United-States,<=50K
1,0.75,Self-emp-not-inc,0.048238,Bachelors,0.800000,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.00,0.00,0.00,United-States,<=50K
2,0.50,Private,0.138113,HS-grad,0.533333,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.00,0.00,0.50,United-States,<=50K
3,0.75,Private,0.151068,11th,0.400000,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.00,0.00,0.50,United-States,<=50K
4,0.25,Private,0.221488,Bachelors,0.800000,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.00,0.00,0.50,Cuba,<=50K
5,0.50,Private,0.184932,Masters,0.866667,Married-civ-spouse,Exec-managerial,Wife,White,Female,0.00,0.00,0.50,United-States,<=50K
6,0.75,Private,0.100448,9th,0.266667,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0.00,0.00,0.00,Jamaica,<=50K
7,0.75,Self-emp-not-inc,0.134036,HS-grad,0.533333,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.00,0.00,0.50,United-States,>50K
8,0.25,Private,0.022749,Masters,0.866667,Never-married,Prof-specialty,Not-in-family,White,Female,1.00,0.00,0.75,United-States,>50K
9,0.50,Private,0.099947,Bachelors,0.800000,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.50,0.00,0.50,United-States,>50K


In [57]:
cleaned_test_df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native-country,class
34189,0.50,Self-emp-not-inc,0.170866,Some-college,0.600000,Married-civ-spouse,Craft-repair,Husband,White,Male,0.0,0.0,0.75,United-States,<=50K
34190,0.50,State-gov,0.029210,Bachelors,0.800000,Married-civ-spouse,Prof-specialty,Husband,White,Male,0.0,0.0,0.50,United-States,>50K
34191,0.00,Private,0.109872,Assoc-voc,0.666667,Never-married,Other-service,Own-child,White,Female,0.0,0.0,0.00,United-States,<=50K
34192,0.75,State-gov,0.167378,Some-college,0.600000,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,0.50,United-States,>50K
34193,0.75,Private,0.179013,HS-grad,0.533333,Married-civ-spouse,Prof-specialty,Husband,White,Male,0.0,0.0,0.50,United-States,<=50K
34194,0.00,Private,0.096406,Some-college,0.600000,Never-married,Sales,Own-child,White,Female,0.0,0.0,0.25,United-States,<=50K
34195,0.25,Local-gov,0.134583,Some-college,0.600000,Married-civ-spouse,Craft-repair,Other-relative,White,Male,0.0,0.0,0.50,United-States,<=50K
34196,0.25,Private,0.221148,Some-college,0.600000,Divorced,Adm-clerical,Unmarried,Black,Female,0.0,0.0,0.25,United-States,<=50K
34197,0.25,State-gov,0.131109,Bachelors,0.800000,Never-married,Prof-specialty,Not-in-family,White,Female,0.0,0.0,0.00,United-States,<=50K
34198,0.00,Private,0.118962,Some-college,0.600000,Separated,Other-service,Own-child,White,Male,0.0,0.0,0.50,United-States,<=50K


## 8. `include_operators` and `exclude_operators` parameter
The `include_operators` indicates which operator must be included in the cleaning pipeline. It is a list. For example: 
* `['one_hot', 'minmax', 'median', 'most_frequent']`

The `exclude_operators` indicates which operator must be excluded in the cleaning pipeline. It has the same format with `include_operators`.

The valid choices for `include_operators` and `exclude_operators`:
* `one_hot`
* `constant`
* `most_frequent`
* `drop`
* `mean`
* `median`
* `standardize`
* `minmax`
* `maxabs`

## 9. `customized_cat_pipeline` and `customized_num_pipeline` parameter
Experienced users can specify their own `customized_cat_pipeline` and `customized_num_pipeline`. The two parameters are lists including dictionaries of each component. Each compontent is also a dictionary including the name of specified operator and related parameters. For example: 
* `[
    {"cat_imputation": {"operator": 'constant', "cat_null_value": ['?'], "fill_val": "Hahahaha!!!!!"}},
]
`

Users can also specifiy their own operators. They just need to define a typical class with the `__init__` function, the `fit`, `transform` and `fit_transform` functions. When using them, the name of the class can be put at the operator's position.

In [58]:
from typing import Any, Union
import dask.dataframe as dd
import pandas as pd
import numpy as np

class MaxAbsScaler:
    def __init__(self) -> None:
        self.name = "minmaxScaler"

    def fit(self,
            df: pd.Series) -> Any:
        self.maxabs = df.abs().max()
        return self

    def transform(self,
            df: pd.Series) -> pd.Series:
        result = df.map(self.compute_val)
        return result

    def fit_transform(self,
            df: pd.Series) -> pd.Series:
        return  self.fit(df).transform(df)

    def compute_val(self, val):
        return val / self.maxabs

customized_cat_pipeline = [
    {"cat_imputation": {"operator": 'constant', "cat_null_value": ['?'], "fill_val": "Hahahaha!!!!!"}},
]
customized_num_pipeline = [
    {"num_scaling": {"operator": MaxAbsScaler}},
]
cleaned_training_df, cleaned_test_df = clean_ml(training_df, test_df, customized_cat_pipeline=customized_cat_pipeline, customized_num_pipeline=customized_num_pipeline)

In [59]:
cleaned_training_df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native-country,class
0,0.50,State-gov,0.052210,Bachelors,0.8125,Never-married,Adm-clerical,Not-in-family,White,Male,0.25,0.00,0.50,United-States,<=50K
1,0.75,Self-emp-not-inc,0.056113,Bachelors,0.8125,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.00,0.00,0.00,United-States,<=50K
2,0.50,Private,0.145245,HS-grad,0.5625,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.00,0.00,0.50,United-States,<=50K
3,0.75,Private,0.158093,11th,0.4375,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.00,0.00,0.50,United-States,<=50K
4,0.25,Private,0.227930,Bachelors,0.8125,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.00,0.00,0.50,Cuba,<=50K
5,0.50,Private,0.191676,Masters,0.8750,Married-civ-spouse,Exec-managerial,Wife,White,Female,0.00,0.00,0.50,United-States,<=50K
6,0.75,Private,0.107891,9th,0.3125,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0.00,0.00,0.00,Jamaica,<=50K
7,0.75,Self-emp-not-inc,0.141201,HS-grad,0.5625,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.00,0.00,0.50,United-States,>50K
8,0.25,Private,0.030835,Masters,0.8750,Never-married,Prof-specialty,Not-in-family,White,Female,1.00,0.00,0.75,United-States,>50K
9,0.50,Private,0.107394,Bachelors,0.8125,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.50,0.00,0.50,United-States,>50K


In [60]:
cleaned_test_df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native-country,class
34189,0.50,Self-emp-not-inc,0.177726,Some-college,0.6250,Married-civ-spouse,Craft-repair,Husband,White,Male,0.0,0.0,0.75,United-States,<=50K
34190,0.50,State-gov,0.037242,Bachelors,0.8125,Married-civ-spouse,Prof-specialty,Husband,White,Male,0.0,0.0,0.50,United-States,>50K
34191,0.00,Private,0.117237,Assoc-voc,0.6875,Never-married,Other-service,Own-child,White,Female,0.0,0.0,0.00,United-States,<=50K
34192,0.75,State-gov,0.174267,Some-college,0.6250,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,0.50,United-States,>50K
34193,0.75,Private,0.185806,HS-grad,0.5625,Married-civ-spouse,Prof-specialty,Husband,White,Male,0.0,0.0,0.50,United-States,<=50K
34194,0.00,Private,0.103883,Some-college,0.6250,Never-married,Sales,Own-child,White,Female,0.0,0.0,0.25,United-States,<=50K
34195,0.25,Local-gov,0.141744,Some-college,0.6250,Married-civ-spouse,Craft-repair,Other-relative,White,Male,0.0,0.0,0.50,United-States,<=50K
34196,0.25,Private,0.227593,Some-college,0.6250,Divorced,Adm-clerical,Unmarried,Black,Female,0.0,0.0,0.25,United-States,<=50K
34197,0.25,State-gov,0.138299,Bachelors,0.8125,Never-married,Prof-specialty,Not-in-family,White,Female,0.0,0.0,0.00,United-States,<=50K
34198,0.00,Private,0.126252,Some-college,0.6250,Separated,Other-service,Own-child,White,Male,0.0,0.0,0.50,United-States,<=50K
