<a href="https://colab.research.google.com/github/ramdhanhdy/Ramdhan_Portfolio/blob/main/Missing_Value_Imputation_using_Datawig.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## About DataWig:

From Amazon's official documentation
> DataWig learns Machine Learning models to impute missing values in tables

In [None]:
#Install datawig to path
!pip3 install --target=$nb_path datawig

## Dunning DataWig

The DataWig API expect pandas DataFrame as its input

Amazon's official tutorial suggest trying the `simpleimputer` as a starting point to get familiar with DataWig. One of the function belong to `simpleimputer` class is `SimpleImputer.complete`, which fits an imputation model for each column of the input DataFrame having missing values.

The code below is an example of how `SimpleImputer.complete` works in practice

In [1]:
import datawig, numpy

# generate synthetic data with simple nonlinear dependency
df = datawig.utils.generate_df_numeric()

# hide 10% of the values 
df_with_missing = df.mask(numpy.random.rand(*df.shape) > .9)

# impute missing values
df_with_missing_imputed = datawig.SimpleImputer.complete(df_with_missing)

2021-03-14 10:28:28,337 [INFO]  NumExpr defaulting to 2 threads.


The `NumExpr defaulting to 2 threads` message arises from `datawig.utils` execution. It basically sets the number of threads to the number of CPU cores currently being used.

### Imputing numerical columns

In [10]:
# generate synthetic data with simple nonlinear dependency
df = datawig.utils.generate_df_numeric( num_samples=200,
                                        data_column_name='x',
                                       label_column_name='y')
# split into train and test data
df_train, df_test = datawig.utils.random_split(df)

In [13]:
#Initialize a SimpleImputer model
imputer = datawig.SimpleImputer(
        input_columns = ['x'], #column(s) containing information about the column we want to impute
        output_column = 'y', # the column we'd like to impute values for
        output_path = 'imputer_model' #stores model data and metrics
      )

#Fit an imputer model on the train data
imputer.fit(train_df = df_train, num_epochs=50)

#Impute missing values and return original dataframe with predictions
imputed = imputer.predict(df_test)

### Imputing categorical columns

In [2]:
df = datawig.utils.generate_df_string(num_samples=200, 
                                      data_column_name='sentences',
                                      label_column_name='label')

df_train, df_test = datawig.utils.random_split(df)

In [8]:
df.head

<bound method NDFrame.head of                                sentences  label
0    lThEn qJu8x iWgNZ PtwBl lThEn UuT3e  iWgNZ
1    6nGis sCfMZ yuKps rhTox JMWa4 n9JEC  6nGis
2    z3SKA JsThJ iWgNZ 1qeoe w5UE1 Soc0K  iWgNZ
3    RebhO FQPS6 3HqdZ IripW 6nGis 2V9SM  6nGis
4    2V9SM 56KkS PfaMm O67IM YQOaN 6nGis  6nGis
..                                   ...    ...
195  j7c1f 1qeoe qJu8x iWgNZ q6ITZ U1zT0  iWgNZ
196  mM1Kh RK3Ac 6nGis n9JEC VfvVc 7zOfb  6nGis
197  saijr mM1Kh w5UE1 6nGis Wa0RH JsThJ  6nGis
198  9ZMJL 7zOfb 9ZMJL iWgNZ Rn9Xd DdZ04  iWgNZ
199  6J6af yuKps iWgNZ tyVA1 JsThJ RebhO  iWgNZ

[200 rows x 2 columns]>

In [7]:
list(df.columns.values)

['sentences', 'label']

In [9]:
#Initialize a SimpleImputer model
imputer = datawig.SimpleImputer(
    input_columns=['sentences'],
    output_column='label',
    output_path = 'imputer_model',
  ) 

# Fit an imputer model on the train data
imputer.fit(train_df = df_train)

#Impute missing values and return original dataframe with predictions
imputed = imputer.predict(df_test)

## Applying DataWig on Real Dataset