# Optimise Speed of Filling-NaN Function

In this notebook, I compare the time-consumption of different filling-NaN function function implementations from discussion forum [Comparison between different fillna methods][1]. 

Here, I try filling the NaN values with an array containing all the mean values pre-calculated from the train set.

[1]: https://www.kaggle.com/c/jane-street-market-prediction/discussion/201302

In [None]:
import warnings
warnings.filterwarnings('ignore')

import torch 
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(device)

import os, gc, random
if device == 'cuda':
    import cudf
    import cupy as cp
import datatable as dtable
import pandas as pd
import numpy as np
import janestreet
from numba import njit
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
from joblib import dump, load

In [None]:
def seed_everything(seed_value):
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    os.environ['PYTHONHASHSEED'] = str(seed_value)
    
    if torch.cuda.is_available(): 
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
        
seed_everything(42)

# Loading Example Test

`f_mean` contains the mean values that we would like to fill into the examples test, which is pre-calculated from the mean values in the train set

In [None]:
test = pd.read_csv('../input/jane-street-market-prediction/example_test.csv')
features = [c for c in test.columns if 'feature' in c]
test = test[features].to_numpy()
f_mean = np.load('../input/js-nn-models/f_mean.npy')
print(f"test.shape={test.shape}, f_mean.shape={f_mean.shape}")

In [None]:
print(f"f_mean={f_mean}")

# Filling-NaN Functions

First, we compare two basic fillna functions. One is from [Jane Street: How to deal with Timeout error][1] and the other is my vectorised version.

[1]: https://www.kaggle.com/markmipt/jane-street-how-to-deal-with-timeout-error

In [None]:
# https://www.kaggle.com/markmipt/jane-street-how-to-deal-with-timeout-error
def fillna_minus_plus(array, values):
    array -= values
    array = np.nan_to_num(array)
    array += values
    return array

def fillna_vectorised(array, values):
    array = np.nan_to_num(array) + np.isnan(array) * values
    return array

Since not all samples in the test contain NaN values, we would better first check if there is any NaN value and then call the fillna function. This should accelerate the speed. The check function is from [Fast check for NaN in NumPy][1].

[1]: https://stackoverflow.com/questions/6736590/fast-check-for-nan-in-numpy

In [None]:
def fillna_minus_plus_with_check(array, values):
    if np.isnan(array.sum()):
        array -= values
        array = np.nan_to_num(array)
        array += values
    return array

def fillna_vectorised_with_check(array, values):
    if np.isnan(array.sum()):
        array = np.nan_to_num(array) + np.isnan(array) * values
    return array

I just come up with an idea using `numpy.where` with `numba.njit`. Let's try them!

In [None]:
def fillna_npwhere(array, values):
    if np.isnan(array.sum()):
        array = np.where(np.isnan(array), values, array)
    return array

@njit
def fillna_npwhere_njit(array, values):
    if np.isnan(array.sum()):
        array = np.where(np.isnan(array), values, array)
    return array

Finally, we compare the baseline method - pandas.DataFrame.fillna

# Time-Consumption Comparison

First, we create a for-loop function since we will fill NaN values one sample at each time during inference.

In [None]:
def for_loop(method, matrix, values):
    for i in range(matrix.shape[0]):
        matrix[i] = method(matrix[i], values)
    return matrix

Let's compare the time consumption of different methods.

In [None]:
print('fillna_minus_plus:')
%timeit for_loop(fillna_minus_plus, test[:, 1:], f_mean)
print('-' * 65)

print('fillna_vectorised:')
%timeit for_loop(fillna_vectorised, test[:, 1:], f_mean)
print('-' * 65)

print('fillna_minus_plus_with_check:')
%timeit for_loop(fillna_minus_plus_with_check, test[:, 1:], f_mean)
print('-' * 65)

print('fillna_vectorised_with_check:')
%timeit for_loop(fillna_vectorised_with_check, test[:, 1:], f_mean)
print('-' * 65)

print('fillna_npwhere:')
%timeit for_loop(fillna_npwhere, test[:, 1:], f_mean)
print('-' * 65)

print('fillna_npwhere_njit:')
%timeit for_loop(fillna_npwhere_njit, test[:, 1:], f_mean)

Finally, let's also check the pandas `fillna` function which is widely used in many notebooks.

In [None]:
test = pd.read_csv('../input/jane-street-market-prediction/example_test.csv', usecols = features[1:])
f_mean_dict = dict(zip(features[1:], f_mean))

In [None]:
def pandas_fillna(df, values):
    return df.fillna(values)

def for_loop_pandas(method, df, values):
    for i in range(df.shape[0]):
        df.loc[i] = method(df.loc[i], values)
    return df

In [None]:
print('pandas fillna:')
%timeit for_loop_pandas(pandas_fillna, test, f_mean_dict)

Oooops, it is too slow!

# Forward-Filling Example
Here is an example of using forward-filling (i.e., filling with the last seen valid value instead of filling with a constant array).

In [None]:
def for_loop_ffill(method, matrix):
    tmp = np.zeros(matrix.shape[1])
    for i in range(matrix.shape[0]):
        matrix[i] = method(matrix[i], tmp)
        tmp = matrix[i]
    return matrix

We compare the time consumption of mean-filling and forward-filling

In [None]:
test = pd.read_csv('../input/jane-street-market-prediction/example_test.csv')
test = test[features].values

In [None]:
print('fillna_npwhere_njit (mean-filling):')
%timeit for_loop(fillna_npwhere_njit, test[:, 1:], f_mean)
print('-' * 65)

print('fillna_npwhere_njit (forward-filling):')
%timeit for_loop_ffill(fillna_npwhere_njit, test)

Forward-filling is a little bit slower than mean-filling since it needs to update the temporary array every iteration.

# Conclusion

So far, the fastest version is the numba version!

In [None]:
import numpy as np
from numba import njit

@njit
def fillna_npwhere_njit(array, values):
    if np.isnan(array.sum()):
        array = np.where(np.isnan(array), values, array)
    return array