# Intro

The sample code provided by the competition iterates over the entire sumbissions dataframe to fill up the values. There's already been some effort to optimize this as discussed in [this kernel](https://www.kaggle.com/code/hasanbasriakcay/tpsjun22-10xfastersubmissionfunction/notebook?scriptVersionId=97216338), but the following function improves readability & optimization by reducing the time taken in **HALF** i.e. ~50s (vs. ~100s in [the kernel](https://www.kaggle.com/code/hasanbasriakcay/tpsjun22-10xfastersubmissionfunction/notebook?scriptVersionId=97216338))


In [None]:
import numpy as np
import pandas as pd
from pathlib import Path


input_path = Path('/kaggle/input/tabular-playground-series-jun-2022/')

data = pd.read_csv(input_path / 'data.csv', index_col='row_id')
print('Input shape :', data.shape)
print('Input NaN count :', data.isna().sum().sum())

In [None]:
from sklearn.impute import SimpleImputer

imp = SimpleImputer(
        missing_values=np.nan,
        strategy='mean')

output = data.copy()
output[:] = imp.fit_transform(data)
print('Output shape :', output.shape)
print('Output NaN count :', output.isna().sum().sum())

# Optimized Submissions

The function filters the output dataframe based on location of NaN values in input/source dataframe using MultiIndex.

Key things:
* Based on my `%%timeit` results, melting on `isna()` performs faster than melting on original dataframe and then filtering on NaN
* `DataFrame.query()` is a bit slower than simple dataframe indexing (`df[df['isNull'] == True`) but it provides better readability & function chaining capabilities
*  Filtering with MultiIndex(row_id, col) is faster than filtering on plain columns (row_id, col)

In [None]:
%%time

def generate_submission(source_df: pd.DataFrame, output_df: pd.DataFrame) -> pd.DataFrame:
    # Melt source dataframe filtered on NaN values to form [row_id, col, isNull] ...
    # ... with MultiIndex on (row_id, col)
    nan_only = (source_df
                .isna()
                .melt(ignore_index=False, var_name='col', value_name='isNull')
                .query('isNull == True')
                .set_index(['col'], append=True))

    # Melt output dataframe to form [row_id, col, value] with MultiIndex on (row_id, col)
    out = (output_df
               .melt(ignore_index=False, var_name='col')
               .set_index(['col'], append=True))

    # Filter output's MultiIndex on nan_only's MultiIndex
    out = (out.loc[nan_only.index]
               .sort_index())
    
    # Flatten MultiIndex to Index & rename to desired column
    out.index = [f'{r}-{c}' for r, c in out.index]
    out.index.name = 'row-col'
    return out

result = generate_submission(data, output)
result

# Verify Output

Unit test to check if the `sample_submission` index is the same as our output index

In [None]:
submission = pd.read_csv(input_path / 'sample_submission.csv', index_col='row-col')
print('Output matches #NaNs      : ', data.isna().sum().sum() == len(result.index))
print('Submission length matches : ', submission.index.equals(result.index))