Audit the NullTransformer #192

amontanez24 · 2021-07-19T21:10:36Z

Summary of the Audit

Fit

When the fill value is provided, 99% of the time of fit is spent on the
self.nulls = data.isnull().any()
line. It appears to be slightly faster to do
self.nulls = data.isnull().values.any()

When the fill value is set to mean or mode, most of the time is spent on the lines
self._fill_value = data.mean() if pd.notnull(data).any() else 0 or
self._fill_value = data.mode(dropna=True)[0] if pd.notnull(data).any() else 0

We already have to get the values that are null later so I suggest just getting the null values once at the top, and then using that array to figure out self.nulls and another boolean that tells us if any values are not null. This reduces the time by about half.

Transform

After analyzing the different scenarios, the slowest line in the transform function is
data[isnull] = self._fill_value
which executes if copy is False. I don't really see a faster way to implement this. The next most taxing line is return pd.concat([data, isnull.astype('int')], axis=1).values
which happens if self._null_column is True. Again, I don't see a better way to do this since the columns need to be combined.

One problem with the code however is that the docstring suggests that data can be a numpy array, but the first line in transform is isnull = data.isnull(). Numpy arrays don't have an isnull() method so this will crash. We should probably convert to a pandas series.

Reverse Transform

If null_column is True then about 43% of the time is spent on
data = pd.Series(data[:, 0])
If it's False, then about 40% of the time is spent on
pd.Series(data)
So in both cases a good chunk is spent just converting to a Series. The calculation of isnull is about the same time for both branches. I don't think there is a more efficient way to do this.

The next most time consuming line is data.iloc[isnull] = np.nan. Again, I don't know of any way to do this more efficiently.

The text was updated successfully, but these errors were encountered:

amontanez24 added the enhancement label Jul 19, 2021

amontanez24 mentioned this issue Jul 20, 2021

Audit the NullTransformer #193

Merged

amontanez24 closed this as completed in #193 Jul 20, 2021

fealho assigned amontanez24 Aug 11, 2021

fealho added this to the 0.5.1 milestone Aug 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audit the NullTransformer #192

Audit the NullTransformer #192

amontanez24 commented Jul 19, 2021 •

edited

Loading

Audit the NullTransformer #192

Audit the NullTransformer #192

Comments

amontanez24 commented Jul 19, 2021 • edited Loading

Summary of the Audit

Fit

Transform

Reverse Transform

amontanez24 commented Jul 19, 2021 •

edited

Loading