Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audit the NullTransformer #192

Closed
amontanez24 opened this issue Jul 19, 2021 · 0 comments · Fixed by #193
Closed

Audit the NullTransformer #192

amontanez24 opened this issue Jul 19, 2021 · 0 comments · Fixed by #193
Assignees
Milestone

Comments

@amontanez24
Copy link
Contributor

amontanez24 commented Jul 19, 2021

Summary of the Audit

Fit

When the fill value is provided, 99% of the time of fit is spent on the
self.nulls = data.isnull().any()
line. It appears to be slightly faster to do
self.nulls = data.isnull().values.any()

When the fill value is set to mean or mode, most of the time is spent on the lines
self._fill_value = data.mean() if pd.notnull(data).any() else 0 or
self._fill_value = data.mode(dropna=True)[0] if pd.notnull(data).any() else 0

We already have to get the values that are null later so I suggest just getting the null values once at the top, and then using that array to figure out self.nulls and another boolean that tells us if any values are not null. This reduces the time by about half.

Transform

After analyzing the different scenarios, the slowest line in the transform function is
data[isnull] = self._fill_value
which executes if copy is False. I don't really see a faster way to implement this. The next most taxing line is return pd.concat([data, isnull.astype('int')], axis=1).values
which happens if self._null_column is True. Again, I don't see a better way to do this since the columns need to be combined.

One problem with the code however is that the docstring suggests that data can be a numpy array, but the first line in transform is isnull = data.isnull(). Numpy arrays don't have an isnull() method so this will crash. We should probably convert to a pandas series.

Reverse Transform

If null_column is True then about 43% of the time is spent on
data = pd.Series(data[:, 0])
If it's False, then about 40% of the time is spent on
pd.Series(data)
So in both cases a good chunk is spent just converting to a Series. The calculation of isnull is about the same time for both branches. I don't think there is a more efficient way to do this.

The next most time consuming line is data.iloc[isnull] = np.nan. Again, I don't know of any way to do this more efficiently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants