You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the fill value is provided, 99% of the time of fit is spent on the self.nulls = data.isnull().any()
line. It appears to be slightly faster to do self.nulls = data.isnull().values.any()
When the fill value is set to mean or mode, most of the time is spent on the lines self._fill_value = data.mean() if pd.notnull(data).any() else 0 or self._fill_value = data.mode(dropna=True)[0] if pd.notnull(data).any() else 0
We already have to get the values that are null later so I suggest just getting the null values once at the top, and then using that array to figure out self.nulls and another boolean that tells us if any values are not null. This reduces the time by about half.
Transform
After analyzing the different scenarios, the slowest line in the transform function is data[isnull] = self._fill_value
which executes if copy is False. I don't really see a faster way to implement this. The next most taxing line is return pd.concat([data, isnull.astype('int')], axis=1).values
which happens if self._null_column is True. Again, I don't see a better way to do this since the columns need to be combined.
One problem with the code however is that the docstring suggests that data can be a numpy array, but the first line in transform is isnull = data.isnull(). Numpy arrays don't have an isnull() method so this will crash. We should probably convert to a pandas series.
Reverse Transform
If null_column is True then about 43% of the time is spent on data = pd.Series(data[:, 0])
If it's False, then about 40% of the time is spent on pd.Series(data)
So in both cases a good chunk is spent just converting to a Series. The calculation of isnull is about the same time for both branches. I don't think there is a more efficient way to do this.
The next most time consuming line is data.iloc[isnull] = np.nan. Again, I don't know of any way to do this more efficiently.
The text was updated successfully, but these errors were encountered:
Summary of the Audit
Fit
When the fill value is provided, 99% of the time of fit is spent on the
self.nulls = data.isnull().any()
line. It appears to be slightly faster to do
self.nulls = data.isnull().values.any()
When the fill value is set to mean or mode, most of the time is spent on the lines
self._fill_value = data.mean() if pd.notnull(data).any() else 0
orself._fill_value = data.mode(dropna=True)[0] if pd.notnull(data).any() else 0
We already have to get the values that are null later so I suggest just getting the null values once at the top, and then using that array to figure out self.nulls and another boolean that tells us if any values are not null. This reduces the time by about half.
Transform
After analyzing the different scenarios, the slowest line in the transform function is
data[isnull] = self._fill_value
which executes if copy is False. I don't really see a faster way to implement this. The next most taxing line is return
pd.concat([data, isnull.astype('int')], axis=1).values
which happens if
self._null_column
is True. Again, I don't see a better way to do this since the columns need to be combined.One problem with the code however is that the docstring suggests that data can be a numpy array, but the first line in
transform
isisnull = data.isnull()
. Numpy arrays don't have anisnull()
method so this will crash. We should probably convert to a pandas series.Reverse Transform
If
null_column
isTrue
then about 43% of the time is spent ondata = pd.Series(data[:, 0])
If it's
False
, then about 40% of the time is spent onpd.Series(data)
So in both cases a good chunk is spent just converting to a Series. The calculation of
isnull
is about the same time for both branches. I don't think there is a more efficient way to do this.The next most time consuming line is
data.iloc[isnull] = np.nan
. Again, I don't know of any way to do this more efficiently.The text was updated successfully, but these errors were encountered: