-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature thought: Add generic imputation #106
Comments
Aloha @eli-s-goldberg! 😸 Thanks for pinging in with the feature request. Love it - it means we've got users who are engaged! Before I go on and try an implementation, I have a few questions. "Imputation" can be a bit nebulous - would you be open to providing some details? One question I have is - how does this differ from Also, would you be able to describe a bit more about a desired API? You don't have to worry about the implementation, we can try to figure it out. Those two specifics would be helpful for me to work out a proper implementation! |
I gave it some thought, and here's my ideas on a generic imputation function: @pf.register_dataframe_method
def impute(df, column: str, value=None, statistic=None):
"""
Method-chainable imputation of values in a column.
Underneath the hood, this function calls the `.fillna()` method available
to every pandas.Series object.
Method-chaining example:
..code-block:: python
df = (
pd.DataFrame(...)
# Impute null values with 0
.impute(column='sales', value=0.0)
# Impute null values with median
.impute(column='score', statistic='median')
)
Either one of ``value`` or ``statistic`` should be provided.
If ``value`` is provided, then all null values in the selected column will
take on the value provided.
If ``statistic`` is provided, then all null values in the selected column
will take on the summary statistic value of other non-null values.
Currently supported ``statistic``s include:
- ``mean`` (also aliased by ``average``)
- ``median``
- ``mode``
- ``minimum`` (also aliased by ``min``)
- ``maximum`` (also aliased by ``max``)
:param df: A pandas DataFrame
:param column: The name of the column on which to impute values.
:param value: (optional) The value to impute.
:param statistic: (optional) The column statistic to impute.
"""
# Firstly, we check that only one of `value` or `statistic` are provided.
if value is not None and statistic is not None:
raise ValueError(
'Only one of `value` or `statistic` should be provided'
)
# If statistic is provided, then we compute the relevant summary statistic
# from the other data.
funcs = {
'mean': np.mean,
'average': np.mean, # aliased
'median': np.median,
'mode': np.mode,
'minimum': np.minimum,
'min': np.minimum, # aliased
'maximum': np.maximum,
'max': np.maximum, # aliased
}
if statistic is not None:
# Check that the statistic keyword argument is one of the approved.
if statistic not in funcs.keys():
raise KeyError(f'`statistic` must be one of {funcs.keys()}')
value = funcs[statistic](df[column].dropna())
if value is not None:
df[column] = df[column].fillna(value)
return df What are your thoughts on this? |
@ericmjl - Thanks for keying me in and I like what's written. That said, I actually need to check this out quickly and use it, which is something that I'm unable to do until tomorrow afternoon. A more complex imputation task that I find myself doing is fillna with age, gender, and disease mean imputation. Here's a description of the hack as my code is a bit too specific at this point to share/be useful. First, I iterate through the data to create a dict that links category with the matches with mean values.
Next, I loop through the unique labels, genders, and diseases, using
With a bit of your patented |
@eli-s-goldberg once you had described this, I think the grammar and ontology looks something like this: Potential function signature (ontology): def grouped_impute(df, columns, mapping=None, statistic=None):
pass Another potential function signature, if we wanted to sound more academic, is: def stratified_impute(df, columns, mapping, statistic): The grammar can actually be quite generic.
The way you've shown it is pretty good, actually. I'll use that as a jumping board for this. |
A bit of an update. This doesn't handle variable statistics, or textual data, but it's getting there. I've been using it generically for the past weekend, or so. It's not the quickest thing (millions of rows take a few minutes per column). This will only work with a single column. def stratified_impute(df, mapping, columns):
"""
Perform a stratified impute match. Note, mapping cannot contain columns.
Method chaining usage:
.. code-block:: python
df = (
pd.DataFrame(...)
.stratified_impute(mapping=['gender', 'race','ethnicity','disease'], columns=['pack_year'])
)
:param df: pandas DataFrame.
:param mapping: Column(s) on which to map stratified impute.
:param columns: Column(s) on which to perform stratified imputation.
"""
if set(columns).issubset(mapping):
raise ValueError("{} must not include {}".format(mapping, columns))
filtered_groupby = df.groupby(mapping)
strat_dict = dict()
for column in columns:
for name, group in filtered_groupby:
group = group.dropna()
strat_dict.update(
{str(column + str(name)): group[column].mean()}
)
group_nan_list = []
group_nonnan_list = []
for column in columns:
for name, group in filtered_groupby:
# nan_data to be filled
nan_data = group[pd.isna(group[column])]
# replace nan_data with backfilled data from dict
nan_data = nan_data.fillna({column:strat_dict.get(str(column + str(name)))})
group_nan_list.append(nan_data)
# non nan_data to be passed along
nonnan_data = group[~pd.isna(group[column])]
group_nonnan_list.append(nonnan_data)
group_nan_list.extend(group_nonnan_list)
df = pd.concat(group_nan_list)
return df |
Hi Eric,
Just a thought -- I find myself doing a lot of hand imputation. It would be nice if you could add imputation as a chain able function.
I'm under the gun and can't submit a PR, but I think this would be a great feature.
The text was updated successfully, but these errors were encountered: