Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature thought: Add generic imputation #106

Open
eli-s-goldberg opened this issue Jan 22, 2019 · 5 comments
Open

Feature thought: Add generic imputation #106

eli-s-goldberg opened this issue Jan 22, 2019 · 5 comments
Labels
available for hacking This issue has not been claimed by any individual. enhancement New feature or request good first issue Good for newcomers

Comments

@eli-s-goldberg
Copy link

Hi Eric,

Just a thought -- I find myself doing a lot of hand imputation. It would be nice if you could add imputation as a chain able function.

I'm under the gun and can't submit a PR, but I think this would be a great feature.

@ericmjl
Copy link
Member

ericmjl commented Jan 22, 2019

Aloha @eli-s-goldberg! 😸

Thanks for pinging in with the feature request. Love it - it means we've got users who are engaged!

Before I go on and try an implementation, I have a few questions. "Imputation" can be a bit nebulous - would you be open to providing some details?

One question I have is - how does this differ from df.fillna()?

Also, would you be able to describe a bit more about a desired API? You don't have to worry about the implementation, we can try to figure it out.

Those two specifics would be helpful for me to work out a proper implementation!

@ericmjl
Copy link
Member

ericmjl commented Jan 23, 2019

I gave it some thought, and here's my ideas on a generic imputation function:

@pf.register_dataframe_method
def impute(df, column: str, value=None, statistic=None):
    """
    Method-chainable imputation of values in a column.

    Underneath the hood, this function calls the `.fillna()` method available
    to every pandas.Series object.

    Method-chaining example:

    ..code-block:: python

        df = (
            pd.DataFrame(...)
            # Impute null values with 0
            .impute(column='sales', value=0.0)
            # Impute null values with median
            .impute(column='score', statistic='median')
        )

    Either one of ``value`` or ``statistic`` should be provided.

    If ``value`` is provided, then all null values in the selected column will
        take on the value provided.

    If ``statistic`` is provided, then all null values in the selected column
    will take on the summary statistic value of other non-null values.

    Currently supported ``statistic``s include:

    - ``mean`` (also aliased by ``average``)
    - ``median``
    - ``mode``
    - ``minimum`` (also aliased by ``min``)
    - ``maximum`` (also aliased by ``max``)

    :param df: A pandas DataFrame
    :param column: The name of the column on which to impute values.
    :param value: (optional) The value to impute.
    :param statistic: (optional) The column statistic to impute.
    """

    # Firstly, we check that only one of `value` or `statistic` are provided.
    if value is not None and statistic is not None:
        raise ValueError(
            'Only one of `value` or `statistic` should be provided'
            )

    # If statistic is provided, then we compute the relevant summary statistic
    # from the other data.
    funcs = {
        'mean': np.mean,
        'average': np.mean,  # aliased
        'median': np.median,
        'mode': np.mode,
        'minimum': np.minimum,
        'min': np.minimum,  # aliased
        'maximum': np.maximum,
        'max': np.maximum,  # aliased
    }
    if statistic is not None:
        # Check that the statistic keyword argument is one of the approved.
        if statistic not in funcs.keys():
            raise KeyError(f'`statistic` must be one of {funcs.keys()}')
        value = funcs[statistic](df[column].dropna())

    if value is not None:
        df[column] = df[column].fillna(value)
    return df

What are your thoughts on this?

@ericmjl ericmjl added enhancement New feature or request good first issue Good for newcomers labels Jan 24, 2019
@ericmjl ericmjl mentioned this issue Jan 24, 2019
9 tasks
@eli-s-goldberg
Copy link
Author

@ericmjl - Thanks for keying me in and I like what's written. That said, I actually need to check this out quickly and use it, which is something that I'm unable to do until tomorrow afternoon.

A more complex imputation task that I find myself doing is fillna with age, gender, and disease mean imputation. Here's a description of the hack as my code is a bit too specific at this point to share/be useful.

First, I iterate through the data to create a dict that links category with the matches with mean values.
Here's some pseudocode:

df = DataFrame(MedicalData)
filtered_groupby = df.groupby(['gender', 'disease'])
ageGenderMatchAverageDict = dict()
valueLabels = ['cat1', 'cat2','cat3']
for label in valueLabels:
    for name, group in filtered_groupby:
        ageGenderMatchAverageDict.update({str(label + '_' + '_'.join(name)): group[label].mean()})

Next, I loop through the unique labels, genders, and diseases, using .get to select the fill value based on the combination of gender/disease. I use a little helper function iffillnaval to make sure that I'm only imputing NaNs and not real data. I know, I know. Hacky.

def iffillnaval(x, ageGenderMatchAverageDict, label, gender, disease):
    if np.isnan(x):
        return ageGenderMatchAverageDict.get(str(label + '_' + gender + '_' + disease))
    else:
        return x

for label in valueLabels:
    for gender in df['gender'].unique():
        for disease in df['disease'].unique():
            df.loc[:, label] = [iffillnaval(x, ageGenderMatchAverageDict, label, gender, disease) for x in
                                df[label].values]

With a bit of your patented ma-gic, I'm sure you can turn this hack into something powerful and generic. Thanks again and thanks for pyjanitor!

@ericmjl
Copy link
Member

ericmjl commented Jan 24, 2019

@eli-s-goldberg once you had described this, I think the grammar and ontology looks something like this:

Potential function signature (ontology):

def grouped_impute(df, columns, mapping=None, statistic=None):
    pass

Another potential function signature, if we wanted to sound more academic, is:

def stratified_impute(df, columns, mapping, statistic):

The grammar can actually be quite generic.

  1. Groupby on the columns keyword.
  2. If a mapping (i.e. dictionary) is provided, use the mapping. Mapping should have the groupby keys that are provided by df.groupby(columns). Naturally, this is not the easiest thing that end-users will commonly use, but I think it's useful to provide the option.
  3. If a statistic is provided, then we can easily map your mean imputation (or mode or median or minimum or maximum - all the M&Ms) to the groupby keys.

The way you've shown it is pretty good, actually. I'll use that as a jumping board for this.

@ericmjl ericmjl changed the title Feature though: Add generic imputation Feature thought: Add generic imputation Jan 24, 2019
@eli-s-goldberg
Copy link
Author

eli-s-goldberg commented Feb 5, 2019

A bit of an update. This doesn't handle variable statistics, or textual data, but it's getting there. I've been using it generically for the past weekend, or so. It's not the quickest thing (millions of rows take a few minutes per column). This will only work with a single column.

def stratified_impute(df, mapping, columns):
    """
    Perform a stratified impute match. Note, mapping cannot contain columns. 
    Method chaining usage:

    .. code-block:: python
        df = (
            pd.DataFrame(...)
            .stratified_impute(mapping=['gender', 'race','ethnicity','disease'], columns=['pack_year'])
        )
    
    :param df: pandas DataFrame.
    :param mapping: Column(s) on which to map stratified impute.
    :param columns: Column(s) on which to perform stratified imputation.
    """
    
    if set(columns).issubset(mapping):
        raise ValueError("{} must not include {}".format(mapping, columns))
        
    filtered_groupby = df.groupby(mapping)
    strat_dict = dict()
    for column in columns:
        for name, group in filtered_groupby:
            group = group.dropna()
            strat_dict.update(
                {str(column + str(name)): group[column].mean()}
            )
            
    group_nan_list = []
    group_nonnan_list = []
    for column in columns:
        for name, group in filtered_groupby:
            # nan_data to be filled
            nan_data = group[pd.isna(group[column])]
            
            # replace nan_data with backfilled data from dict
            nan_data = nan_data.fillna({column:strat_dict.get(str(column + str(name)))})
            group_nan_list.append(nan_data)
            
            # non nan_data to be passed along
            nonnan_data = group[~pd.isna(group[column])]
            group_nonnan_list.append(nonnan_data)
            
    group_nan_list.extend(group_nonnan_list)
    df = pd.concat(group_nan_list)
    
    return df

@ericmjl ericmjl added this to New Function Contributions in Sprint Tasks Apr 7, 2019
@ericmjl ericmjl added the available for hacking This issue has not been claimed by any individual. label May 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
available for hacking This issue has not been claimed by any individual. enhancement New feature or request good first issue Good for newcomers
Projects
Sprint Tasks
  
New Function Contributions
Development

No branches or pull requests

2 participants