Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Provision generating missing values and add parameters to control the same #6284

Open
raghavrv opened this issue Feb 5, 2016 · 25 comments
Open

Comments

@raghavrv
Copy link
Member

@raghavrv raghavrv commented Feb 5, 2016

  1. Would it be worthwhile to add parameters to control missingness in dataset generators?

    I need this for benchmarking #5974. Thought this might come in handy for teaching too.

    Typically I would like to add missing_rate, target_correlation_rate, feature_correlation_rate and missing_values.

    The target_correlation_rate would control the extent to which the dataset is MNAR* and feature_correlation_rate would control the extent to which the dataset is MAR.

    target_correlation_rate + feature_correlation_rate <= 1

    1 - (target_correlation_rate + feature_correlation_rate) would control the extent to which the dataset is MCAR.

    Does this sound good?

  2. Either as an addition or as an alternative to 1, could we have missing transformers with the above described params?


* - Missing Not At Random (Missingness is correlated with the target)

† - Missing At Random but correlated with the other feature values.

‡ - Missing Completely At Random (No correlation with either the target or parameters)

Ping @agramfort @glouppe @GaelVaroquaux

@agramfort

This comment has been minimized.

Copy link
Member

@agramfort agramfort commented Feb 5, 2016

@raghavrv

This comment has been minimized.

Copy link
Member Author

@raghavrv raghavrv commented Feb 25, 2016

I've written a basic drop_value function which helps in (successively) removing an exact fraction of values according to a preset label_correlation value.

MNAR --> label_correlation is 1 and MCAR --> label_correlation is 0

This was needed for my benchmarking at #5974

Suggestions welcome. (Note that this first draft is probably slow and unoptimized, but seems to work well for my benchmarking...)

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Feb 25, 2016

@agramfort

This comment has been minimized.

Copy link
Member

@agramfort agramfort commented Feb 25, 2016

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Jun 29, 2018

@RianneSchouten's ampute to produce MAR, MCAR or MNAR data is described here. I think the basic idea is that it assigns to each x a missingness pattern (False to drop, True to keep) and a missingness probability:

X_pat = rng.choice(pattern.shape[0], size=X.shape[0], p=pattern_proba)
X_missing_proba = score_func(np.sum(X * feature_weights.take(X_pat, axis=0))

We then want to have missing values in k% of samples according to a random draw with respect to their missingness probability. I think this can be done with:

score = rng.uniform(size=X.shape[0]) - X_missing_proba
n_complete = int(1 - k / X.shape[0])
complete_sample_idx = np.argpartition(score, n_complete)[:n_complete]
observed_mask = pattern.take(X_pat, axis=0)
observed_mask[complete_sample_idx] = True

I'm not sure if this (particularly the first line) is correct.

If feature_weights is nonzero where the corresponding pattern is 0, it is MNAR. For MCAR, all scores are 0.

The tricky part here is determining good defaults, but I suppose some are defined in ampute.

@jnothman jnothman changed the title [RFC][ENH] Provision generating missing values and add parameters to control the same [ENH] Provision generating missing values and add parameters to control the same Sep 27, 2018
@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Sep 27, 2018

@sergeyf, this is the appropriate place to discuss ampute / missing value generation. I'd be keen to see another contributor implement it with @RianneSchouten's advice, but if she wants to implement it herself, she may.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Sep 27, 2018

I have marked this "help wanted"

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Sep 27, 2018

Note to the potential contributor: we cannot port the code directly from the R mice package, as its software licence is incompatible with ours.

@RianneSchouten

This comment has been minimized.

Copy link

@RianneSchouten RianneSchouten commented Sep 27, 2018

I am enthusiastic about the idea but do not have the time to implement it myself. I would be willing to advice, review, check, and think along.

These two papers should help the implementer to make a start:
https://www.tandfonline.com/doi/full/10.1080/00949655.2018.1491577
https://rianneschouten.github.io/mice_ampute/vignette/ampute.html

@RianneSchouten

This comment has been minimized.

Copy link

@RianneSchouten RianneSchouten commented Sep 27, 2018

Now I am thinking about it. Ampute has two aspects: continuous amputation and discrete amputation. We could skip the latter, which would make it a whole lot easier.

What i can do, is set up some simple code that shows the idea. After that, someone can check that I've used all the python code well enough and improve it. And then we can send it for review.

Maybe @sergeyf is willing to improve my code?
@sergeyf after that I can definitely put it in the MI example.

@sergeyf

This comment has been minimized.

Copy link
Contributor

@sergeyf sergeyf commented Sep 27, 2018

@RianneSchouten What is the difference between continuous and discrete amputation? The latter is for categorical variables? So far IterativeImputer is only for continuous variables, so it would be inline with our current limitations (although hopefully someone will want to extend all of this to categorical variables as well).

Your suggestion that you set up simple code sounds good to me. It would be great to have a correct kernel to work from.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Sep 27, 2018

@RianneSchouten, is my summary above along the right lines?

@RianneSchouten

This comment has been minimized.

Copy link

@RianneSchouten RianneSchouten commented Sep 28, 2018

@sergeyf At the moment ampute does not have the possibility to deal with categorical variables in case of MAR and MNAR. It changes categorical variables into values. The distinction between continuous and discrete is about whether the user wants to use the logit functions to assign probabilities (continuous) or whether the user wants to specify probabilities him/herself (for instance by dividing the scores into three groups and saying the odds of one group should be twice as large as the other two groups).

@jnothman Not entirely, but close. The algorithm would have these steps:

  1. split the data into as many parts as there are patterns (with each pattern a certain probability. I think this is your first row).
  2. calculate weighted sum scores per part defined by a weights matrix. Indeed, if the missing variables are weighted, it is MNAR. if the observed variables are weighted, it is MAR. If no variables are weighted it is MCAR. And you could have combinations of course.
  3. Use the weighted sum scores to specify a probability for each case. In ampute we use 4 types of logit functions for this (in the continuous case).
    This is where the trick is: you have to shift the logit distribution along the x-axis to obtain the desired missingness percentage. In ampute we use binary search for this.
  4. You apply the probabilities and see which cases become missing according the pattern.

The weights can be different per pattern. The logit function can be different per pattern.

The missingness percentage is specified by the number of rows.
In ampute there is also the option to specify it by number of cells.

I will set up some code in the next 2 weeks.

@sergeyf

This comment has been minimized.

Copy link
Contributor

@sergeyf sergeyf commented Sep 28, 2018

Thanks @RianneSchouten, that is helpful. In my opinion, if it makes the code a lot simpler and still provide functionality to do more general amputation, I think we should go for the simplest version. Logit functions are plenty for the use-cases I've seen so far. @jnothman do you concur?

I also think we can just specify the percentage by number of rows, and that will also be sufficient.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Sep 29, 2018

@RianneSchouten

This comment has been minimized.

Copy link

@RianneSchouten RianneSchouten commented Oct 12, 2018

I wrote the setup but before I make a pull request, can someone answer my question in #11370? I am not sure whether I did something wrong.

In addition I have these questions:

  • the defaults of the amputation function depend on the dataset X. for instance, the patterns matrix has the number of columns equal to X. How can I make such a thing in def init ?
  • i am thinking whether we really need the binary search I did in ampute in R. I will try to figure whether we can use a formula instead, but we also have the options to run the binary search separately ones and store the outcomes in some matrices. when you assume the data has a normal distribution, you would need 4 matrices of 100 by 100. Is this a common thing to do/ possibility to do?

Many thanks in advance.

@sergeyf

This comment has been minimized.

Copy link
Contributor

@sergeyf sergeyf commented Oct 12, 2018

Re: merging. My git-fu is not good. I'll let @jnothman chime in.

Re: defaults. A common way to specify contingent defaults is to specify None by default and then in the code check

if self.param is None:
    do_some_stuff()

Re: binary search. I don't think I understand the second question. Can you provide more details about it? Why is it sufficient to only run the binary search once?

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Oct 14, 2018

@RianneSchouten

This comment has been minimized.

Copy link

@RianneSchouten RianneSchouten commented Oct 20, 2018

@jnothman a possible reason to make a transformer would be to use ampute to create missing data to test the effect of different mechanisms / imputation methods on the outcome of your model. But no research is done on that topic yet, so we don't know the advantages of it yet.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Oct 21, 2018

@RianneSchouten

This comment has been minimized.

Copy link

@RianneSchouten RianneSchouten commented Dec 12, 2018

Hi guys,

I am sorry but it seems I don't have enough time to finish it. I hope someone else can take it up?
I will post in the next message what i have so far. The function in itself is done, that means, the methodology is implemented with the arguments as they also are in the R function ampute.

What needs to be finished is: the binary search should be improved such that it works in all cases, a proper test script has to be written.

I hope someone else wants to finish it.

Kind regards,

Rianne Schouten

@RianneSchouten

This comment has been minimized.

Copy link

@RianneSchouten RianneSchouten commented Dec 12, 2018

"""Transformer for generating multivariate missingness in complete datasets"""
# Author: Rianne Schouten <riannemargarethaschouten@gmail.com>

import numpy as np
from sklearn.base import TransformerMixin
from scipy import stats


class MultivariateAmputation(TransformerMixin):
    """Generating multivariate missingness patterns in complete datasets

    Some short explanation.

    Parameters
    ----------
    complete_data : matrix with shape (A, B)

    prop : float

    patterns : matrix with shape (C, B)

    freq : array of length B

    weights : matrix with shape (C, B)

    std : boolean

    mechanism: array of length B

    logit_type : array of length B

    lower_range : float

    upper_range : float

    max_dif_with_target : float

    max_iter : integer

    Attributes
    ----------
    incomplete_data : 

    Notes
    -----
    Something on difference ampute in R and Python

    References
    ----------
    .. [1] Rianne Margaretha Schouten, Peter Lugtig & Gerko Vink (2018). 
    Generating missing values for simulation purposes: A multivariate amputation procedure. 
    Journal of Statistical Computation and Simulation, DOI: 10.1080/00949655.2018.1491577
    """


    def __init__(self,
                 prop=0.5, 
                 patterns=None, 
                 freqs=None, 
                 weights=None, 
                 std=True, 
                 mechanisms=None,
                 types=None,
                 lower_range=-3,
                 upper_range=3,
                 max_dif_with_target=0.001,
                 max_iter=100):

        self.prop = prop
        self.patterns = patterns
        self.freqs = freqs
        self.weights = weights
        self.std = std
        self.mechanisms = mechanisms
        self.types = types
        self.lower_range = lower_range
        self.upper_range = upper_range
        self.max_dif_with_target = max_dif_with_target
        self.max_iter = max_iter


    def _binary_search(self, wss_standardized, i): 
  
        b = 0
        counter = 0
        lower_range = self.lower_range
        upper_range = self.upper_range

        # start binary search with a maximum amount of tries of max_iter
        while counter < self.max_iter:
            counter += 1
          
            # in every iteration, the new b is the mid of the lower and upper range
            # the lower and upper range are updated at the end of each iteration
            b = lower_range + (upper_range - lower_range)/2
            if counter == self.max_iter: break 
     
            # calculate the expected missingness proportion
            # depends on the logit type, the sum scores and b
            x = np.zeros(len(wss_standardized))
            if self.types[i] == 'RIGHT': 
                x = wss_standardized + b
            elif self.types[i] == 'LEFT':
                x = -wss_standardized + b
            elif self.types[i] == 'MID':
                x = -np.absolute(wss_standardized) + 0.75 + b
            elif self.types[i] == 'TAIL':
                x = np.absolute(wss_standardized) - 0.75 + b
            probs = 1 / (1 + np.exp(-x))
            current_prop = np.sum(probs) / len(x)

            # if the expected proportion is close to the target, break
            # the maximum difference can be specified
            # if max_dif_with_target is 0.001, the proportion differs with max 0.1%
            if np.absolute(current_prop - self.prop) < self.max_dif_with_target: break

            # if we have not reached the desired proportion
            # we adjust either the upper or lower range
            # this way of adjusting works for self.types[i] = 'RIGHT'
            # need to check for the other types
            # in the next iteration, a new b is then calculated and used
            if (current_prop - self.prop) > 0: 
               upper_range = b
            else:
               lower_range = b 

        return b    


    def _choose_probabilities(self, wss, i):

        # when wss contains merely zeroos, the mechanism is MCAR
        # then each case has an equal probability of becoming missing
        if  np.all(wss == 0):
            probs = np.repeat(self.freqs[i], len(wss))
        # else we calculate the probabilities based on the wss
        else:
            # standardize wss
            wss_standardized = stats.zscore(wss)
            # calculate the size of b for the desired missingness proportion
            b = self._binary_search(wss_standardized, i)
            # apply the right logistic function
            x = np.zeros(len(wss))
            if self.types[i] == 'RIGHT':
                 x = wss_standardized + b
            elif self.types[i] == 'LEFT':
                x = -wss_standardized + b
            elif self.types[i] == 'MID':
                x = -np.absolute(wss_standardized) + 0.75 + b
            elif self.types[i] == 'TAIL':
                x = np.absolute(wss_standardized) - 0.75 + b
            # calculate probability to be missing 
            probs_matrix = 1 / (1 + np.exp(-x))
            probs = np.squeeze(np.asarray(probs_matrix))
        
        return probs


    def _calculate_sumscores(self, data_group, i):

        # transform categorical data to numerical data
        # standardize data or not
        if self.std:
            data_group = stats.zscore(data_group)
        
        # calculate sum scores
        # in case of MCAR, weights[i, ] contains merely zeroos and wss are merely zeroos
        # in case of MAR, MNAR, the mechanisms is determined by the weights
        wss = np.dot(data_group, self.weights[i, ].T)

        return wss


    def _validate_input(self, X):

        # default patterns is missingness on each variable
        if self.patterns is None:
            self.patterns = 1 - np.identity(n=X.shape[1])
        #else: 
            #check if self.patterns.shape[1] = X.shape[1]
            #check if self.patterns does not contain values other than 0 and 1
            #check if there are no patterns with merely 0 and merely 1

        # default freqs is every pattern occurs with the same frequency
        if self.freqs is None:
            self.freqs = np.repeat(1 / self.patterns.shape[0], self.patterns.shape[0])
        #else: 
            #check if freq has length equal to self.patterns.shape[0]  
            #check if freq does not have values < 0 or > 1

        # default mechanisms is MAR mechanism for each pattern 
        if self.mechanisms is None:
            self.mechanisms = np.repeat('MAR', len(self.freqs))
        elif len(self.mechanisms) == 1:
            self.mechanisms = np.repeat(self.mechanisms[0], len(self.freqs))
        #else: 
            #check if mechanism has length equal to self.patterns.shape[0]     
            #check if mechanism has no other values than 'MAR', 'MNAR' or MCAR   

        # default in case of MAR: all observed variables have weight 1
        # default in case of MNAR: all non observed variables have weight 1
        # with MCAR: all variables should have weight 0
        if self.weights is None:
            self.weights = np.zeros(shape=self.patterns.shape)
            self.weights[self.mechanisms == 'MAR', ] = self.patterns[self.mechanisms == 'MAR', ]
            self.weights[self.mechanisms == 'MNAR', ] = 1 - self.patterns[self.mechanisms == 'MNAR', ]
        #else:
            #check if self.weights.shape equals self.patterns.shape
            #check if the patterns with MCAR contain merely zeroos

        if self.types is None: 
            self.types = np.repeat('RIGHT', len(self.mechanisms))
        #else:
            # check if len equals len mechanisms and len freqs
            # check if types contains no other words then right, left, mid and tail

        return X

        
    def fit_transform(self, X):
        """Fits amputer on complete data X and returns the incomplete data X
    
        Parameters
        ----------
        X : array-like, shape (n_samples, n_features)
            Complete input data, where "n_samples" is the number of samples and
            "n_features" is the number of features.

        Returns
        -------
        X_incomplete : array-like, shape (n_samples, n_features)
        """

        # some check functions
        # including specification of defaults
        X = self._validate_input(X)

        # split complete_data in groups
        # the number of groups is defined by the number of patterns
        # we know the number of patterns by the number of rows of self.patterns
        X_incomplete = np.zeros(X.shape)
        X_indices = np.arange(X.shape[0])
        assigned_group_number = np.random.choice(a=self.patterns.shape[0],
                                                 size=X.shape[0], p=self.freqs)
        
        # start a loop over each pattern
        for i in range(self.patterns.shape[0]):
            # assign cases to the group
            group_indices = X_indices[assigned_group_number == i]
            pattern = np.squeeze(np.asarray(self.patterns[i, ]))
            data_group = X[group_indices]
            # calculate weighted sum scores for each group
            wss = self._calculate_sumscores(data_group, i)
            # define candidate probabilities in group
            probs = self._choose_probabilities(wss, i)
            # apply probabilities and choose cases
            chosen_candidates = np.random.binomial(n=1,
                                                   size=data_group.shape[0],
                                                   p=probs)
            # apply missing data pattern
            chosen_indices = group_indices[chosen_candidates==1]
            X_incomplete[chosen_indices, pattern == 0] = np.nan

        return X_incomplete
@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Dec 12, 2018

Thanks a lot for offering your algorithm and code, @RianneSchouten! I agree it would be good to find someone to finish this work.

@sergeyf

This comment has been minimized.

Copy link
Contributor

@sergeyf sergeyf commented Dec 12, 2018

Thanks @RianneSchouten. Can you say a bit more about what you mean here: "the binary search should be improved such that it works in all cases". In what cases does it work vs not work? Why?

@RianneSchouten

This comment has been minimized.

Copy link

@RianneSchouten RianneSchouten commented Dec 12, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
6 participants
You can’t perform that action at this time.