REF: Expose missing data handling as classmethod #1402

Merged
merged 4 commits into from Feb 21, 2014

Projects

None yet

2 participants

@jseabold
Member

A little refactoring so that we can do missing data handling for different types without everything else that it involves.

Makes it easier for models where the data you get on the front end from the user might be different than the data you do estimation on. E.g., collapsed groups in duration analysis.

@josef-pkt
Member

apropos refactoring: can we return and attach the (valid) mask instead of the index?
I think in most cases that would be easier to work with.

also MaskedArrays uses mask=None as a shortcut to indicate there are no missing values / masked elements. That would be useful if users want to check by default, and don't have any missing values most of the time.

@jseabold
Member

I'd like to think about this a bit more outside of this PR since this is a simple fix that will help with the survival code I'm working on right now.

My reasoning, I'm a bit worried memory-wise that we're starting to carry around too many arrays. I was thinking to just have a sparse-like index for missing values and create the mask on the fly with a property, but I'd need to play around a bit. Something like

class Data(object):
    def __init__(self, X, missing_rows):
        self.missing_rows = missing_rows
        self.nobs = len(X)

    @property
    def missing_mask(self):
        bool_missing_mask = np.ones(self.nobs, dtype=bool) 
        bool_missing_mask[self.missing_rows] = False
        return bool_missing_mask

X = np.random.randn(100)
data = Data(X, [12, 37, 84])
X[data.missing_mask]
@josef-pkt
Member

Fine with me to wait until we need it.

I was also thinking about memory consumption as we keep adding arrays.
Using a helper function to create the mask on the fly sounds fine, there will not be many cases where we will need it.
In this case it is easy for users to avoid, if they clean their data first, especially if we also have a None or empty list to quickly check that there were no nans.

So this PR is essentially to make the class method callable from the outside without any side effects?

@jseabold
Member

Provided I fixed the test failures, I'm going to go ahead and rebase and merge this, so I can try to use it in other branch. Can refactor again later if needed.

@josef-pkt
Member

Fine with me, I don't think this is used yet outside of the base datahandling.

@jseabold
Member

Eh, now that I'm actually thinking about it, it might need a few more changes.

@jseabold jseabold merged commit 8662c9d into statsmodels:master Feb 21, 2014

1 check passed

default The Travis CI build passed
Details
@jseabold jseabold deleted the jseabold:handle-missing-classmethod branch Feb 21, 2014
@jseabold
Member

FYI, we are carrying around already the missing_row_idx in the data attribute. I didn't see it.

@josef-pkt josef-pkt added the PR label Mar 11, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment