Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: Expose missing data handling as classmethod #1402

Merged
merged 4 commits into from Feb 21, 2014

Conversation

Projects
None yet
2 participants
@jseabold
Copy link
Member

commented Feb 20, 2014

A little refactoring so that we can do missing data handling for different types without everything else that it involves.

Makes it easier for models where the data you get on the front end from the user might be different than the data you do estimation on. E.g., collapsed groups in duration analysis.

@josef-pkt

This comment has been minimized.

Copy link
Member

commented Feb 20, 2014

apropos refactoring: can we return and attach the (valid) mask instead of the index?
I think in most cases that would be easier to work with.

also MaskedArrays uses mask=None as a shortcut to indicate there are no missing values / masked elements. That would be useful if users want to check by default, and don't have any missing values most of the time.

@jseabold

This comment has been minimized.

Copy link
Member Author

commented Feb 20, 2014

I'd like to think about this a bit more outside of this PR since this is a simple fix that will help with the survival code I'm working on right now.

My reasoning, I'm a bit worried memory-wise that we're starting to carry around too many arrays. I was thinking to just have a sparse-like index for missing values and create the mask on the fly with a property, but I'd need to play around a bit. Something like

class Data(object):
    def __init__(self, X, missing_rows):
        self.missing_rows = missing_rows
        self.nobs = len(X)

    @property
    def missing_mask(self):
        bool_missing_mask = np.ones(self.nobs, dtype=bool) 
        bool_missing_mask[self.missing_rows] = False
        return bool_missing_mask

X = np.random.randn(100)
data = Data(X, [12, 37, 84])
X[data.missing_mask]
@josef-pkt

This comment has been minimized.

Copy link
Member

commented Feb 20, 2014

Fine with me to wait until we need it.

I was also thinking about memory consumption as we keep adding arrays.
Using a helper function to create the mask on the fly sounds fine, there will not be many cases where we will need it.
In this case it is easy for users to avoid, if they clean their data first, especially if we also have a None or empty list to quickly check that there were no nans.

So this PR is essentially to make the class method callable from the outside without any side effects?

@jseabold

This comment has been minimized.

Copy link
Member Author

commented Feb 20, 2014

Provided I fixed the test failures, I'm going to go ahead and rebase and merge this, so I can try to use it in other branch. Can refactor again later if needed.

@josef-pkt

This comment has been minimized.

Copy link
Member

commented Feb 20, 2014

Fine with me, I don't think this is used yet outside of the base datahandling.

@jseabold

This comment has been minimized.

Copy link
Member Author

commented Feb 20, 2014

Eh, now that I'm actually thinking about it, it might need a few more changes.

jseabold added a commit that referenced this pull request Feb 21, 2014

Merge pull request #1402 from jseabold/handle-missing-classmethod
REF/ENH: Expose missing data handling as function.

@jseabold jseabold merged commit 8662c9d into statsmodels:master Feb 21, 2014

1 check passed

default The Travis CI build passed
Details

@jseabold jseabold deleted the jseabold:handle-missing-classmethod branch Feb 21, 2014

@jseabold

This comment has been minimized.

Copy link
Member Author

commented Feb 21, 2014

FYI, we are carrying around already the missing_row_idx in the data attribute. I didn't see it.

@josef-pkt josef-pkt added the PR label Mar 11, 2014

PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this pull request Sep 2, 2014

Merge pull request statsmodels#1402 from jseabold/handle-missing-clas…
…smethod

REF/ENH: Expose missing data handling as function.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.