A little refactoring so that we can do missing data handling for different types without everything else that it involves.
Makes it easier for models where the data you get on the front end from the user might be different than the data you do estimation on. E.g., collapsed groups in duration analysis.
apropos refactoring: can we return and attach the (valid) mask instead of the index?
I think in most cases that would be easier to work with.
also MaskedArrays uses mask=None as a shortcut to indicate there are no missing values / masked elements. That would be useful if users want to check by default, and don't have any missing values most of the time.
I'd like to think about this a bit more outside of this PR since this is a simple fix that will help with the survival code I'm working on right now.
My reasoning, I'm a bit worried memory-wise that we're starting to carry around too many arrays. I was thinking to just have a sparse-like index for missing values and create the mask on the fly with a property, but I'd need to play around a bit. Something like
def __init__(self, X, missing_rows):
self.missing_rows = missing_rows
self.nobs = len(X)
bool_missing_mask = np.ones(self.nobs, dtype=bool)
bool_missing_mask[self.missing_rows] = False
X = np.random.randn(100)
data = Data(X, [12, 37, 84])
Fine with me to wait until we need it.
I was also thinking about memory consumption as we keep adding arrays.
Using a helper function to create the mask on the fly sounds fine, there will not be many cases where we will need it.
In this case it is easy for users to avoid, if they clean their data first, especially if we also have a None or empty list to quickly check that there were no nans.
So this PR is essentially to make the class method callable from the outside without any side effects?
REF: Expose missing data handling as classmethod
STY: Expose handle_missing as a public method.
Provided I fixed the test failures, I'm going to go ahead and rebase and merge this, so I can try to use it in other branch. Can refactor again later if needed.
Fine with me, I don't think this is used yet outside of the base datahandling.
Eh, now that I'm actually thinking about it, it might need a few more changes.
ENH: Higher level handle_missing function.
TST: Test handle_missing function
FYI, we are carrying around already the missing_row_idx in the data attribute. I didn't see it.