[air - preprocessor] Add BatchMapper. #23700

xwjiang2010 · 2022-04-04T21:46:41Z

Why are these changes needed?

Add BatchMapper preprocessor.
Update the semantics of preprocessor.fit() to allow for multiple fit. This is to follow scikitlearn example.
Introduce FitStatus to explicitly incorporate Chain case.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

gjoliver · 2022-04-04T22:58:56Z

python/ray/ml/preprocessor.py

            raise PreprocessorNotFittedException(
                "`fit` must be called before `transform_batch`."
            )
        return self._transform_batch(df)

+    def should_fit(self):


sounds more like can_fit(self) or fittable(self) to me.
btw why is it check_is_fitted() and not is_fitted() ...

Hmmm, got it. We can decide on the naming. But semantics are basically:

fittable is an inherent attribute of the "type" of the preprocessors. It also implies whether a fit method in meaningful at all throughout the entire lifetime of this preprocessor.

should_fit/can_fit depends on the state a preprocessor is currently in (assuming it's fittable).

Exposing check_is_fitted alone is not enough, as you can see in trainer.py - it only checks for check_is_fitted in current impl, which leads to crash in the case of non-fittable preprocessors. That's why the proposal is to add should_fit.

check_is_fitted v.s. is_fitted or can v.s. should - I don't have much preference. @clarkzinzow @matthewdeng maybe as original author of the API?

agree with the change. I am just nit-picking the naming.
hope to get things named consistently. thanks :)

I opted for should_fit over can_fit since it's not indicating an optional operation for fittable preprocessors, it's a necessary operation: if a fittable preprocessor is not fit before calling .transform(), it will fail. An argument could even be made for needs_fit.

WE can change check_is_fitted to is_fitted.

should_fit functionally is a bit strange to me, at least as a public API. In particular, I want to avoid the case where the user does something like:

if (preprocessor.should_fit()): preprocessor.fit()

It's not clear how to differentiate the case where the preprocessor is fitted from the case where the preprocessor was already fitted before.

@matthewdeng hmm, mind elaborating a bit more?
so 3 cases:

not fittable

fittable and fitted

fittable and not fitted yet

should_fit == case 3
if_fitted == 2 conditioned on (2 + 3)

Another way that I can see to work is just to enforce "at most once" fitting semantics internally - and caller doesn't have to call should_fit before fit. Which one do you prefer? Or are you proposing an alternative?

I think I lack the context why you want to bundle these 2 things together in the first place.
but in my mind, the most intuitive way is to:

if calling fit() or fit_transform(), and not fittable, throw exception.
if calling fit(), and already fitted, print warning msg, and no-op.
if calling fit_transoform(), and already fitted, print warning msg, then proceed to transform().

Can we raise an exception in the fit()/fit_transfomr() when already fitted instead? Logging is better than no logging, but I worry the behavior here isn't clear for users (I can see users thinking it should re-fit).

clarkzinzow

Looking good, the biggest remaining things are:

We need to modify Chain to work correctly with the new should_fit() API.
I think that BatchMapper.fit() should be a no-op in order for Chain to be able to naively call .fit() and .fit_transform() on all of its preprocessors, which should be cleaner.

ray/python/ray/ml/preprocessors/chain.py

Lines 22 to 47 in 8b8afd5

    
           def _fit(self, ds: Dataset) -> Preprocessor: 
        
               for preprocessor in self.preprocessors[:-1]: 
        
                   ds = preprocessor.fit_transform(ds) 
        
               self.preprocessors[-1].fit(ds) 
        
               return self 
        
           def fit_transform(self, ds: Dataset) -> Dataset: 
        
               for preprocessor in self.preprocessors: 
        
                   ds = preprocessor.fit_transform(ds) 
        
               return ds 
        
           def _transform(self, ds: Dataset) -> Dataset: 
        
               for preprocessor in self.preprocessors: 
        
                   ds = preprocessor.transform(ds) 
        
               return ds 
        
           def _transform_batch(self, df: DataBatchType) -> DataBatchType: 
        
               for preprocessor in self.preprocessors: 
        
                   df = preprocessor.transform_batch(df) 
        
               return df 
        
           def check_is_fitted(self) -> bool: 
        
               return all(p.check_is_fitted() for p in self.preprocessors) 
        
           def __repr__(self): 
        
               return f"<Chain preprocessors={self.preprocessors}>"

python/ray/ml/preprocessors/batch_mapper.py

clarkzinzow · 2022-04-05T00:01:16Z

python/ray/ml/preprocessor.py

        Args:
            dataset: Input dataset.

        Returns:
            Preprocessor: The fitted Preprocessor with state attributes.
        """
+        assert self._is_fittable, "One is expected to call `should_fit` before `fit`."


Could also make this a no-op when not self._is_fittable, which would be more friendly to e.g. chain preprocessors.

See comment above.

Looking good, the biggest remaining things are:

We need to modify Chain to work correctly with the new should_fit() API.

I think that BatchMapper.fit() should be a no-op in order for Chain to be able to naively call .fit() and .fit_transform() on all of its preprocessors, which should be cleaner.

ray/python/ray/ml/preprocessors/chain.py

Lines 22 to 47 in 8b8afd5

def _fit(self, ds: Dataset) -> Preprocessor:

for preprocessor in self.preprocessors[:-1]:

ds = preprocessor.fit_transform(ds)

self.preprocessors[-1].fit(ds)

return self

def fit_transform(self, ds: Dataset) -> Dataset:

for preprocessor in self.preprocessors:

ds = preprocessor.fit_transform(ds)

return ds

def _transform(self, ds: Dataset) -> Dataset:

for preprocessor in self.preprocessors:

ds = preprocessor.transform(ds)

return ds

def _transform_batch(self, df: DataBatchType) -> DataBatchType:

for preprocessor in self.preprocessors:

df = preprocessor.transform_batch(df)

return df

def check_is_fitted(self) -> bool:

return all(p.check_is_fitted() for p in self.preprocessors)

def __repr__(self):

return f"<Chain preprocessors={self.preprocessors}>"

@clarkzinzow I see.
Looking at Chain preprocessor, _is_fittable is set to False. Are users supposed to overwrite this when constructing their Chain preprocessor?

gjoliver · 2022-04-05T17:36:21Z

python/ray/ml/preprocessor.py

@@ -60,6 +64,8 @@ def fit_transform(self, dataset: Dataset) -> Dataset:
        Returns:
            ray.data.Dataset: The transformed Dataset.
        """
+        assert self._is_fittable, "One is expected to call `should_fit` before `fit`."


this error message looks weird. why don't you check:
assert self.should_fit() here as well?

python/ray/ml/preprocessor.py

gjoliver · 2022-04-05T17:48:33Z

python/ray/ml/preprocessor.py

            raise PreprocessorNotFittedException(
                "`fit` must be called before `transform_batch`."
            )
        return self._transform_batch(df)

+    def should_fit(self):


I think I lack the context why you want to bundle these 2 things together in the first place.
but in my mind, the most intuitive way is to:

if calling fit() or fit_transform(), and not fittable, throw exception.
if calling fit(), and already fitted, print warning msg, and no-op.
if calling fit_transoform(), and already fitted, print warning msg, then proceed to transform().

xwjiang2010 · 2022-04-06T17:06:38Z

@gjoliver @matthewdeng @clarkzinzow
A few updates:

introduces a FitStatus and fit_status() to incorporate some of the nuances for chained preprocessors.
throws explicit exceptions
check_is_fitted is now private

python/ray/ml/preprocessor.py

clarkzinzow · 2022-04-06T17:53:13Z

python/ray/ml/preprocessors/chain.py

+            elif fitted_count > 0:
+                return Preprocessor.FitStatus.PARTIALLY_FITTED


Is this a valid state, and when would this happen? Is this just when a chain is created that contains some fitted and some unfitted preprocessors? Is that even a valid use case that we should allow?

correct. I don't think this is necessarily a valid state to be in. But one may construct a chain preprocessor incorrectly ending up in this mixed state.
Trying to be defensive and explicit here.
I am also open to have another error to warn explicitly about this mixed state, which should not happen..

clarkzinzow · 2022-04-06T17:53:59Z

python/ray/ml/trainer.py

@@ -192,7 +192,7 @@ def preprocess_datasets(self) -> None:

        if self.preprocessor:
            train_dataset = self.datasets.get(TRAIN_DATASET_KEY, None)
-            if train_dataset and not self.preprocessor.check_is_fitted():
+            if train_dataset:


Are there valid use cases in which an already-fitted preprocessor may be passed and we'd rather no-op than error here?

See @matthewdeng's preference about wanting explicit exception. :)
let's make a decision and stick to it.

I think we should allow fitted dataset, and basically no-op here.
why do we want to require unfitted dataset? what if the entire dataset is not_fitable?

we could do that. It's just @matthewdeng has this concern to not silently no-op (even with a warning msg):

Can we raise an exception in the fit()/fit_transfomr() when already fitted instead? Logging is better than no logging, but I worry the behavior here isn't clear for users (I can see users thinking it should re-fit).

print a info or warning msg sounds good.

So I think that Preprocessor itself should error if .fit() is called on an already fitted preprocessor, but I was less sure about whether Train as a user of Preprocessor should let these exceptions happen. I think that @matthewdeng is right, we should error here to ensure that the user doesn't think that an overwriting or incremental fit is happening.

what about partially fitted chain? what's a user's options here?

Synced offline.
@matthewdeng @gjoliver @clarkzinzow PTAL.

python/ray/ml/preprocessors/batch_mapper.py

python/ray/ml/preprocessors/chain.py

python/ray/ml/preprocessor.py

python/ray/ml/preprocessors/chain.py

gjoliver · 2022-04-06T19:13:02Z

python/ray/ml/trainer.py

@@ -192,7 +192,7 @@ def preprocess_datasets(self) -> None:

        if self.preprocessor:
            train_dataset = self.datasets.get(TRAIN_DATASET_KEY, None)
-            if train_dataset and not self.preprocessor.check_is_fitted():
+            if train_dataset:


I think we should allow fitted dataset, and basically no-op here.
why do we want to require unfitted dataset? what if the entire dataset is not_fitable?

clarkzinzow

LGTM, only nits! IMO good to merge after one other ML team reviewer approval.

python/ray/ml/preprocessor.py

matthewdeng

New functionality looks great, thanks for iterating on this!

python/ray/ml/preprocessor.py

python/ray/ml/preprocessors/chain.py

python/ray/ml/preprocessors/batch_mapper.py

python/ray/ml/preprocessor.py

python/ray/ml/tests/test_preprocessors.py

…add_column

matthewdeng

LGTM - can you update the PR summary and add a description (including the fit status changes)?

[air - preprocessor] add BatchMapper.

8afa0ba

xwjiang2010 assigned matthewdeng and clarkzinzow Apr 4, 2022

xwjiang2010 added 2 commits April 4, 2022 15:43

introduce should_fit().

4023c24

add to preprocessor.__init__

8b8afd5

gjoliver reviewed Apr 4, 2022

View reviewed changes

clarkzinzow reviewed Apr 5, 2022

View reviewed changes

Add doc string

dcb3c4d

gjoliver reviewed Apr 5, 2022

View reviewed changes

Address comments.

6af526d

update exception msg.

f2ace76

clarkzinzow reviewed Apr 6, 2022

View reviewed changes

address comments

ba0ec52

gjoliver reviewed Apr 6, 2022

View reviewed changes

xwjiang2010 added 2 commits April 6, 2022 13:50

address comments

4df6390

allow fitting more than once. print a warning.

56c594c

clarkzinzow approved these changes Apr 7, 2022

View reviewed changes

python/ray/ml/preprocessor.py Show resolved Hide resolved

python/ray/ml/preprocessor.py Outdated Show resolved Hide resolved

richardliaw added this to the Ray AIR milestone Apr 8, 2022

Merge branch 'ray-project:master' into add_column

ccf8822

matthewdeng reviewed Apr 8, 2022

View reviewed changes

xwjiang2010 added 4 commits April 8, 2022 13:41

comments

deab13c

Merge branch 'add_column' of https://github.com/xwjiang2010/ray into …

f495403

…add_column

address comments

c07e305

up

624b308

matthewdeng reviewed Apr 13, 2022

View reviewed changes

matthewdeng approved these changes Apr 13, 2022

View reviewed changes

amogkam merged commit 06a57b2 into ray-project:master Apr 14, 2022

matthewdeng mentioned this pull request May 30, 2022

[AIR] Support fitted preprocessors #25299

Closed

xwjiang2010 deleted the add_column branch July 26, 2023 19:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[air - preprocessor] Add BatchMapper. #23700

[air - preprocessor] Add BatchMapper. #23700

xwjiang2010 commented Apr 4, 2022 •

edited

gjoliver Apr 4, 2022

xwjiang2010 Apr 4, 2022

gjoliver Apr 4, 2022

clarkzinzow Apr 4, 2022 •

edited

matthewdeng Apr 5, 2022

xwjiang2010 Apr 5, 2022

gjoliver Apr 5, 2022

matthewdeng Apr 5, 2022

clarkzinzow left a comment

clarkzinzow Apr 5, 2022

xwjiang2010 Apr 5, 2022

gjoliver Apr 5, 2022

gjoliver Apr 5, 2022

xwjiang2010 commented Apr 6, 2022 •

edited

clarkzinzow Apr 6, 2022

xwjiang2010 Apr 6, 2022

clarkzinzow Apr 6, 2022

xwjiang2010 Apr 6, 2022

gjoliver Apr 6, 2022

xwjiang2010 Apr 6, 2022

gjoliver Apr 6, 2022

clarkzinzow Apr 7, 2022

gjoliver Apr 7, 2022

xwjiang2010 Apr 7, 2022

gjoliver Apr 6, 2022

clarkzinzow left a comment

matthewdeng left a comment

matthewdeng left a comment

	def _fit(self, ds: Dataset) -> Preprocessor:
	for preprocessor in self.preprocessors[:-1]:
	ds = preprocessor.fit_transform(ds)
	self.preprocessors[-1].fit(ds)
	return self

	def fit_transform(self, ds: Dataset) -> Dataset:
	for preprocessor in self.preprocessors:
	ds = preprocessor.fit_transform(ds)
	return ds

	def _transform(self, ds: Dataset) -> Dataset:
	for preprocessor in self.preprocessors:
	ds = preprocessor.transform(ds)
	return ds

	def _transform_batch(self, df: DataBatchType) -> DataBatchType:
	for preprocessor in self.preprocessors:
	df = preprocessor.transform_batch(df)
	return df

	def check_is_fitted(self) -> bool:
	return all(p.check_is_fitted() for p in self.preprocessors)

	def __repr__(self):
	return f"<Chain preprocessors={self.preprocessors}>"

		elif fitted_count > 0:
		return Preprocessor.FitStatus.PARTIALLY_FITTED

[air - preprocessor] Add BatchMapper. #23700

[air - preprocessor] Add BatchMapper. #23700

Conversation

xwjiang2010 commented Apr 4, 2022 • edited

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clarkzinzow Apr 4, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clarkzinzow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xwjiang2010 commented Apr 6, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clarkzinzow left a comment

Choose a reason for hiding this comment

matthewdeng left a comment

Choose a reason for hiding this comment

matthewdeng left a comment

Choose a reason for hiding this comment

xwjiang2010 commented Apr 4, 2022 •

edited

clarkzinzow Apr 4, 2022 •

edited

xwjiang2010 commented Apr 6, 2022 •

edited