Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application context objects should not be kept by default + add way to supply fit context #86

Closed
shaypal5 opened this issue Feb 4, 2022 · 9 comments · Fixed by #89
Closed

Comments

@shaypal5
Copy link
Collaborator

shaypal5 commented Feb 4, 2022

pdpipe uses PdpApplicationContext objects in two ways:

  1. As the fit_context that should be kept as-is after a fit, and used by stages to pass to one another parameters that should also be used on transform time.
  2. As the application_context that should be discarded after a specific application is done, and is used by stages to feed consecutive stages with context. It can be added to by supplying apply(context={}), fit_transform(context={}) or transform(context={}) with a dict that will be used to update the application context.

Two changes are required:

  1. At the moment there is a single context parameter to application functions that is used to update both the fit and the application context. I think they should be two, one for each type of context.
  2. At the moment the application_context is not discarded when the application is done. It's as simple as self.application_context = None expression added at the PdPipeline level in the couple of right cases.
@shaypal5
Copy link
Collaborator Author

shaypal5 commented Feb 4, 2022

@carbonleakage does this sound interesting?

@carbonleakage
Copy link
Contributor

@shaypal5 sounds pretty interesting, ill start working on this!

@shaypal5
Copy link
Collaborator Author

shaypal5 commented Feb 5, 2022

Great. I'll share code with the latest use case I had for application context, so you have some context yourself. :)

@shaypal5
Copy link
Collaborator Author

shaypal5 commented Feb 6, 2022

See the pipeline stage here:
https://github.com/shaypal5/mba_ds_project/blob/main/mba/pipeline.py#L20

It gets a product-review dataframe as context, used to enrich some of the rows in the input dataframe with product sentiment features.

In another file, on fit-transform the pipeline is provided with the train review dataframe:
https://github.com/shaypal5/mba_ds_project/blob/main/mba/buyer.py#L112

And on transform it is provided with the rollout/holdout review dataframe:
https://github.com/shaypal5/mba_ds_project/blob/main/mba/buyer.py#L283

Now here I did something sensible and provided it as a path, buy a very intuitive thing to do is to provide the dataframe object itself, in which case it will be kept(!) inside the pipeline object (because the context parameter currently also updates the fit_context attribute). What worse, if the pipeline is serialized to be deployed to production somewhere, the entire train context dataframe is serialized with it, even though it is never used in transform!

@shaypal5
Copy link
Collaborator Author

Hey @carbonleakage :)

Had a chance to take a look at this yet?

@carbonleakage
Copy link
Contributor

Hey @shaypal5 I started looking into this last week. I have not made much progress due to other commitments. I plan to do some pull requests in the coming weeks.

@shaypal5
Copy link
Collaborator Author

Great. :)

Just wanted to touch base.

@carbonleakage
Copy link
Contributor

Hello @shaypal5 I just did a pull request #89, let me know what you think.

@shaypal5
Copy link
Collaborator Author

Released in v0.2.1: 🎊 🎉
https://github.com/pdpipe/pdpipe/releases/tag/v0.2.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants