Skip to content

[CHIP-9] fetcher refactor + support derivations in chaining #992

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: haozhen--model-transform
Choose a base branch
from

Conversation

hzding621
Copy link
Collaborator

@hzding621 hzding621 commented May 15, 2025

Summary

Refactor fetch join calls and support derivations in chaining.

  • Refactor fetchJoin into multiple sub-functions: fetchBaseJoin, fetchDerivations, fetchModelTransforms and instrumentAndLog
  • Update the call of fetchJoin from JoinSourceRunner into calling each sub-functions separately
    • for derivations, instead of calling fetchDerivations, use spark session to apply derivations
    • the new sequence is:
      • 1st mapPartitions: invoke fetchBaseJoin
      • run derivations using the main session
      • 2nd mapPartitions: invoke fetchModelTransforms and instrumentAndLog
  • Also, support UDFs loading (via setups) in JoinSourceRunner

Why / Goal

Currently, if a join contains derivations, it cannot be used as a joinSource for chaining, because in the chaining streaming job (JoinSourceRunner), spark executor will call fetcher.fetchJoin which invokes its own local spark session, and Spark doesn't like this - results in Caused by: org.apache.spark.SparkException: Failed to get broadcast_1_piece0 of broadcast_1

Before model transforms chaining this is fine, because the same derivations can be expressed using GroupBy's query. But with model transforms chaining, running derivations as part of the join is a mandatory step to prepare model inputs. This is to be done together with UDF usage in derivations.

Test Plan

  • Added Unit Tests
  • Covered by existing CI
  • Integration tested

Checklist

  • Documentation update

Reviewers

@hzding621 hzding621 changed the base branch from main to haozhen--model-transform May 15, 2025 03:04
@hzding621 hzding621 changed the title fetcher refactor + support derivations in chaining [CHIP-9] fetcher refactor + support derivations in chaining May 15, 2025
@hzding621 hzding621 force-pushed the haozhen--model-transform-derivations branch from d6da2e2 to ce03746 Compare May 15, 2025 07:52
@nikhil-zlai
Copy link
Collaborator

should we discuss the direction a bit more as a larger group - before we go deep?

Lets also ask users if they prefer join.inferenceSpec or a standalone model object in the API.

@hzding621 hzding621 force-pushed the haozhen--model-transform branch from 574695c to cc942e1 Compare June 14, 2025 00:06
@hzding621 hzding621 force-pushed the haozhen--model-transform-derivations branch from ce03746 to 6b44f06 Compare June 14, 2025 00:06
@hzding621 hzding621 force-pushed the haozhen--model-transform branch from cc942e1 to 1abcf86 Compare June 20, 2025 19:22
@hzding621 hzding621 force-pushed the haozhen--model-transform-derivations branch from 6b44f06 to 8335f3d Compare June 20, 2025 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants