Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execution Manager #84

Closed
5 of 11 tasks
Okanmercan99 opened this issue Jul 25, 2023 · 0 comments
Closed
5 of 11 tasks

Execution Manager #84

Okanmercan99 opened this issue Jul 25, 2023 · 0 comments
Assignees

Comments

@Okanmercan99
Copy link
Contributor

Okanmercan99 commented Jul 25, 2023

This issue describes how each execution of mapping jobs will be managed.

  • Jobs should be run in an asynchronous manner. If the job to be executed is valid, the execution should start in the background and job submission result should return to the client immediately. If the submitted job is not valid, an appropriate error message should be returned.
  • An ExecutionManager component should keep track of the active (i.e. running) jobs. This component should provide an API to stop running executions.
  • It should also be possible to start and stop individual mapping tasks. This would be required when a mapping is updated and it should be restarted.
  • There should be only one execution of a mapping task at the same time.
  • For file system streaming, processed files should be archived; mappings with errors should be aggregated in a separate file. This should be configurable.

Scenarios:

  1. System restart / crash
  • Kafka stream
    • Clear checkpoints config
      • true -> Existing checkpoint directory for the job will be deleted. Earliest and latest configs will apply as expected
      • false -> Records will be read as of the last offset
  • File system stream
    • Clear checkpoints config
      • true -> Existing checkpoint directory for the job will be deleted. Existing files in the data source directory will be reprocessed. Users will need to put already processed files again to the data source directory monitored by Spark.
      • false -> Processing will continue from the last read point.
  1. Mapping update (only the updated mapping task will be restarted)
  • Kafka stream
    • Clear checkpoints config
      • true -> Existing checkpoint directory for the mapping will be deleted. Earliest and latest configs will apply as expected. (Note that if the config is set to latest, old records won't be affected by the mapping updates)
      • false -> Records will be read as of the last offset. Old records won't be affected from the mapping updates.
  • File system stream
    • Clear checkpoints config
      • true -> Existing checkpoint directory for the mapping will be deleted. Existing files in the data source directory will be reprocessed. Users will need to put already processed files again to the data source directory monitored by Spark.
      • false -> Processing will continue from the last read point. Already processed records won't be affected from the mapping updates.

Technical specs:

  • Checkpoints should be per job (not per execution to be able to continue execution of a job in case of a crash) and per mapping task included in the job
  • There should be a configuration to clear Spark's checkpoint directory for a job as a whole and for individual mappings.

Sub-issues:

  • Interactive CLI should be able to run, stop and list streaming queries
  • Running status information should be included in the execution summary
  • Simultaneous execution of the same mapping should be prevented
  • Checkpoint resetting
  • Long-running batch should be tracked
  • Services returning logs does not have spesific models. A new model needs to be built to cover all types of logs.
  • Currently, a startTime (the time streaming started) and endTime (the time streaming ended) are calculated in the frontend for the streaming mapping tasks. We need to calculate them and create a well-formatted response in the backend instead.
  • If there is no log in batch mappings, we cannot show the mapping URL on the execution detail page. Adding the started log may be the solution for this.
  • If there is no log for streaming mappings, we do not show the mapping URL. Again, adding a mapping started log for that mapping task could be a solution.
  • While running file streaming mapping task, if a data source (csv) gets invalid fhir resource error while writing to the onfhir server, that execution stops reading new data source files which means streaming execution stops.

Available Bugs

  • When running a file streaming job, if new data source file is put in configured streaming folder, mapping task count is increased for that execution although the same mapping task is used.
suatgonul added a commit that referenced this issue Sep 1, 2023
Implemented REST endpoints to stop running streaming queries
Relates to #84
suatgonul added a commit that referenced this issue Sep 2, 2023
Implemented REST endpoints to stop running streaming queries
Relates to #84
suatgonul added a commit that referenced this issue Sep 11, 2023
Implemented REST endpoints to stop running streaming queries
Relates to #84
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants