-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial pass at metrics collection for Central Scheduler #2044
Initial pass at metrics collection for Central Scheduler #2044
Conversation
@chrispalmer, thanks for your PR! By analyzing the history of the files in this pull request, we identified @daveFNbuck, @erikbern and @DAVW to be potential reviewers. |
@@ -354,7 +356,7 @@ def __init__(self, worker_id, last_active=None): | |||
self.disabled = False | |||
self.rpc_messages = [] | |||
|
|||
def add_info(self, info): | |||
def update_info(self, info): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this method name, because I feel like it fits much better with what is actually happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Those edits are great actually. :)
self._update_metrics_worker_status_change(worker, 'disabled') | ||
|
||
def _update_metrics_task_status_change(self, task, status): | ||
if status != UNKNOWN: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this is the correct place for this check, but tasks can get added to the state before we actually know anything about them. I wish there was a better way to identify when new tasks are being added.
I just came back from computer-less vacation. I'll try to look at this soon :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks cool. But don't you also have screen-shots from Promotheus web interface. Would look cooler I think. :)
Can you move all the files related to the scheduler that you've created into luigi.scheudler.*
? It's about time. Also don't forget to move the batch-notifier while you’re at it.
def get_worker_ids(self): | ||
return self._active_workers.keys() # only used for unit tests | ||
|
||
def get_worker(self, worker_id): | ||
return self._active_workers.setdefault(worker_id, Worker(worker_id)) | ||
|
||
def inactivate_workers(self, delete_workers): | ||
def remove_worker(self, worker_id): | ||
self.inactivate_workers([worker_id], reason='disconnected') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "gracefully_stopped"
is clearer than disconnected
?
@@ -0,0 +1,14 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to copy paste the license-blah blah from the other files, updating the year.
@@ -147,6 +147,8 @@ class scheduler(Config): | |||
|
|||
prune_on_get_work = parameter.BoolParameter(default=False) | |||
|
|||
metrics_collection = parameter.Parameter(default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a description?
@@ -354,7 +356,7 @@ def __init__(self, worker_id, last_active=None): | |||
self.disabled = False | |||
self.rpc_messages = [] | |||
|
|||
def add_info(self, info): | |||
def update_info(self, info): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Those edits are great actually. :)
We're currently integration our solutions with DataDog. We wanted to integrate that ability to send metrics to that service from our Pipeline. Doing such, will allow us to monitor the status of our Pipeline by looking at statistics based on metrics sent by the pipeline. At this moment, there is only one event that's supported but as the feature progress forward, it's easy to see that we could support a bunch more. I had an implementation that was fairly basic at first but after navigating on the existing PR against the official Luigi repo I've discovered that there an ongoing implementation of exactly what we were trying to achieve but with similar service. Thanks to chrispalmer, I've been able to re-use his original work to implement ours. spotify#2044
We're currently integration our solutions with DataDog. We wanted to integrate that ability to send metrics to that service from our Pipeline. Doing such, will allow us to monitor the status of our Pipeline by looking at statistics based on metrics sent by the pipeline. At this moment, there is only one event that's supported but as the feature progress forward, it's easy to see that we could support a bunch more. I had an implementation that was fairly basic at first but after navigating on the existing PR against the official Luigi repo I've discovered that there an ongoing implementation of exactly what we were trying to achieve but with similar service. Thanks to chrispalmer, I've been able to re-use his original work to implement ours. spotify#2044
Datadog is a tool that allows you to send metrics that you create dashboard and add alerting on specific behaviors. Adding this contrib will allow for users of this tool to log their pipeline information to Datadog. Based on the status change of a task, we log that information to Datadog with the parameters that were used to run that specific task. This allow us to easily create dashboard to visualize the health. For example, we can be notified via Datadog if a task has failed, or we can graph the execution time of a specific task over a period of time. The implementation idea was strongly based on the stale PR spotify#2044.
Datadog is a tool that allows you to send metrics that you create dashboard and add alerting on specific behaviors. Adding this contrib will allow for users of this tool to log their pipeline information to Datadog. Based on the status change of a task, we log that information to Datadog with the parameters that were used to run that specific task. This allow us to easily create dashboard to visualize the health. For example, we can be notified via Datadog if a task has failed, or we can graph the execution time of a specific task over a period of time. The implementation idea was strongly based on the stale PR spotify#2044.
Datadog is a tool that allows you to send metrics that you create dashboard and add alerting on specific behaviors. Adding this contrib will allow for users of this tool to log their pipeline information to Datadog. Based on the status change of a task, we log that information to Datadog with the parameters that were used to run that specific task. This allow us to easily create dashboard to visualize the health. For example, we can be notified via Datadog if a task has failed, or we can graph the execution time of a specific task over a period of time. The implementation idea was strongly based on the stale PR spotify#2044.
Datadog is a tool that allows you to send metrics that you create dashboard and add alerting on specific behaviors. Adding this contrib will allow for users of this tool to log their pipeline information to Datadog. Based on the status change of a task, we log that information to Datadog with the parameters that were used to run that specific task. This allow us to easily create dashboard to visualize the health. For example, we can be notified via Datadog if a task has failed, or we can graph the execution time of a specific task over a period of time. The implementation idea was strongly based on the stale PR spotify#2044.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If closed, you may revisit when your time allows and reopen! Thank you for your contributions. |
Datadog is a tool that allows you to send metrics that you create dashboard and add alerting on specific behaviors. Adding this contrib will allow for users of this tool to log their pipeline information to Datadog. Based on the status change of a task, we log that information to Datadog with the parameters that were used to run that specific task. This allow us to easily create dashboard to visualize the health. For example, we can be notified via Datadog if a task has failed, or we can graph the execution time of a specific task over a period of time. The implementation idea was strongly based on the stale PR spotify#2044.
Datadog is a tool that allows you to send metrics that you create dashboard and add alerting on specific behaviors. Adding this contrib will allow for users of this tool to log their pipeline information to Datadog. Based on the status change of a task, we log that information to Datadog with the parameters that were used to run that specific task. This allow us to easily create dashboard to visualize the health. For example, we can be notified via Datadog if a task has failed, or we can graph the execution time of a specific task over a period of time. The implementation idea was strongly based on the stale PR spotify#2044.
Datadog is a tool that allows you to send metrics that you create dashboard and add alerting on specific behaviors. Adding this contrib will allow for users of this tool to log their pipeline information to Datadog. Based on the status change of a task, we log that information to Datadog with the parameters that were used to run that specific task. This allow us to easily create dashboard to visualize the health. For example, we can be notified via Datadog if a task has failed, or we can graph the execution time of a specific task over a period of time. The implementation idea was strongly based on the stale PR spotify#2044.
Datadog is a tool that allows you to send metrics that you create dashboard and add alerting on specific behaviors. Adding this contrib will allow for users of this tool to log their pipeline information to Datadog. Based on the status change of a task, we log that information to Datadog with the parameters that were used to run that specific task. This allow us to easily create dashboard to visualize the health. For example, we can be notified via Datadog if a task has failed, or we can graph the execution time of a specific task over a period of time. The implementation idea was strongly based on the stale PR spotify#2044.
Datadog is a tool that allows you to send metrics that you create dashboard and add alerting on specific behaviors. Adding this contrib will allow for users of this tool to log their pipeline information to Datadog. Based on the status change of a task, we log that information to Datadog with the parameters that were used to run that specific task. This allow us to easily create dashboard to visualize the health. For example, we can be notified via Datadog if a task has failed, or we can graph the execution time of a specific task over a period of time. The implementation idea was strongly based on the stale PR spotify#2044.
Datadog is a tool that allows you to send metrics that you create dashboard and add alerting on specific behaviors. Adding this contrib will allow for users of this tool to log their pipeline information to Datadog. Based on the status change of a task, we log that information to Datadog with the parameters that were used to run that specific task. This allow us to easily create dashboard to visualize the health. For example, we can be notified via Datadog if a task has failed, or we can graph the execution time of a specific task over a period of time. The implementation idea was strongly based on the stale PR spotify#2044.
Datadog is a tool that allows you to send metrics that you create dashboard and add alerting on specific behaviors. Adding this contrib will allow for users of this tool to log their pipeline information to Datadog. Based on the status change of a task, we log that information to Datadog with the parameters that were used to run that specific task. This allow us to easily create dashboard to visualize the health. For example, we can be notified via Datadog if a task has failed, or we can graph the execution time of a specific task over a period of time. The implementation idea was strongly based on the stale PR spotify#2044.
Datadog is a tool that allows you to send metrics that you create dashboard and add alerting on specific behaviors. Adding this contrib will allow for users of this tool to log their pipeline information to Datadog. Based on the status change of a task, we log that information to Datadog with the parameters that were used to run that specific task. This allow us to easily create dashboard to visualize the health. For example, we can be notified via Datadog if a task has failed, or we can graph the execution time of a specific task over a period of time. The implementation idea was strongly based on the stale PR spotify#2044.
Datadog is a tool that allows you to send metrics that you create dashboard and add alerting on specific behaviors. Adding this contrib will allow for users of this tool to log their pipeline information to Datadog. Based on the status change of a task, we log that information to Datadog with the parameters that were used to run that specific task. This allow us to easily create dashboard to visualize the health. For example, we can be notified via Datadog if a task has failed, or we can graph the execution time of a specific task over a period of time. The implementation idea was strongly based on the stale PR spotify#2044.
* Add Datadog contrib for monitoring purpose Datadog is a tool that allows you to send metrics that you create dashboard and add alerting on specific behaviors. Adding this contrib will allow for users of this tool to log their pipeline information to Datadog. Based on the status change of a task, we log that information to Datadog with the parameters that were used to run that specific task. This allow us to easily create dashboard to visualize the health. For example, we can be notified via Datadog if a task has failed, or we can graph the execution time of a specific task over a period of time. The implementation idea was strongly based on the stale PR #2044. * Refactor MetricsCollectors in Scheduler I've also added a few test to ensure that the implementation was working well. * Add polish + tests around metrics on task state This takes care of ensuring that the proper metrics collection calls are being done when they are expected to be happening. We've also removed a few `@RPC_METHOD` that weren't actually being used and that wasn't required. * Add tests related to the Datadog contrib This makes sure that we're properly dispatching API and STATSD call with the proper parameter values to Datadog. This doesn't test all the different possible parameters configuration. * Improve configuration documentation with new Datadog contrib This adds a few extra documentation line for the configuration to allow user to find all the settings they can tweak for each individual contribs instead of having to go through each individual contrib files. * Update DataDog dep. to the most recent release The original implementation was made when 0.16.0 was the latest version. Since there there have been a few improvements and bug fixes made to the library that we should be using. Reading through the release log there shouldn't be any feature-breaking changes so we should be good to update it! * Change metrics collection getter to class method Previously, the getter wasn't a class method and wouldn't work as expected. In order to ensure that the output is what we expect, we've added more tests. * Fix spec issues related to new tests There was multiple problems that needed to be solved in order to get the specs green again. Each individual specs were passing when ran individually, but when ran into tox as a group, some of them would pass and other would fail depending the tox environment. It came to my attention that the time function of this file, was creating an issue with other specs because we were not tearDowning it as expected. Also, using setTime within the setUp group had side effects with unexpected behaviors. Then, the way way that the task_id and task_family was named was also causing problems with the same spec that were failing prior. I'm unsure why this would be the case, but changing either fail, but changing both makes the spec to green. Finally, the last spec would always fail because the setTime was set AFTER the task was actually being run, which would always cause the execution time to be greater than 0. My understanding of all of this is still a bit fuzzy, but hey, now the spec suite passes. * Refactor the datadog_metric tests * Abstract MetricsCollector class This will force people to implement this methods of this class when they refer to it. * Kwargs on the _send_event call This allows for less-strict function calls. * Fix metrics collector * Change DataDog scheduler_api_tests * Change default_tags of DatadogMetricsCollector to a property The underlying configuration of the Datadog metrics collector is a property, so it makes more sense that it's also a property when used within the class itself.
We're currently integration our solutions with DataDog. We wanted to integrate that ability to send metrics to that service from our Pipeline. Doing such, will allow us to monitor the status of our Pipeline by looking at statistics based on metrics sent by the pipeline. At this moment, there is only one event that's supported but as the feature progress forward, it's easy to see that we could support a bunch more. I had an implementation that was fairly basic at first but after navigating on the existing PR against the official Luigi repo I've discovered that there an ongoing implementation of exactly what we were trying to achieve but with similar service. Thanks to chrispalmer, I've been able to re-use his original work to implement ours. spotify#2044
We're currently integration our solutions with DataDog. We wanted to integrate that ability to send metrics to that service from our Pipeline. Doing such, will allow us to monitor the status of our Pipeline by looking at statistics based on metrics sent by the pipeline. At this moment, there is only one event that's supported but as the feature progress forward, it's easy to see that we could support a bunch more. I had an implementation that was fairly basic at first but after navigating on the existing PR against the official Luigi repo I've discovered that there an ongoing implementation of exactly what we were trying to achieve but with similar service. Thanks to chrispalmer, I've been able to re-use his original work to implement ours. spotify#2044
We're currently integration our solutions with DataDog. We wanted to integrate that ability to send metrics to that service from our Pipeline. Doing such, will allow us to monitor the status of our Pipeline by looking at statistics based on metrics sent by the pipeline. At this moment, there is only one event that's supported but as the feature progress forward, it's easy to see that we could support a bunch more. I had an implementation that was fairly basic at first but after navigating on the existing PR against the official Luigi repo I've discovered that there an ongoing implementation of exactly what we were trying to achieve but with similar service. Thanks to chrispalmer, I've been able to re-use his original work to implement ours. spotify#2044
Description
This is an initial pass at adding metric collection functionality to the Central Scheduler, and is still a work in progress. I'm looking for feedback on the general approach.
Summary of changes:
MetricsCollector
class (luigi/metrics.py) with no-op methods, but can be subclassed for a specific metric collection tool.PrometheusMetricsCollector
class (luigi/contrib/prometheus.py) that subclasses the above, as an example implementation.scheduler
config option/parameter to select the metric collection tool.SimpleTaskState
to identify and trigger metric updates.Scheduler.remove_worker
rpc_method to relatedWorker._remove_worker
method to allow workers to tell the scheduler when they are finishing and disconnecting. This allows us to distinguish between deliberate disconnects and worker failures.Motivation and Context
This is an attempt to provide functionality desired in issue 2306
Have you tested this? If so, how?
Does not yet include any tests.
I have run it against some very simple pipelines and it seems to do as expected.