Initial pass at metrics collection for Central Scheduler #2044

chrispalmer · 2017-02-27T03:19:07Z

Description

This is an initial pass at adding metric collection functionality to the Central Scheduler, and is still a work in progress. I'm looking for feedback on the general approach.

Summary of changes:

Dummy MetricsCollector class (luigi/metrics.py) with no-op methods, but can be subclassed for a specific metric collection tool.
PrometheusMetricsCollector class (luigi/contrib/prometheus.py) that subclasses the above, as an example implementation.
New scheduler config option/parameter to select the metric collection tool.
changes to SimpleTaskState to identify and trigger metric updates.
Added Scheduler.remove_worker rpc_method to related Worker._remove_worker method to allow workers to tell the scheduler when they are finishing and disconnecting. This allows us to distinguish between deliberate disconnects and worker failures.

Motivation and Context

This is an attempt to provide functionality desired in issue 2306

Have you tested this? If so, how?

Does not yet include any tests.
I have run it against some very simple pipelines and it seems to do as expected.

mention-bot · 2017-02-27T03:19:08Z

@chrispalmer, thanks for your PR! By analyzing the history of the files in this pull request, we identified @daveFNbuck, @erikbern and @DAVW to be potential reviewers.

chrispalmer · 2017-02-27T03:22:24Z

luigi/scheduler.py

@@ -354,7 +356,7 @@ def __init__(self, worker_id, last_active=None):
        self.disabled = False
        self.rpc_messages = []

-    def add_info(self, info):
+    def update_info(self, info):


I changed this method name, because I feel like it fits much better with what is actually happening.

Yes! Those edits are great actually. :)

chrispalmer · 2017-02-27T03:32:37Z

luigi/scheduler.py

+            self._update_metrics_worker_status_change(worker, 'disabled')
+
+    def _update_metrics_task_status_change(self, task, status):
+        if status != UNKNOWN:


Not sure this is the correct place for this check, but tasks can get added to the state before we actually know anything about them. I wish there was a better way to identify when new tasks are being added.

Tarrasch · 2017-03-05T03:36:28Z

I just came back from computer-less vacation. I'll try to look at this soon :)

Tarrasch

This looks cool. But don't you also have screen-shots from Promotheus web interface. Would look cooler I think. :)

Can you move all the files related to the scheduler that you've created into luigi.scheudler.*? It's about time. Also don't forget to move the batch-notifier while you’re at it.

Tarrasch · 2017-03-06T04:51:20Z

luigi/scheduler.py

    def get_worker_ids(self):
        return self._active_workers.keys()  # only used for unit tests

    def get_worker(self, worker_id):
        return self._active_workers.setdefault(worker_id, Worker(worker_id))

-    def inactivate_workers(self, delete_workers):
+    def remove_worker(self, worker_id):
+        self.inactivate_workers([worker_id], reason='disconnected')


Maybe "gracefully_stopped" is clearer than disconnected?

Tarrasch · 2017-03-06T04:52:09Z

luigi/metrics.py

@@ -0,0 +1,14 @@
+


I think you need to copy paste the license-blah blah from the other files, updating the year.

Tarrasch · 2017-03-06T04:54:32Z

luigi/scheduler.py

@@ -147,6 +147,8 @@ class scheduler(Config):

    prune_on_get_work = parameter.BoolParameter(default=False)

+    metrics_collection = parameter.Parameter(default=None)


Maybe add a description?

Tarrasch · 2017-03-06T04:54:52Z

luigi/scheduler.py

@@ -354,7 +356,7 @@ def __init__(self, worker_id, last_active=None):
        self.disabled = False
        self.rpc_messages = []

-    def add_info(self, info):
+    def update_info(self, info):


Yes! Those edits are great actually. :)

We're currently integration our solutions with DataDog. We wanted to integrate that ability to send metrics to that service from our Pipeline. Doing such, will allow us to monitor the status of our Pipeline by looking at statistics based on metrics sent by the pipeline. At this moment, there is only one event that's supported but as the feature progress forward, it's easy to see that we could support a bunch more. I had an implementation that was fairly basic at first but after navigating on the existing PR against the official Luigi repo I've discovered that there an ongoing implementation of exactly what we were trying to achieve but with similar service. Thanks to chrispalmer, I've been able to re-use his original work to implement ours. spotify#2044

Datadog is a tool that allows you to send metrics that you create dashboard and add alerting on specific behaviors. Adding this contrib will allow for users of this tool to log their pipeline information to Datadog. Based on the status change of a task, we log that information to Datadog with the parameters that were used to run that specific task. This allow us to easily create dashboard to visualize the health. For example, we can be notified via Datadog if a task has failed, or we can graph the execution time of a specific task over a period of time. The implementation idea was strongly based on the stale PR spotify#2044.

stale · 2018-07-31T01:21:25Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If closed, you may revisit when your time allows and reopen! Thank you for your contributions.

Datadog is a tool that allows you to send metrics that you create dashboard and add alerting on specific behaviors. Adding this contrib will allow for users of this tool to log their pipeline information to Datadog. Based on the status change of a task, we log that information to Datadog with the parameters that were used to run that specific task. This allow us to easily create dashboard to visualize the health. For example, we can be notified via Datadog if a task has failed, or we can graph the execution time of a specific task over a period of time. The implementation idea was strongly based on the stale PR spotify#2044.

* Add Datadog contrib for monitoring purpose Datadog is a tool that allows you to send metrics that you create dashboard and add alerting on specific behaviors. Adding this contrib will allow for users of this tool to log their pipeline information to Datadog. Based on the status change of a task, we log that information to Datadog with the parameters that were used to run that specific task. This allow us to easily create dashboard to visualize the health. For example, we can be notified via Datadog if a task has failed, or we can graph the execution time of a specific task over a period of time. The implementation idea was strongly based on the stale PR #2044. * Refactor MetricsCollectors in Scheduler I've also added a few test to ensure that the implementation was working well. * Add polish + tests around metrics on task state This takes care of ensuring that the proper metrics collection calls are being done when they are expected to be happening. We've also removed a few `@RPC_METHOD` that weren't actually being used and that wasn't required. * Add tests related to the Datadog contrib This makes sure that we're properly dispatching API and STATSD call with the proper parameter values to Datadog. This doesn't test all the different possible parameters configuration. * Improve configuration documentation with new Datadog contrib This adds a few extra documentation line for the configuration to allow user to find all the settings they can tweak for each individual contribs instead of having to go through each individual contrib files. * Update DataDog dep. to the most recent release The original implementation was made when 0.16.0 was the latest version. Since there there have been a few improvements and bug fixes made to the library that we should be using. Reading through the release log there shouldn't be any feature-breaking changes so we should be good to update it! * Change metrics collection getter to class method Previously, the getter wasn't a class method and wouldn't work as expected. In order to ensure that the output is what we expect, we've added more tests. * Fix spec issues related to new tests There was multiple problems that needed to be solved in order to get the specs green again. Each individual specs were passing when ran individually, but when ran into tox as a group, some of them would pass and other would fail depending the tox environment. It came to my attention that the time function of this file, was creating an issue with other specs because we were not tearDowning it as expected. Also, using setTime within the setUp group had side effects with unexpected behaviors. Then, the way way that the task_id and task_family was named was also causing problems with the same spec that were failing prior. I'm unsure why this would be the case, but changing either fail, but changing both makes the spec to green. Finally, the last spec would always fail because the setTime was set AFTER the task was actually being run, which would always cause the execution time to be greater than 0. My understanding of all of this is still a bit fuzzy, but hey, now the spec suite passes. * Refactor the datadog_metric tests * Abstract MetricsCollector class This will force people to implement this methods of this class when they refer to it. * Kwargs on the _send_event call This allows for less-strict function calls. * Fix metrics collector * Change DataDog scheduler_api_tests * Change default_tags of DatadogMetricsCollector to a property The underlying configuration of the Datadog metrics collector is a property, so it makes more sense that it's also a property when used within the class itself.

We're currently integration our solutions with DataDog. We wanted to integrate that ability to send metrics to that service from our Pipeline. Doing such, will allow us to monitor the status of our Pipeline by looking at statistics based on metrics sent by the pipeline. At this moment, there is only one event that's supported but as the feature progress forward, it's easy to see that we could support a bunch more. I had an implementation that was fairly basic at first but after navigating on the existing PR against the official Luigi repo I've discovered that there an ongoing implementation of exactly what we were trying to achieve but with similar service. Thanks to chrispalmer, I've been able to re-use his original work to implement ours. spotify#2044

Initial pass at metrics collection for Central Scheduler

e24382d

chrispalmer mentioned this pull request Feb 27, 2017

Expose Central Scheduler metrics for easy ingestion #2036

Closed

chrispalmer commented Feb 27, 2017

View reviewed changes

Tarrasch reviewed Mar 6, 2017

View reviewed changes

thisiscab mentioned this pull request Oct 24, 2017

Add the ability to add metrics based on certain scheduler events glossier/luigi#4

Merged

thisiscab mentioned this pull request May 30, 2018

Add Datadog contrib for monitoring purpose #2434

Merged

stale bot added the wontfix label Jul 31, 2018

stale bot closed this Aug 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial pass at metrics collection for Central Scheduler #2044

Initial pass at metrics collection for Central Scheduler #2044

chrispalmer commented Feb 27, 2017

mention-bot commented Feb 27, 2017

chrispalmer Feb 27, 2017

Tarrasch Mar 6, 2017

chrispalmer Feb 27, 2017

Tarrasch commented Mar 5, 2017

Tarrasch left a comment

Tarrasch Mar 6, 2017

Tarrasch Mar 6, 2017

Tarrasch Mar 6, 2017

Tarrasch Mar 6, 2017

stale bot commented Jul 31, 2018

		@@ -147,6 +147,8 @@ class scheduler(Config):

		prune_on_get_work = parameter.BoolParameter(default=False)

		metrics_collection = parameter.Parameter(default=None)

Initial pass at metrics collection for Central Scheduler #2044

Initial pass at metrics collection for Central Scheduler #2044

Conversation

chrispalmer commented Feb 27, 2017

Description

Motivation and Context

Have you tested this? If so, how?

mention-bot commented Feb 27, 2017

chrispalmer Feb 27, 2017

Choose a reason for hiding this comment

Tarrasch Mar 6, 2017

Choose a reason for hiding this comment

chrispalmer Feb 27, 2017

Choose a reason for hiding this comment

Tarrasch commented Mar 5, 2017

Tarrasch left a comment

Choose a reason for hiding this comment

Tarrasch Mar 6, 2017

Choose a reason for hiding this comment

Tarrasch Mar 6, 2017

Choose a reason for hiding this comment

Tarrasch Mar 6, 2017

Choose a reason for hiding this comment

Tarrasch Mar 6, 2017

Choose a reason for hiding this comment

stale bot commented Jul 31, 2018