Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial pass at metrics collection for Central Scheduler #2044

Conversation

chrispalmer
Copy link

Description

This is an initial pass at adding metric collection functionality to the Central Scheduler, and is still a work in progress. I'm looking for feedback on the general approach.

Summary of changes:

  • Dummy MetricsCollector class (luigi/metrics.py) with no-op methods, but can be subclassed for a specific metric collection tool.
  • PrometheusMetricsCollector class (luigi/contrib/prometheus.py) that subclasses the above, as an example implementation.
  • New scheduler config option/parameter to select the metric collection tool.
  • changes to SimpleTaskState to identify and trigger metric updates.
  • Added Scheduler.remove_worker rpc_method to related Worker._remove_worker method to allow workers to tell the scheduler when they are finishing and disconnecting. This allows us to distinguish between deliberate disconnects and worker failures.

Motivation and Context

This is an attempt to provide functionality desired in issue 2306

Have you tested this? If so, how?

Does not yet include any tests.
I have run it against some very simple pipelines and it seems to do as expected.

@mention-bot
Copy link

@chrispalmer, thanks for your PR! By analyzing the history of the files in this pull request, we identified @daveFNbuck, @erikbern and @DAVW to be potential reviewers.

@@ -354,7 +356,7 @@ def __init__(self, worker_id, last_active=None):
self.disabled = False
self.rpc_messages = []

def add_info(self, info):
def update_info(self, info):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this method name, because I feel like it fits much better with what is actually happening.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Those edits are great actually. :)

self._update_metrics_worker_status_change(worker, 'disabled')

def _update_metrics_task_status_change(self, task, status):
if status != UNKNOWN:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is the correct place for this check, but tasks can get added to the state before we actually know anything about them. I wish there was a better way to identify when new tasks are being added.

@Tarrasch
Copy link
Contributor

Tarrasch commented Mar 5, 2017

I just came back from computer-less vacation. I'll try to look at this soon :)

Copy link
Contributor

@Tarrasch Tarrasch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks cool. But don't you also have screen-shots from Promotheus web interface. Would look cooler I think. :)

Can you move all the files related to the scheduler that you've created into luigi.scheudler.*? It's about time. Also don't forget to move the batch-notifier while you’re at it.

def get_worker_ids(self):
return self._active_workers.keys() # only used for unit tests

def get_worker(self, worker_id):
return self._active_workers.setdefault(worker_id, Worker(worker_id))

def inactivate_workers(self, delete_workers):
def remove_worker(self, worker_id):
self.inactivate_workers([worker_id], reason='disconnected')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "gracefully_stopped" is clearer than disconnected?

@@ -0,0 +1,14 @@

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to copy paste the license-blah blah from the other files, updating the year.

@@ -147,6 +147,8 @@ class scheduler(Config):

prune_on_get_work = parameter.BoolParameter(default=False)

metrics_collection = parameter.Parameter(default=None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a description?

@@ -354,7 +356,7 @@ def __init__(self, worker_id, last_active=None):
self.disabled = False
self.rpc_messages = []

def add_info(self, info):
def update_info(self, info):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Those edits are great actually. :)

thisiscab added a commit to glossier/luigi that referenced this pull request Oct 24, 2017
We're currently integration our solutions with DataDog. We wanted to
integrate that ability to send metrics to that service from our
Pipeline.

Doing such, will allow us to monitor the status of our Pipeline by
looking at statistics based on metrics sent by the pipeline.

At this moment, there is only one event that's supported but as the
feature progress forward, it's easy to see that we could support a bunch
more.

I had an implementation that was fairly basic at first but after
navigating on the existing PR against the official Luigi repo I've
discovered that there an ongoing implementation of exactly what we were
trying to achieve but with similar service. Thanks to chrispalmer, I've
been able to re-use his original work to implement ours.

spotify#2044
thisiscab added a commit to glossier/luigi that referenced this pull request Apr 19, 2018
We're currently integration our solutions with DataDog. We wanted to
integrate that ability to send metrics to that service from our
Pipeline.

Doing such, will allow us to monitor the status of our Pipeline by
looking at statistics based on metrics sent by the pipeline.

At this moment, there is only one event that's supported but as the
feature progress forward, it's easy to see that we could support a bunch
more.

I had an implementation that was fairly basic at first but after
navigating on the existing PR against the official Luigi repo I've
discovered that there an ongoing implementation of exactly what we were
trying to achieve but with similar service. Thanks to chrispalmer, I've
been able to re-use his original work to implement ours.

spotify#2044
thisiscab added a commit to glossier/luigi that referenced this pull request May 14, 2018
Datadog is a tool that allows you to send metrics that you create
dashboard and add alerting on specific behaviors.

Adding this contrib will allow for users of this tool to log their pipeline
information to Datadog.

Based on the status change of a task, we log that information to Datadog
with the parameters that were used to run that specific task.

This allow us to easily create dashboard to visualize the health. For
example, we can be notified via Datadog if a task has failed, or we can
graph the execution time of a specific task over a period of time.

The implementation idea was strongly based on the stale PR
spotify#2044.
thisiscab added a commit to glossier/luigi that referenced this pull request May 30, 2018
Datadog is a tool that allows you to send metrics that you create
dashboard and add alerting on specific behaviors.

Adding this contrib will allow for users of this tool to log their pipeline
information to Datadog.

Based on the status change of a task, we log that information to Datadog
with the parameters that were used to run that specific task.

This allow us to easily create dashboard to visualize the health. For
example, we can be notified via Datadog if a task has failed, or we can
graph the execution time of a specific task over a period of time.

The implementation idea was strongly based on the stale PR
spotify#2044.
thisiscab added a commit to glossier/luigi that referenced this pull request Jun 4, 2018
Datadog is a tool that allows you to send metrics that you create
dashboard and add alerting on specific behaviors.

Adding this contrib will allow for users of this tool to log their pipeline
information to Datadog.

Based on the status change of a task, we log that information to Datadog
with the parameters that were used to run that specific task.

This allow us to easily create dashboard to visualize the health. For
example, we can be notified via Datadog if a task has failed, or we can
graph the execution time of a specific task over a period of time.

The implementation idea was strongly based on the stale PR
spotify#2044.
thisiscab added a commit to glossier/luigi that referenced this pull request Jul 9, 2018
Datadog is a tool that allows you to send metrics that you create
dashboard and add alerting on specific behaviors.

Adding this contrib will allow for users of this tool to log their pipeline
information to Datadog.

Based on the status change of a task, we log that information to Datadog
with the parameters that were used to run that specific task.

This allow us to easily create dashboard to visualize the health. For
example, we can be notified via Datadog if a task has failed, or we can
graph the execution time of a specific task over a period of time.

The implementation idea was strongly based on the stale PR
spotify#2044.
@stale
Copy link

stale bot commented Jul 31, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If closed, you may revisit when your time allows and reopen! Thank you for your contributions.

@stale stale bot added the wontfix label Jul 31, 2018
thisiscab added a commit to glossier/luigi that referenced this pull request Aug 2, 2018
Datadog is a tool that allows you to send metrics that you create
dashboard and add alerting on specific behaviors.

Adding this contrib will allow for users of this tool to log their pipeline
information to Datadog.

Based on the status change of a task, we log that information to Datadog
with the parameters that were used to run that specific task.

This allow us to easily create dashboard to visualize the health. For
example, we can be notified via Datadog if a task has failed, or we can
graph the execution time of a specific task over a period of time.

The implementation idea was strongly based on the stale PR
spotify#2044.
thisiscab added a commit to glossier/luigi that referenced this pull request Aug 8, 2018
Datadog is a tool that allows you to send metrics that you create
dashboard and add alerting on specific behaviors.

Adding this contrib will allow for users of this tool to log their pipeline
information to Datadog.

Based on the status change of a task, we log that information to Datadog
with the parameters that were used to run that specific task.

This allow us to easily create dashboard to visualize the health. For
example, we can be notified via Datadog if a task has failed, or we can
graph the execution time of a specific task over a period of time.

The implementation idea was strongly based on the stale PR
spotify#2044.
thisiscab added a commit to glossier/luigi that referenced this pull request Aug 8, 2018
Datadog is a tool that allows you to send metrics that you create
dashboard and add alerting on specific behaviors.

Adding this contrib will allow for users of this tool to log their pipeline
information to Datadog.

Based on the status change of a task, we log that information to Datadog
with the parameters that were used to run that specific task.

This allow us to easily create dashboard to visualize the health. For
example, we can be notified via Datadog if a task has failed, or we can
graph the execution time of a specific task over a period of time.

The implementation idea was strongly based on the stale PR
spotify#2044.
thisiscab added a commit to glossier/luigi that referenced this pull request Aug 8, 2018
Datadog is a tool that allows you to send metrics that you create
dashboard and add alerting on specific behaviors.

Adding this contrib will allow for users of this tool to log their pipeline
information to Datadog.

Based on the status change of a task, we log that information to Datadog
with the parameters that were used to run that specific task.

This allow us to easily create dashboard to visualize the health. For
example, we can be notified via Datadog if a task has failed, or we can
graph the execution time of a specific task over a period of time.

The implementation idea was strongly based on the stale PR
spotify#2044.
thisiscab added a commit to glossier/luigi that referenced this pull request Aug 8, 2018
Datadog is a tool that allows you to send metrics that you create
dashboard and add alerting on specific behaviors.

Adding this contrib will allow for users of this tool to log their pipeline
information to Datadog.

Based on the status change of a task, we log that information to Datadog
with the parameters that were used to run that specific task.

This allow us to easily create dashboard to visualize the health. For
example, we can be notified via Datadog if a task has failed, or we can
graph the execution time of a specific task over a period of time.

The implementation idea was strongly based on the stale PR
spotify#2044.
thisiscab added a commit to glossier/luigi that referenced this pull request Aug 8, 2018
Datadog is a tool that allows you to send metrics that you create
dashboard and add alerting on specific behaviors.

Adding this contrib will allow for users of this tool to log their pipeline
information to Datadog.

Based on the status change of a task, we log that information to Datadog
with the parameters that were used to run that specific task.

This allow us to easily create dashboard to visualize the health. For
example, we can be notified via Datadog if a task has failed, or we can
graph the execution time of a specific task over a period of time.

The implementation idea was strongly based on the stale PR
spotify#2044.
thisiscab added a commit to glossier/luigi that referenced this pull request Aug 8, 2018
Datadog is a tool that allows you to send metrics that you create
dashboard and add alerting on specific behaviors.

Adding this contrib will allow for users of this tool to log their pipeline
information to Datadog.

Based on the status change of a task, we log that information to Datadog
with the parameters that were used to run that specific task.

This allow us to easily create dashboard to visualize the health. For
example, we can be notified via Datadog if a task has failed, or we can
graph the execution time of a specific task over a period of time.

The implementation idea was strongly based on the stale PR
spotify#2044.
@stale stale bot closed this Aug 14, 2018
thisiscab added a commit to glossier/luigi that referenced this pull request Oct 15, 2018
Datadog is a tool that allows you to send metrics that you create
dashboard and add alerting on specific behaviors.

Adding this contrib will allow for users of this tool to log their pipeline
information to Datadog.

Based on the status change of a task, we log that information to Datadog
with the parameters that were used to run that specific task.

This allow us to easily create dashboard to visualize the health. For
example, we can be notified via Datadog if a task has failed, or we can
graph the execution time of a specific task over a period of time.

The implementation idea was strongly based on the stale PR
spotify#2044.
thisiscab added a commit to glossier/luigi that referenced this pull request Dec 13, 2018
Datadog is a tool that allows you to send metrics that you create
dashboard and add alerting on specific behaviors.

Adding this contrib will allow for users of this tool to log their pipeline
information to Datadog.

Based on the status change of a task, we log that information to Datadog
with the parameters that were used to run that specific task.

This allow us to easily create dashboard to visualize the health. For
example, we can be notified via Datadog if a task has failed, or we can
graph the execution time of a specific task over a period of time.

The implementation idea was strongly based on the stale PR
spotify#2044.
dlstadther pushed a commit that referenced this pull request Dec 17, 2018
* Add Datadog contrib for monitoring purpose

Datadog is a tool that allows you to send metrics that you create
dashboard and add alerting on specific behaviors.

Adding this contrib will allow for users of this tool to log their pipeline
information to Datadog.

Based on the status change of a task, we log that information to Datadog
with the parameters that were used to run that specific task.

This allow us to easily create dashboard to visualize the health. For
example, we can be notified via Datadog if a task has failed, or we can
graph the execution time of a specific task over a period of time.

The implementation idea was strongly based on the stale PR
#2044.

* Refactor MetricsCollectors in Scheduler

I've also added a few test to ensure that the implementation was working
well.

* Add polish + tests around metrics on task state

This takes care of ensuring that the proper metrics collection calls are
being done when they are expected to be happening.

We've also removed a few `@RPC_METHOD` that weren't actually being used
and that wasn't required.

* Add tests related to the Datadog contrib

This makes sure that we're properly dispatching API and STATSD call with
the proper parameter values to Datadog.

This doesn't test all the different possible parameters configuration.

* Improve configuration documentation with new Datadog contrib

This adds a few extra documentation line for the configuration to allow
user to find all the settings they can tweak for each individual
contribs instead of having to go through each individual contrib files.

* Update DataDog dep. to the most recent release

The original implementation was made when 0.16.0 was the latest version.
Since there there have been a few improvements and bug fixes made to the
library that we should be using.

Reading through the release log there shouldn't be any feature-breaking
changes so we should be good to update it!

* Change metrics collection getter to class method

Previously, the getter wasn't a class method and wouldn't work as
expected. In order to ensure that the output is what we expect, we've
added more tests.

* Fix spec issues related to new tests

There was multiple problems that needed to be solved in order to get the
specs green again. Each individual specs were passing when ran
individually, but when ran into tox as a group, some of them would pass
and other would fail depending the tox environment.

It came to my attention that the time function of this file, was
creating an issue with other specs because we were not tearDowning it as
expected. Also, using setTime within the setUp group had side effects
with unexpected behaviors.

Then, the way way that the task_id and task_family was named was also
causing problems with the same spec that were failing prior.  I'm unsure
why this would be the case, but changing either fail, but changing both
makes the spec to green.

Finally, the last spec would always fail because the setTime was set
AFTER the task was actually being run, which would always cause the
execution time to be greater than 0.

My understanding of all of this is still a bit fuzzy, but hey, now the
spec suite passes.

* Refactor the datadog_metric tests

* Abstract MetricsCollector class

This will force people to implement this methods of this class when they
refer to it.

* Kwargs on the _send_event call

This allows for less-strict function calls.

* Fix metrics collector

* Change DataDog scheduler_api_tests

* Change default_tags of DatadogMetricsCollector to a property

The underlying configuration of the Datadog metrics collector is a
property, so it makes more sense that it's also a property when used
within the class itself.
thisiscab added a commit to glossier/luigi that referenced this pull request Jan 28, 2019
We're currently integration our solutions with DataDog. We wanted to
integrate that ability to send metrics to that service from our
Pipeline.

Doing such, will allow us to monitor the status of our Pipeline by
looking at statistics based on metrics sent by the pipeline.

At this moment, there is only one event that's supported but as the
feature progress forward, it's easy to see that we could support a bunch
more.

I had an implementation that was fairly basic at first but after
navigating on the existing PR against the official Luigi repo I've
discovered that there an ongoing implementation of exactly what we were
trying to achieve but with similar service. Thanks to chrispalmer, I've
been able to re-use his original work to implement ours.

spotify#2044
thisiscab added a commit to glossier/luigi that referenced this pull request Jan 28, 2019
We're currently integration our solutions with DataDog. We wanted to
integrate that ability to send metrics to that service from our
Pipeline.

Doing such, will allow us to monitor the status of our Pipeline by
looking at statistics based on metrics sent by the pipeline.

At this moment, there is only one event that's supported but as the
feature progress forward, it's easy to see that we could support a bunch
more.

I had an implementation that was fairly basic at first but after
navigating on the existing PR against the official Luigi repo I've
discovered that there an ongoing implementation of exactly what we were
trying to achieve but with similar service. Thanks to chrispalmer, I've
been able to re-use his original work to implement ours.

spotify#2044
thisiscab added a commit to glossier/luigi that referenced this pull request Jan 28, 2019
We're currently integration our solutions with DataDog. We wanted to
integrate that ability to send metrics to that service from our
Pipeline.

Doing such, will allow us to monitor the status of our Pipeline by
looking at statistics based on metrics sent by the pipeline.

At this moment, there is only one event that's supported but as the
feature progress forward, it's easy to see that we could support a bunch
more.

I had an implementation that was fairly basic at first but after
navigating on the existing PR against the official Luigi repo I've
discovered that there an ongoing implementation of exactly what we were
trying to achieve but with similar service. Thanks to chrispalmer, I've
been able to re-use his original work to implement ours.

spotify#2044
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants