Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Datadog contrib for monitoring purpose #2434

Merged
merged 14 commits into from Dec 17, 2018

Conversation

thisiscab
Copy link
Contributor

Description

Datadog is a tool that allows you to send metrics that you create
dashboard and add alerting on specific behaviors.

Adding this contrib will allow for users of this tool to log their pipeline
information to Datadog.

Motivation and Context

Based on the status change of a task, we log that information to Datadog
with the parameters that were used to run that specific task.

This allows us to easily create dashboards to visualize the health. For
example, we can be notified via Datadog if a task has failed, or we can
graph the execution time of a specific task over a period of time.

The implementation idea was strongly based on the stale PR
#2044.

Have you tested this? If so, how?

We've been using this contrib for multiple months now (maybe a year?), at Glossier. This is the main point of reference to see the health of our pipeline.

@thisiscab
Copy link
Contributor Author

Interesting, I was certain that I had run the spec suite locally to make sure that all was well.
I'll make sure that the problems are resolved ASAP!

@thisiscab thisiscab force-pushed the feature/add-datadog-contrib branch from 5ed031c to 3d4f9d2 Compare May 30, 2018 18:23
Copy link
Collaborator

@dlstadther dlstadther left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to add tests of this contrib module?

@thisiscab
Copy link
Contributor Author

@dlstadther I'll take a look at how we could test this contrib without it being painful. Thanks for your input!

@thisiscab thisiscab force-pushed the feature/add-datadog-contrib branch 4 times, most recently from a87b877 to b5de03b Compare June 4, 2018 20:11
@thisiscab
Copy link
Contributor Author

@dlstadther Hey! I've added a bunch of fun tests to make sure everything was working and I'm having a bunch of trouble running the spec suite locally. When running each individual new tests locally all works, but I want to make sure that running the whole suite works fine.

  1. Travis seems to always fail on the docs job and the error being displayed is not something I would understand how to fix, do you have suggestions?
  2. Travis seems to have timeouted on some specs and I don't have the power to re-run the suite. What should I do under that situation? Would you want me to rebase to trigger a re-build or shall I ask you to explicitly re-run the spec suite?

Let me know how to proceed further! :)

@dlstadther
Copy link
Collaborator

I've not seen that doc failure before.

I've restarted the build and will review again after the build completes. (Thanks for your efforts!)

@dlstadther
Copy link
Collaborator

docs tests are still failing with same error

Are you able to build the docs locally? tox -e docs

@thisiscab thisiscab force-pushed the feature/add-datadog-contrib branch from 0ff975f to 700172d Compare June 7, 2018 15:49
@thisiscab
Copy link
Contributor Author

thisiscab commented Jun 7, 2018

@dlstadther Docs are running file locally:

tox -e docs yields

_____________________________________________________ summary _________________________________________________________
  docs: commands succeeded
  congratulations :)

Also when I'm running the tests locally that seems to hang on Travis, they all succeed. Not sure what this is all about :/

I think it's worth investigating deeper.

@dlstadther
Copy link
Collaborator

@Tarrasch Have you experienced this sort of Travis doc failure or test exit before?

@Tarrasch
Copy link
Contributor

No idea, I restarted an older build to know if it's due to changes in this PR or not. Maybe Travis just had a bad day?

@dlstadther
Copy link
Collaborator

@cabouffard Could you resolve conflicts and let's see if the build can be resolved!

@thisiscab thisiscab force-pushed the feature/add-datadog-contrib branch 2 times, most recently from e4b6c0c to 0d30791 Compare July 9, 2018 16:39
@thisiscab
Copy link
Contributor Author

@dlstadther I've tried many different things, I can't figure out why the specs, with the addition of this PR, keeps hanging on Travis.

I also can't figure out why the docs keep failing on Travis while I can run it on my local machine without any problems.

When I try to run tox -e py27-nonhdfs on my local machine, I have a bunch of different specs that fail due to many different kinds of errors. Here is a gist that shows what is failing locally: https://gist.github.com/cabouffard/5e20a4cf787752cb0caebdbc467e9ac0.

Fortunately, those specs aren't failing on Travis, but I can't reproduce Travis' behavior locally. I would need help on this.

@dlstadther
Copy link
Collaborator

This is bizarre... Nothing is standing out to me as an issue and yet there is clearly a problem

@thisiscab
Copy link
Contributor Author

@dlstadther What would you suggest the next step be? Can you investigate the issue on your side, can any other collaborator jump and give us their point of view on the matter?

Thanks! :)

@thisiscab
Copy link
Contributor Author

@dlstadther Hey, what's the latest on this one? :)

@dlstadther
Copy link
Collaborator

So sorry @cabouffard - been kinda busy and unintentionally neglected this. Hoping to have some more time next week - i'll set myself a reminder! Thanks for your patience :)

tox.ini Outdated Show resolved Hide resolved
@dlstadther
Copy link
Collaborator

(Found a little time today to look into this briefly)

@cabouffard I've downloaded this branch locally and tried building docs (tox -e docs) on 2.7.15 and 3.7.0.

2.7.15 failed with the import error.
3.7.0 was successful.

(Spit balling here a bit...) I'm inclined to suspect the from enum import Enum Python 3.4+ module import as the source of the issue here. And the error is trickling down to fail imports.

I've had personal experience with unittest where I've received misleading errors due to package import errors.

Mind looking into a 3.4- solution for Enum and see if that resolves the issues here?

Thanks!

@thisiscab
Copy link
Contributor Author

It has been my turn to be busy! Very clever find about the ENUM, I would have expected a better error message. Thank for taking your time to help me out on this! I'll make the additional changes right now! :)

Datadog is a tool that allows you to send metrics that you create
dashboard and add alerting on specific behaviors.

Adding this contrib will allow for users of this tool to log their pipeline
information to Datadog.

Based on the status change of a task, we log that information to Datadog
with the parameters that were used to run that specific task.

This allow us to easily create dashboard to visualize the health. For
example, we can be notified via Datadog if a task has failed, or we can
graph the execution time of a specific task over a period of time.

The implementation idea was strongly based on the stale PR
spotify#2044.
I've also added a few test to ensure that the implementation was working
well.
This takes care of ensuring that the proper metrics collection calls are
being done when they are expected to be happening.

We've also removed a few `@RPC_METHOD` that weren't actually being used
and that wasn't required.
This makes sure that we're properly dispatching API and STATSD call with
the proper parameter values to Datadog.

This doesn't test all the different possible parameters configuration.
This adds a few extra documentation line for the configuration to allow
user to find all the settings they can tweak for each individual
contribs instead of having to go through each individual contrib files.
The original implementation was made when 0.16.0 was the latest version.
Since there there have been a few improvements and bug fixes made to the
library that we should be using.

Reading through the release log there shouldn't be any feature-breaking
changes so we should be good to update it!
Previously, the getter wasn't a class method and wouldn't work as
expected. In order to ensure that the output is what we expect, we've
added more tests.
There was multiple problems that needed to be solved in order to get the
specs green again. Each individual specs were passing when ran
individually, but when ran into tox as a group, some of them would pass
and other would fail depending the tox environment.

It came to my attention that the time function of this file, was
creating an issue with other specs because we were not tearDowning it as
expected. Also, using setTime within the setUp group had side effects
with unexpected behaviors.

Then, the way way that the task_id and task_family was named was also
causing problems with the same spec that were failing prior.  I'm unsure
why this would be the case, but changing either fail, but changing both
makes the spec to green.

Finally, the last spec would always fail because the setTime was set
AFTER the task was actually being run, which would always cause the
execution time to be greater than 0.

My understanding of all of this is still a bit fuzzy, but hey, now the
spec suite passes.
luigi/metrics.py Outdated Show resolved Hide resolved
This will force people to implement this methods of this class when they
refer to it.
This allows for less-strict function calls.
@thisiscab
Copy link
Contributor Author

@dlstadther Sorry for all the force pushes, I've had trouble with my local environment so I decided to test my commits directly on Travis :)

Now that the specs are all passing, I've tested this branch on our development system and all seems to work accordingly. I think we can finally 🚢 it!

Thank you so much for the help!

The underlying configuration of the Datadog metrics collector is a
property, so it makes more sense that it's also a property when used
within the class itself.
@dfeldstarsky
Copy link
Contributor

Hey yall - just wanna say that we're really excited to be able to use this soon! Good luck getting it past the finish line, can't wait to try to measure our pipeline w/ this integration.

@dlstadther dlstadther merged commit b2a5759 into spotify:master Dec 17, 2018
@dlstadther
Copy link
Collaborator

Thanks for the long, hard work @cabouffard !

@thisiscab
Copy link
Contributor Author

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants