Service stops when an exception occurs in table.py:ensure_provisioning #134

Closed
doapp-jeremy opened this Issue Mar 25, 2014 · 20 comments

Projects

None yet

2 participants

@doapp-jeremy

I noticed this exception in my log. It looks like my reads of cloud watch metrics were throttled, I'm not sure why, I'm not aware of a read limit on cloud watch metric. The service is checking every 5 minutes, and I don't have any other code that's reading a bunch of cloud watch metrics either.

I think the correct behavior would be to catch all exceptions, send the exception to the SNS topic, and then restart the service.

2014-03-24 22:16:28,799 - dynamic-dynamodb - ERROR - BotoServerError: 400 Bad Request
<ErrorResponse xmlns="http://monitoring.amazonaws.com/doc/2010-08-01/">
  <Error>
    <Type>Sender</Type>
    <Code>Throttling</Code>
    <Message>Rate exceeded</Message>
  </Error>
  <RequestId>f2532ac6-b3a1-11e3-9764-e376b04113f7</RequestId>
</ErrorResponse>
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/__init__.py", line 50, in run
    table.ensure_provisioning(table_name, table_key)
  File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/core/table.py", line 29, in ensure_provisioning
    table_name, key_name)
  File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/core/table.py", line 64, in __ensure_provisioning_reads
    get_consumed_read_units_percent(table_name)
  File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/statistics/table.py", line 23, in get_consumed_read_units_percent
    table_name, time_frame, 'ConsumedReadCapacityUnits')
  File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/statistics/table.py", line 140, in __get_aws_metric
    unit='Count')
  File "/usr/lib/python2.6/site-packages/boto/ec2/cloudwatch/__init__.py", line 240, in get_metric_statistics
    [('member', Datapoint)])
  File "/usr/lib/python2.6/site-packages/boto/connection.py", line 1157, in get_list
    raise self.ResponseError(response.status, response.reason, body)
BotoServerError: BotoServerError: 400 Bad Request
<ErrorResponse xmlns="http://monitoring.amazonaws.com/doc/2010-08-01/">
  <Error>
    <Type>Sender</Type>
    <Code>Throttling</Code>
    <Message>Rate exceeded</Message>
  </Error>
  <RequestId>f2532ac6-b3a1-11e3-9764-e376b04113f7</RequestId>
</ErrorResponse>
@sebdah sebdah added this to the 1.10.x milestone Mar 28, 2014
@sebdah sebdah self-assigned this Mar 28, 2014
@sebdah
Owner
sebdah commented Mar 28, 2014

Thanks for the report. I have looked into it and the exception should now be catched and the daemon shouldn't die, but rather retry 3 times before failing. I have released a test version for this

pip install dynamic-dynamodb==1.10.2b3
@sebdah sebdah added a commit that referenced this issue Mar 28, 2014
@sebdah Added release notes for #134 and #135 9c97b7f
@sebdah
Owner
sebdah commented Mar 29, 2014

This has now been released in version 1.10.2. Thanks for the bug report!

@sebdah sebdah closed this Mar 29, 2014
@doapp-jeremy

I installed 1.10.2b3 and overnight, I still ran into the same issue.

2014-03-28 18:49:25,999 - dynamic-dynamodb - ERROR - BotoServerError: 400 Bad Request
<ErrorResponse xmlns="http://monitoring.amazonaws.com/doc/2010-08-01/">
  <Error>
    <Type>Sender</Type>
    <Code>Throttling</Code>
    <Message>Rate exceeded</Message>
  </Error>
  <RequestId>af69ee04-b6a9-11e3-afc4-15ebe919aca3</RequestId>
</ErrorResponse>
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/__init__.py", line 52, in run
    table.ensure_provisioning(table_name, table_key)
  File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/core/table.py", line 29, in ensure_provisioning
    table_name, key_name)
  File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/core/table.py", line 109, in __ensure_provisioning_reads
    table_stats.get_consumed_read_units_percent(table_name)
  File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/statistics/table.py", line 24, in get_consumed_read_units_percent
    table_name, time_frame, 'ConsumedReadCapacityUnits')
  File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/statistics/table.py", line 157, in __get_aws_metric
    unit='Count')
  File "/usr/lib/python2.6/site-packages/boto/ec2/cloudwatch/__init__.py", line 240, in get_metric_statistics
    [('member', Datapoint)])
  File "/usr/lib/python2.6/site-packages/boto/connection.py", line 1157, in get_list
    raise self.ResponseError(response.status, response.reason, body)
BotoServerError: BotoServerError: 400 Bad Request
<ErrorResponse xmlns="http://monitoring.amazonaws.com/doc/2010-08-01/">
  <Error>
    <Type>Sender</Type>
    <Code>Throttling</Code>
    <Message>Rate exceeded</Message>
  </Error>
  <RequestId>af69ee04-b6a9-11e3-afc4-15ebe919aca3</RequestId>
</ErrorResponse>

[ec2-user@domU-12-31-39-0A-36-9E ~]$ dynamic-dynamodb --version
Dynamic DynamoDB version: 1.10.2b3
@sebdah sebdah reopened this Mar 29, 2014
@sebdah
Owner
sebdah commented Mar 29, 2014

Thanks for getting back on this. I think there are two problems here. Firstly that the service stopped when this error occurred and secondly that we seem to run into some limit in number of requests (?) to CloudWatch.

I'll contact AWS to get more information and a suggested approach to manage the exceeded rate limit. Currently Dynamic DynamoDB will contact CloudWatch 4 times per table or GSI that we are looking at. Shouldn't be too much in my opinion, but I'll see how we can handle that.

Have the first problem been fixed (it will catch the error three times and then crash if the problem persists)?

@sebdah
Owner
sebdah commented Mar 29, 2014

Could you give me a hint about how many tables + GSIs you have Dynamic DynamoDB configure to monitor?

@doapp-jeremy

Yeah, it doesn't seem like it should be anywhere near a cloudwatch rate limit. It also only happens about once a day.

I don't see anything in the log file that indicates it was attempted more than once. Instead of crashing, could a message be sent to the sns topic, otherwise I have to continue to check if it's still running.

@sebdah
Owner
sebdah commented Mar 29, 2014

Currently SNS topics are configured on a per table / GSI basis. I have planned, and now opened an issue (#136), to implement SNS topic support on a global level. That would make it possible to get a notification when an error occurs. I will not be able to implement that right now, but maybe during next week.

@doapp-jeremy

I'm currently monitoring 3 tables and 1 gsi, but once I get all my variables worked out, I intend to monitor quite a bit more.

Also, I've got a couple quick questtons for you.

  1. Is it possible to only monitor writes on a table. I've got a couple tables where I'm writing daily and aggregating at night. So I have a job that increases the read throughput, runs, and then decreases it. So I'd really only want to monitor writes on this table. It seems the lowest values I can enter are 1 for increase/decrease and this causes problems.
    1a) Along this same line, is it possible to only increase when throttling occurs?

  2. Is the maintenance window for all monitoring? Is it possible to only do decrease during certain hours?

@sebdah
Owner
sebdah commented Mar 29, 2014

Thanks, I'll contact AWS to get more info about the CloudWatch issue.

  1. No. But definitely interesting.
    1a) No, only the other way around. Haven't heard about this use case (although I understand it) before. The throttling support is quite new.
  2. Yes, it is. I have actually been thinking about this lately. But I got confused now, do you mean that you only want to make decreases and no increases within the window? Or do you want to allow increases at any time and decreases only within certain windows?

Please file issues for the separate features if you'd like them to get on the road map, I think that it is interesting ideas.

@doapp-jeremy

1a) I think this is the way I'd most want use increasing tables, just want to increase when throttling occurs.
2) Yeah, since there is no limit on the # of increases, I'd want to be able to do that any time, but limit decreases to certain hours of the day.

Another option I was thinking about. It would be nice to decrease to a percentage of consumed capacity. For example, if my current provisioned throughput is 200, and the last cloud watch reading was 100, it would be nice to set to 125% of consumed, so in this case, set to 125 provisioned. I think this would maximize the effectiveness of decreases.

@sebdah
Owner
sebdah commented Mar 30, 2014

I have opened #137 to address question 1.

@sebdah
Owner
sebdah commented Mar 30, 2014

The CloudWatch team will be looking into the reason behind the exceeded rate limit in your request above. To make Dynamic DynamoDB handle this as good as possible, I will implement a retry mechanism with an exponential backoff strategy.

@sebdah
Owner
sebdah commented Apr 1, 2014

I have now implemented an exponential backoff strategy using the retrying package. I published an alpha version of it to PyPI just now. Install it using pip install dynamic-dynamodb==1.10.6a1.

I cannot reproduce the error in my environment so I would be happy if you could test this and see how it works for you.

All the best!

@doapp-jeremy

Did the config file format change? I'm getting this error when trying to start the service after installing 1.10.6a1

2014-04-01T13:07:28.104109778: do_start:Starting dynamicdynamodb
Traceback (most recent call last):
File "/usr/bin/dynamic-dynamodb", line 22, in
import dynamic_dynamodb
File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/init.py", line 29, in
from dynamic_dynamodb.core import dynamodb, gsi, table
File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/core/dynamodb.py", line 12, in
from dynamic_dynamodb.log_handler import LOGGER as logger
File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/log_handler.py", line 25, in
import config_handler
File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/config_handler.py", line 4, in
CONFIGURATION = config.get_configuration()
File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/config/init.py", line 118, in get_configuration
configuration['tables'] = get_config_table_options(conf_file_options)
File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/config/__init
.py", line 191, in __get_config_table_options
options[table_name]['gsis'][gsi_name][option] = opt
KeyError: 'gsis'
2014-04-01T13:07:28.271445028: Failed in starting dynamicdynamodb service

@sebdah
Owner
sebdah commented Apr 1, 2014

Nah, this issue was smashed in a separate branch. Test pip install cumulus==1.10.7a1 instead.

@doapp-jeremy

Ok, I got it running, I'll let you know if I hit the cloud watch throttling. It usually happened within a day.

@sebdah
Owner
sebdah commented Apr 1, 2014

Ok, great, thanks!

@sebdah
Owner
sebdah commented Apr 5, 2014

@doapp-jeremy did you see this exception in the latest version?

Have a great weekend!

@doapp-jeremy

Nope, haven't seen it. Thanks!

@sebdah
Owner
sebdah commented Apr 7, 2014

I have now released 1.10.7, containing the fix for this issue. Thanks for the report and testing.

@sebdah sebdah closed this Apr 7, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment