Service stops when an exception occurs in table.py:ensure_provisioning #134

doapp-jeremy · 2014-03-25T12:57:40Z

I noticed this exception in my log. It looks like my reads of cloud watch metrics were throttled, I'm not sure why, I'm not aware of a read limit on cloud watch metric. The service is checking every 5 minutes, and I don't have any other code that's reading a bunch of cloud watch metrics either.

I think the correct behavior would be to catch all exceptions, send the exception to the SNS topic, and then restart the service.

2014-03-24 22:16:28,799 - dynamic-dynamodb - ERROR - BotoServerError: 400 Bad Request
<ErrorResponse xmlns="http://monitoring.amazonaws.com/doc/2010-08-01/">
  <Error>
    <Type>Sender</Type>
    <Code>Throttling</Code>
    <Message>Rate exceeded</Message>
  </Error>
  <RequestId>f2532ac6-b3a1-11e3-9764-e376b04113f7</RequestId>
</ErrorResponse>
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/__init__.py", line 50, in run
    table.ensure_provisioning(table_name, table_key)
  File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/core/table.py", line 29, in ensure_provisioning
    table_name, key_name)
  File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/core/table.py", line 64, in __ensure_provisioning_reads
    get_consumed_read_units_percent(table_name)
  File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/statistics/table.py", line 23, in get_consumed_read_units_percent
    table_name, time_frame, 'ConsumedReadCapacityUnits')
  File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/statistics/table.py", line 140, in __get_aws_metric
    unit='Count')
  File "/usr/lib/python2.6/site-packages/boto/ec2/cloudwatch/__init__.py", line 240, in get_metric_statistics
    [('member', Datapoint)])
  File "/usr/lib/python2.6/site-packages/boto/connection.py", line 1157, in get_list
    raise self.ResponseError(response.status, response.reason, body)
BotoServerError: BotoServerError: 400 Bad Request
<ErrorResponse xmlns="http://monitoring.amazonaws.com/doc/2010-08-01/">
  <Error>
    <Type>Sender</Type>
    <Code>Throttling</Code>
    <Message>Rate exceeded</Message>
  </Error>
  <RequestId>f2532ac6-b3a1-11e3-9764-e376b04113f7</RequestId>
</ErrorResponse>

The text was updated successfully, but these errors were encountered:

sebdah · 2014-03-28T06:34:02Z

Thanks for the report. I have looked into it and the exception should now be catched and the daemon shouldn't die, but rather retry 3 times before failing. I have released a test version for this

pip install dynamic-dynamodb==1.10.2b3

sebdah · 2014-03-29T10:51:06Z

This has now been released in version 1.10.2. Thanks for the bug report!

doapp-jeremy · 2014-03-29T12:37:51Z

I installed 1.10.2b3 and overnight, I still ran into the same issue.

2014-03-28 18:49:25,999 - dynamic-dynamodb - ERROR - BotoServerError: 400 Bad Request
<ErrorResponse xmlns="http://monitoring.amazonaws.com/doc/2010-08-01/">
  <Error>
    <Type>Sender</Type>
    <Code>Throttling</Code>
    <Message>Rate exceeded</Message>
  </Error>
  <RequestId>af69ee04-b6a9-11e3-afc4-15ebe919aca3</RequestId>
</ErrorResponse>
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/__init__.py", line 52, in run
    table.ensure_provisioning(table_name, table_key)
  File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/core/table.py", line 29, in ensure_provisioning
    table_name, key_name)
  File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/core/table.py", line 109, in __ensure_provisioning_reads
    table_stats.get_consumed_read_units_percent(table_name)
  File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/statistics/table.py", line 24, in get_consumed_read_units_percent
    table_name, time_frame, 'ConsumedReadCapacityUnits')
  File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/statistics/table.py", line 157, in __get_aws_metric
    unit='Count')
  File "/usr/lib/python2.6/site-packages/boto/ec2/cloudwatch/__init__.py", line 240, in get_metric_statistics
    [('member', Datapoint)])
  File "/usr/lib/python2.6/site-packages/boto/connection.py", line 1157, in get_list
    raise self.ResponseError(response.status, response.reason, body)
BotoServerError: BotoServerError: 400 Bad Request
<ErrorResponse xmlns="http://monitoring.amazonaws.com/doc/2010-08-01/">
  <Error>
    <Type>Sender</Type>
    <Code>Throttling</Code>
    <Message>Rate exceeded</Message>
  </Error>
  <RequestId>af69ee04-b6a9-11e3-afc4-15ebe919aca3</RequestId>
</ErrorResponse>

[ec2-user@domU-12-31-39-0A-36-9E ~]$ dynamic-dynamodb --version
Dynamic DynamoDB version: 1.10.2b3

sebdah · 2014-03-29T15:15:45Z

Thanks for getting back on this. I think there are two problems here. Firstly that the service stopped when this error occurred and secondly that we seem to run into some limit in number of requests (?) to CloudWatch.

I'll contact AWS to get more information and a suggested approach to manage the exceeded rate limit. Currently Dynamic DynamoDB will contact CloudWatch 4 times per table or GSI that we are looking at. Shouldn't be too much in my opinion, but I'll see how we can handle that.

Have the first problem been fixed (it will catch the error three times and then crash if the problem persists)?

sebdah · 2014-03-29T15:21:44Z

Could you give me a hint about how many tables + GSIs you have Dynamic DynamoDB configure to monitor?

doapp-jeremy · 2014-03-29T15:22:26Z

Yeah, it doesn't seem like it should be anywhere near a cloudwatch rate limit. It also only happens about once a day.

I don't see anything in the log file that indicates it was attempted more than once. Instead of crashing, could a message be sent to the sns topic, otherwise I have to continue to check if it's still running.

sebdah · 2014-03-29T15:26:52Z

Currently SNS topics are configured on a per table / GSI basis. I have planned, and now opened an issue (#136), to implement SNS topic support on a global level. That would make it possible to get a notification when an error occurs. I will not be able to implement that right now, but maybe during next week.

doapp-jeremy · 2014-03-29T15:27:36Z

I'm currently monitoring 3 tables and 1 gsi, but once I get all my variables worked out, I intend to monitor quite a bit more.

Also, I've got a couple quick questtons for you.

Is it possible to only monitor writes on a table. I've got a couple tables where I'm writing daily and aggregating at night. So I have a job that increases the read throughput, runs, and then decreases it. So I'd really only want to monitor writes on this table. It seems the lowest values I can enter are 1 for increase/decrease and this causes problems.
1a) Along this same line, is it possible to only increase when throttling occurs?
Is the maintenance window for all monitoring? Is it possible to only do decrease during certain hours?

sebdah · 2014-03-29T15:34:57Z

Thanks, I'll contact AWS to get more info about the CloudWatch issue.

No. But definitely interesting.
1a) No, only the other way around. Haven't heard about this use case (although I understand it) before. The throttling support is quite new.
Yes, it is. I have actually been thinking about this lately. But I got confused now, do you mean that you only want to make decreases and no increases within the window? Or do you want to allow increases at any time and decreases only within certain windows?

Please file issues for the separate features if you'd like them to get on the road map, I think that it is interesting ideas.

doapp-jeremy · 2014-03-29T15:41:28Z

1a) I think this is the way I'd most want use increasing tables, just want to increase when throttling occurs.
2) Yeah, since there is no limit on the # of increases, I'd want to be able to do that any time, but limit decreases to certain hours of the day.

Another option I was thinking about. It would be nice to decrease to a percentage of consumed capacity. For example, if my current provisioned throughput is 200, and the last cloud watch reading was 100, it would be nice to set to 125% of consumed, so in this case, set to 125 provisioned. I think this would maximize the effectiveness of decreases.

sebdah · 2014-03-30T05:47:09Z

I have opened #137 to address question 1.

sebdah · 2014-03-30T07:35:42Z

The CloudWatch team will be looking into the reason behind the exceeded rate limit in your request above. To make Dynamic DynamoDB handle this as good as possible, I will implement a retry mechanism with an exponential backoff strategy.

sebdah · 2014-04-01T10:01:12Z

I have now implemented an exponential backoff strategy using the retrying package. I published an alpha version of it to PyPI just now. Install it using pip install dynamic-dynamodb==1.10.6a1.

I cannot reproduce the error in my environment so I would be happy if you could test this and see how it works for you.

All the best!

doapp-jeremy · 2014-04-01T13:09:23Z

Did the config file format change? I'm getting this error when trying to start the service after installing 1.10.6a1

2014-04-01T13:07:28.104109778: do_start:Starting dynamicdynamodb
Traceback (most recent call last):
File "/usr/bin/dynamic-dynamodb", line 22, in
import dynamic_dynamodb
File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/init.py", line 29, in
from dynamic_dynamodb.core import dynamodb, gsi, table
File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/core/dynamodb.py", line 12, in
from dynamic_dynamodb.log_handler import LOGGER as logger
File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/log_handler.py", line 25, in
import config_handler
File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/config_handler.py", line 4, in
CONFIGURATION = config.get_configuration()
File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/config/init.py", line 118, in get_configuration
configuration['tables'] = get_config_table_options(conf_file_options)
File "/usr/lib/python2.6/site-packages/dynamic_dynamodb/config/__init.py", line 191, in __get_config_table_options
options[table_name]['gsis'][gsi_name][option] = opt
KeyError: 'gsis'
2014-04-01T13:07:28.271445028: Failed in starting dynamicdynamodb service

sebdah · 2014-04-01T13:15:15Z

Nah, this issue was smashed in a separate branch. Test pip install cumulus==1.10.7a1 instead.

doapp-jeremy · 2014-04-01T13:16:34Z

Ok, I got it running, I'll let you know if I hit the cloud watch throttling. It usually happened within a day.

sebdah · 2014-04-01T13:17:05Z

Ok, great, thanks!

sebdah · 2014-04-05T07:45:04Z

@doapp-jeremy did you see this exception in the latest version?

Have a great weekend!

doapp-jeremy · 2014-04-05T12:17:23Z

Nope, haven't seen it. Thanks!

sebdah · 2014-04-07T07:03:03Z

I have now released 1.10.7, containing the fix for this issue. Thanks for the report and testing.

sebdah added type: enhancement and removed type: enhancement labels Mar 28, 2014

sebdah added this to the 1.10.x milestone Mar 28, 2014

sebdah self-assigned this Mar 28, 2014

sebdah added a commit that referenced this issue Mar 28, 2014

Implemented fix (but blinded by bad exception) for #134

2e85482

sebdah added the status: in progress label Mar 28, 2014

sebdah added a commit that referenced this issue Mar 28, 2014

Added release notes for #134 and #135

9c97b7f

sebdah added status: ready and removed status: in progress labels Mar 29, 2014

sebdah closed this as completed Mar 29, 2014

sebdah reopened this Mar 29, 2014

sebdah added status: in progress and removed status: ready labels Mar 30, 2014

sebdah added a commit that referenced this issue Apr 1, 2014

Added exponential backoff strategy #134

cd172b1

furio mentioned this issue Apr 5, 2014

Service stops with exception with BOTO and error 500 #144

Closed

sebdah closed this as completed Apr 7, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service stops when an exception occurs in table.py:ensure_provisioning #134

Service stops when an exception occurs in table.py:ensure_provisioning #134

doapp-jeremy commented Mar 25, 2014

sebdah commented Mar 28, 2014

sebdah commented Mar 29, 2014

doapp-jeremy commented Mar 29, 2014

sebdah commented Mar 29, 2014

sebdah commented Mar 29, 2014

doapp-jeremy commented Mar 29, 2014

sebdah commented Mar 29, 2014

doapp-jeremy commented Mar 29, 2014

sebdah commented Mar 29, 2014

doapp-jeremy commented Mar 29, 2014

sebdah commented Mar 30, 2014

sebdah commented Mar 30, 2014

sebdah commented Apr 1, 2014

doapp-jeremy commented Apr 1, 2014

sebdah commented Apr 1, 2014

doapp-jeremy commented Apr 1, 2014

sebdah commented Apr 1, 2014

sebdah commented Apr 5, 2014

doapp-jeremy commented Apr 5, 2014

sebdah commented Apr 7, 2014

Service stops when an exception occurs in table.py:ensure_provisioning #134

Service stops when an exception occurs in table.py:ensure_provisioning #134

Comments

doapp-jeremy commented Mar 25, 2014

sebdah commented Mar 28, 2014

sebdah commented Mar 29, 2014

doapp-jeremy commented Mar 29, 2014

sebdah commented Mar 29, 2014

sebdah commented Mar 29, 2014

doapp-jeremy commented Mar 29, 2014

sebdah commented Mar 29, 2014

doapp-jeremy commented Mar 29, 2014

sebdah commented Mar 29, 2014

doapp-jeremy commented Mar 29, 2014

sebdah commented Mar 30, 2014

sebdah commented Mar 30, 2014

sebdah commented Apr 1, 2014

doapp-jeremy commented Apr 1, 2014

sebdah commented Apr 1, 2014

doapp-jeremy commented Apr 1, 2014

sebdah commented Apr 1, 2014

sebdah commented Apr 5, 2014

doapp-jeremy commented Apr 5, 2014

sebdah commented Apr 7, 2014