New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_capacity_pct unused? #9
Comments
Also, since it is related, how does this |
Hey @traviscrawford, are you sure your |
I did a bit more digging and didn't see anything provided by AWS for explicitly throttling consumption of read units. I did stumble across an article discussing a technique for doing this at the application level, but I don't see anything like this in I assume the Anyway, thanks for looking at this with me. This is blocking a production deploy. |
Only other thought I had: Is this throttling enforced per Spark executor? If not, how is it enforced? If it's per executor, obviously a distributed Spark job of varying size would have different capacity consumption based on the cluster size. |
Hi @findchris , thanks for reporting this. You're correct - Great to hear you find this useful. |
Here we add support for the `rate_limit_per_segment` option. We use a Guava RateLimiter for each scan segments that treats DynamoDB consumed read capacity units as RateLimiter permits. Connects to #9
Hi @findchris - what do you think about the approach in https://github.com/traviscrawford/spark-dynamodb/compare/travis/ratelimit ? Are you able to try this in your environment prior to publishing a new release? |
Thanks for chiming in here @traviscrawford. The pull request looks solid and straight-forward. 👍 Can you help me understand the semantics of I'd be happy to test this in QA, but I'm still relatively new to the Scala world. I'm currently using this project like so:
Besides a version bump (your call), got an easy way for me to test this out? Out of curiosity, how do you implement your internal DynamoDB scanner's rate limiting? |
Take a look at http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScanGuidelines.html for an overview of how to best scan DynamoDB tables. When tuning scans, there are a number of variables:
I generally scan tables as part of an ETL process, and start with a single segment that uses about 20% of the provisioned read capacity units. If that's fast enough I don't look much further. Sometimes the scan is slow because a small table has very low provisioned read capacity units, so I increase that. Sometimes the table is large and adding scan segments greatly speeds up the scan. Given the above variables that affects table scans, what do your tables look like? |
Here we add support for the `rate_limit_per_segment` option. We use a Guava RateLimiter for each scan segments that treats DynamoDB consumed read capacity units as RateLimiter permits. Connects to #9
Thanks for the link and thoughts @traviscrawford. I haven't had to do much Like you mention, I'll just have to experiment with different values for the I noticed that you just added |
I just published version |
👍 I'll keep you posted on my testing. |
@traviscrawford - I just tried this out (sorry for the delay), but it looks like there might be a
|
@traviscrawford - I'm still trying to resolve this. It appears that Spark depends on I'm looking into "shading" the dependency, but don't have much experience there. |
Still no luck. I've tried:
This StackOverflow answer would seem to point to the solution, but I've had no luck using To reiterate, I think the answer lies with shading, but I have no experience using this technique, and my attempts above failed. @traviscrawford, can you offer any thoughts? |
@findchris are you marking your Spark dependencies as |
Thanks for chiming in @timchan-lumoslabs. Here is what I have in my build.sbt's
See anything obviously for me to change? |
@traviscrawford / @timchan-lumoslabs - Any insights as to what might be going on? I just need to make sure |
Hi @findchris - my hunch is we could downgrade to the version of Guava used by Spark and make this issue go away for you. Will take a look... |
@traviscrawford - That's always an option. I have to imagine there is a solution to this problem, and I believe it involves shading - I'm just not sure how to do it. Maybe you can shade Either way, I appreciate the help! |
@findchris @timchan-lumoslabs Question for you - internally at Medium we have tried a few different approaches to integrating DynamoDB with Spark, and the approach we're planning on using going forward is: Using a Spark job to backup the DynamoDB table as JSON on s3. We chose JSON over a binary format such as Parquet because DynamoDB tables do not have a schema, so we can avoid the schema issue during the backup phase. Then we simply read the DynamoDB backups like you normally would read JSON from a Spark job. At this point DynamoDB is not in the picture at all. Would the above approach work for you? If so, I could see if we can open source the Spark-based DynamoDB scanner. If we're all using the same approach and code in production it would be easier to make sure it handles all the corner cases. |
@traviscrawford - I appreciate the collaboration and interest in sharing more useful code. Your suggestion sounds interesting. Let me share my use case and we can see if we have a compatible use case. I need to scan an entire DynamoDB table (optimally I'd use a filter expression to eliminate some records lacking certain attributes), projecting out a subset of the returned attributes, doing some light transformation of the data, and then write out the results to S3 as CSV. So for my use case, I'm not doing a straight backup to S3. I suppose what you describe would work, with the disadvantage that I'd need two Spark jobs for my case: One to backup the table to S3 as JSON, and another to read the JSON and operate on it. Does that help to shed light on my usage? |
@traviscrawford Our use case is somewhat similar to yours. We are basically forklifting the data in a DynamoDB table attribute to Redshift. Item attribute values are JSON. |
@findchris We have some use-cases like yours too, where we need to filter & project some records in the DynamoDB table before processing. DynamoDB filters are interesting, where you actually scan the full table, and filters are applied before returning rows to the client. Since behind-the-scenes we're scanning the full table, we simply write the whole thing out to s3. Then a separate job processes the backups. Would y'all find if useful if we published the backup job we're using? |
Yes. AWS has their Data Pipeline stuff, but it feels more consistent to have it all in Spark. Does this backup job happen to have a throttling mechanism built in @traviscrawford ? ;-) |
@findchris I just switched to using the guava dependency that Spark provides, so you shouldn't have the issue anymore. Can you |
@traviscrawford - A bit embarrassed to ask, but I'll need more hand-holding to do what you suggest. I use Regardless, I appreciate the work. Looking forward to testing this all out. |
@findchris You're building the
copy |
@timchan-lumoslabs I appreciate the tips! I did what you said, and it compiled ok as However, @traviscrawford, when I run my job I'm seeing a new stacktrace:
I was concerned Any idea what's up? I hope to investigate more tomorrow. |
A follow up:
However, when I explicitly depend on guava 14.0.1 in my sbt file ( This was using the manual steps provided by @timchan-lumoslabs to build @traviscrawford, not sure why I explicitly had to specify the guava dependency, but seems good now. Merge? |
Thanks for testing this! It's been published to maven central.
|
Is the current version then |
Latest version is |
Thanks for all the correspondence! Closing this out now. |
Greetings!
First run and this library works as advertised (thanks @traviscrawford for the open source contribution 👍), except the
readCapacityPct
option doesn't appear to be respected.My snippet:
The issue I'm seeing is I watched the read unit consumption jump to ~85%, which won't fly in a production environment. Am I configuring the
read_capacity_pct
option correctly?From what I see,
readCapacityPct
gets declared but is not used elsewhere.Cheers.
The text was updated successfully, but these errors were encountered: