Skip to content

Tdl 25859/handle s3 files race condition#67

Merged
rdeshmukh15 merged 18 commits intomasterfrom
TDL-25859/handle-s3-files-race-condition
Nov 20, 2024
Merged

Tdl 25859/handle s3 files race condition#67
rdeshmukh15 merged 18 commits intomasterfrom
TDL-25859/handle-s3-files-race-condition

Conversation

@rdeshmukh15
Copy link
Copy Markdown
Member

@rdeshmukh15 rdeshmukh15 commented Aug 9, 2024

Description of change

This PR addresses the issue of S3 file race conditions where the last_modified timestamps of S3 files are being updated while extractions are in progress, causing the bookmark time to advance beyond the current execution time.

Resolution:

  • Record the sync_start_time at the beginning of the extraction.
  • Check if the file's last_modified is greater than the sync_start_time.
    • If so, store the sync_start_time as the bookmark in the state.
    • Otherwise, store the file's last_modified timestamp in the state file.

Manual QA steps

  • tested on the client connection

Risks

Rollback steps

  • revert this branch

Comment thread tap_s3_csv/sync.py
Comment on lines +43 to +46
if s3_file['last_modified'] < sync_start_time:
state = singer.write_bookmark(state, table_name, 'modified_since', s3_file['last_modified'].isoformat())
else:
state = singer.write_bookmark(state, table_name, 'modified_since', sync_start_time.isoformat())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add unit test for this change.

Copy link
Copy Markdown
Contributor

@RushiT0122 RushiT0122 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some suggestions in-line.

Comment thread tests/unittests/test_sync_stream.py Outdated
Comment on lines +35 to +38
# Case when file is newer than sync_start_time
{
"file_last_modified": datetime(2024, 8, 14, 12, 0, 0),
"sync_start_time": datetime(2024, 8, 14, 12, 0, 0),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the comment and indentation.

Suggested change
# Case when file is newer than sync_start_time
{
"file_last_modified": datetime(2024, 8, 14, 12, 0, 0),
"sync_start_time": datetime(2024, 8, 14, 12, 0, 0),
# Case when file is the same as sync_start_time
{
"file_last_modified": datetime(2024, 8, 14, 12, 0, 0),
"sync_start_time": datetime(2024, 8, 14, 12, 0, 0),

Comment thread tap_s3_csv/__init__.py Outdated
Comment on lines +78 to +79
now = datetime.now()
sync_start_time = singer_utils.strptime_with_tz(now.strftime("%Y-%m-%dT%H:%M:%SZ"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below code will better readability.

Suggested change
now = datetime.now()
sync_start_time = singer_utils.strptime_with_tz(now.strftime("%Y-%m-%dT%H:%M:%SZ"))
now_str = datetime.now().strftime("%Y-%m-%dT%H:%M:%SZ")
sync_start_time = singer_utils.strptime_with_tz(now_str)

Comment thread tests/unittests/test_sync_stream.py Outdated
Depending on whether the last_modified date is earlier or later than sync_start_time,
the bookmark will either be updated to the file's last_modified or the sync_start_time.
"""
test_cases = [
Copy link
Copy Markdown
Contributor

@RushiT0122 RushiT0122 Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate out test scenarios rather than combining all scenarios in one test. Optionally you can parameterize it if there is a repetition of code. This makes it easier to identify which specific scenario fails if a test does not pass.

@rdeshmukh15 rdeshmukh15 merged commit 10b9798 into master Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants