Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: remove google cloud storage dependency for gcs datalake test using fake-gcs-server #3576

Merged
merged 3 commits into from
Jul 13, 2023

Conversation

achettyiitr
Copy link
Member

@achettyiitr achettyiitr commented Jul 3, 2023

Description

  • Remove Google Cloud Storage dependency for running GCS Datalake test using fake-gcs-server

Notion Ticket

https://www.notion.so/rudderstacks/Remove-GCS-dependency-using-fake-gcs-server-7321148ad63a44daa2d967bfa02dfaf8?pvs=4

Security

  • The code changed/added as part of this pull request won't create any security issues with how the software is being used.

@achettyiitr achettyiitr changed the title chore: remove gcs dependency for gcs datalake test using fake-gcs-server chore: remove google cloud storage dependency for gcs datalake test using fake-gcs-server Jul 3, 2023
@achettyiitr achettyiitr force-pushed the chore.gcs-fake-datalake-test branch from c9e50a5 to 3582644 Compare July 3, 2023 09:41
@codecov
Copy link

codecov bot commented Jul 3, 2023

Codecov Report

Patch coverage has no change and project coverage change: +0.07 🎉

Comparison is base (45955e2) 67.94% compared to head (21b6edc) 68.02%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3576      +/-   ##
==========================================
+ Coverage   67.94%   68.02%   +0.07%     
==========================================
  Files         318      318              
  Lines       50262    50262              
==========================================
+ Hits        34153    34193      +40     
+ Misses      13877    13845      -32     
+ Partials     2232     2224       -8     

see 9 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Collaborator

@fracasula fracasula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@achettyiitr This is nice, it seems like the project is well maintained too. However I'm concerned about not being able to detect regressions as timely as we would be able to if we were to use the real thing.

Considering that we are already using the real GCS, what do you think we would gain by using this emulator in its place (beside faster tests)?

warehouse/integrations/datalake/datalake_test.go Outdated Show resolved Hide resolved
@achettyiitr
Copy link
Member Author

achettyiitr commented Jul 4, 2023

@achettyiitr This is nice, it seems like the project is well maintained too. However I'm concerned about not being able to detect regressions as timely as we would be able to if we were to use the real thing.

Considering that we are already using the real GCS, what do you think we would gain by using this emulator in its place (beside faster tests)?

  1. Since we are already capturing these. In GCS for BQ, S3 for Snowflake and RS, and Azure for Deltalake. I thought it would not be required to use Real GCS here at least for Datalakes.
  2. If we still want a real one, we can probably move it to e2e tests. wdyt?

@fracasula
Copy link
Collaborator

  1. Since we are already capturing these. In GCS for BQ, S3 for Snowflake and RS, and Azure for Deltalake. I thought it would not be required to use Real GCS here at least for Datalakes.

So you're saying that when handling data in BQ the underlying API does the same operations we do for Datalakes, meaning creating buckets and uploading files on GCS?

  1. If we still want a real one, we can probably move it to e2e tests. wdyt?

Before doing that we would need to sync with @pChondros so that we stay covered until we're ready there.

@achettyiitr
Copy link
Member Author

achettyiitr commented Jul 4, 2023

So you're saying that when handling data in BQ the underlying API does the same operations we do for Datalakes, meaning creating buckets and uploading files on GCS?

Yeah. That's the reason, for S3 datalake I am using minio container, and for Azure Datalake I am using azurite container.

@fracasula
Copy link
Collaborator

So you're saying that when handling data in BQ the underlying API does the same operations we do for Datalakes, meaning creating buckets and uploading files on GCS?

Yeah. That's the reason, for S3 datalake I am using minio container, and for Azure Datalake I am using azurite container.

I checked the bigquery client and even though it is calling the Google API, it is calling different endpoints, it's not really creating buckets. What am I missing?

Anyway, considering that on Google Cloud APIs they do proper versioning I am OK with the change. As long as we don't upgrade the client we should be fine.

However, we will have to keep in mind that we wouldn't be able anymore to just upgrade the client, run a test and release. We would need to check manually and manual checks means that people need to remember to do them. So I'm keen to have some regression test on the e2e suite that would run less often as these. Thoughts? (cc @pChondros)

@pChondros
Copy link
Contributor

So you're saying that when handling data in BQ the underlying API does the same operations we do for Datalakes, meaning creating buckets and uploading files on GCS?

Yeah. That's the reason, for S3 datalake I am using minio container, and for Azure Datalake I am using azurite container.

I checked the bigquery client and even though it is calling the Google API, it is calling different endpoints, it's not really creating buckets. What am I missing?

Anyway, considering that on Google Cloud APIs they do proper versioning I am OK with the change. As long as we don't upgrade the client we should be fine.

However, we will have to keep in mind that we wouldn't be able anymore to just upgrade the client, run a test and release. We would need to check manually and manual checks means that people need to remember to do them. So I'm keen to have some regression test on the e2e suite that would run less often as these. Thoughts? (cc @pChondros)

Since we already have the S3 bucket in e2e introducing the GCS should just mean one more bucket name and creating one more test. I guess for the integration tests we can use the change suggested by Akash, since these tests run on each push and PR accepting the doubt introduced by the external mock. We just need to add the e2e test also so that we can have the final go ahead on each release. This delegation of "completeness" to e2e just means we need to be extra vigilant that e2e tests are checked before each release (and maybe patch releases also if they affect this functionality)

@achettyiitr
Copy link
Member Author

achettyiitr commented Jul 4, 2023

I checked the bigquery client and even though it is calling the Google API, it is calling different endpoints, it's not really creating buckets. What am I missing?

  • I was referring to load file generation since that's only what the Datalakes do which is the same for BQ as well because they also use the GCSManager and generate the load file and upload.
  • Once the load files generate, there's an extra step that BQ used to load the data to the Tables.
  • Create bucket/container is being provided by Filemanager during Upload and only for Minio and Azure.

@fracasula
Copy link
Collaborator

  • I was referring to load file generation since that's only what the Datalakes do which is the same for BQ as well because they also use the GCSManager and generate the load file and upload.
  • Once the load files generate, there's an extra step that BQ used to load the data to the Tables.
  • Create bucket/container is being provided by Filemanager during Upload and only for Minio and Azure.

It makes sense now. So I think we should have an integration test for GCS, one for S3 and one for Azure in the layer that is using them (e.g. filemanager, ...).

This way we can mock the storage layer in the other tests like BQ, Datalakes, Redshift, etc...

@fracasula fracasula merged commit ccb3ed2 into master Jul 13, 2023
37 checks passed
@fracasula fracasula deleted the chore.gcs-fake-datalake-test branch July 13, 2023 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants