Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to properly configure for a dataset on S3? #194

Closed
WeatherGod opened this issue Jan 21, 2022 · 43 comments
Closed

How to properly configure for a dataset on S3? #194

WeatherGod opened this issue Jan 21, 2022 · 43 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@WeatherGod
Copy link
Contributor

I am having difficulty figuring out the proper configuration for a dataset stored on S3. There is no documented example of this, and the only thing I can find to work off of is https://github.com/Unidata/tds/blob/main/tds/src/test/content/thredds/tds-s3.xml. But it isn't clear to me how to adapt that example to our own S3 bucket, let alone the fact that this file appears to use two dataset roots, but only has one defined.

I am using the latest tds docker container, and rather than focusing on my own particular catalogue.xml, I am hoping to get the ball rolling on producing a well-explained, clean example to add to the documentation.

@haileyajohnson haileyajohnson added the documentation Improvements or additions to documentation label Jan 21, 2022
@haileyajohnson haileyajohnson self-assigned this Jan 21, 2022
@haileyajohnson
Copy link

Agreed that documentation is needed, but generally speaking, you should be able to use datasets on S3 as drop-in replacements for local datasets by using the "cdms3" prefix. In the tds-s3 test calatog that you've linked, I'm not sure what you're referring two as the two dataset roots, I only see the one: "cdms3:noaa-goes16". Am I missing something?

We certainly need to and will add some object store examples to our documentation, but that's not an immediate answer and there's no guarantee that the provided documentation will address your particular issue, so if you would like help sorting your configuration out, I'd be happy to look at it.

@dopplershift
Copy link
Member

@haileyajohnson It took me a bit to find it too, but this tag:

  <dataset name="Test GOES-16 S3 Aggregation (November 2nd, 2020)"
    ID="2020-11-02_OR-ABI-L1b-RadC-M6C01-G16"
    urlPath="s3-agg/ABI-L1b-RadC/2020/11/02/OR_ABI-L1b-RadC-M6C01_G16.nc">

seems to be using a "s3-agg" dataset root that's not defined in the file. It's probably not used, though, because the <scan> tag explicitly gives the cdms3 prefix:

        <scan location="cdms3:noaa-goes16?ABI-L1b-RadC/2020/307/#delimiter=/"
          dateFormatMark="OR_ABI-L1b-RadC-M3C01_G16_s#yyyyDDDHHmmss"
          regExp=".*OR_ABI-L1b-RadC-M6C01_G16_s.*" />

@WeatherGod
Copy link
Contributor Author

Yes, that's what I meant by two dataset roots.

I think an important view point to consider when drafting the documentation is what is the minimum amount of information the user needs to know. Initially, the user is going to think, "I have a dataset located at s3://myexamplebucket/foo/bar", how should a user specify that in catalog.xml, and what other information does the user need to provide? A narrative walking through that would go a long way.

@haileyajohnson
Copy link

haileyajohnson commented Jan 21, 2022

Ah ok, I see. Yes, @dopplershift is correct, that second dataset doesn't need a <datasetRoot> mapping because the location is provided via the <scan> element.

@haileyajohnson
Copy link

@WeatherGod this is the current documentation for catalog configuration.

It definitely does need a lot of work - but as far as serving S3 datasets goes, the only things users need to do differently is provide locations that start with the cdms3: prefix, followed by the bucket name and object path, as opposed to a local path.

@WeatherGod
Copy link
Contributor Author

should it end with a / or not? Digging through the code, I see that there is a check for that, but it doesn't explain what it is for. Also, the code checks for both "cdms3" and "s3". Is there a difference?

https://github.com/Unidata/tds/blob/f0a6297f6fd8b9e4cb3ee2b29312ce64eda94458/tdcommon/src/main/java/thredds/server/catalog/DataRoot.java
https://github.com/Unidata/tds/blob/15d02c350d3c411f899c4befae694b492243911b/tds/src/main/java/thredds/core/DatasetManager.java

@lesserwhirls
Copy link
Collaborator

Just a quick chime-in here - this should be helpful:

https://docs.unidata.ucar.edu/netcdf-java/current/userguide/dataset_urls.html#object-stores

as well as https://github.com/lesserwhirls/tds-s3-jpl-test (I had planned on using this to bootstrap TDS specific docs).

@haileyajohnson
Copy link

Thanks, @lesserwhirls !

I'm not sure what what you're trying to configure, but for most situations you shouldn't need the location to end with a /, it looks like the code you're referencing is just adding a trailing '/' to the root if it's not already there.

Either cdms3: or s3: will work as a prefix, but cdms3: is recommended. I'm guessing the fact that both exist and work is an artifact of evolving decisions mid-development...

@WeatherGod
Copy link
Contributor Author

cdms3: implies something different from S3. And doing a search for "cdms3" yields not much. The closest I get is "Climate Data Management System, v3", which does not make me think "S3".

@dopplershift
Copy link
Member

@WeatherGod Let's focus on what it takes to get something working rather than delving off into design critiques for code that's already been merged. 😉

@WeatherGod
Copy link
Contributor Author

I can see how that was a design critique and sorry for not being clearer. This was more of me raising a point that in any such documentation it should make clear what, if any, difference there is between "s3:" and "cdms3:", and if there isn't a difference, it should state that they are equivalent, because an end-user who isn't familiar with one or the other will get confused on this point.

@WeatherGod
Copy link
Contributor Author

And to note, I have currently taken some of these notes to apply back to our system to see if it resolves the problems in our bigger, more complicated system. There are so many moving parts to this, and difficulty in nailing down root causes that I am just trying to eliminate sources of ambiguity/confusion.

Once our CI/CD cycle completes, I'll let you know how that goes.

@lesserwhirls
Copy link
Collaborator

cdms3: implies something different from S3. And doing a search for "cdms3" yields not much. The closest I get is "Climate Data Management System, v3", which does not make me think "S3".

Yes, because it is different (very similar, but different). cdms3 is a URI scheme specific to netCDF-java (a.k.a. The CDM) and was created to provide a more generic and flexible way for to interface object storage systems. s3: was used in the first implementation of object store support for netCDF-java, which focused on AWS S3. Once we started to support aggregations, non-AWS object stores (including on premises object stores), etc., we needed more flexibility, and thus cdms3: (support for the s3: scheme is deprecated).

@WeatherGod
Copy link
Contributor Author

cdms3: implies something different from S3. And doing a search for "cdms3" yields not much. The closest I get is "Climate Data Management System, v3", which does not make me think "S3".

Yes, because it is different (very similar, but different). cdms3 is a URI scheme specific to netCDF-java (a.k.a. The CDM) and was created to provide a more generic and flexible way for to interface object storage systems. s3: was used in the first implementation of object store support for netCDF-java, which focused on AWS S3. Once we started to support aggregations, non-AWS object stores (including on premises object stores), etc., we needed more flexibility, and thus cdms3: (support for the s3: scheme is deprecated).

I <3 this explanation soo much, thank you! This sort of information in the documentation would go so far in helping people understand. From a perspective of discoverability, in light of deprecation of "s3:", the documentation should make it clear that "cdms3:" is for object stores like AWS S3 (and maybe include some others). Then I would include examples that expressly describes an example object store and then translate that into a dataset entry. Linking to the netcdf-java page would be helpful, but looking through that, I start to wonder if I still got the right mental model or not. Clear examples would be helpful in reducing ambiguity.

@dopplershift
Copy link
Member

@WeatherGod Pull Requests welcome...

Just kidding. We definitely need to capture a lot of this discussion (esp. the context from @lesserwhirls who wrote the stuff) in the docs. I sense the object store support is going to become increasingly important to a large swath of the community, and we definitely want it to be easy for them to set up.

@WeatherGod
Copy link
Contributor Author

Alright, some progress, I guess. So, here is a snippet of our catalog.xml:

  <dataset name="ERA5 Daily Precipitation and Temperatures"
           ID="era5-daily-summary"
           urlPath="cdms3:aer-awi-era5/temp_precip/era5_pnt_daily_2000_2020">

"aer-awi-era5" is a private S3 bucket of ours. The output from the relevant log:

# cat /usr/local/tomcat/content/thredds/logs/catalogInit.log 
You are currently running TDS version 4.6.19 - 2021-12-20T10:32:09-0500
Latest Available TDS Version Info:
    latest stable version = 5.1
    latest maintenance version = 4.6.17

initCatalogs(): initializing 1 root catalogs.

**************************************
Catalog init catalog.xml
[2022-01-21T21:17:13.123Z]
initCatalog catalog.xml -> /usr/local/tomcat/content/thredds/catalog.xml

-------readCatalog(): full path=/usr/local/tomcat/content/thredds/catalog.xml; path=catalog.xml
----Catalog Validation

*** ERROR DataRootConfig path =temp_precip directory= <cdms3:aer-awi-era5/> does not exist
  add static catalog to hash=catalog.xml

And to confirm that the bucket is accessible from the server running thredds, I installed awscli and did:

# aws s3 ls s3://aer-awi-era5/
                           PRE temp_precip/

So, any ideas what is wrong? Could it be subtle differences between the python and java AWS SDKs? Authentication is being implicitly done via the server instance having assumed the appropriate IAM role.

@WeatherGod
Copy link
Contributor Author

waitaminute... that says TDS version 4.6. I need to go back and double-check which docker image we are using

@haileyajohnson
Copy link

haileyajohnson commented Jan 21, 2022

I think we're crossing wires a bit on what is a urlPath and what is a location. The cdms3: prefix should be included in the location field of the <datasetRoot> element to let the TDS know that dataset maps to your s3 bucket. It tells the TDS where to look for the stored data.

The urlPath is a url path for users to get the data via the TDS. For example, this:
urlPath="s3-test/ABI-L1b-RadC/2019/363/21/OR_ABI-L1b-RadC-M6C16_G16_s20193632101189_e20193632103574_c20193632104070.nc"
means the dataset can be accessed using ncss with this url: <hostname>/thredds/ncss/grid/s3-test/ABI-L1b-RadC/2019/363/21/OR_ABI-L1b-RadC-M6C16_G16_s20193632101189_e20193632103574_c20193632104070.nc/dataset.html

So I think the TDS is trying to parse what you've put in the urlPath as a url, and it doesn't know what to do with the ":". It should be something like: urlPath=aer-awi-era5/temp_precip/era5_pnt_daily_2000_2020

...or all of that will be true if you have a 5.x TDS running :)

@WeatherGod
Copy link
Contributor Author

that's actually a question I had, and maybe this is just general knowledge for thredds catalog.xml configs (I am a complete noob to thredds configs). In your example, you have urlPath starting with "aer-awi-era5". Would my datasetRoot then be <datasetRoot path="aer-awi-era5" location="cdms3:aer-awi-era5" />? In other words, the "path" is just an id, and that it doesn't appear in the final URL? In other words, could I have a datasetRoot of <datasetRoot path="foobar" location="cdms3:aer-awi-era5" />, and then my urlPath could have been "foobar/temp_precip/era5_pnt_daily_2000_2020"?

@haileyajohnson
Copy link

Yes, exactly. It's an id that you're setting that the TDS maps to a storage location, but it can be anything.

@WeatherGod
Copy link
Contributor Author

Alright, fixing our config, as well as fixing our docker mirror (somehow, our mirror had mixed up the 4.6 and 5.3 tags). We'll know how well this turns out in about an hour.

I think the aspect about the path="foobar" being an identifier is another potential stumbling block that would get made clear by having an example that clearly defines the layout of the object store. The reader can see that that name doesn't appear in the store layout and can then (hopefully!) not try and treat it as a subpath name or something.

@WeatherGod
Copy link
Contributor Author

The <scan location="..." is relative to urlPath? So, if my urlPath already pointed to the directory with my files, then the scan location should just be "/", or something?

@WeatherGod
Copy link
Contributor Author

(the tds-s3.xml example had a different schema, which confused me)

@haileyajohnson
Copy link

Your scan location should be absolute, e.g. `<scan location="cdms3:bucketname/path/to/objects_shared_location>

urlPath is used by a client/user interfacing with the TDS, location is used by the TDS to find the data, whether on disk on in a cloud store.

I'll follow this up with some annotations on the tds-s3.xml file that might clarify things, but it's taking me a bit to type it up.

@WeatherGod
Copy link
Contributor Author

"urlPath is used by a client/user interfacing with the TDS". So, would siphon count as a client interfacing with TDS? Because siphon, as far as I can tell, doesn't support s3 or cdms3 stuff, so maybe this whole time, I shouldn't have been setting urlPath to a cdms3: address? An error in another part of our system that uses siphon is what triggered this entire investigation of wondering how we were supposed to configure catalog.xml.

@haileyajohnson
Copy link

haileyajohnson commented Jan 21, 2022

There are two datasets being served from the tds-s3 catalog, the first is a single dataset, i.e. one .nc file stored in an AWS bucket:

<datasetRoot path="s3-test" location="cdms3:noaa-goes16" />

  <dataset name="Test Single GOES-16 S3 (December 29th, 2019 21:01) " ID="2020-11-02_2101-OR-ABI-L1b-RadC-M6C16-G16"
    urlPath="s3-test/ABI-L1b-RadC/2019/363/21/OR_ABI-L1b-RadC-M6C16_G16_s20193632101189_e20193632103574_c20193632104070.nc"
    dataType="Grid"
    serviceName="obstoreGrid"/>

The <datasetRoot> element here is telling the TDS that anytime a client provides a URL with "s3-test", it should go look in the "cdms3:noaa-goes16" location. That's used in the following <dataset> element by the urlPath field: we can't put "cdms3:noaa-goes16" in directly, because that wouldn't be a valid, parsable url, but the TDS knows what "s3-test" is referring to because we mapped it in the <datasetRoot> above. Everything after "s3-test/" is the path to the dataset object within the s3 bucket (i.e. path relative to dataRoot).

The second dataset is an aggregation:

<dataset name="Test GOES-16 S3 Aggregation (November 2nd, 2020)"
    ID="2020-11-02_OR-ABI-L1b-RadC-M6C01-G16"
    urlPath="s3-agg/ABI-L1b-RadC/2020/11/02/OR_ABI-L1b-RadC-M6C01_G16.nc">
    ...
        <scan location="cdms3:noaa-goes16?ABI-L1b-RadC/2020/307/#delimiter=/"
    ...
  </dataset>

Here our urlPath is, again, the URL that will be used by a client to access the data through the TDS, but we didn't need to define a <datasetRoot> because we tell the TDS where the data lives in the <scan> element. That's why the <scan> needs the full uri, cdms3 prefix and path.

@haileyajohnson
Copy link

So, would siphon count as a client interfacing with TDS? Because siphon, as far as I can tell, doesn't support s3 or cdms3 stuff, so maybe this whole time, I shouldn't have been setting urlPath to a cdms3: address?

Yep! Siphon counts as client, and urlPath should not have cdms3 in it. The TDS should be handling the mapping from a url, like <hostname>/thredds/ncss/grid/s3-test/ABI-L1b-RadC/2019/363/21/OR_ABI-L1b-RadC-M6C16_G16_s20193632101189_e20193632103574_c20193632104070.nc/ to a storage location like cdms3:noaa-goes16/ABI-L1b-RadC/2019/363/21/OR_ABI-L1b-RadC-M6C16_G16_s20193632101189_e20193632103574_c20193632104070.nc Siphon shouldn't even need to know whether the dataset lives on s3 or on local disk.

@WeatherGod
Copy link
Contributor Author

Oh, that really clears things up. And that scan location is utilizing the url schema defined in the netcdf-java docs that was linked earlier today. Just curious, could the scan location utilize datasetRoot if it was defined, or does it always have to have a full url?

@WeatherGod
Copy link
Contributor Author

I might also have a bug report to file next week. Need to confirm, but it looks like thredds will report "something" rather than return an error when an invalid config is used.

This has been very helpful and I we can continue this next week. Have a good weekend!

@haileyajohnson
Copy link

I'm fairly certain a scan location overrides a datasetRoot and can't rely on one. Since scan location isn't user-facing and is only used internally by the TDS, it doesn't go through the datasetRoot map.

Glad this was helpful!

@WeatherGod
Copy link
Contributor Author

Just a bit of an update on my end. Due to problems with other parts of the system unrelated to thredds and s3, we had to table working on using this feature. I do have my eye on utilizing TDS for another project that would utilize data off of an S3 bucket, but I haven't been allocated time to pursue that. It may be some time before I can contribute more to this discussion. If you all do move ahead and develop some documentation, I would be interested in reviewing it.

@dblodgett-usgs
Copy link
Contributor

dblodgett-usgs commented Mar 17, 2022

I'm looking into this now for some basic examples and failing to get anything working. Various errors that make it clear I have the right TDS and have atleast gotten TDS to try, so I know I'm making some progress. Can someone point me to a dead simple working example?

I've never used the dataset root pattern, don't think I have enough brain cells to hold all the indirection in my head at once. Are there examples of using a netcdf element within a dataset element to point to a single netcdf file on an object store? I'd like to inch my way into this and get to the point that I know how to work with our provider pays bucket. (s3://nhgf-development) -- I put our copy of prism there and have been using the path: cdms3://default@aws/nhgf-development?thredds/prism_v2/prism_2020.nc to try and get one file working.

Thanks for any examples you can point me to.

UPDATE: I have succeeded in getting an unauthenticated goes16 example to work and an authenticated (with a default profile) example against our requester pays bucket to work.

I am unable to get credentials to work by specifying them in the URL. Only a format like: cdms3:noaa-goes16 is working. e.g. if I create a:

[region-only-profile]
region=us-east-1

and set my dataset root to:

<datasetRoot path="s3-test" location="cdms3://region-only-profile@aws/noaa-goes16" />

I get 500, NULL from THREDDS with a Service: S3, Status Code: 403 ... in the body. These public datasets don't seem to work at all with authenticated requests though, so maybe this is expected?

However, for my requester pays bucket, I can set up credentials like...

[default]
aws_access_key_id=enter_your_key
aws_secret_access_key=enter_your_secret
region=us-west-2

I can get this datasetRoot location to work: cdms3:nhgf-development or cdms3://default@aws/nhgf-development

minimal catalog that works:

<?xml version="1.0" encoding="UTF-8"?>
<catalog name="S3 Test Catalog" xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0" xmlns:ncml="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2" version="1.2">

  <service name="all" serviceType="OpenDAP" base="/thredds/dodsC/"/>

  <datasetRoot path="nhgf" location="cdms3:nhgf-development" />

  <dataset name="test prism" ID="prism" urlPath="nhgf/thredds/prism_v2/prism_2020.nc"    dataType="Grid" serviceName="all"/>

  <datasetRoot path="nhgf-test" location="cdms3://default@aws/nhgf-development" />

  <dataset name="test prism 2" ID="prism-test" urlPath="nhgf-test/thredds/prism_v2/prism_2020.nc" dataType="Grid" serviceName="all"/>

</catalog>

I think I'm off and running with this pattern. I'll do some more testing and we'll see where we get.

@dblodgett-usgs
Copy link
Contributor

IMHO, this issue could be closed. The dataset root pattern or using a cdms3 string as I show above is pretty much answering the question.

@haileyajohnson
Copy link

I'm going to leave it open for the time being, just as a reference until we revisit the S3 documentation :)

@WeatherGod
Copy link
Contributor Author

Picking this issue back up at work. I can get thredds to attempt to access my S3 bucket, but I get an access denied, even though my default credentials and configs stored in /usr/local/tomcat/.aws/ is all good. I am trying to debug this, but I can't figure out how to turn on the logging for the "com.amazonaws" logger (looks like they are still using log4j?).

Might be a good idea to update the log4j2.xml file to automatically include WARN and higher entries in the threddsServlet.log file, maybe? I tried doing it myself, but I can't seem to make it work and I don't have much experience with java logging.

@WeatherGod
Copy link
Contributor Author

Any guidance on how to activate the logs from the Amazon SDK? I'm still having difficulty getting this to work and I'm shooting in the dark, trying to figure out what the code is choking on (either I get "null" errors, or "Access Denied" errors).

@tdrwenski
Copy link
Member

tdrwenski commented Oct 6, 2022

I just quickly tried this out, based on what I read here, and it logs a whole bunch of stuff from awssdk in a new file called awssdk.log. You can edit this WEB-INF/classes/log4j2.xml file before starting up TDS to change these log settings.

I hope that helps! If you still can't figure out the errors you are getting, let us know.

@WeatherGod
Copy link
Contributor Author

WeatherGod commented Oct 6, 2022

Oh! Thank you so much! I am such a complete noob with Java, and the logging documentation for AWS SDK and TDS read like ancient Babylonian, and looked like they were completely incompatible with each other. This is a huge leg up!

@WeatherGod
Copy link
Contributor Author

Alright, thanks to the logging info, I have been able to determine one of the reasons why my group has been having such difficulty with figuring out the configs. If I am understanding this correctly, it turns out that TDS is missing a dependency from the awssdk package (I don't know exact terminology). Since we were trying to run TDS as a docker container image on an EKS cluster, the authorization is provided by a Web Identity Token rather than some other usual method such as AWS Role or ~/.aws/credentials. Doing that requires the sts module to be included in the classpath.

Looks like the following line needs to be added to the gradle files somewhere?

implementation 'software.amazon.awssdk:sts'

Not sure if something like that should go into netcdf-java directly, or into TDS?

@tdrwenski
Copy link
Member

I will have a look!

@lesserwhirls
Copy link
Collaborator

The best way to test if this is enough would be to include STS as a runtimeOnly dependency for the TDS. I can make a PR for that here in a bit. If that does the trick, then we would want to add it as a runtimeOnly dependency for netCDF-Java's cdm-s3 subproject.

@tdrwenski
Copy link
Member

Thanks Sean, that sounds good! 😄

@tdrwenski
Copy link
Member

@WeatherGod, the newest docker image(unidata/thredds-docker:5.5-SNAPSHOT) should contain the STS dependency. Let us know if you still have problems!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

6 participants