-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to properly configure for a dataset on S3? #194
Comments
Agreed that documentation is needed, but generally speaking, you should be able to use datasets on S3 as drop-in replacements for local datasets by using the "cdms3" prefix. In the tds-s3 test calatog that you've linked, I'm not sure what you're referring two as the two dataset roots, I only see the one: "cdms3:noaa-goes16". Am I missing something? We certainly need to and will add some object store examples to our documentation, but that's not an immediate answer and there's no guarantee that the provided documentation will address your particular issue, so if you would like help sorting your configuration out, I'd be happy to look at it. |
@haileyajohnson It took me a bit to find it too, but this tag: <dataset name="Test GOES-16 S3 Aggregation (November 2nd, 2020)"
ID="2020-11-02_OR-ABI-L1b-RadC-M6C01-G16"
urlPath="s3-agg/ABI-L1b-RadC/2020/11/02/OR_ABI-L1b-RadC-M6C01_G16.nc"> seems to be using a "s3-agg" dataset root that's not defined in the file. It's probably not used, though, because the <scan location="cdms3:noaa-goes16?ABI-L1b-RadC/2020/307/#delimiter=/"
dateFormatMark="OR_ABI-L1b-RadC-M3C01_G16_s#yyyyDDDHHmmss"
regExp=".*OR_ABI-L1b-RadC-M6C01_G16_s.*" /> |
Yes, that's what I meant by two dataset roots. I think an important view point to consider when drafting the documentation is what is the minimum amount of information the user needs to know. Initially, the user is going to think, "I have a dataset located at s3://myexamplebucket/foo/bar", how should a user specify that in catalog.xml, and what other information does the user need to provide? A narrative walking through that would go a long way. |
Ah ok, I see. Yes, @dopplershift is correct, that second dataset doesn't need a |
@WeatherGod this is the current documentation for catalog configuration. It definitely does need a lot of work - but as far as serving S3 datasets goes, the only things users need to do differently is provide |
should it end with a https://github.com/Unidata/tds/blob/f0a6297f6fd8b9e4cb3ee2b29312ce64eda94458/tdcommon/src/main/java/thredds/server/catalog/DataRoot.java |
Just a quick chime-in here - this should be helpful: https://docs.unidata.ucar.edu/netcdf-java/current/userguide/dataset_urls.html#object-stores as well as https://github.com/lesserwhirls/tds-s3-jpl-test (I had planned on using this to bootstrap TDS specific docs). |
Thanks, @lesserwhirls ! I'm not sure what what you're trying to configure, but for most situations you shouldn't need the location to end with a Either |
cdms3: implies something different from S3. And doing a search for "cdms3" yields not much. The closest I get is "Climate Data Management System, v3", which does not make me think "S3". |
@WeatherGod Let's focus on what it takes to get something working rather than delving off into design critiques for code that's already been merged. 😉 |
I can see how that was a design critique and sorry for not being clearer. This was more of me raising a point that in any such documentation it should make clear what, if any, difference there is between "s3:" and "cdms3:", and if there isn't a difference, it should state that they are equivalent, because an end-user who isn't familiar with one or the other will get confused on this point. |
And to note, I have currently taken some of these notes to apply back to our system to see if it resolves the problems in our bigger, more complicated system. There are so many moving parts to this, and difficulty in nailing down root causes that I am just trying to eliminate sources of ambiguity/confusion. Once our CI/CD cycle completes, I'll let you know how that goes. |
Yes, because it is different (very similar, but different). cdms3 is a URI scheme specific to netCDF-java (a.k.a. The CDM) and was created to provide a more generic and flexible way for to interface object storage systems. s3: was used in the first implementation of object store support for netCDF-java, which focused on AWS S3. Once we started to support aggregations, non-AWS object stores (including on premises object stores), etc., we needed more flexibility, and thus cdms3: (support for the s3: scheme is deprecated). |
I <3 this explanation soo much, thank you! This sort of information in the documentation would go so far in helping people understand. From a perspective of discoverability, in light of deprecation of "s3:", the documentation should make it clear that "cdms3:" is for object stores like AWS S3 (and maybe include some others). Then I would include examples that expressly describes an example object store and then translate that into a dataset entry. Linking to the netcdf-java page would be helpful, but looking through that, I start to wonder if I still got the right mental model or not. Clear examples would be helpful in reducing ambiguity. |
@WeatherGod Pull Requests welcome... Just kidding. We definitely need to capture a lot of this discussion (esp. the context from @lesserwhirls who wrote the stuff) in the docs. I sense the object store support is going to become increasingly important to a large swath of the community, and we definitely want it to be easy for them to set up. |
Alright, some progress, I guess. So, here is a snippet of our catalog.xml:
"aer-awi-era5" is a private S3 bucket of ours. The output from the relevant log:
And to confirm that the bucket is accessible from the server running thredds, I installed awscli and did:
So, any ideas what is wrong? Could it be subtle differences between the python and java AWS SDKs? Authentication is being implicitly done via the server instance having assumed the appropriate IAM role. |
waitaminute... that says TDS version 4.6. I need to go back and double-check which docker image we are using |
I think we're crossing wires a bit on what is a The So I think the TDS is trying to parse what you've put in the ...or all of that will be true if you have a 5.x TDS running :) |
that's actually a question I had, and maybe this is just general knowledge for thredds catalog.xml configs (I am a complete noob to thredds configs). In your example, you have urlPath starting with "aer-awi-era5". Would my datasetRoot then be |
Yes, exactly. It's an id that you're setting that the TDS maps to a storage location, but it can be anything. |
Alright, fixing our config, as well as fixing our docker mirror (somehow, our mirror had mixed up the 4.6 and 5.3 tags). We'll know how well this turns out in about an hour. I think the aspect about the path="foobar" being an identifier is another potential stumbling block that would get made clear by having an example that clearly defines the layout of the object store. The reader can see that that name doesn't appear in the store layout and can then (hopefully!) not try and treat it as a subpath name or something. |
The |
(the tds-s3.xml example had a different schema, which confused me) |
Your scan location should be absolute, e.g. `<scan location="cdms3:bucketname/path/to/objects_shared_location>
I'll follow this up with some annotations on the tds-s3.xml file that might clarify things, but it's taking me a bit to type it up. |
"urlPath is used by a client/user interfacing with the TDS". So, would siphon count as a client interfacing with TDS? Because siphon, as far as I can tell, doesn't support s3 or cdms3 stuff, so maybe this whole time, I shouldn't have been setting urlPath to a cdms3: address? An error in another part of our system that uses siphon is what triggered this entire investigation of wondering how we were supposed to configure catalog.xml. |
There are two datasets being served from the tds-s3 catalog, the first is a single dataset, i.e. one .nc file stored in an AWS bucket:
The The second dataset is an aggregation:
Here our urlPath is, again, the URL that will be used by a client to access the data through the TDS, but we didn't need to define a |
Yep! Siphon counts as client, and urlPath should not have cdms3 in it. The TDS should be handling the mapping from a url, like |
Oh, that really clears things up. And that scan location is utilizing the url schema defined in the netcdf-java docs that was linked earlier today. Just curious, could the scan location utilize datasetRoot if it was defined, or does it always have to have a full url? |
I might also have a bug report to file next week. Need to confirm, but it looks like thredds will report "something" rather than return an error when an invalid config is used. This has been very helpful and I we can continue this next week. Have a good weekend! |
I'm fairly certain a scan location overrides a datasetRoot and can't rely on one. Since scan location isn't user-facing and is only used internally by the TDS, it doesn't go through the datasetRoot map. Glad this was helpful! |
Just a bit of an update on my end. Due to problems with other parts of the system unrelated to thredds and s3, we had to table working on using this feature. I do have my eye on utilizing TDS for another project that would utilize data off of an S3 bucket, but I haven't been allocated time to pursue that. It may be some time before I can contribute more to this discussion. If you all do move ahead and develop some documentation, I would be interested in reviewing it. |
I'm looking into this now for some basic examples and failing to get anything working. Various errors that make it clear I have the right TDS and have atleast gotten TDS to try, so I know I'm making some progress. Can someone point me to a dead simple working example? I've never used the dataset root pattern, don't think I have enough brain cells to hold all the indirection in my head at once. Are there examples of using a netcdf element within a dataset element to point to a single netcdf file on an object store? I'd like to inch my way into this and get to the point that I know how to work with our provider pays bucket. (s3://nhgf-development) -- I put our copy of prism there and have been using the path: Thanks for any examples you can point me to. UPDATE: I have succeeded in getting an unauthenticated goes16 example to work and an authenticated (with a default profile) example against our requester pays bucket to work. I am unable to get credentials to work by specifying them in the URL. Only a format like:
and set my dataset root to:
I get 500, NULL from THREDDS with a However, for my requester pays bucket, I can set up credentials like...
I can get this datasetRoot location to work: minimal catalog that works: <?xml version="1.0" encoding="UTF-8"?>
<catalog name="S3 Test Catalog" xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0" xmlns:ncml="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2" version="1.2">
<service name="all" serviceType="OpenDAP" base="/thredds/dodsC/"/>
<datasetRoot path="nhgf" location="cdms3:nhgf-development" />
<dataset name="test prism" ID="prism" urlPath="nhgf/thredds/prism_v2/prism_2020.nc" dataType="Grid" serviceName="all"/>
<datasetRoot path="nhgf-test" location="cdms3://default@aws/nhgf-development" />
<dataset name="test prism 2" ID="prism-test" urlPath="nhgf-test/thredds/prism_v2/prism_2020.nc" dataType="Grid" serviceName="all"/>
</catalog> I think I'm off and running with this pattern. I'll do some more testing and we'll see where we get. |
IMHO, this issue could be closed. The dataset root pattern or using a cdms3 string as I show above is pretty much answering the question. |
I'm going to leave it open for the time being, just as a reference until we revisit the S3 documentation :) |
Picking this issue back up at work. I can get thredds to attempt to access my S3 bucket, but I get an access denied, even though my default credentials and configs stored in Might be a good idea to update the log4j2.xml file to automatically include WARN and higher entries in the threddsServlet.log file, maybe? I tried doing it myself, but I can't seem to make it work and I don't have much experience with java logging. |
Any guidance on how to activate the logs from the Amazon SDK? I'm still having difficulty getting this to work and I'm shooting in the dark, trying to figure out what the code is choking on (either I get "null" errors, or "Access Denied" errors). |
I just quickly tried this out, based on what I read here, and it logs a whole bunch of stuff from awssdk in a new file called I hope that helps! If you still can't figure out the errors you are getting, let us know. |
Oh! Thank you so much! I am such a complete noob with Java, and the logging documentation for AWS SDK and TDS read like ancient Babylonian, and looked like they were completely incompatible with each other. This is a huge leg up! |
Alright, thanks to the logging info, I have been able to determine one of the reasons why my group has been having such difficulty with figuring out the configs. If I am understanding this correctly, it turns out that TDS is missing a dependency from the awssdk package (I don't know exact terminology). Since we were trying to run TDS as a docker container image on an EKS cluster, the authorization is provided by a Web Identity Token rather than some other usual method such as AWS Role or Looks like the following line needs to be added to the gradle files somewhere?
Not sure if something like that should go into netcdf-java directly, or into TDS? |
I will have a look! |
The best way to test if this is enough would be to include STS as a |
Thanks Sean, that sounds good! 😄 |
@WeatherGod, the newest docker image( |
I am having difficulty figuring out the proper configuration for a dataset stored on S3. There is no documented example of this, and the only thing I can find to work off of is https://github.com/Unidata/tds/blob/main/tds/src/test/content/thredds/tds-s3.xml. But it isn't clear to me how to adapt that example to our own S3 bucket, let alone the fact that this file appears to use two dataset roots, but only has one defined.
I am using the latest tds docker container, and rather than focusing on my own particular catalogue.xml, I am hoping to get the ball rolling on producing a well-explained, clean example to add to the documentation.
The text was updated successfully, but these errors were encountered: