Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix nodefilter not working with Sentinel-3 products #573

Merged
merged 17 commits into from Mar 10, 2023

Conversation

valpamp
Copy link
Contributor

@valpamp valpamp commented Sep 23, 2022

Fixes issue #566, which was caused by the code not handling the product folder path and manifest filename based on the type of product. Since most Sentinel products have a folder path ending in .SAFE and a manifest file named manifest.safe, this was an issue only with products such as Sentinel-3, whose folder path ends in .SEN3 and whose manifest file is named xfdumanifest.xml.
By fetching the full odata metadata of the queried product, it is possible to get the correct folder path, and sets the manifest filename accordingly.

If there are products whose folder path does not end in '.SAFE' or '.SEN3', they should be handled accordingly.

Introduced the 'manifest_name' key in the odata dictionary in order to properly handle the product path and manifest filename depending on the type of product.
The _download_with_node_filter function now uses the 'Filename' and 'manifest_name' dictionary keys to create the product folder and manifest filename depending on the type of product queried by the user.
The _dataobj_to_node_info function sets the "node_path" dictionary key using the path variable rather than recalling the "href" key of the dataobj_info object, so that the initial './' characters are removed if they were present in the original object. Not doing so causes the failure of the nodefilter function for Sentinel-3 products. Partial downloads of other products work normally.
Now the product folder path and the manifest filename are set based on the 'title' key of the odata dictionary. This ensures that Sentinel-3 paths and manifest filenames are set correctly, allowing nodefilters to function also with these products. The previously introduced 'Filename' key was removed in favour of the newly introduced 'manifest_name', which is then used to build the paths throughout the code.
Build the product folder path without querying the full metadata through get_product_odata, making use of the new 'folder_path' key of the product_info dictionary.
Copy link
Contributor

@avalentino avalentino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@j08lue
Copy link
Contributor

j08lue commented Sep 26, 2022

I agree, looks good and is an improvement in terms of avoiding hard-coded names.

CI shows you need to Black-format, please, then we'll know how the rest of the tests go.

@valpamp
Copy link
Contributor Author

valpamp commented Sep 26, 2022

CI shows you need to Black-format, please, then we'll know how the rest of the tests go.

I used the black linter included in Spyder and the number of changes in the code is much higher than I expected. From what I can see basically all of them seem to be breaking down lines that are longer than 79 characters. Is this ok? Or did I use the wrong code formatter?

@avalentino
Copy link
Contributor

@valpamp the configuration for black is in the pyproject.toml file

[tool.black]
line-length = 100
target_version = ['py34'] # required for backwards compatability with py27 and trailing commas
include = '\.pyi?$'
exclude = '''
(
/(
\.cache
\.git
\.sentinelsat.egg-info
_build
docs
)/
)
'''

The line-length is supposed to be 100.

@valpamp
Copy link
Contributor Author

valpamp commented Sep 27, 2022

I finally managed to run black with using the proper configuration file, it should be ok now.

@j08lue
Copy link
Contributor

j08lue commented Sep 27, 2022

Great, now Black is happy, but one test is failing because the product info we expect (from recordings) is different from what we have now, where the folder_path and manifest_name properties are added. https://github.com/sentinelsat/sentinelsat/actions/runs/3134363937/jobs/5090260771#step:8:80

Would you try to re-record that cassette, @valpamp? There are some instructions on that here: https://github.com/sentinelsat/sentinelsat/#tests If it does not work for you, we can help.

Before you do, I am not super happy with the naming of folder_path. It is more like a filename - that, for unzipped products, is more like a folder name. You dub it filename yourself here:
https://github.com/sentinelsat/sentinelsat/pull/573/files#diff-8539d03b31a9a42eeab364c8eafc747a91bef19778c6cd4fc3630488c1fc7439R1075

So should we still consider renaming it to target_filename or product_name or something? I think folder_path is perhaps a bit too generic for what this property is.

@valpamp
Copy link
Contributor Author

valpamp commented Sep 28, 2022

product_name sounds good to me, I will make sure to rename it. Regarding the test, I have skimmed through the vcrpy documentation, and if I understand correctly I should just run pytest on the failing test using the --vcr-record all argument to force the cassette to be re-recorded, and if everything goes right I should see the associated yaml file change in the vcr_cassettes folder.

@j08lue
Copy link
Contributor

j08lue commented Sep 29, 2022

Great, thanks, @valpamp! Sounds right with --vcr-record all.

Removed the initial './' to match the actual content of the product_info set.
@valpamp
Copy link
Contributor Author

valpamp commented Nov 21, 2022

Hey everyone, sorry this took so long, I had to turn in my PhD thesis and I have been completely absorbed in that for a while.

Today I finally ran pytest in my conda environment, and the test failure seems to be caused by a mismatch between the content of the product_node set and the second string:

assert set(product_info["nodes"]) == {"./manifest.safe", "./preview/map-overlay.kml"}

The product_node set contains the string 'preview/map-overlay.kml', but it is being checked against './preview/map-overlay.kml'. Removing the initial './' from the control string and running pytest with --vcr-record all apparently solves the issues, and the cassettes are re-recorded successfully.

However, if I now run pytest without the --vcr-record all option, test_download_nodes fails because of an xml parsing error. The culprit is the manifest.SAFE file of the S1A_IW_GRDH_1SDV_20210722T182826_20210722T182851_038894_0496E2_5BD9 product being downloaded during the test: checking the manifest.SAFE file, I can clearly see that it ends abruptly at the line where the xml parsing library complains, but I have no idea why. Any suggestions?

@j08lue
Copy link
Contributor

j08lue commented Nov 21, 2022

I can see the test is using the first of these recent Sentinel-1 products:

products = api.query(date=("NOW-1MONTH", None), identifier="*IW_GRDH*", limit=3)

uuid = node_test_products[0]["id"]

Have you tried using a different one, e.g. the second from that query? Or try again tomorrow, when the query hopefully returns some different products. If the problem persists, it might be more systematic.

Thanks for your efforts - and good luck with the thesis!

@valpamp
Copy link
Contributor Author

valpamp commented Nov 23, 2022

I ran more tests with the other two products and I am getting different errors. If I run the test with --vcr-record all everything works fine in both cases, but then if I re-run the test without the option I get a CannotOverwriteExistingCassette error, which does not make much sense to me since I just re-recorded the cassette while pointing to the same product. But maybe it's just me having no idea how VCR works 😅
I also tried seeing if something changed with the first product, but the parsing error persists for now, the xml still appears to be truncated.

Finally, yesterday I found and solved another weird issue with the node filter on Sentinel-2 products from 2018. But I will open another issue/PR for that.

@j08lue
Copy link
Contributor

j08lue commented Nov 25, 2022

I can reproduce the issue with the truncated manifest.safe even on current main:

  1. Run pytest -v --record-vcr all tests/test_download.py::test_download_product_nodes --> test passes
  2. Run pytest -v tests/test_download.py::test_download_product_nodes --> test fails
______________________________________________________________________________________________________________________________________________________________________ test_download_product_nodes _______________________________________________________________________________________________________________________________________________________________________

api = <sentinelsat.sentinel.SentinelAPI object at 0x11012bdc0>, tmpdir = local('/private/var/folders/fj/h7p8zmrs5pv538jqx4b0z7c80000gn/T/pytest-of-jonassolvsteen/pytest-8/test_download_product_nodes0')
node_test_products = [{'Creation Date': datetime.datetime(2021, 7, 22, 19, 35, 3, 413000), 'Ingestion Date': datetime.datetime(2021, 7, 22,...ime(2021, 7, 22, 19, 24, 14, 449000), 'Online': True, 'date': datetime.datetime(2021, 7, 22, 17, 43, 56, 232000), ...}]

    @pytest.mark.vcr
    @pytest.mark.scihub
    def test_download_product_nodes(api, tmpdir, node_test_products):
        uuid = node_test_products[0]["id"]
        product_dir = node_test_products[0]["title"] + ".SAFE"
        expected_path = tmpdir.join(product_dir)
    
        nodefilter = make_path_filter("*preview/*.kml")
>       product_info = api.download(uuid, str(tmpdir), checksum=True, nodefilter=nodefilter)

tests/test_download.py:371: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
sentinelsat/sentinel.py:590: in download
    return downloader.download(id, directory_path)
sentinelsat/download.py:133: in download
    return self._download_with_node_filter(id, directory, stop_event)
sentinelsat/download.py:164: in _download_with_node_filter
    node_infos = self._filter_nodes(manifest_path, product_info, self.node_filter)
sentinelsat/download.py:779: in _filter_nodes
    xmldoc = etree.parse(manifest)
/opt/homebrew/Cellar/python@3.10/3.10.6_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/xml/etree/ElementTree.py:1222: in parse
    tree.parse(source, parser)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <xml.etree.ElementTree.ElementTree object at 0x1101293c0>, source = <_io.BufferedReader name='/private/var/folders/fj/h7p8zmrs5pv538jqx4b0z7c80000gn/T/pytest-of-jonassolvsteen/pytest-8/test_download_product_nodes0/S1A_IW_GRDH_1SDV_20210722T182826_20210722T182851_038894_0496E2_5BD9.SAFE/manifest.safe'>
parser = <xml.etree.ElementTree.XMLParser object at 0x10773f520>

    def parse(self, source, parser=None):
        """Load external XML document into element tree.
    
        *source* is a file name or file object, *parser* is an optional parser
        instance that defaults to XMLParser.
    
        ParseError is raised if the parser fails to parse the document.
    
        Returns the root element of the given source document.
    
        """
        close_source = False
        if not hasattr(source, "read"):
            source = open(source, "rb")
            close_source = True
        try:
            if parser is None:
                # If no parser was specified, create a default XMLParser
                parser = XMLParser()
                if hasattr(parser, '_parse_whole'):
                    # The default XMLParser, when it comes from an accelerator,
                    # can define an internal _parse_whole API for efficiency.
                    # It can be used to parse the whole source without feeding
                    # it with chunks.
>                   self._root = parser._parse_whole(source)
E                   xml.etree.ElementTree.ParseError: unclosed token: line 292, column 69

/opt/homebrew/Cellar/python@3.10/3.10.6_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/xml/etree/ElementTree.py:580: ParseError

The copy of the file S1A_IW_GRDH_1SDV_20210722T182826_20210722T182851_038894_0496E2_5BD9.SAFE/manifest.safe that pytest picks up from its temp dir is truncated after 292 lines.

@j08lue
Copy link
Contributor

j08lue commented Nov 25, 2022

The recorded fixture tests/fixtures/vcr_cassettes/data/manifest.safe appears to be complete (332 lines). It is the copy that pytest creates in its temporary directory that is truncated for some reason.

@j08lue
Copy link
Contributor

j08lue commented Nov 25, 2022

Btw, why is the manifest.safe even downloaded when the node filter pattern in the test does not seem to include it: "*preview/*.kml"?

nodefilter = make_path_filter("*preview/*.kml")
product_info = api.download(uuid, str(tmpdir), checksum=True, nodefilter=nodefilter)

@avalentino
Copy link
Contributor

@j08lue the manifest contains the full list of files included in the product. It is always downloaded and parsed to extract the file list (and associate metadata), then filters use this information to decide if a file shall be actually downloaded.

To keep things consistent since the rest of the code uses "directory" instead of "folder".
@codecov
Copy link

codecov bot commented Mar 10, 2023

Codecov Report

Merging #573 (4e13ae5) into main (7fd4122) will increase coverage by 0.07%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #573      +/-   ##
==========================================
+ Coverage   84.44%   84.51%   +0.07%     
==========================================
  Files           7        7              
  Lines        1292     1298       +6     
  Branches      335      322      -13     
==========================================
+ Hits         1091     1097       +6     
  Misses        145      145              
  Partials       56       56              
Impacted Files Coverage Δ
sentinelsat/download.py 73.15% <100.00%> (ø)
sentinelsat/sentinel.py 91.25% <100.00%> (+0.07%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

No longer supported by the GitHub action.
Windows and MacOS started to require a token for some reason. Might be a bug.
Copy link
Contributor

@j08lue j08lue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for completing this long-standing PR, @valgur.

@valgur valgur merged commit aa169a8 into sentinelsat:main Mar 10, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants