Allow to recursively search package #1270

mcoconnor · 2023-08-23T17:30:34Z

Description

Enables recursive mode for glob when processing the exclude_glob setting during package assembly. This allows the use of the ** operator in glob strings and brings the functionality in line with existing documentation.

GitHub Issues

Reported here: #1269

coveralls · 2023-08-28T00:18:30Z

coverage: 74.688%. remained the same when pulling ddd6b65 on CCITatBCM:allow_exclude_glob_recursive into f2f03ba on zappa:master.

javulticat

TL;DR: Please add happy path and some failure mode unit tests, and I think this PR will be ready to merge! Thanks, great work!

Hi @mcoconnor, thanks for pointing this issue out and taking the initiative to create a PR to fix the issue yourself!

This change itself looks pretty good, but I'd like to ask that you also add some test cases to this PR before we merge it - just to confirm the behavior is working as expected, especially since it will be touching a common setting currently being actively used by a large portion of our userbase.

So, unit tests that prove that your changes now allow the ** glob operator to work as expected would be a great place to start.

I also think we probably want to have some tests that would prove that any other behavior that users currently experience/expect when using the exclude_glob setting (besides the ** recursion that is currently not working, of course) isn't going to regress or negatively impact the user experience if we merge this change.

For example, to play devil's advocate... using a recursive glob search over very large directories can be very slow, can it not?

What could the consequences be if someone has massive_dir\**\*.pyi in exclude_glob and then this change goes live?
Could that kind of thing risk causing major delays or even timeouts on CI runners if folks are deploying Zappa via some kind of CI pipeline?
- Could CI timeouts prevent user applications from getting deployed in a timely manner (or at all)?
Could delays cause any open connection Zappa had with AWS to timeout?
- If so, does Zappa have logic in place to automatically try reconnecting to AWS if that happens?
Could delays cause impatient users to think that Zappa is stuck/hanging when it's really just slowly recursively searching over a large directory listed in exclude_blob, which ends up with them force-quitting the process (potentially leaving their AWS environment in a very ugly state due to their force-quit during their half-deployed Zappa application).
- Will those users then almost certainly come here to complain that "Zappa is broken because it just hangs whenever I try to use it"? 😄

All right - now, I'm not saying we need to solve for all of those potential issues before merging this PR, but if things can go wrong, we should at least know what those things could commonly be and how the most common problems will present themselves to the user (which is something tests come in very handy to help show us), which will then help teach us how we can ameliorate, or at least mitigate, these potential issues when, or ideally before, they happen to our users.

So, how can we solve for any of the potential issues that may arise from this rather classic example of a recursion performance dilemma?

Set a timeout for how long we'll recursively look through each directory before moving on to the next one? But what do we do if we actually hit that limit?
Enforce a size/file limit for directories that can be recursively searched using the ** operator? But how do we determine that limit? Especially when machine A might be able to recursively search a 500k file folder structure twice as fast as machine B.
Have documentation explaining the potential issues and any workarounds we've identified that can be attempted by the users if they encounter those issues.

So, yeah... As you can see, even with a simple two line change, messing with recursion can open up a whole can of worms. 🙂

My two cents: you are making a change that fixes behavior that is currently incorrect. I like to have correct behavior, and imo, in this case, that outweighs any unlikely performance edge cases we may have to deal with. The likelihood that users are going to be doing recursive glob searches on directories large enough to cause them or us too much pain is probably very low for the vast majority of our users, so I'm willing to risk letting it go and dealing with any performance issues if/when our users complain of them.

That said, I still would love to see a few tests included in this PR that test some of these worst-case failure scenarios and how Zappa would react in those cases. As I hope I've explained, that will be helpful in showing us what to expect from any potential user issues if/when things go wrong, and the sooner we have that information, the sooner we can begin thinking about ways we might be able to improve the performance of these recursive glob searches. Thanks!

javulticat · 2023-11-06T05:15:01Z

zappa/core.py

-                for path in glob.glob(os.path.join(temp_project_path, glob_path)):
+                # Use `recursive` to match paths deep in the directory tree
+                # https://github.com/zappa/Zappa/issues/1269
+                for path in glob.glob(os.path.join(temp_project_path, glob_path), recursive=True):


Do you think there's any downside to using glob.iglob() here instead of glob.glob(), as iglob() returns an iterator, which should be easier on memory usage if a user decides to put massive folders with tons of files in exclude_glob? Especially since this code may often running be running on a CI runner, and the free ones that Github/Gitlab provide aren't provisioned with very much RAM at all (IIRC, it's like 1GB, and most is taken up my the OS and CI runner software).

Thanks, yes, that's a great idea.

javulticat · 2023-11-06T05:17:07Z

zappa/core.py

-            for path in glob.glob(os.path.join(temp_project_path, glob_path)):
+            # Use `recursive` to match paths deep in the directory tree
+            # https://github.com/zappa/Zappa/issues/1269
+            for path in glob.glob(os.path.join(temp_project_path, glob_path), recursive=True):


Same question of using glob.iglob() instead of glob.glob(). (See other comment)

mcoconnor · 2023-11-15T04:14:40Z

Thanks for the very thoughtful and thorough review (especially for such an apparently minor change). I'll be happy to add some tests when I have time to swing back to this. You've asked some interesting questions about performance management and I agree it would be nice to characterize the failure modes a little more and consider how to handle or document them. I'm not sure about including those in the test suite though--seems like intentionally triggering performance issues in a unit test will be a pain for anyone running tests going forward, though I'm open to other opinions.

I'll have to think a little bit about what mitigations might look like in this scenario; to a degree, I feel that either preemptively bombing the process (e.g. a timeout) or executing a halfway correct solution (e.g. limiting the number of results processed) could lead to more frustrating bugs than an obvious failure with a long-running process. The other good thing is that this affects a relatively early part of the deploy/update process so if it does hang, there isn't a lot of garbage that will be left other than the temp file tree that was being operated on.

github-actions · 2024-04-03T18:56:43Z

Hi there! Unfortunately, this PR has not seen any activity for at least 90 days. If the PR is still relevant to the latest version of Zappa, please comment within the next 10 days if you wish to keep it open. Otherwise, it will be automatically closed.

github-actions · 2024-04-13T20:36:48Z

Hi there! Unfortunately, this PR was automatically closed as it had not seen any activity in at least 100 days. If the PR is still relevant to the latest version of Zappa, please open a new PR.

Allow to recursively search package

ddd6b65

javulticat self-requested a review November 6, 2023 03:27

javulticat requested changes Nov 6, 2023

View reviewed changes

github-actions bot added the no-activity [Bot] Closing soon if no new activity label Apr 3, 2024

github-actions bot added the auto-closed [Bot] Closed, details in comments label Apr 13, 2024

github-actions bot closed this Apr 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to recursively search package #1270

Allow to recursively search package #1270

mcoconnor commented Aug 23, 2023

coveralls commented Aug 28, 2023

javulticat left a comment

javulticat Nov 6, 2023

mcoconnor Nov 15, 2023

javulticat Nov 6, 2023

mcoconnor commented Nov 15, 2023

github-actions bot commented Apr 3, 2024

github-actions bot commented Apr 13, 2024

Allow to recursively search package #1270

Allow to recursively search package #1270

Conversation

mcoconnor commented Aug 23, 2023

Description

GitHub Issues

coveralls commented Aug 28, 2023

javulticat left a comment

Choose a reason for hiding this comment

javulticat Nov 6, 2023

Choose a reason for hiding this comment

mcoconnor Nov 15, 2023

Choose a reason for hiding this comment

javulticat Nov 6, 2023

Choose a reason for hiding this comment

mcoconnor commented Nov 15, 2023

github-actions bot commented Apr 3, 2024

github-actions bot commented Apr 13, 2024