Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match filename rather than paths #73

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kimt33
Copy link
Contributor

@kimt33 kimt33 commented Nov 19, 2017

  1. Split the given path into its components and compare each component
    with the given pattern (re.split is used)
  2. Add exception for double backslash (would cause problems with split)
  3. Add tests and documentation
  4. Little rearrangement

1. Split the given path into its components and compare each component
   with the given pattern (re.split is used)
2. Add exception for double backslash (would cause problems with split)
3. Add tests and documentation
4. Little rearrangement
@codecov
Copy link

codecov bot commented Nov 19, 2017

Codecov Report

Merging #73 into master will increase coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #73      +/-   ##
==========================================
+ Coverage   98.13%   98.16%   +0.03%     
==========================================
  Files          17       17              
  Lines         642      655      +13     
  Branches       93       94       +1     
==========================================
+ Hits          630      643      +13     
  Misses         12       12
Impacted Files Coverage Δ
cardboardlint/tests/test_common.py 100% <100%> (ø) ⬆️
cardboardlint/common.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ed36bcb...aa56061. Read the comment docs.

@tovrstra
Copy link
Member

Good idea but we need to work out a few more details.

Can you explain the purpose of the double backslash?

Some of the tests are not really in line with expected behavior on a bash terminal, e.g. I would expect
assert matches_filefilter('foo/a.py', ['+ *.py']) should fail when one tries to mimic the behavior of the glob function or bash. How do we want to deal with this? For example, it would be good to look at how rsync works because it has to handle to same problem, namely mimicking bash behavior, while also adding a few extra features where bash falls short for file filtering. See https://developer.apple.com/legacy/library/documentation/Darwin/Reference/ManPages/man1/rsync.1.html and scroll down to the section "INCLUDE/EXCLUDE PATTERN RULES".

The best way to implement this is to translate the pattern into one regular expression that works on the entire path name and gives a match or not. This is an example implementation of that idea: https://github.com/metagriffin/globre

@tovrstra
Copy link
Member

P.S. When you put "Fixes #72" in the first message of your PR, that issue will get closed automatically when the PR is merged. This is better than putting it in the PR title because that only adds confusion, namely having two numbers appearing on the title line.

@tovrstra
Copy link
Member

Oh, something we shouldn't forget: when we merge a change like this, we should do it at a moment where we can fix all the repositories that rely on the current behavior.

@kimt33
Copy link
Contributor Author

kimt33 commented Nov 20, 2017

Can you explain the purpose of the double backslash?

I thought the backslash will cause some problems later on for someone that would use a character with a backslash. I mainly thought of it for people that would use spaces in the filename, like file\ name.py, but on second thought, this would probably get passed as a string so it would look more like file name.py. Still, if we allow backslash to be used, then the case of '\' + os.sep should be considered, which should be treated differently than just os.sep.

So then the question is to support it or not because these crazy names violate some sort of PEP or law of nature. I thought it should be supported because these names should be caught (hopefully) by some linter anyways and the functionality of the method is to filter out names with the given pattern (and not to judge the user for their unconventional naming practices).

The complexity introduced to the code wasn't that terrible (i.e. using regular expression) in exchange for slight generality, so I thought it was okay.

@kimt33
Copy link
Contributor Author

kimt33 commented Nov 20, 2017

P.S. When you put "Fixes #72" in the first message of your PR, that issue will get closed automatically when the PR is merged. This is better than putting it in the PR title because that only adds confusion, namely having two numbers appearing on the title line.

Thank you, it does make things a bit ugly. I thought that adding the issue number will link the PR to the issue but that can be done somewhere in the text and not in the title.

@kimt33
Copy link
Contributor Author

kimt33 commented Nov 21, 2017

I like the *.py format because it doesn't assume anything about the package structure (like gitignore). Normally, I would need to check all python files, and maybe skip over some files within certain directories. If I go up to three/four levels of subdirectory, then I would need to include multiple patterns that more or less do the same thing. And I would need to change them everytime I move something around.

I just thought that since statements like + *.py, + */*.py, + */*/*.py are impractical, we should use *.py to describe them all. If we use *.py format to search the ends of the path, then we should also be able to use test_*.py.

I just checked the rsync documentation and I think they do the same thing (I think):

if the pattern starts with a / then it is anchored to a particular spot in the hierarchy of files, otherwise it is matched against the end of the pathname.

Even the unanchored "sub/foo" would match at any point in the hierarchy where a "foo" was found within a directory named "sub".

@kimt33 kimt33 changed the title Match filename rather than paths (Issue #72) Match filename rather than paths Nov 22, 2017
@kimt33
Copy link
Contributor Author

kimt33 commented Nov 30, 2017

@tovrstra I remembered the purpose of the double backslash. The code works by dividing up the given path (as a string) into the contributing directories (and subdirectories) and/or the filename. The path is "decomposed" by splitting it by the os.sep, which I'll denote / for the sake of argument. Then, /dir/sub/file should be divided into '', 'dir', 'sub', and 'file'.

However, if the directory name or the filename contains the os.sep, e.g. '2015/6/2.txt', then the code will misinterpret the name as a collection of subdirectories. Since this name will be passed in as 2015\/6\/2.txt, we use the regular expression to split using / only when it is not preceded by \.

However, if a directory name ends with a double backslash, e.g. why\\, then all the names that succeed it will be misinterpreted within this scheme. For example, the file not.txt in the directory why\\ will be described as why\\/not.txt. Since \/ are excluded, the code will interpret this as a single file rather than a file in a directory. To avoid these cases (which can be made even more complex with more backslashes), we raise an error if there are any double backslashes.

@tovrstra
Copy link
Member

ok. I understand. Thanks for explaining.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants