Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor filepath equality checks #197

Closed
dcastro opened this issue Oct 18, 2022 · 8 comments · Fixed by #230
Closed

Refactor filepath equality checks #197

dcastro opened this issue Oct 18, 2022 · 8 comments · Fixed by #230
Assignees
Milestone

Comments

@dcastro
Copy link
Member

dcastro commented Oct 18, 2022

Clarification and motivation

In the past, we've had many bugs related to path equality. A few examples (there are probably more):

We've solved them mainly by sprinkling the codebase with normalise, normaliseWithNoTrailing, expandIndirections, dropTrailingPathSeparator, canonizeLocalRef etc.

But this approach is extremely error-prone.

A more principled approach might be to canonicalize filepaths using canonicalizePath.

We should canonicalize:

  • All filepaths returned by git ls-files
  • All local references

Here's a suggestion (haven't actually tried it):

-newtype RepoInfo = RepoInfo (Map FilePath (Maybe FileInfo))
+newtype RepoInfo = RepoInfo (Map CanonicalPath (Maybe FileInfo))

data FileInfo = FileInfo
  { _fiReferences :: [Reference]
  , _fiAnchors    :: [Anchor]
  } 

data Reference = Reference
  { rName   :: Text
  , rLink   :: Text
  , rAnchor :: Maybe Text
  , rPos    :: Position
+ , rInfo   :: ReferenceInfo
  }


+ -- We probably won't need the type `LocationType` after this.
+data ReferenceInfo
+  = RICurrentFile
+  | RIOtherFile CanonicalPath
+  | RIExternal
+  | RIOtherProtocol

+newtype CanonicalPath = UnsafeCanonicalPath { unCanonicalPath :: FilePath }
+  deriving newtype (Eq, Show, Ord)
+ 
+canonicalizePath :: FilePath -> IO CanonicalPath
+canonicalizePath = UnsafeCanonicalPath . Directory.canonicalizePath

Then we should be able to simply use the Eq and Ord instances as usual for comparing CanonicalPaths, checking if a CanonicalPath exists in a Map/Set, etc.

Let's review the uses of normalise and similar functions, and try to get rid of as many as possible.

Acceptance criteria

  • We're canonicalizing paths at the boundaries (i.e. after reading filepaths from git ls-files and from scanning markdown files)
  • We're avoiding manually "massaging" paths using normalise/dropTrailingPathSeparator/etc throughout the codebase
@dcastro dcastro added this to the 0.3 milestone Oct 18, 2022
@aeqz aeqz self-assigned this Nov 29, 2022
aeqz added a commit that referenced this issue Nov 30, 2022
Problem: The current usage of filepaths is error-prone and can be simplified.

Solution: Canonicalize filepaths at the boundaries, so their management will be safer and will simplify the codebase.
@aeqz
Copy link
Contributor

aeqz commented Nov 30, 2022

I have been trying to tackle this refactor with the proposed approach and I think that the result looks promising. Just some details:

With this change, files in the output which used to be relative become absolute (canonical) unless we store also their non canonical representation or manage to get the relative path again before printing. Should we add this overhead in order to keep relative paths in the program output? Or let the output to show canonical paths from now on?

Also, currently References are constructed with no (pure) access to the repository root path, so there is no way to canonicalize absolute links at that point, which is required to safely check if the link corresponds to the same file or to other one. I have tried this and currently xrefcheck does not report a file as 'current file' when the reference is for example /README.md#anchor or ./README.md#anchor from inside the same README.md file. I can try to rearrange things in order to fix this with this refactor, or open a new issue. What do you think @Martoon-00?

@Martoon-00
Copy link
Member

Should we add this overhead in order to keep relative paths in the program output?

Good concern. I would go with printing the old, non-canonicalized filepaths.

Let me read through the code for a while, I'll respond to your last paragraph then.

@Martoon-00
Copy link
Member

I have tried this and currently xrefcheck does not report a file as 'current file' when the reference is for example /README.md#anchor or ./README.md#anchor from inside the same README.md file

You are referring to "(current file)" next to the link in the xrefcheck's output?

I guess something went wrong here. Initially it was named "local" and was intended to be a property related to the link's format (i.e. denoted whether the link contained the part with filepath or not). And in fd96731 it was renamed, and you well spotted that "current file" is likely to be interpreted differently now 🤔. It was not meant to be smart in understanding whether we point to the same file or not.

Let's probably take that commit as example and rename everything related to CurrentFileLoc again to "file-local" or something like that which would let us be more unambigous. Could you make a separate PR for this?

@aeqz
Copy link
Contributor

aeqz commented Dec 1, 2022

It makes sense, so that tag just classifies file links into relative (./file.md#example), absolute (/file.md#example) or file-local (#example). As you said, I misunderstood it because of the "current file" text.

I will change it before applying the refactor, "file-local" sounds good and less unambiguous to me.

aeqz added a commit that referenced this issue Dec 2, 2022
Problem: The current usage of filepaths is error-prone and can be simplified.

Solution: Canonicalize filepaths at the boundaries, so their management will be safer and will simplify the codebase.
aeqz added a commit that referenced this issue Dec 2, 2022
aeqz added a commit that referenced this issue Dec 2, 2022
aeqz added a commit that referenced this issue Dec 2, 2022
aeqz added a commit that referenced this issue Dec 2, 2022
@aeqz
Copy link
Contributor

aeqz commented Dec 2, 2022

I have been applying the refactor in a branch. My main concern regarding the result is currently that I think that we should try to specify how we want to show filepaths in the program output, because the current behaviour may be difficult to exactly replicate after the refactor. These are some details that I have noticed by observing the golden test expected outputs:

  • The output is different if we execute xrefcheck -r some-path compared with executing xrefcheck inside some-path.
  • When reporting errors, files are shown with a leading "./" when the link is absolute and with no leading "./" when the link is relative.

I will try to think about a possible strategy and propose it here.

@Martoon-00
Copy link
Member

The output is different if we execute xrefcheck -r some-path compared with executing xrefcheck inside some-path.

I think, let's not care about preserving this. Running from within the repository can produce the same output as if run from the repository root, this may be even considered as a benign behaviour.

Although to make sure I get it correctly, I would be glad to see the diff myself. Could you create a pull request, maybe in a draft state, so that we could continue the discussion there?

When reporting errors, files are shown with a leading "./" when the link is absolute and with no leading "./" when the link is relative.

Not vice versa?

@aeqz
Copy link
Contributor

aeqz commented Dec 2, 2022

I think, let's not care about preserving this

Could you create a pull request, maybe in a draft state

👍

Not vice versa?

Right, not vice versa. These are examples from the current expected outputs in golden tests. Look at how the "file that does not exist" is shown:

  ➥  In file dir1/dir2/d2f1.md
     bad reference (relative) at src:28:1-31:
       - text: "bad-casing-file-rel"
       - link: D2F2.md/
       - anchor: -

     File does not exist:
       dir1/dir2/D2F2.md/
  ➥  In file dir1/dir2/d2f1.md
     bad reference (absolute) at src:42:1-22:
       - text: "file-abs-2"
       - link: /d1f1.md
       - anchor: -

     File does not exist:
       ./d1f1.md

@aeqz aeqz mentioned this issue Dec 2, 2022
8 tasks
@Martoon-00
Copy link
Member

That's hilarious. I believe displaying it like this can be considered reasonable from several points of view, but anyway there is no need to stick to it, let's consider on a PR basis whether the change in the output format is worth it or not.

aeqz added a commit that referenced this issue Dec 2, 2022
aeqz added a commit that referenced this issue Dec 6, 2022
aeqz added a commit that referenced this issue Dec 7, 2022
Problem: The current usage of filepaths is error-prone and can be simplified.

Solution: Canonicalize filepaths at the boundaries, so their management will be safer and will simplify the codebase.
aeqz added a commit that referenced this issue Dec 7, 2022
aeqz added a commit that referenced this issue Dec 7, 2022
aeqz added a commit that referenced this issue Dec 7, 2022
aeqz added a commit that referenced this issue Dec 16, 2022
aeqz added a commit that referenced this issue Dec 16, 2022
aeqz added a commit that referenced this issue Dec 20, 2022
Problem: The current usage of filepaths is error-prone
and can be simplified.

Solution: Canonicalize filepaths at the boundaries, so
their management will be safer and will simplify the
codebase.
aeqz added a commit that referenced this issue Dec 20, 2022
aeqz added a commit that referenced this issue Dec 20, 2022
aeqz added a commit that referenced this issue Dec 20, 2022
aeqz added a commit that referenced this issue Dec 20, 2022
aeqz added a commit that referenced this issue Dec 20, 2022
aeqz added a commit that referenced this issue Dec 20, 2022
aeqz added a commit that referenced this issue Dec 20, 2022
aeqz added a commit that referenced this issue Dec 20, 2022
aeqz added a commit that referenced this issue Dec 20, 2022
aeqz added a commit that referenced this issue Dec 20, 2022
aeqz added a commit that referenced this issue Dec 20, 2022
aeqz added a commit that referenced this issue Dec 20, 2022
Problem: the current usage of filepaths is error-prone and can be
simplified.

Solution: canonicalize filepaths at the boundaries, so their management
will be safer and will simplify the codebase.
aeqz added a commit that referenced this issue Dec 22, 2022
Problem: the current usage of filepaths is error-prone and can be
simplified.

Solution: canonicalize filepaths at the boundaries, so their management
will be safer and will simplify the codebase.
@aeqz aeqz closed this as completed in #230 Dec 22, 2022
aeqz added a commit that referenced this issue Dec 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants