Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CHANGE] Align the Data Model with Wikidata Items like "file system object" #30

Closed
dla-kramski opened this issue Feb 13, 2024 · 10 comments · Fixed by #53
Closed

[CHANGE] Align the Data Model with Wikidata Items like "file system object" #30

dla-kramski opened this issue Feb 13, 2024 · 10 comments · Fixed by #53
Assignees
Labels
confirmed enhancement New feature or request
Milestone

Comments

@dla-kramski
Copy link

Wikidata is playing an increasingly important role in digital preservation.

FileTrove should align its data model with this de facto standard (see https://www.wikidata.org/wiki/Q37787110 and related pages):

file system object (Q37787110)
    part of
        file system (Q174989)
    has characteristic: 
       path ((Q817765))
           has part(s):
                path separator (Q64826685)
                filename (Q1144928)
                    has part(s):
                        filename extension (Q186157)

This could be implemented with the following changes:

Session table

  • New Column: filesystem
    • WD description: "concrete format or program for storing files and directories on a data storage device" (https://www.wikidata.org/wiki/Q174989)
    • Example value(s): "ntfs", "ext4", "fat32"
    • Remarks: The Wikidata description may not be the best possible... Need a controlled vocabulary for FS values.
  • New Column: pathseparator

Files and directories tables

  • New Column: filepath | dirpath
    • WD description: "general form of the name of a file or directory; resources can be represented by either absolute or relative paths" (https://www.wikidata.org/wiki/Q817765)
    • Example value(s): "logs/filetrove.log"
    • Remarks: This should be relative to the sessions's mountpoint. This is identical to the existing columns filename | dirname
  • New Column: filename | dirname
    • WD description: "text string used to uniquely identify a computer file" (https://www.wikidata.org/wiki/Q1144928)
    • Example value(s): "filetrove.log"
    • Remarks: We should add to the description: "Without the leading path". This is different from the existing columns filename | dirname.
  • New Column: filenameextension | dirnameextension
    • WD description: "suffix to the name of a computer file" (https://www.wikidata.org/wiki/Q186157)
    • Example value(s): "log"
    • Remarks: Of course, the extension is only a weak indicator of the file format, but it does play a role. Less common for directories, but still may be useful.

This information could also be obtained by parsing the existing "filename" column afterwards, but ftrove has it all to hand on run time and can easily record it. The filename without the path may be particularly useful for tracking files with the same name across several sessions.

@dla-kramski dla-kramski added the enhancement New feature or request label Feb 13, 2024
@steffenfritz
Copy link
Owner

This is a bigger change but it makes sense and this will be implemented. Added as a milestone for v1.0.0

@steffenfritz
Copy link
Owner

In regards of the filesystem Ross adds +1 in #43 (just don't want to lose it due to closing of that issue)

@steffenfritz
Copy link
Owner

steffenfritz commented Apr 13, 2024

@dla-kramski I am not sure about the extension of directory. As there is no definition of it and it really is just part of the name, how would you work with it? How would you define the boundaries, e.g. would "." separate the dirname and the extension? How should we handle if a dirname has more than one (arbitrarily) chosen separator? I understand that there are extensions like "SYSTEM" and such but these are more semantics for the user and do not serve a technical purpose like MIME connotations.

So, I'd like to not add dirname extensions.

@steffenfritz
Copy link
Owner

steffenfritz commented Apr 13, 2024

Added, resp. aligned with Wikidata

  • filepath | dirpath
  • filename | dirname
  • filenameextension
  • pathseparator (depends on the OS FileTrove is running on)

ToDo / To be discussed

  • dirnameextension: see above
  • filesystem: As this has to be done for all supported OS that FT is running on seperately, this takes some tme and testing.

@steffenfritz
Copy link
Owner

Go's filepath.Ext() is not working as expected:

filename: .hiddenfile
filepath: ../../testdata/.hiddenfile
filenameextension: .hiddenfile

The doc says:
Ext returns the file name extension used by path. The extension is the suffix beginning at the final dot in the final element of path; it is empty if there is no dot.: https://pkg.go.dev/path/filepath#Ext

As the example shows, this is not a perfect approach as every hidden file on Linux/Unix without a dot extension will have it's own name as extension. There must also be the condition that the dot is not the first element in the string.

@steffenfritz
Copy link
Owner

Go's filepath.Ext() is not working as expected:

filename: .hiddenfile
filepath: ../../testdata/.hiddenfile
filenameextension: .hiddenfile

The doc says: Ext returns the file name extension used by path. The extension is the suffix beginning at the final dot in the final element of path; it is empty if there is no dot.: https://pkg.go.dev/path/filepath#Ext

As the example shows, this is not a perfect approach as every hidden file on Linux/Unix without a dot extension will have it's own name as extension. There must also be the condition that the dot is not the first element in the string.

golang/go#66814

@steffenfritz
Copy link
Owner

steffenfritz commented Apr 14, 2024

golang/go#66814

As Golang will not change filepath.Ext(), I added a workaround.

@steffenfritz
Copy link
Owner

Regarding filesystem:

Significant changes would have to be made to automatically determine the file system. In addition, the execution might have to be carried out as root/SYSTEM, which is not desirable. I am considering introducing a flag that allows users to add the file system manually.

@dla-kramski
Copy link
Author

@dla-kramski I am not sure about the extension of directory. As there is no definition of it and it really is just part of the name, how would you work with it? How would you define the boundaries, e.g. would "." separate the dirname and the extension? How should we handle if a dirname has more than one (arbitrarily) chosen separator? I understand that there are extensions like "SYSTEM" and such but these are more semantics for the user and do not serve a technical purpose like MIME connotations.

So, I'd like to not add dirname extensions.

On second thought, I'm inclined to agree.

@steffenfritz
Copy link
Owner

  • dirnameextension dropped
  • For v1.0.0 there will also be no filesystem detection. In a next version an automatic detection will be discussed again. I am not sure if filesystem detection is in the scope of FileTrove as it works on files, independent from the filesystem.

steffenfritz added a commit that referenced this issue Apr 20, 2024
…el-with-wikidata-items-like-file-system-object

Close #30: Change align the data model with wikidata items like file system object
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
confirmed enhancement New feature or request
Projects
None yet
2 participants