New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change allowed filename characters to whitelist #2070

Closed
wants to merge 5 commits into
base: master
from

Conversation

Projects
None yet
10 participants
@AI0867
Member

AI0867 commented Oct 2, 2017

No description provided.

@AI0867

This comment has been minimized.

Show comment
Hide comment
@AI0867

AI0867 Oct 2, 2017

Member

786c419 introduces one implementation. 143ae3d replaces it with another. If the latter is preferred, I can squash the two commits together.

Member

AI0867 commented Oct 2, 2017

786c419 introduces one implementation. 143ae3d replaces it with another. If the latter is preferred, I can squash the two commits together.

@gfgtdf

This comment has been minimized.

Show comment
Hide comment
@gfgtdf

gfgtdf Oct 2, 2017

Contributor

I'm not convinced that the second implementation is better, in particular since isalnum somehow depends on locales (setlocale). So while the first implementation is obvious to understand the second one might require advanced knowdedge of locales to veryfy that it does what one would expect.

Contributor

gfgtdf commented Oct 2, 2017

I'm not convinced that the second implementation is better, in particular since isalnum somehow depends on locales (setlocale). So while the first implementation is obvious to understand the second one might require advanced knowdedge of locales to veryfy that it does what one would expect.

@AI0867

This comment has been minimized.

Show comment
Hide comment
@AI0867

AI0867 Oct 2, 2017

Member

I switched to that implementation as that was what addon_name_legal already used. I could switch that one to the previous implementation instead.

Member

AI0867 commented Oct 2, 2017

I switched to that implementation as that was what addon_name_legal already used. I could switch that one to the previous implementation instead.

@Arcanister

This comment has been minimized.

Show comment
Hide comment
@Arcanister

Arcanister Oct 3, 2017

Contributor
  1. Implementation with switch is longer and harder to understand. Does it have any advantages compared to simple check whenever character is in set from older commits?
  2. Check whenever there are two periods in any position is not same as check whenever filename ends with period. The difference is in filenames such as file...png, which were disallowed before. Whenever names such as ..- or ..@ can cause problems in some systems I do not know.
  3. If filenames with trailing periods are disallowed, the message should say that, such as:
    " File or directory names may only contain Latin letters, digits, or any of the following characters: '- _ . + @' and may not end with a dot."
    Also, it's better to use simple words such as "Latin", "letter" and "digit", rather than special words such as "alphanumeric", for this message could occur to non-English speakers. It is not translated currently.
Contributor

Arcanister commented Oct 3, 2017

  1. Implementation with switch is longer and harder to understand. Does it have any advantages compared to simple check whenever character is in set from older commits?
  2. Check whenever there are two periods in any position is not same as check whenever filename ends with period. The difference is in filenames such as file...png, which were disallowed before. Whenever names such as ..- or ..@ can cause problems in some systems I do not know.
  3. If filenames with trailing periods are disallowed, the message should say that, such as:
    " File or directory names may only contain Latin letters, digits, or any of the following characters: '- _ . + @' and may not end with a dot."
    Also, it's better to use simple words such as "Latin", "letter" and "digit", rather than special words such as "alphanumeric", for this message could occur to non-English speakers. It is not translated currently.
@AI0867

This comment has been minimized.

Show comment
Hide comment
@AI0867

AI0867 Oct 3, 2017

Member
  1. It's almost certainly faster, though I haven't profiled anything.
  2. I've removed the period-related checks from addon_name_legal only, which does not permit periods. They remain in place in addon_filename_legal.
  3. It also doesn't mention the "..", but I can add that too.
Member

AI0867 commented Oct 3, 2017

  1. It's almost certainly faster, though I haven't profiled anything.
  2. I've removed the period-related checks from addon_name_legal only, which does not permit periods. They remain in place in addon_filename_legal.
  3. It also doesn't mention the "..", but I can add that too.
@CelticMinstrel

This comment has been minimized.

Show comment
Hide comment
@CelticMinstrel

CelticMinstrel Oct 4, 2017

Member

I'm not convinced a whitelist is the correct choice here. What's the reason for excluding Unicode characters such as é or β? As far as I know, such characters are permitted on all major operating systems. I don't see a reason to place more restrictions than Windows does (other than forbidding tildes since Wesnoth uses them for image path functions).

Member

CelticMinstrel commented Oct 4, 2017

I'm not convinced a whitelist is the correct choice here. What's the reason for excluding Unicode characters such as é or β? As far as I know, such characters are permitted on all major operating systems. I don't see a reason to place more restrictions than Windows does (other than forbidding tildes since Wesnoth uses them for image path functions).

@Arcanister

This comment has been minimized.

Show comment
Hide comment
@Arcanister

Arcanister Oct 4, 2017

Contributor

Non-ASCII characters cause problems in many circumstances. For example:

  • If filename is longer than 255 bytes, when encoded as UTF-8, but shorter than 255 codepoints in UTF-16, then it will be acceptable for NTFS, but not for Ext 2-3-4. If filename consists only of ASCII characters, max filename length will be same for both filesystems.
  • If files are put on FAT filesystem, then non-ASCII characters will have to be encoded in a certain codepage. If the filesystem is opened in different codepage, filenames will be screwed up.
  • If files are put into ZIP archive using Windows built-in archiver, then non-ASCII chars in filenames will be unreadable elsewhere, and if they are put into ZIP with properly configured archiver, they will be unreadable in Windows built-in archiver, but will be fine in 7-zip, and other programs on any OS.
  • On GNU/Linux with non-UTF-8 locale, non-ASCII filenames will be unreadable if they are coming from UTF-8 tar archive. Non-UTF-8 locales are uncommon on GNU/Linux, but are still used in some BSD systems.
  • Characters with codepoints above 65535 will suffer, depending on Windows (or MacOS X?) version, because their filesystems are limited to UTF-16, which has limited support of non-BMP characters via surrogate pairs. They are recognized as two characters in some contexts and as one character in other contexts.
  • Some Unicode characters look same or very similar, which can cause hard to spot errors.

In addition to that, ASCII control and special characters can cause problems in various contexts. Windows allows some control characters in filenames, namely \177 (DEL), for example. Special characters make it harder to work with files from shell and scripts.

Contributor

Arcanister commented Oct 4, 2017

Non-ASCII characters cause problems in many circumstances. For example:

  • If filename is longer than 255 bytes, when encoded as UTF-8, but shorter than 255 codepoints in UTF-16, then it will be acceptable for NTFS, but not for Ext 2-3-4. If filename consists only of ASCII characters, max filename length will be same for both filesystems.
  • If files are put on FAT filesystem, then non-ASCII characters will have to be encoded in a certain codepage. If the filesystem is opened in different codepage, filenames will be screwed up.
  • If files are put into ZIP archive using Windows built-in archiver, then non-ASCII chars in filenames will be unreadable elsewhere, and if they are put into ZIP with properly configured archiver, they will be unreadable in Windows built-in archiver, but will be fine in 7-zip, and other programs on any OS.
  • On GNU/Linux with non-UTF-8 locale, non-ASCII filenames will be unreadable if they are coming from UTF-8 tar archive. Non-UTF-8 locales are uncommon on GNU/Linux, but are still used in some BSD systems.
  • Characters with codepoints above 65535 will suffer, depending on Windows (or MacOS X?) version, because their filesystems are limited to UTF-16, which has limited support of non-BMP characters via surrogate pairs. They are recognized as two characters in some contexts and as one character in other contexts.
  • Some Unicode characters look same or very similar, which can cause hard to spot errors.

In addition to that, ASCII control and special characters can cause problems in various contexts. Windows allows some control characters in filenames, namely \177 (DEL), for example. Special characters make it harder to work with files from shell and scripts.

@CelticMinstrel

This comment has been minimized.

Show comment
Hide comment
@CelticMinstrel

CelticMinstrel Oct 5, 2017

Member
  • On length, it seems to me that this isn't too difficult to solve - limit the names to 127 code points in BMP. The worst-case such name would be 254 bytes in UTF-8. Rarely are filenames anywhere near this long, anyway.
  • Assuming you're describing a problem in FAT32, this does sound a little troublesome; while very few people would still be using FAT32 on their computer, it seems to be used for various kinds of removable media such as memory keys or (possibly) SD cards, due to being supported on both Mac and Windows. I don't have a solution here, though I don't see it as a major problem.
  • The zip problem seems minor, but I suppose it could bite people who decide to download from the web interface instead of through the client...
  • Again, the tar problem seems minor, even more so than the zip problem (unless it's also relevant for zip files on such systems).
  • I'm pretty sure that 99% of the code points people would want to use in filenames are in the BMP and thus less than 65536. (By the way, I'm fairly sure that MacOS/HFS+ uses UTF-8, but I don't have a reference handy so I could be wrong.)
  • I understand the thing about similar characters, but (depending on the font) this can be a problem even in just ASCII, so it seems silly to use this as an argument against allowing Unicode.
  • I'd say control characters shouldn't be allowed, anyway, so this final point is kind of moot.

Other than the issue of length, most of these problems just don't seem all that severe to me. I do wonder if any addon currently uses non-ASCII characters in filenames, though. While I think it's silly to disallow non-ASCII characters and remain against the idea, it's still a limitation I could live with, and I imagine this is true of most people.

Member

CelticMinstrel commented Oct 5, 2017

  • On length, it seems to me that this isn't too difficult to solve - limit the names to 127 code points in BMP. The worst-case such name would be 254 bytes in UTF-8. Rarely are filenames anywhere near this long, anyway.
  • Assuming you're describing a problem in FAT32, this does sound a little troublesome; while very few people would still be using FAT32 on their computer, it seems to be used for various kinds of removable media such as memory keys or (possibly) SD cards, due to being supported on both Mac and Windows. I don't have a solution here, though I don't see it as a major problem.
  • The zip problem seems minor, but I suppose it could bite people who decide to download from the web interface instead of through the client...
  • Again, the tar problem seems minor, even more so than the zip problem (unless it's also relevant for zip files on such systems).
  • I'm pretty sure that 99% of the code points people would want to use in filenames are in the BMP and thus less than 65536. (By the way, I'm fairly sure that MacOS/HFS+ uses UTF-8, but I don't have a reference handy so I could be wrong.)
  • I understand the thing about similar characters, but (depending on the font) this can be a problem even in just ASCII, so it seems silly to use this as an argument against allowing Unicode.
  • I'd say control characters shouldn't be allowed, anyway, so this final point is kind of moot.

Other than the issue of length, most of these problems just don't seem all that severe to me. I do wonder if any addon currently uses non-ASCII characters in filenames, though. While I think it's silly to disallow non-ASCII characters and remain against the idea, it's still a limitation I could live with, and I imagine this is true of most people.

@Arcanister

This comment has been minimized.

Show comment
Hide comment
@Arcanister

Arcanister Oct 5, 2017

Contributor

This post has outdated information, refer to newer post
So, out of 606 add-ons from 1.12 and 1.13 servers, 37 would need to correct filenames:
17 use apostrophe '
12 use round parentheses ()
11 add-ons use other disallowed characters, including:
4 have non-ascii characters ’éèäþ and [Оригиналы]
3 have square brackets []
3 have ampersand &
2 have comma ,
2 have exclamation mark !
characters $ and ? are used in one add-on each.
About case conflicts (#2071):
One add-on has case conflict filename which would make it break if installed on Windows.
Another addon has case-conflict, but case-conflicting files are duplicates and removing one of them won't hurt.

Contributor

Arcanister commented Oct 5, 2017

This post has outdated information, refer to newer post
So, out of 606 add-ons from 1.12 and 1.13 servers, 37 would need to correct filenames:
17 use apostrophe '
12 use round parentheses ()
11 add-ons use other disallowed characters, including:
4 have non-ascii characters ’éèäþ and [Оригиналы]
3 have square brackets []
3 have ampersand &
2 have comma ,
2 have exclamation mark !
characters $ and ? are used in one add-on each.
About case conflicts (#2071):
One add-on has case conflict filename which would make it break if installed on Windows.
Another addon has case-conflict, but case-conflicting files are duplicates and removing one of them won't hurt.

@jyrkive

This comment has been minimized.

Show comment
Hide comment
@jyrkive

jyrkive Oct 5, 2017

Member

IMO, breaking 37 add-ons is just horrible and we shouldn't do it. Instead, we should whitelist all the characters above (' ( ) [ ] & , ! $, and BMP Unicode characters), with the exception of ? which isn't allowed on Windows and is therefore already a huge problem.

Member

jyrkive commented Oct 5, 2017

IMO, breaking 37 add-ons is just horrible and we shouldn't do it. Instead, we should whitelist all the characters above (' ( ) [ ] & , ! $, and BMP Unicode characters), with the exception of ? which isn't allowed on Windows and is therefore already a huge problem.

@GregoryLundberg

This comment has been minimized.

Show comment
Hide comment
@GregoryLundberg

GregoryLundberg Oct 5, 2017

Contributor

I'm with @jyrkive on the breakage.

But I'm nervous of those punctuation characters. I guess they're OK so long as nothing Wesnothian ever throws them at a shell script or web browser without carefully escaping them. But, in the back of my head, there's a little fear that, someday, one of those characters is gonna come back to bite us.

Contributor

GregoryLundberg commented Oct 5, 2017

I'm with @jyrkive on the breakage.

But I'm nervous of those punctuation characters. I guess they're OK so long as nothing Wesnothian ever throws them at a shell script or web browser without carefully escaping them. But, in the back of my head, there's a little fear that, someday, one of those characters is gonna come back to bite us.

@Arcanister

This comment has been minimized.

Show comment
Hide comment
@Arcanister

Arcanister Oct 5, 2017

Contributor
  • Top half of BMP is encoded with 3 UTF-8 bytes, though, so if anything, limit should be 85 characters. Such limit won't break anything currently, since all filenames are shorter than 75
  • Even more so: over 99.98% of the filenames people did actually use are fully within ASCII.
  • Removing whitespace from allowed characters has affected 24% of add-ons. Compared to that, 37 out of 606 is tiny.
  • Most of affecting filenames are not referenced anywhere and files can simply be renamed.
  • Apostrophe seems to have more or less justified use in filenames such as:
    Blemyr's_Ravine.map, images/portraits/M'brin.png (probably copy of portraits/mal-mbrin.png from TSG)
  • Round parentheses are mostly used unnecessarily, even if rather widely, such as in set of pictures rain-(1).png to rain-(5).png which were copied over several add-ons.
Contributor

Arcanister commented Oct 5, 2017

  • Top half of BMP is encoded with 3 UTF-8 bytes, though, so if anything, limit should be 85 characters. Such limit won't break anything currently, since all filenames are shorter than 75
  • Even more so: over 99.98% of the filenames people did actually use are fully within ASCII.
  • Removing whitespace from allowed characters has affected 24% of add-ons. Compared to that, 37 out of 606 is tiny.
  • Most of affecting filenames are not referenced anywhere and files can simply be renamed.
  • Apostrophe seems to have more or less justified use in filenames such as:
    Blemyr's_Ravine.map, images/portraits/M'brin.png (probably copy of portraits/mal-mbrin.png from TSG)
  • Round parentheses are mostly used unnecessarily, even if rather widely, such as in set of pictures rain-(1).png to rain-(5).png which were copied over several add-ons.
@CelticMinstrel

This comment has been minimized.

Show comment
Hide comment
@CelticMinstrel

CelticMinstrel Oct 5, 2017

Member

I'm also with @jyrkive on not breaking existing addons (apart from that one using ?). So in summary:

  • The obvious characters should be allowed - alphanumeric characters and the most basic punctuation, _ - .. Also + @ because a number of core Wesnoth filenames use them.
  • We should definitely disallow < > : " / \ | ? * since Windows does not allow them.
  • We should probably also disallow ~ since Wesnoth has a special use for it. (Do any addons use this character in filenames?)
  • We should allow ' ( ) [ ] & , ! $ since they're already used in some addons.
  • Obviously disallow any control characters (ASCII 127 and <32) and I guess spaces (which were already disallowed, so whatever).
  • That leaves # % ; = ^ ` { }. Personally I don't see any reason to disallow any of these, and in particular # % make sense to me for use in filenames. On the other hand, I probably wouldn't miss them that much if they weren't allowed. I guess do whichever is simpler (it might be simpler to restrict them, or it might be simpler to allow them, depending on how you're doing the check).
  • As for non-ASCII BMP characters, that's a bit difficult. Allowing all valid printable BMP characters may be rather complicated, unless you use something like libicu (does Boost.Locale provide wrappers for this sort of thing?) - there are non-characters haphazardly mixed in amongst the valid characters, and if I recall correctly there are also additional control characters beyond the ASCII ones. The private use area starting at 0xE000 can probably also be omitted. I suppose non-ASCII characters could still be forbidden though - breaking 5 addons is nowhere near as bad as breaking 37 addons. I don't think they should be, but chances are I wouldn't really notice that much if they were. However, non-English speakers might notice a lot more.
Member

CelticMinstrel commented Oct 5, 2017

I'm also with @jyrkive on not breaking existing addons (apart from that one using ?). So in summary:

  • The obvious characters should be allowed - alphanumeric characters and the most basic punctuation, _ - .. Also + @ because a number of core Wesnoth filenames use them.
  • We should definitely disallow < > : " / \ | ? * since Windows does not allow them.
  • We should probably also disallow ~ since Wesnoth has a special use for it. (Do any addons use this character in filenames?)
  • We should allow ' ( ) [ ] & , ! $ since they're already used in some addons.
  • Obviously disallow any control characters (ASCII 127 and <32) and I guess spaces (which were already disallowed, so whatever).
  • That leaves # % ; = ^ ` { }. Personally I don't see any reason to disallow any of these, and in particular # % make sense to me for use in filenames. On the other hand, I probably wouldn't miss them that much if they weren't allowed. I guess do whichever is simpler (it might be simpler to restrict them, or it might be simpler to allow them, depending on how you're doing the check).
  • As for non-ASCII BMP characters, that's a bit difficult. Allowing all valid printable BMP characters may be rather complicated, unless you use something like libicu (does Boost.Locale provide wrappers for this sort of thing?) - there are non-characters haphazardly mixed in amongst the valid characters, and if I recall correctly there are also additional control characters beyond the ASCII ones. The private use area starting at 0xE000 can probably also be omitted. I suppose non-ASCII characters could still be forbidden though - breaking 5 addons is nowhere near as bad as breaking 37 addons. I don't think they should be, but chances are I wouldn't really notice that much if they were. However, non-English speakers might notice a lot more.
@jyrkive

This comment has been minimized.

Show comment
Hide comment
@jyrkive

jyrkive Oct 5, 2017

Member

Regarding # % ; = ^ { }, because we have the option to ban them (since they aren't used), I think we should.

Regarding non-ASCII BMP characters, if we can't easily tell which characters are alphanumeric, printable, or even allocated, I think we should just go ahead and continue to allow them all, even non-printable and nonexistent characters. IMO, the problems @Arcanister pointed out for Unicode are grasping at straws (except the situation with supplementary Unicode planes, because those characters are extremely rare and easy to detect).

Member

jyrkive commented Oct 5, 2017

Regarding # % ; = ^ { }, because we have the option to ban them (since they aren't used), I think we should.

Regarding non-ASCII BMP characters, if we can't easily tell which characters are alphanumeric, printable, or even allocated, I think we should just go ahead and continue to allow them all, even non-printable and nonexistent characters. IMO, the problems @Arcanister pointed out for Unicode are grasping at straws (except the situation with supplementary Unicode planes, because those characters are extremely rare and easy to detect).

@Arcanister

This comment has been minimized.

Show comment
Hide comment
@Arcanister

Arcanister Oct 5, 2017

Contributor

Regarding individual characters from [ ] & , ! $ ? ~ set:

  • ~ was already disallowed, so it's not used anywhere.

  • $ is used only in filename name from In "Defence of Gaer Derole" from 1.12:
    core$images$portraits$merfolk$transparent$priestess.png
    Referenced in scenarios/surrounded!_02.cfg:
    profile=portraits/core$images$portraits$merfolk$transparent$priestess.png
    Changing reference to profile=portraits/merfolk/priestess.png and removing offending file would fix the problem.
    Since it's used for variable substitution both in WML and in shell, I would not whitelist it.

  • ? is used in Potato Era (1.12):
    "level/12???.png" referenced in factions/level.cfg and units/level/???.cfg
    "units/level/???.cfg" is not referenced by name.

  • [] are used in Altered and Altered_New_Sprites (1.12) in name main-[+units].cfg, which is referenced in _main.cfg.
    They are also used in "A Whim of Fate", in one filename:
    "images/[Оригиналы]/swamp.jpg", which is higher-res version of "images/swamp.jpg". Hi-Res version is not referenced anywhere and can be safely renamed or removed.
    [] are special characters in AnimationWML and are used in same context as filenames. IMO should be disallowed.

  • , is used in same two addons Altered and Altered_New_Sprites (1.12):
    "ability-heals(4,8).cfg", "ability-regrows(4,8).cfg", "special-circular(1,2,3).cfg", "special-shield(10,20).cfg". None of them are referenced by name in code, so can safely be renamed.
    One of them mentioned in comment: "# (implemented as macro in ability-heals(4,8).cfg)"
    Comma is used as special character in AnimationWML, should be disallowed.

  • ! is used in already mentioned Defence of Gaer Derole (1.12):
    "maps/surrounded!_02.map" referenced in "scenarios/surrounded!_02.cfg"
    ! is also used in PC_Campaign (The Strange Men, 1.12), which also broken on Windows because of case conflict:
    "scenarios/02_Humans!.cfg", referenced by its id 02_Humans! in previous scenario.

  • & used in Everfell Faction (1.12):
    "macros/races&movetypes.cfg", not referenced by filename, can be renamed safely.
    It is also used in Fingerbone Of Destiny (1.12):
    "COPYING&disclaimer.txt".
    And in Battleground Europe (1.12):
    "maps/Intense_River_1v2&3.map" referenced in "scenarios/Intense_River_1v2&3.cfg"
    "maps/Sedan_1&3DE_2FRr_4FR.map" referenced in "scenarios/Intense_River_1v2&3.cfg"

Is it used in WML or Lua as special character?

Regarding non-ascii characters:

  • [Оригиналы] was already covered above. Removing the file won't break anything.
  • þ is used in "CnF arena" (1.12), in multiple names MiddEarþ. Should be easy to bulk-replace to "MiddEarth".
  • è and é are used in "skyraven arena maps":
    "maps/Rumble_lumière_skyraven.map"
    "maps/Rumble_marécage_skyraven.map"
    None of them are referenced, thus can be safely renamed
  • ä is used in "Battle for Westfront" (1.12):
    "images/units/Deutsches_Reich/Panzerjäger/BFWPanzerjäger.png", referenced in "units/BFW-Panzerknacker.cfg"
  • is used in "The Golden Age" (1.12) in one filename:
    "units/monsters/Chak’kso_Ney’yks.cfg". Not referenced anywhere, can be renamed safely.

Regarding '():

  • ' crept in many places. Perhaps should be allowed. Though even if disallowed, shouldn't break anything too hard.

  • () is used in "Era of Four Moons" (1.12) and "Era of Restoration" in one place each:
    "macros/abilities(extended).cfg"

  • () is used in A New Land Era (both 1.12 and 1.13):
    "images/units/drakes/worker-fly-1-(2).png"
    Not referenced anywhere. Probably WIP animation for Drake Worker, which is visually identical to core Drake Fighter.
    The add-on is already broken both on 1.12 and 1.13.

  • () in "images/terrain/rain-(1).png" to "rain-(5).png" are found in following add-ons
    Aria_of_the_Dragon_Slayer
    A_Song_of_Fire
    COTS
    Hero_of_Irdya_I
    Soldier_of_Wesnoth
    War_of_the_Jewel
    Water_Era
    Referenced in a cfg file to create rain effect.

  • () are used in Altered and Altered_New_Sprites (1.12), which were mentioned earlier:
    Many *.cfg files with (media) in name. Not referenced anywhere by name.
    "traits/racial-trait-(stub).cfg". Not referenced anywhere by name either.
    In aforementioned names which also contain comma ,. Not referenced anywhere.

Contributor

Arcanister commented Oct 5, 2017

Regarding individual characters from [ ] & , ! $ ? ~ set:

  • ~ was already disallowed, so it's not used anywhere.

  • $ is used only in filename name from In "Defence of Gaer Derole" from 1.12:
    core$images$portraits$merfolk$transparent$priestess.png
    Referenced in scenarios/surrounded!_02.cfg:
    profile=portraits/core$images$portraits$merfolk$transparent$priestess.png
    Changing reference to profile=portraits/merfolk/priestess.png and removing offending file would fix the problem.
    Since it's used for variable substitution both in WML and in shell, I would not whitelist it.

  • ? is used in Potato Era (1.12):
    "level/12???.png" referenced in factions/level.cfg and units/level/???.cfg
    "units/level/???.cfg" is not referenced by name.

  • [] are used in Altered and Altered_New_Sprites (1.12) in name main-[+units].cfg, which is referenced in _main.cfg.
    They are also used in "A Whim of Fate", in one filename:
    "images/[Оригиналы]/swamp.jpg", which is higher-res version of "images/swamp.jpg". Hi-Res version is not referenced anywhere and can be safely renamed or removed.
    [] are special characters in AnimationWML and are used in same context as filenames. IMO should be disallowed.

  • , is used in same two addons Altered and Altered_New_Sprites (1.12):
    "ability-heals(4,8).cfg", "ability-regrows(4,8).cfg", "special-circular(1,2,3).cfg", "special-shield(10,20).cfg". None of them are referenced by name in code, so can safely be renamed.
    One of them mentioned in comment: "# (implemented as macro in ability-heals(4,8).cfg)"
    Comma is used as special character in AnimationWML, should be disallowed.

  • ! is used in already mentioned Defence of Gaer Derole (1.12):
    "maps/surrounded!_02.map" referenced in "scenarios/surrounded!_02.cfg"
    ! is also used in PC_Campaign (The Strange Men, 1.12), which also broken on Windows because of case conflict:
    "scenarios/02_Humans!.cfg", referenced by its id 02_Humans! in previous scenario.

  • & used in Everfell Faction (1.12):
    "macros/races&movetypes.cfg", not referenced by filename, can be renamed safely.
    It is also used in Fingerbone Of Destiny (1.12):
    "COPYING&disclaimer.txt".
    And in Battleground Europe (1.12):
    "maps/Intense_River_1v2&3.map" referenced in "scenarios/Intense_River_1v2&3.cfg"
    "maps/Sedan_1&3DE_2FRr_4FR.map" referenced in "scenarios/Intense_River_1v2&3.cfg"

Is it used in WML or Lua as special character?

Regarding non-ascii characters:

  • [Оригиналы] was already covered above. Removing the file won't break anything.
  • þ is used in "CnF arena" (1.12), in multiple names MiddEarþ. Should be easy to bulk-replace to "MiddEarth".
  • è and é are used in "skyraven arena maps":
    "maps/Rumble_lumière_skyraven.map"
    "maps/Rumble_marécage_skyraven.map"
    None of them are referenced, thus can be safely renamed
  • ä is used in "Battle for Westfront" (1.12):
    "images/units/Deutsches_Reich/Panzerjäger/BFWPanzerjäger.png", referenced in "units/BFW-Panzerknacker.cfg"
  • is used in "The Golden Age" (1.12) in one filename:
    "units/monsters/Chak’kso_Ney’yks.cfg". Not referenced anywhere, can be renamed safely.

Regarding '():

  • ' crept in many places. Perhaps should be allowed. Though even if disallowed, shouldn't break anything too hard.

  • () is used in "Era of Four Moons" (1.12) and "Era of Restoration" in one place each:
    "macros/abilities(extended).cfg"

  • () is used in A New Land Era (both 1.12 and 1.13):
    "images/units/drakes/worker-fly-1-(2).png"
    Not referenced anywhere. Probably WIP animation for Drake Worker, which is visually identical to core Drake Fighter.
    The add-on is already broken both on 1.12 and 1.13.

  • () in "images/terrain/rain-(1).png" to "rain-(5).png" are found in following add-ons
    Aria_of_the_Dragon_Slayer
    A_Song_of_Fire
    COTS
    Hero_of_Irdya_I
    Soldier_of_Wesnoth
    War_of_the_Jewel
    Water_Era
    Referenced in a cfg file to create rain effect.

  • () are used in Altered and Altered_New_Sprites (1.12), which were mentioned earlier:
    Many *.cfg files with (media) in name. Not referenced anywhere by name.
    "traits/racial-trait-(stub).cfg". Not referenced anywhere by name either.
    In aforementioned names which also contain comma ,. Not referenced anywhere.

@jyrkive

This comment has been minimized.

Show comment
Hide comment
@jyrkive

jyrkive Oct 5, 2017

Member

@Arcanister Sure, the affected files "can be safely renamed", but the problem is that the add-on author needs to do it. We can't do it ourselves, the author would fail to upload the add-on if he/she attempted to update it later.

We would be telling them to rename/remove files in their add-ons, and only because "this kind of file names might cause problems in add-ons and doesn't even cause any in yours". It would be unacceptable treatment of UMC authors in my opinion.

Member

jyrkive commented Oct 5, 2017

@Arcanister Sure, the affected files "can be safely renamed", but the problem is that the add-on author needs to do it. We can't do it ourselves, the author would fail to upload the add-on if he/she attempted to update it later.

We would be telling them to rename/remove files in their add-ons, and only because "this kind of file names might cause problems in add-ons and doesn't even cause any in yours". It would be unacceptable treatment of UMC authors in my opinion.

@AI0867

This comment has been minimized.

Show comment
Hide comment
@AI0867

AI0867 Oct 5, 2017

Member

Reference regarding FS limitations: https://en.wikipedia.org/wiki/Filename#Comparison_of_filename_limitations

The addon manager uses boost-filesystem, so it will handle unicode properly, though the question of whether FAT filesystems will properly handle non-BMP characters remains.

The webinterface distributes tarballs, so I don't see a problem there.

That means the main question is: do we want a restricted, known-to-work ASCII-based whitelist, or do we want to keep a blacklist system?
If we want a blacklist system, I suggest disallowing pretty much anything that appears in the "reserved" column on that wikipedia table, and requiring that the filename is valid UTF-8.

Member

AI0867 commented Oct 5, 2017

Reference regarding FS limitations: https://en.wikipedia.org/wiki/Filename#Comparison_of_filename_limitations

The addon manager uses boost-filesystem, so it will handle unicode properly, though the question of whether FAT filesystems will properly handle non-BMP characters remains.

The webinterface distributes tarballs, so I don't see a problem there.

That means the main question is: do we want a restricted, known-to-work ASCII-based whitelist, or do we want to keep a blacklist system?
If we want a blacklist system, I suggest disallowing pretty much anything that appears in the "reserved" column on that wikipedia table, and requiring that the filename is valid UTF-8.

@Arcanister

This comment has been minimized.

Show comment
Hide comment
@Arcanister

Arcanister Oct 5, 2017

Contributor
  • I think that use of non-ascii characters is not significant enough to justify introduction of complex code to check it. So I suggest a white list.
  • Perhaps it makes sense to also whitelist '(), since they did creep into 1.13 server already, though if I did use them, I wouldn't mind renaming a few files if they were banned.
  • Other special characters are only found on 1.12 server. No strong reason to whitelist them.
  • I suggest blocking characters which have special meaning in AnimationWML, because they create ambiguity, which could lead to non-obvious errors, which include [,] and maybe something else.
  • Disallowing special characters in filenames, has added benefit to make introduction of new special syntax in future easier.
Contributor

Arcanister commented Oct 5, 2017

  • I think that use of non-ascii characters is not significant enough to justify introduction of complex code to check it. So I suggest a white list.
  • Perhaps it makes sense to also whitelist '(), since they did creep into 1.13 server already, though if I did use them, I wouldn't mind renaming a few files if they were banned.
  • Other special characters are only found on 1.12 server. No strong reason to whitelist them.
  • I suggest blocking characters which have special meaning in AnimationWML, because they create ambiguity, which could lead to non-obvious errors, which include [,] and maybe something else.
  • Disallowing special characters in filenames, has added benefit to make introduction of new special syntax in future easier.
@jyrkive

This comment has been minimized.

Show comment
Hide comment
@jyrkive

jyrkive Oct 5, 2017

Member

List of all characters which appear in the reserved column in that table in Wikipedia (excluding space and control characters): " * / : < > ? \ | + , . ; = [ ] $ # @ -

Out of those characters, [ ] , $ are already used in add-ons and I strongly want to keep them allowed.

Square brackets are only disallowed in FAT without LFN. Comma is only disallowed in FAT without LFN, Commodore DOS and HP 250. Dollar sign is only disallowed in z/OS. All of those three are horribly obsolete and we don't need to care about them.

Everything in the "reserved characters" column, excluding FAT without LFN, z/OS, Commodore DOS and HP 250: " * / : < > ? \ |. This is equal to the list of characters disallowed on Windows.

Member

jyrkive commented Oct 5, 2017

List of all characters which appear in the reserved column in that table in Wikipedia (excluding space and control characters): " * / : < > ? \ | + , . ; = [ ] $ # @ -

Out of those characters, [ ] , $ are already used in add-ons and I strongly want to keep them allowed.

Square brackets are only disallowed in FAT without LFN. Comma is only disallowed in FAT without LFN, Commodore DOS and HP 250. Dollar sign is only disallowed in z/OS. All of those three are horribly obsolete and we don't need to care about them.

Everything in the "reserved characters" column, excluding FAT without LFN, z/OS, Commodore DOS and HP 250: " * / : < > ? \ |. This is equal to the list of characters disallowed on Windows.

@jyrkive

This comment has been minimized.

Show comment
Hide comment
@jyrkive

jyrkive Oct 5, 2017

Member

So, a summary of my opinion:

  • Disallow control characters
  • Disallow supplementary Unicode planes
  • Disallow " * / : < > ? \ |
  • I don't care much about # % ; = ^ `` { }, but I'm in favor of banning them
  • Allow everything else
Member

jyrkive commented Oct 5, 2017

So, a summary of my opinion:

  • Disallow control characters
  • Disallow supplementary Unicode planes
  • Disallow " * / : < > ? \ |
  • I don't care much about # % ; = ^ `` { }, but I'm in favor of banning them
  • Allow everything else
@AI0867

This comment has been minimized.

Show comment
Hide comment
@AI0867

AI0867 Oct 5, 2017

Member

Out of those characters, [ ] , $ are already used in add-ons and I strongly want to keep them allowed.

Those characters conflict with various types of notation in WML. I especially want to ban these.

Member

AI0867 commented Oct 5, 2017

Out of those characters, [ ] , $ are already used in add-ons and I strongly want to keep them allowed.

Those characters conflict with various types of notation in WML. I especially want to ban these.

@vn971

This comment has been minimized.

Show comment
Hide comment
@vn971

vn971 Oct 5, 2017

Contributor

One additional argument in favor of non-ASCII: those file names still break various backups or badly written software sometimes. For example, you want to privately send your add-on to a friend and you get problems while using zip.

Contributor

vn971 commented Oct 5, 2017

One additional argument in favor of non-ASCII: those file names still break various backups or badly written software sometimes. For example, you want to privately send your add-on to a friend and you get problems while using zip.

@jyrkive

This comment has been minimized.

Show comment
Hide comment
@jyrkive

jyrkive Oct 5, 2017

Member

@vn971 Your example only concerns addons which "you want to privately send to your friend". This discussion is about file name validation in the addon server and only affects addons uploaded to the server. I don't see why anyone would want to share such an addon privately given that you can simply download it from the server.

Member

jyrkive commented Oct 5, 2017

@vn971 Your example only concerns addons which "you want to privately send to your friend". This discussion is about file name validation in the addon server and only affects addons uploaded to the server. I don't see why anyone would want to share such an addon privately given that you can simply download it from the server.

@vn971

This comment has been minimized.

Show comment
Hide comment
@vn971

vn971 Oct 5, 2017

Contributor

@jyrkive some people are shy and don't want to publish something until it's "done" (by their definition). I ofthen see such people sharing their work for limited audiences first.

As to the private/server question. That's actually a good question. Are we going the way "allow some addons locally, but disallow them from being published"? Or we rather want to go the path "make restrictions to addons themselves"?

Contributor

vn971 commented Oct 5, 2017

@jyrkive some people are shy and don't want to publish something until it's "done" (by their definition). I ofthen see such people sharing their work for limited audiences first.

As to the private/server question. That's actually a good question. Are we going the way "allow some addons locally, but disallow them from being published"? Or we rather want to go the path "make restrictions to addons themselves"?

@AI0867

This comment has been minimized.

Show comment
Hide comment
@AI0867

AI0867 Oct 5, 2017

Member

One possible issue with allowing full or partial (BMP-only) unicode is unicode equivalence and normalization. This has the potential of creating problems similar to case conflicts. Similar to case sensitivity, some platforms normalize and others do not, and they may even normalize differently.

Member

AI0867 commented Oct 5, 2017

One possible issue with allowing full or partial (BMP-only) unicode is unicode equivalence and normalization. This has the potential of creating problems similar to case conflicts. Similar to case sensitivity, some platforms normalize and others do not, and they may even normalize differently.

@sevu

This comment has been minimized.

Show comment
Hide comment
@sevu

sevu Oct 7, 2017

Member

(It does already exclude them by default, but not if you provide a custom server.ign unless that explicitly lists them.)

That's, despite the wiki claiming so, not the case. It was impossible for me to upload a file with .sh extension with a custom _server.ign.

Member

sevu commented Oct 7, 2017

(It does already exclude them by default, but not if you provide a custom server.ign unless that explicitly lists them.)

That's, despite the wiki claiming so, not the case. It was impossible for me to upload a file with .sh extension with a custom _server.ign.

@Pentarctagon

This comment has been minimized.

Show comment
Hide comment
@Pentarctagon

Pentarctagon Oct 7, 2017

Member

That's, despite the wiki claiming so, not the case. It was impossible for me to upload a file with .sh extension with a custom _server.ign.

Which, honestly, is probably a good thing. I can't think of a good reason for giving your own _server.ign list to replace the defaults, especially when that (also according to the wiki) would mean no longer excluding things like *.pbl and __MACOSX/. It seems like a perfect way for people to accidentally upload files as part of their add-on that should never be uploaded.

Member

Pentarctagon commented Oct 7, 2017

That's, despite the wiki claiming so, not the case. It was impossible for me to upload a file with .sh extension with a custom _server.ign.

Which, honestly, is probably a good thing. I can't think of a good reason for giving your own _server.ign list to replace the defaults, especially when that (also according to the wiki) would mean no longer excluding things like *.pbl and __MACOSX/. It seems like a perfect way for people to accidentally upload files as part of their add-on that should never be uploaded.

@CelticMinstrel

This comment has been minimized.

Show comment
Hide comment
@CelticMinstrel

CelticMinstrel Oct 7, 2017

Member

Agreed. Hidden files should probably always be excluded, like I said. The __MACOSX folder would never actually exist though if I understand correctly (it's just a way of encoding HFS data on non-HFS filesystems or in zip archives). The pbl should always be excluded because it contains sensitive information. Shell scripts wouldn't be useful for anyone who downloads them.

Member

CelticMinstrel commented Oct 7, 2017

Agreed. Hidden files should probably always be excluded, like I said. The __MACOSX folder would never actually exist though if I understand correctly (it's just a way of encoding HFS data on non-HFS filesystems or in zip archives). The pbl should always be excluded because it contains sensitive information. Shell scripts wouldn't be useful for anyone who downloads them.

@Arcanister

This comment has been minimized.

Show comment
Hide comment
@Arcanister

Arcanister Oct 10, 2017

Contributor

I agree that _server.pbl should never be uploaded, unless you somehow explicitly specify you want to upload it, since it contains private information.

Custom _server.ign is definitely useful for some cases. Perhaps there should also be a counterpart of _server.ign which specifies files you want to upload anyway, _server.upl or something? So default rules (like no hidden files, no *.pbl etc) apply, whenever files exist or not, but they could be modified in both ways.

Contributor

Arcanister commented Oct 10, 2017

I agree that _server.pbl should never be uploaded, unless you somehow explicitly specify you want to upload it, since it contains private information.

Custom _server.ign is definitely useful for some cases. Perhaps there should also be a counterpart of _server.ign which specifies files you want to upload anyway, _server.upl or something? So default rules (like no hidden files, no *.pbl etc) apply, whenever files exist or not, but they could be modified in both ways.

@galegosimpatico

This comment has been minimized.

Show comment
Hide comment
@galegosimpatico

galegosimpatico Oct 11, 2017

Contributor

If there is no harm in putting % on hold, maybe you should, because of having it for integer expansion is common in the UNIX, C, and ffmpeg traditions. Maybe useful in the future.

Contributor

galegosimpatico commented Oct 11, 2017

If there is no harm in putting % on hold, maybe you should, because of having it for integer expansion is common in the UNIX, C, and ffmpeg traditions. Maybe useful in the future.

@Arcanister

This comment has been minimized.

Show comment
Hide comment
@Arcanister

Arcanister Oct 12, 2017

Contributor

You mean, blacklist it? My current suggestion is add ' and maybe () to whitelist. Number of add-ons which still have other characters has been reduced since.

Contributor

Arcanister commented Oct 12, 2017

You mean, blacklist it? My current suggestion is add ' and maybe () to whitelist. Number of add-ons which still have other characters has been reduced since.

@galegosimpatico

This comment has been minimized.

Show comment
Hide comment
@galegosimpatico

galegosimpatico Oct 12, 2017

Contributor

Yes, blacklisting %. There was some discussion about it, without expliciting that is kind of common using the character in integer placeholder expressions in the form of:

%[PLACEHOLDER_FILL_CHAR]<PLACEHOLDER_WIDTH>d

For instance when some edit has to be done to video but there is no compiled proper ffmpeg device for the intended recoding, the video can be broken down to images, for images to be edited, and get another video built back.

In that case, is cool the program can use an integer placeholder expression to define the naming of output images. The awesome part of it is with that same syntax the program can gather a large list of input files with which to build a video.

It may be present in many other software, it all comes from old tradition (http://stackoverflow.com/questions/2778785/).

I failed to recognize this reason in the discussion, and I thought it was worth stating.

Contributor

galegosimpatico commented Oct 12, 2017

Yes, blacklisting %. There was some discussion about it, without expliciting that is kind of common using the character in integer placeholder expressions in the form of:

%[PLACEHOLDER_FILL_CHAR]<PLACEHOLDER_WIDTH>d

For instance when some edit has to be done to video but there is no compiled proper ffmpeg device for the intended recoding, the video can be broken down to images, for images to be edited, and get another video built back.

In that case, is cool the program can use an integer placeholder expression to define the naming of output images. The awesome part of it is with that same syntax the program can gather a large list of input files with which to build a video.

It may be present in many other software, it all comes from old tradition (http://stackoverflow.com/questions/2778785/).

I failed to recognize this reason in the discussion, and I thought it was worth stating.

@Arcanister

This comment has been minimized.

Show comment
Hide comment
@Arcanister

Arcanister Oct 13, 2017

Contributor

If my understanding of HFS+ is correct, any add-ons uploaded from a Mac and having unicode characters with diacritic marks in filenames, such as äéè will be broken on all other OSes, since, when file is saved in Mac, characters will be decomposed into base character and combining diacritic mark. In same time, a unicode-aware text editor is most likely to save those characters using single code-point. MacOS will load desired file, recognizing two representations of same character, but other OSes will not.

If unicode is allowed in filenames, special precautions should be taken to handle this and likely others edge cases, which I didn't think about yet.

Contributor

Arcanister commented Oct 13, 2017

If my understanding of HFS+ is correct, any add-ons uploaded from a Mac and having unicode characters with diacritic marks in filenames, such as äéè will be broken on all other OSes, since, when file is saved in Mac, characters will be decomposed into base character and combining diacritic mark. In same time, a unicode-aware text editor is most likely to save those characters using single code-point. MacOS will load desired file, recognizing two representations of same character, but other OSes will not.

If unicode is allowed in filenames, special precautions should be taken to handle this and likely others edge cases, which I didn't think about yet.

@CelticMinstrel

This comment has been minimized.

Show comment
Hide comment
@CelticMinstrel

CelticMinstrel Oct 13, 2017

Member

The proper solution for that would be the same as the solution for the case check - compare it to the on-disk version of the name and warn if it differs. This would only need to be done on Mac, I think, since Windows stores the name in UTF-16.

I guess you'd also need to normalize all names when uploading or downloading an addon (NFD on upload, NFC on download except on Windows). There could still be some edge-cases if a Linux editor then proceeds to use the longer normalization form (as Mac does), but I think that's an edge case no worth worrying about.

Also, about %, I'm not sure why that stands as a reason for blacklisting it. I mean, I'm not even sure what it has to do with Wesnoth addons.

Member

CelticMinstrel commented Oct 13, 2017

The proper solution for that would be the same as the solution for the case check - compare it to the on-disk version of the name and warn if it differs. This would only need to be done on Mac, I think, since Windows stores the name in UTF-16.

I guess you'd also need to normalize all names when uploading or downloading an addon (NFD on upload, NFC on download except on Windows). There could still be some edge-cases if a Linux editor then proceeds to use the longer normalization form (as Mac does), but I think that's an edge case no worth worrying about.

Also, about %, I'm not sure why that stands as a reason for blacklisting it. I mean, I'm not even sure what it has to do with Wesnoth addons.

@galegosimpatico

This comment has been minimized.

Show comment
Hide comment
@galegosimpatico

galegosimpatico Oct 14, 2017

Contributor

We have been treating the topic privately and I have decided to withdraw the provided information as a reason for blacklisting %, if not for good, at least until I have a better understanding of the add-ons system; very unlikely to achieve before this gets closed.

Contributor

galegosimpatico commented Oct 14, 2017

We have been treating the topic privately and I have decided to withdraw the provided information as a reason for blacklisting %, if not for good, at least until I have a better understanding of the add-ons system; very unlikely to achieve before this gets closed.

@Arcanister

This comment has been minimized.

Show comment
Hide comment
@Arcanister

Arcanister Oct 14, 2017

Contributor

This would only need to be done on Mac, I think, since Windows stores the name in UTF-16.

Mac also stores file names as UTF-16, but I do not see how is it relevant here.

Your suggestion sounds overly complicated. Especially, it's not clear why normalize filenames two times and why you suggest different behavior on Windows and GNU/Linux, while the problem is with MacOS X file system driver.

Contributor

Arcanister commented Oct 14, 2017

This would only need to be done on Mac, I think, since Windows stores the name in UTF-16.

Mac also stores file names as UTF-16, but I do not see how is it relevant here.

Your suggestion sounds overly complicated. Especially, it's not clear why normalize filenames two times and why you suggest different behavior on Windows and GNU/Linux, while the problem is with MacOS X file system driver.

@Arcanister

This comment has been minimized.

Show comment
Hide comment
@Arcanister

Arcanister Oct 15, 2017

Contributor

I have tested my assumptions about macOS X and yes, actually add-ons with unicode names created in macOS are broken on all other OSes and working fine in macOS

Contributor

Arcanister commented Oct 15, 2017

I have tested my assumptions about macOS X and yes, actually add-ons with unicode names created in macOS are broken on all other OSes and working fine in macOS

@CelticMinstrel

This comment has been minimized.

Show comment
Hide comment
@CelticMinstrel

CelticMinstrel Oct 15, 2017

Member

Your suggestion sounds overly complicated. Especially, it's not clear why normalize filenames two times and why you suggest different behavior on Windows and GNU/Linux, while the problem is with MacOS X file system driver.

The idea was to encode them in the way they would typically be encoded for that system. This may not be generically possible on Linux, but NFC-normalized UTF-8 seemed like a decent guess, at least.

I have tested my assumptions about macOS X and yes, actually add-ons with unicode names created in macOS are broken on all other OSes and working fine in macOS

Is it possible to make an addon that works on both OSX and Windows? (Or even on all three systems.) For example, by carefully ensuring that references to files use NFD-normalization.

Member

CelticMinstrel commented Oct 15, 2017

Your suggestion sounds overly complicated. Especially, it's not clear why normalize filenames two times and why you suggest different behavior on Windows and GNU/Linux, while the problem is with MacOS X file system driver.

The idea was to encode them in the way they would typically be encoded for that system. This may not be generically possible on Linux, but NFC-normalized UTF-8 seemed like a decent guess, at least.

I have tested my assumptions about macOS X and yes, actually add-ons with unicode names created in macOS are broken on all other OSes and working fine in macOS

Is it possible to make an addon that works on both OSX and Windows? (Or even on all three systems.) For example, by carefully ensuring that references to files use NFD-normalization.

@vn971

This comment has been minimized.

Show comment
Hide comment
@vn971

vn971 Oct 15, 2017

Contributor

Just to clear things out, are we discussing whether work-around is possible, or work-around (and arbitrary UTF) is desireable?
Because I would (personally) certainly prefer a whitelist rule for names, just for sake of readability. After all, all addons are licensed under GPLv2, they are intended to be editable.

Contributor

vn971 commented Oct 15, 2017

Just to clear things out, are we discussing whether work-around is possible, or work-around (and arbitrary UTF) is desireable?
Because I would (personally) certainly prefer a whitelist rule for names, just for sake of readability. After all, all addons are licensed under GPLv2, they are intended to be editable.

@CelticMinstrel

This comment has been minimized.

Show comment
Hide comment
@CelticMinstrel

CelticMinstrel Oct 15, 2017

Member

After all, all addons are licensed under GPLv2, they are intended to be editable.

Being editable isn't really an intention as much as a side-effect of the format. Anyway, what's most important is that they're editable by the designer, and if having the filenames be in their native language helps, then why should we stop them?

Member

CelticMinstrel commented Oct 15, 2017

After all, all addons are licensed under GPLv2, they are intended to be editable.

Being editable isn't really an intention as much as a side-effect of the format. Anyway, what's most important is that they're editable by the designer, and if having the filenames be in their native language helps, then why should we stop them?

@Arcanister

This comment has been minimized.

Show comment
Hide comment
@Arcanister

Arcanister Oct 16, 2017

Contributor

Is it possible to make an addon that works on both OSX and Windows?

If Unicode filenames originate from and work on Windows or GNU/Linux, then they will automatically work on OS X as well.

However if add-on originated or was repackaged on OS X, it will be broken elsewhere. In other words, OS X corrupts all Unicode filenames immediately when they are stored in its filesystem, but can recognize those names even after corruption.

If you are on OS X, then you can use a text editor such as Emacs or Vim to paste output of console command ls into current document, bypassing the system clipboard to make sure no normalization is performed.

However, if file is edited later in a text editor which performs normalization on saving, then add-on will stop working on other OSes again.

The idea was to encode them in the way they would typically be encoded for that system. This may not be generically possible on Linux, but NFC-normalized UTF-8 seemed like a decent guess, at least.

Neither Linux, nor Windows do anything with Unicode names, they just preserve codes which application provided. Filenames there could be stored in any way: If files came from OS X, they are likely to be in NFD, but if names have never been corrupted by OS X, they are likely to be in NFC, since NFC is easier to produce with system default input methods. It would be incorrect to normalize filenames if they come from them.

Being editable isn't really an intention as much as a side-effect of the format.

I believe that Wesnoth add-ons were initially meant to be as easy to edit and remix as mainline content.

and if having the filenames be in their native language helps, then why should we stop them

But does it really help?

  • if UMC authors are able to write WML, then they already have at least basic understanding of English, because all WML tags are in English.

  • if they use non-ASCII filenames, they will get numerous troubles. While a really determined technical user will be able to resolve them, they are less likely to even attempt using them, precisely because they are aware of those troubles.

  • Even if filenames are usually not visible for regular players, some resources are referenced only by filename in game interface, such as maps. An 1.12 add-on with unicode names is the case.

  • Filenames are likely to match unit/scenario ids or names. Unit and scenario names are visible to player. If Wesnoth asks authors to fix filenames they are more likely to also fix in-game names.

  • Simply applying a whitelist will be the most straightforward way to ensure inoperability. Other ways increase support burden on Wesnoth developers without significant benefit for the community.

Contributor

Arcanister commented Oct 16, 2017

Is it possible to make an addon that works on both OSX and Windows?

If Unicode filenames originate from and work on Windows or GNU/Linux, then they will automatically work on OS X as well.

However if add-on originated or was repackaged on OS X, it will be broken elsewhere. In other words, OS X corrupts all Unicode filenames immediately when they are stored in its filesystem, but can recognize those names even after corruption.

If you are on OS X, then you can use a text editor such as Emacs or Vim to paste output of console command ls into current document, bypassing the system clipboard to make sure no normalization is performed.

However, if file is edited later in a text editor which performs normalization on saving, then add-on will stop working on other OSes again.

The idea was to encode them in the way they would typically be encoded for that system. This may not be generically possible on Linux, but NFC-normalized UTF-8 seemed like a decent guess, at least.

Neither Linux, nor Windows do anything with Unicode names, they just preserve codes which application provided. Filenames there could be stored in any way: If files came from OS X, they are likely to be in NFD, but if names have never been corrupted by OS X, they are likely to be in NFC, since NFC is easier to produce with system default input methods. It would be incorrect to normalize filenames if they come from them.

Being editable isn't really an intention as much as a side-effect of the format.

I believe that Wesnoth add-ons were initially meant to be as easy to edit and remix as mainline content.

and if having the filenames be in their native language helps, then why should we stop them

But does it really help?

  • if UMC authors are able to write WML, then they already have at least basic understanding of English, because all WML tags are in English.

  • if they use non-ASCII filenames, they will get numerous troubles. While a really determined technical user will be able to resolve them, they are less likely to even attempt using them, precisely because they are aware of those troubles.

  • Even if filenames are usually not visible for regular players, some resources are referenced only by filename in game interface, such as maps. An 1.12 add-on with unicode names is the case.

  • Filenames are likely to match unit/scenario ids or names. Unit and scenario names are visible to player. If Wesnoth asks authors to fix filenames they are more likely to also fix in-game names.

  • Simply applying a whitelist will be the most straightforward way to ensure inoperability. Other ways increase support burden on Wesnoth developers without significant benefit for the community.

@galegosimpatico

This comment has been minimized.

Show comment
Hide comment
@galegosimpatico

galegosimpatico Oct 16, 2017

Contributor

I am on GNU/Linux, let's touch ñ.

$ ls ./ | hexdump -C
00000000  c3 b1 0a                                          |...|
00000003
$ _

I am on macOS, let's do the same. Well, the same happened.

I keep being on macOS, let's have another doc renamed to ñ using the Finder.

$ ls ./ | hexdump -C
00000000  6e cc 83 0a                                       |n...|
00000004
$ _

Back to GNU/Linux, let's have another doc renamed to ñ using dolphin.

$ ls ./ | hexdump -C
00000000  c3 b1 0a                                          |...|
00000003
$ _
Contributor

galegosimpatico commented Oct 16, 2017

I am on GNU/Linux, let's touch ñ.

$ ls ./ | hexdump -C
00000000  c3 b1 0a                                          |...|
00000003
$ _

I am on macOS, let's do the same. Well, the same happened.

I keep being on macOS, let's have another doc renamed to ñ using the Finder.

$ ls ./ | hexdump -C
00000000  6e cc 83 0a                                       |n...|
00000004
$ _

Back to GNU/Linux, let's have another doc renamed to ñ using dolphin.

$ ls ./ | hexdump -C
00000000  c3 b1 0a                                          |...|
00000003
$ _
@jyrkive

This comment has been minimized.

Show comment
Hide comment
@jyrkive

jyrkive Oct 16, 2017

Member

if UMC authors are able to write WML, then they already have at least basic understanding of English, because all WML tags are in English.

Well, some UMC authors are quite bad at English regardless.

if they use non-ASCII filenames, they will get numerous troubles.

Since when has "numerous" meant "a couple"? You have only pointed out two significant problems (this issue, and Windows being unable to create a ZIP where filenames contains 3- or 4-byte characters).

Filenames are likely to match unit/scenario ids or names. Unit and scenario names are visible to player. If Wesnoth asks authors to fix filenames they are more likely to also fix in-game names.

Uh, what? Now you're trying to say that unit and scenario names should also be limited to ASCII?!?!

Simply applying a whitelist will be the most straightforward way to ensure inoperability. Other ways increase support burden on Wesnoth developers without significant benefit for the community.

This "increase in support burden" is negligible.

Member

jyrkive commented Oct 16, 2017

if UMC authors are able to write WML, then they already have at least basic understanding of English, because all WML tags are in English.

Well, some UMC authors are quite bad at English regardless.

if they use non-ASCII filenames, they will get numerous troubles.

Since when has "numerous" meant "a couple"? You have only pointed out two significant problems (this issue, and Windows being unable to create a ZIP where filenames contains 3- or 4-byte characters).

Filenames are likely to match unit/scenario ids or names. Unit and scenario names are visible to player. If Wesnoth asks authors to fix filenames they are more likely to also fix in-game names.

Uh, what? Now you're trying to say that unit and scenario names should also be limited to ASCII?!?!

Simply applying a whitelist will be the most straightforward way to ensure inoperability. Other ways increase support burden on Wesnoth developers without significant benefit for the community.

This "increase in support burden" is negligible.

@galegosimpatico

This comment has been minimized.

Show comment
Hide comment
@galegosimpatico

galegosimpatico Oct 16, 2017

Contributor

Filenames are likely to match unit/scenario ids or names. Unit and scenario names are visible to player. If Wesnoth asks authors to fix filenames they are more likely to also fix in-game names.

Uh, what? Now you're trying to say that unit and scenario names should also be limited to ASCII?!?!

I understand s/he is pointing out s/he prefers imposing a temporary ban on non-ASCII while it is behaving buggy. In that case s/he is right to do so. Of course, that doesn't necessarily mean the best action to take has to be acknowledging and embracing the way s/he prefers things to be.

Contributor

galegosimpatico commented Oct 16, 2017

Filenames are likely to match unit/scenario ids or names. Unit and scenario names are visible to player. If Wesnoth asks authors to fix filenames they are more likely to also fix in-game names.

Uh, what? Now you're trying to say that unit and scenario names should also be limited to ASCII?!?!

I understand s/he is pointing out s/he prefers imposing a temporary ban on non-ASCII while it is behaving buggy. In that case s/he is right to do so. Of course, that doesn't necessarily mean the best action to take has to be acknowledging and embracing the way s/he prefers things to be.

@AI0867

This comment has been minimized.

Show comment
Hide comment
@AI0867

AI0867 Oct 16, 2017

Member

On linux, one can do stuff like echo -e '\xFF\xFF\xFF'|xargs touch, which will successfully create a file that dolphin, at least, does not understand how to deal with. When trying to publish such a file, wesnoth fails on the "check if the filename looks like a pbl" step, because it's not valid utf8.

To summarize:

  • macOS normalizes to NFD
  • linux stores opaque bytestreams that do not contain 0x00 (NUL) or 0x2F (/)
  • windows stores opaque UCS-2 streams that do not contain various reserved codepoints
  • macOS may (but may also not) perform case insensitive matching (according to which rules?)
  • linux simply does binary matching of the bytestream
  • windows does limited case insensitive matching, depending on the filesystem instance

There are two possible approaches here:

  1. Do case-checks for ASCII, hope for the best for unicode (similar to what we do now)
  2. Do case-checks, switch to a whitelist and only permit a limited set of unicode (if anything, beyond ASCII).

Whitelist: this PR
Blacklist: PR #2077
Case checks: PR #2071

Another option, which I don't think is a good idea, is normalizing filenames to NFD upload/download, and when opening. This would however, mean that users on windows and linux would be unable to open their own files until they normalized.

Member

AI0867 commented Oct 16, 2017

On linux, one can do stuff like echo -e '\xFF\xFF\xFF'|xargs touch, which will successfully create a file that dolphin, at least, does not understand how to deal with. When trying to publish such a file, wesnoth fails on the "check if the filename looks like a pbl" step, because it's not valid utf8.

To summarize:

  • macOS normalizes to NFD
  • linux stores opaque bytestreams that do not contain 0x00 (NUL) or 0x2F (/)
  • windows stores opaque UCS-2 streams that do not contain various reserved codepoints
  • macOS may (but may also not) perform case insensitive matching (according to which rules?)
  • linux simply does binary matching of the bytestream
  • windows does limited case insensitive matching, depending on the filesystem instance

There are two possible approaches here:

  1. Do case-checks for ASCII, hope for the best for unicode (similar to what we do now)
  2. Do case-checks, switch to a whitelist and only permit a limited set of unicode (if anything, beyond ASCII).

Whitelist: this PR
Blacklist: PR #2077
Case checks: PR #2071

Another option, which I don't think is a good idea, is normalizing filenames to NFD upload/download, and when opening. This would however, mean that users on windows and linux would be unable to open their own files until they normalized.

@Arcanister

This comment has been minimized.

Show comment
Hide comment
@Arcanister

Arcanister Oct 16, 2017

Contributor

MacOS X is only OS affected by this issue, so it best to keep the solution of this problem local for OS X build.

Now you're trying to say that unit and scenario names should also be limited to ASCII?!?!

Not really, but they should be in English when English locale is selected. I do not suggest to enforce it in any way, however.

Contributor

Arcanister commented Oct 16, 2017

MacOS X is only OS affected by this issue, so it best to keep the solution of this problem local for OS X build.

Now you're trying to say that unit and scenario names should also be limited to ASCII?!?!

Not really, but they should be in English when English locale is selected. I do not suggest to enforce it in any way, however.

@CelticMinstrel

This comment has been minimized.

Show comment
Hide comment
@CelticMinstrel

CelticMinstrel Oct 16, 2017

Member

You didn't even answer my question...

Member

CelticMinstrel commented Oct 16, 2017

You didn't even answer my question...

@vn971

This comment has been minimized.

Show comment
Hide comment
@vn971

vn971 Oct 16, 2017

Contributor

@CelticMinstrel I personally would focus on both the initial author and later developers.
Meaning, the difficulty to name a file in English for the initial author is less than the difficulty for other addon developers to write arbitrary unicode.

Contributor

vn971 commented Oct 16, 2017

@CelticMinstrel I personally would focus on both the initial author and later developers.
Meaning, the difficulty to name a file in English for the initial author is less than the difficulty for other addon developers to write arbitrary unicode.

@CelticMinstrel

This comment has been minimized.

Show comment
Hide comment
@CelticMinstrel

CelticMinstrel Oct 16, 2017

Member

Some other person editing the files wouldn't need to write arbitrary Unicode. If they need a filename they can copy-paste it. Using Unicode names only makes it slightly harder for people other than the designer to edit it, so I don't see the problem here.

Member

CelticMinstrel commented Oct 16, 2017

Some other person editing the files wouldn't need to write arbitrary Unicode. If they need a filename they can copy-paste it. Using Unicode names only makes it slightly harder for people other than the designer to edit it, so I don't see the problem here.

@vn971

This comment has been minimized.

Show comment
Hide comment
@vn971

vn971 Oct 16, 2017

Contributor

@CelticMinstrel that is subjective (what's hard and what's not). So I'd leave it as is.

BTW, looking from another perspective: if I read the statistics correctly, very few add-on developers actually used non-English "alphabetical" letters, even though there were no restrictions for that in wesnoth engine and the addon server. My personal interpretation of this statistics is that people aren't really that interested in writing arbitrary unicode in file names.

EDIT: This post says that it's 4 addons out of 606: #2070 (comment)

Contributor

vn971 commented Oct 16, 2017

@CelticMinstrel that is subjective (what's hard and what's not). So I'd leave it as is.

BTW, looking from another perspective: if I read the statistics correctly, very few add-on developers actually used non-English "alphabetical" letters, even though there were no restrictions for that in wesnoth engine and the addon server. My personal interpretation of this statistics is that people aren't really that interested in writing arbitrary unicode in file names.

EDIT: This post says that it's 4 addons out of 606: #2070 (comment)

@jyrkive

This comment has been minimized.

Show comment
Hide comment
@jyrkive

jyrkive Oct 16, 2017

Member

The low number of add-ons that use Unicode in filenames also means that Unicode is unlikely to cause problems in practice.

Member

jyrkive commented Oct 16, 2017

The low number of add-ons that use Unicode in filenames also means that Unicode is unlikely to cause problems in practice.

@vn971

This comment has been minimized.

Show comment
Hide comment
@vn971

vn971 Oct 16, 2017

Contributor

@jyrkive I don't follow that. If something is not popular, it means it is not problematic?

Contributor

vn971 commented Oct 16, 2017

@jyrkive I don't follow that. If something is not popular, it means it is not problematic?

@GregoryLundberg

This comment has been minimized.

Show comment
Hide comment
@GregoryLundberg

GregoryLundberg Oct 16, 2017

Contributor

What's presently done is fine as a gauge of how much breakage will be caused.

But that it is rarely used, at present, is not an argument for not fixing it.

We want the add-ons to be cross-platform. That means allowing only the common points. Our job, here, is to prevent FUTURE authors shooting themselves in the foot.

Knowing what the change breaks is more knowing who to contact and what they need to fix than anything else.

Contributor

GregoryLundberg commented Oct 16, 2017

What's presently done is fine as a gauge of how much breakage will be caused.

But that it is rarely used, at present, is not an argument for not fixing it.

We want the add-ons to be cross-platform. That means allowing only the common points. Our job, here, is to prevent FUTURE authors shooting themselves in the foot.

Knowing what the change breaks is more knowing who to contact and what they need to fix than anything else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment