Migration to Python 3 with file paths as bytes instead str #53

ericzolf · 2019-08-10T20:30:04Z

This is the promised adaptation for rdiff-backup to work with bytes to handle paths so that there shouldn't be a dependency on the charset used.

The branch is based upon #40 so that it can either be merged directly into master and the PR#40 been closed, or I can merge the branch of this PR and close it so that the review happens further on 40. Not sure what is best for the reviewers.

ericzolf · 2019-08-10T20:32:13Z

Another thought: should we stash the branch?

ericzolf · 2019-08-10T20:34:52Z

I almost forgot, if you want to test, it depends on https://github.com/ericzolf/rdiff-backup/releases/tag/Testfiles2019-08-10

ottok · 2019-08-12T08:01:13Z

Could you please rebase on master so that the relevant commits are easier to review? Thanks

Based on discussion on the mailing list, still more to do. Single backup and incremental backup basically works. Still need to make all test cases work again (untried).

This commit is more of a generic let's check everything in a big swipe. There are surely things let to do, but all paths have bin set as bytes, where I could find them with some search pattern.

Just simple bytes vs. string fixes.

Somehow the librsync test failed after the selection was fixed, it was due to files not being cleaned up, not sure why it didn't appear before. The selection fixes were just the usual str vs. bytes thingy.

The tests were again mostly about bytes vs. str. rpath's dirsplit had to be reverted, the expected result didn't fit os.path.

The usual bytes vs. str stuff and replacing StringIO with BytesIO in compare.py and restore.py

Also improve the ability to create a new restoretest repo.

ericzolf · 2019-08-13T04:30:00Z

Branch re-based without conflict, can be merged

ottok · 2019-08-13T05:47:34Z

Oh, so this change really is 18 commits.. I am surprised that many changes were needed.

Was the removal of src/memoryleak.c in the first commit intentional? The commit does not reference that change or the motivation of it, so it leaves me guessing.

I am no expert on binary vs unicode. I am surprised that b needs to be sprinkled all over and also that the regexp string no_compression_regexp_string is set as binary. I would have assumed the rules are defined as strings and then regex compiles it into bytecode with bytes or something.

I also see lines like os.system(b"cp -a – do system calls really be in byes as well?

Somehow this does not feel like a pythonic solution to me. I understand that os.fsencode might be needed in a few places when interacting with the filenames, but putting b in front of almost every variable in the code base makes me wonder...

ericzolf · 2019-08-13T19:34:53Z

Honestly, I went through the same "wonder" as I attacked this new topic, and searched also for a less intrusive approach. The problem is that as soon as you accept the fact that any filename can have any encoding, also a broken one, or multiple ones because of cross-platform remote backup etc, you need to work solely with bytes and can't work any more with str.

One need to understand that at this point any "encoding" is possible but any "decoding" can lead to a conversion error, which means all files (metadata & co) containing paths need to be written as binary/byte, any command which might contain a path must be expressed as bytes, and the same applies to regex.

There is perhaps an alternative but it would be way above my pay range (so to say).

ericzolf · 2019-08-13T20:06:22Z

Sorry, forgot: removing src/memoryleak.c was indeed unclean, not really related to the branch topic, but it was clear that it had become useless and my first "grep's" to find places to fix for bytes vs. str had a hit on the file, so I just decided to remove it and not try to fix it.

ottok · 2019-08-16T09:39:12Z

Incidentally I came across the same discussion at https://lwn.net/Articles/796344/ but I can't find any info on what is the correct way to do this.. My love for Python has got a big dent if one really needs to sprinkle b' everywhere to do something with file paths..

ericzolf · 2019-08-16T15:25:09Z

The linked thread is a good summary of what I had to learn the hard way.

…

On August 16, 2019 9:39:13 AM UTC, "Otto Kekäläinen" ***@***.***> wrote: Incidentally I came across the same discussion at https://lwn.net/Articles/796344/ but I can't find any info on what is the correct way to do this.. My love for Python has got a big dent if one really needs to sprinkle `b'` everywhere to do something with file paths.. -- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: #53 (comment)

ericzolf · 2019-08-17T07:42:44Z

Sorry, one additional commit to fix also rdiff-backup-statistics in regards to bytes and some "lazy" remains.

Remove lazy references forgotten and replace with standard filter iterator Tested on ../rdiff-backup_testfiles/restoretest3 and works

ottok · 2019-08-17T11:31:41Z

I am trying to reconstruct the logic here to better understand what is going on. First of all, I don't really understand deeply what is the exact problem these changes try to fix. There is no issue and the commits are don't explain why they are made in regards to functionality:

What exactly are the paths or filenames that didn't work? What was the error messages Python threw?

ericzolf · 2019-08-17T14:55:46Z

OK, basically, everything started with the thread https://lists.nongnu.org/archive/html/rdiff-backup-users/2019-08/msg00010.html - based on the feedback that I couldn't assume only UTF-8 paths but had to assume any codeset even mixed and broken ones.

The first commit in the PR was about removing the ignoring file %s with wrong encoding sanity check in src/rdiff_backup/selection.py that I had introduced in PR #40 - after that any test involving the files found by find ../rdiff-backup_testfiles -name ث\* would fail. I had to re-create a new testfiles archive to re-introduce the strange file into the restoretest3 repository.

The rest of the MR is solely about making sure that rdiff-backup and all the tests work properly, once I had started to switch from str/unicode to bytes.

It is even worse because my earlier version wouldn't have been able to use a repo containing any file with broken repo. Take the rdiff-backup version at the end of PR #40 and try to restore the repo under ../rdiff-backup_testfiles/restoretest3 from the archive I uploaded on the 10th of August:

$ git checkout master
$ ./setup.py build
$ PATH=$PWD/build/scripts-3.7:$PATH PYTHONPATH=$PWD/build/lib.linux-x86_64-3.7 python3 rdiff-backup -r0 ../rdiff-backup_testfiles/restoretest3 /tmp/dummyrestore
[...]
  File "/home/ericl/Public/rdiff-backup/build/lib.linux-x86_64-3.7/rdiff_backup/rpath.py", line 1445, in read
    data = data.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 14908: invalid start byte

I didn't go into the details in my commit messages because it was always the same thing, either an error that the broken filename couldn't be decoded, or an error that I can't mix bytes and str (in many variations), until I had rdiff-backup(-statistics) and the testsuite running properly again.

ottok · 2019-08-17T20:14:37Z

I found out that in Python 3 the file path is not of type str but it has its own type inherited from pathlib. Maybe the original problem was that some code mixed str and path objects? And maybe the solution is to consistently use the path object correctly?

See
https://realpython.com/python-pathlib/
https://snarky.ca/why-pathlib-path-doesn-t-inherit-from-str/

ericzolf · 2019-08-17T20:53:30Z

Perhaps true, but this would mean IMHO a major rewrite, and more difficult than just changing everything from str to bytes or vice-versa, and I'm not absolutely convinced that it solves all our problems: the statement from the 2nd link that "all path can be represented by string" is quite wrong:

import pathlib
for p in pathlib.Path('.../rdiff-backup_testfiles/various_file_types').iterdir():
    print(p)

ends with:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcb1' in position 62: surrogates not allowed

So, we might end in dead-end because of cases like this, so it's something we might want to look at to get cleaner code, base on a new issue, but I don't see the topic relevant for now.

ericzolf · 2019-08-21T19:31:47Z

So, how do we progress? I'm happy to accept/merge some of the other pending pull requests, but they would introduce conflicts with this pull request.

ikus060 · 2019-08-21T19:35:59Z

I would merge stuff according to the priorities. According to me. supporting python3 is the priority, since python2 end-of-line is plan on 2020. So, my recommendation is to merge your python3 changes first, and slowly merge other changes after that, by resolving the conflict.

…

-- Patrik Dufresne Service Logiciel inc. http://www.patrikdufresne.com <http://patrikdufresne.com/>/ 514-971-6442 130 rue Doris St-Colomban, QC J5K 1T9

On Wed, Aug 21, 2019 at 3:31 PM Eric L. ***@***.***> wrote: So, how do we progress? I'm happy to accept/merge some of the other pending pull requests, but they would introduce conflicts with this pull request. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#53?email_source=notifications&email_token=AAHA5I7FCMIEN2IXYVEG6Q3QFWJ2JA5CNFSM4IK2JT7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD423ZVA#issuecomment-523615444>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHA5I4FV727SAZDC3NMM63QFWJ2JANCNFSM4IK2JT7A> .

ottok · 2019-08-21T20:12:46Z

I would lean towards prioritizing Python 3 (and tha tPR was already merged) and CI / automatic testing pipeline to ensure the following changes don't break anything. A good testing system can also be utilized here to make files with UTF-8, UTF-16, UTF-32 etc unicode characters and then it is easy to see if fixing all str cases with Python Path objects is enough, or if we really need to have everything in binary all the time.

ericzolf · 2019-08-22T04:50:37Z

This is a huge patch again so I'm not keen to let it rot and do endless rebases, especially not if Travis CI has a PEP8 cleanup of the code as prerequisite. Should this happen, it would be a complete rewrite of the patch.

This PR might be ugly but it works the way it should and the tox tests prove it, so:

It's better this way than the other way around, nice but non functional, as on master currently
Introducing pathlib might be a good idea, actually I like it (*) BUT it's a huge change because bytes and strings act more or less alike but pathlib.Path has a completely different interface
Once the code works correctly and the tests work as they should, refactoring/redesigning becomes easier, which is currently not the case on master.

So, my strong preference, merge this PR as ugly as it is, merge the other smaller stuff, including CI, clean the code PEP8, and in parallel continue the discussion about pathlib in a new issue.

(*) I'm thinking that the right way to do it would be to somehow derive RPath from pathlib.Path and/or RORPath from PurePath, with the potential issue that our code ought to be cross platform hence potentially mix windows and Linux path and transfer them remotely, so that there are a lot of thing to consider as pathlib uses internally different classes for both types of path.

ottok · 2019-08-22T07:56:59Z

I made a couple of extra test files:

ls testfiles/various_file_types/
 ไฟล์ทดสอบ
 aaaテストファイル
 aaaaaaåäöåäöÅÄÖ
'aaaملف الاختبار'
 aaa測試文件
 åäöåäöÅÄÖ
 changeable_permission
 executable
 executable2
 fifo
'Ø«±Wb®Å]'$'\302\212''»'$'\025''v*ô'$'\017''!ù>âY'$'\302\206''»«Ûp°'$'\302\204\023''k'$'\035''Âñõe¥U'$'\302\202\302\232''UV ôß4ºýX'$'\003\302\202\a''sÎ'$'\302\236\302\213''³4'$'\004\302\237\027'' ô'$'\302\217''¦ú'$'\302\227''«Ø¬Ú'$'\302\205''ÜKvCú#'$'\302\224\302\222\302\236''É·Ã_'$'\017\302\204''g'$'\302\232''B'$'\021''<=^ÛM'$'\023\302\226''c'$'\302\213''§|*"\'\''^$@#!(){}?+ ~` '
 regular_file
 regular_file.sig
 subdir
 symbolic_link
 test
 two_hardlinked_files1
 two_hardlinked_files2
'ث'$'\261''Wb'$'\256\305'']'$'\212\273\025''v*'$'\364\017''!'$'\371''>'$'\342''Y'$'\206\273\253\333''p'$'\260\204\023''k'$'\035\302\361\365''e'$'\245''U'$'\202\232''UV'$'\240\364\337''4'$'\272\375''X'$'\003\202\a''sΞ'$'\213\263''4'$'\004\237\027'' '$'\364\217\246\372\227\253''جڅ'$'\334''KvC'$'\372''#'$'\224\222\236''ɷ'$'\303''_'$'\017\204''g'$'\232''B'$'\021''<=^'$'\333''M'$'\023\226''c'$'\213\247''|*"\'\''^$@#!(){}?+ ~` '

When I run the test script I get this result:

$ sudo python3 pathlib-test.py 
testfiles/various_file_types/aaaaaaåäöåäöÅÄÖ
testfiles/various_file_types/aaaملف الاختبار
testfiles/various_file_types/aaaテストファイル
testfiles/various_file_types/aaa測試文件
testfiles/various_file_types/changeable_permission
testfiles/various_file_types/executable
testfiles/various_file_types/executable2
testfiles/various_file_types/fifo
testfiles/various_file_types/regular_file
testfiles/various_file_types/regular_file.sig
testfiles/various_file_types/subdir
testfiles/various_file_types/subdir/subdir_file
testfiles/various_file_types/symbolic_link
testfiles/various_file_types/test
testfiles/various_file_types/two_hardlinked_files1
testfiles/various_file_types/two_hardlinked_files2
testfiles/various_file_types/Ø«±Wb®Å]�»�v*ô!ù>âY»«Ûp°
                                                     �k�Âñõe¥U�UV ôß4ºýX��sÎ��³4��� ô¦ú«Ø¬Ú
ÜKvCú#���É·Ã_
             gB�<=^ÛM�c�§|*"\'^$@#!(){}?+ ~` 
testfiles/various_file_types/åäöåäöÅÄÖ
^[[?62;c^[[?62;cTraceback (most recent call last):
  File "pathlib-test.py", line 3, in <module>
    print(p)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcb1' in position 30: surrogates not allowed
$ 62;c62;c

(The 62.. on the terminal prompt are real and intentional.)

My test code:

import pathlib
for p in sorted(pathlib.Path('testfiles/various_file_types').iterdir()):
    print(p)

So it seems that Pathlib handles just fine reading and sorting all paths but when it is time to print, it fails to print the unprintable characters. The way ls(or bash) handles this is by printing the ordinal values, eg. \123 of characters it cannot print.

I also tried this which is used in the PR:

    x = b'\n'
    print(x)

Result: b'\n'

Your commit 'First try at using filenames handled as bytes' still removes C source files and does changes unrelated to the commit title. I understand that you want this change in and agree that it has its uglinesses, and that you think this is a better base to further improve and refactor the code than the current master branch, so I guess we just need to accept that stanze and merge this. I just hope that introducing "ugliness" does not later on add the work required to clean up and replace b'' with f'' or remove it.

This was referenced Aug 11, 2019

UnicodeDecodeError with WinACLs (locally executed; Windows) #31

Closed

Exception ''ascii' codec can't decode byte 0xe2 #27

Closed

Regression of failed backup with long file names can result in assertion error #9

Closed

ericzolf and others added 18 commits August 13, 2019 06:26

First try at using filenames handled as bytes

765b0af

Based on discussion on the mailing list, still more to do. Single backup and incremental backup basically works. Still need to make all test cases work again (untried).

Adapt a bunch of test scripts to move from str to bytes path

569fa69

This commit is more of a generic let's check everything in a big swipe. There are surely things let to do, but all paths have bin set as bytes, where I could find them with some search pattern.

Fix no_compression_regexp_string to be bytes type

f215ac5

Make longname test again working with bytes paths

63a73c3

Make hardlinktest and incrementest work for bytes paths

0ce6bb6

Just simple bytes vs. string fixes.

Make test hardlink work really with bytes

2ebf80c

Make setconnection test work with bytes instead of str.

98059cd

Make the EA/ACL tests work with bytes instead of str

d6611b4

Make FilenameMapping test work with bytes

207b86c

Small byte fix to call server.py in hash test

453bde7

Fix librsync test and selection test

5eee800

Somehow the librsync test failed after the selection was fixed, it was due to files not being cleaned up, not sure why it didn't appear before. The selection fixes were just the usual str vs. bytes thingy.

Just metadata test fixes (no change to actual code)

8081f93

Make rpath and rpath test work

1b13f0f

The tests were again mostly about bytes vs. str. rpath's dirsplit had to be reverted, the expected result didn't fit os.path.

Some more bytes vs. str fixes to make to get security tests running

e29ecbe

Make compare tests work in regard to bytes

5743c44

The usual bytes vs. str stuff and replacing StringIO with BytesIO in compare.py and restore.py

Make restoretest and cmdlinetest work with bytes

0b3d5f1

Also improve the ability to create a new restoretest repo.

Make root tests work with bytes paths

cb9d344

Fix bytes path for the benchmark in tox_slow

c29a5b8

ericzolf force-pushed the ericzolf-py2to3-bytes branch from 7054dde to c29a5b8 Compare August 13, 2019 04:26

ericzolf added this to the 2.0.0 milestone Aug 14, 2019

Fix rdiff-backup-statistics to work with bytes paths and without lazy

1b5c352

Remove lazy references forgotten and replace with standard filter iterator Tested on ../rdiff-backup_testfiles/restoretest3 and works

ericzolf force-pushed the ericzolf-py2to3-bytes branch from 11fb247 to 1b5c352 Compare August 17, 2019 07:43

ottok approved these changes Aug 22, 2019

View reviewed changes

ottok merged commit 0b35777 into rdiff-backup:master Aug 22, 2019

ottok mentioned this pull request Aug 25, 2019

Recent change to support bytes and python3 is breaking Windows build #106

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migration to Python 3 with file paths as bytes instead str #53

Migration to Python 3 with file paths as bytes instead str #53

ericzolf commented Aug 10, 2019

ericzolf commented Aug 10, 2019

ericzolf commented Aug 10, 2019

ottok commented Aug 12, 2019

ericzolf commented Aug 13, 2019

ottok commented Aug 13, 2019

ericzolf commented Aug 13, 2019

ericzolf commented Aug 13, 2019

ottok commented Aug 16, 2019

ericzolf commented Aug 16, 2019 via email

ericzolf commented Aug 17, 2019

ottok commented Aug 17, 2019

ericzolf commented Aug 17, 2019

ottok commented Aug 17, 2019

ericzolf commented Aug 17, 2019

ericzolf commented Aug 21, 2019

ikus060 commented Aug 21, 2019 via email

ottok commented Aug 21, 2019

ericzolf commented Aug 22, 2019 •

edited

ottok commented Aug 22, 2019

Migration to Python 3 with file paths as bytes instead str #53

Migration to Python 3 with file paths as bytes instead str #53

Conversation

ericzolf commented Aug 10, 2019

ericzolf commented Aug 10, 2019

ericzolf commented Aug 10, 2019

ottok commented Aug 12, 2019

ericzolf commented Aug 13, 2019

ottok commented Aug 13, 2019

ericzolf commented Aug 13, 2019

ericzolf commented Aug 13, 2019

ottok commented Aug 16, 2019

ericzolf commented Aug 16, 2019 via email

ericzolf commented Aug 17, 2019

ottok commented Aug 17, 2019

ericzolf commented Aug 17, 2019

ottok commented Aug 17, 2019

ericzolf commented Aug 17, 2019

ericzolf commented Aug 21, 2019

ikus060 commented Aug 21, 2019 via email

ottok commented Aug 21, 2019

ericzolf commented Aug 22, 2019 • edited

ottok commented Aug 22, 2019

ericzolf commented Aug 22, 2019 •

edited