Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migration to Python 3 with file paths as bytes instead str #53

Merged
merged 19 commits into from Aug 22, 2019

Conversation

ericzolf
Copy link
Member

This is the promised adaptation for rdiff-backup to work with bytes to handle paths so that there shouldn't be a dependency on the charset used.

The branch is based upon #40 so that it can either be merged directly into master and the PR#40 been closed, or I can merge the branch of this PR and close it so that the review happens further on 40. Not sure what is best for the reviewers.

@ericzolf
Copy link
Member Author

Another thought: should we stash the branch?

@ericzolf
Copy link
Member Author

I almost forgot, if you want to test, it depends on https://github.com/ericzolf/rdiff-backup/releases/tag/Testfiles2019-08-10

@ottok
Copy link
Contributor

ottok commented Aug 12, 2019

Could you please rebase on master so that the relevant commits are easier to review? Thanks

ericzolf and others added 18 commits August 13, 2019 06:26
Based on discussion on the mailing list, still more to do.
Single backup and incremental backup basically works.
Still need to make all test cases work again (untried).
This commit is more of a generic let's check everything in a big swipe.
There are surely things let to do, but all paths have bin set as bytes,
where I could find them with some search pattern.
Just simple bytes vs. string fixes.
Somehow the librsync test failed after the selection was fixed,
it was due to files not being cleaned up, not sure why it didn't
appear before.
The selection fixes were just the usual str vs. bytes thingy.
The tests were again mostly about bytes vs. str.
rpath's dirsplit had to be reverted, the expected result didn't fit os.path.
The usual bytes vs. str stuff and replacing StringIO with BytesIO in compare.py and restore.py
Also improve the ability to create a new restoretest repo.
@ericzolf
Copy link
Member Author

Branch re-based without conflict, can be merged

@ottok
Copy link
Contributor

ottok commented Aug 13, 2019

Oh, so this change really is 18 commits.. I am surprised that many changes were needed.

Was the removal of src/memoryleak.c in the first commit intentional? The commit does not reference that change or the motivation of it, so it leaves me guessing.

I am no expert on binary vs unicode. I am surprised that b needs to be sprinkled all over and also that the regexp string no_compression_regexp_string is set as binary. I would have assumed the rules are defined as strings and then regex compiles it into bytecode with bytes or something.

I also see lines like os.system(b"cp -a – do system calls really be in byes as well?

Somehow this does not feel like a pythonic solution to me. I understand that os.fsencode might be needed in a few places when interacting with the filenames, but putting b in front of almost every variable in the code base makes me wonder...

@ericzolf
Copy link
Member Author

Honestly, I went through the same "wonder" as I attacked this new topic, and searched also for a less intrusive approach. The problem is that as soon as you accept the fact that any filename can have any encoding, also a broken one, or multiple ones because of cross-platform remote backup etc, you need to work solely with bytes and can't work any more with str.

One need to understand that at this point any "encoding" is possible but any "decoding" can lead to a conversion error, which means all files (metadata & co) containing paths need to be written as binary/byte, any command which might contain a path must be expressed as bytes, and the same applies to regex.

There is perhaps an alternative but it would be way above my pay range (so to say).

@ericzolf
Copy link
Member Author

Sorry, forgot: removing src/memoryleak.c was indeed unclean, not really related to the branch topic, but it was clear that it had become useless and my first "grep's" to find places to fix for bytes vs. str had a hit on the file, so I just decided to remove it and not try to fix it.

@ericzolf ericzolf added this to the 2.0.0 milestone Aug 14, 2019
@ottok
Copy link
Contributor

ottok commented Aug 16, 2019

Incidentally I came across the same discussion at https://lwn.net/Articles/796344/ but I can't find any info on what is the correct way to do this.. My love for Python has got a big dent if one really needs to sprinkle b' everywhere to do something with file paths..

@ericzolf
Copy link
Member Author

ericzolf commented Aug 16, 2019 via email

@ericzolf
Copy link
Member Author

Sorry, one additional commit to fix also rdiff-backup-statistics in regards to bytes and some "lazy" remains.

Remove lazy references forgotten and replace with standard filter iterator
Tested on ../rdiff-backup_testfiles/restoretest3 and works
@ottok
Copy link
Contributor

ottok commented Aug 17, 2019

I am trying to reconstruct the logic here to better understand what is going on. First of all, I don't really understand deeply what is the exact problem these changes try to fix. There is no issue and the commits are don't explain why they are made in regards to functionality:
image

What exactly are the paths or filenames that didn't work? What was the error messages Python threw?

@ericzolf
Copy link
Member Author

OK, basically, everything started with the thread https://lists.nongnu.org/archive/html/rdiff-backup-users/2019-08/msg00010.html - based on the feedback that I couldn't assume only UTF-8 paths but had to assume any codeset even mixed and broken ones.

The first commit in the PR was about removing the ignoring file %s with wrong encoding sanity check in src/rdiff_backup/selection.py that I had introduced in PR #40 - after that any test involving the files found by find ../rdiff-backup_testfiles -name ث\* would fail. I had to re-create a new testfiles archive to re-introduce the strange file into the restoretest3 repository.

The rest of the MR is solely about making sure that rdiff-backup and all the tests work properly, once I had started to switch from str/unicode to bytes.

It is even worse because my earlier version wouldn't have been able to use a repo containing any file with broken repo. Take the rdiff-backup version at the end of PR #40 and try to restore the repo under ../rdiff-backup_testfiles/restoretest3 from the archive I uploaded on the 10th of August:

$ git checkout master
$ ./setup.py build
$ PATH=$PWD/build/scripts-3.7:$PATH PYTHONPATH=$PWD/build/lib.linux-x86_64-3.7 python3 rdiff-backup -r0 ../rdiff-backup_testfiles/restoretest3 /tmp/dummyrestore
[...]
  File "/home/ericl/Public/rdiff-backup/build/lib.linux-x86_64-3.7/rdiff_backup/rpath.py", line 1445, in read
    data = data.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 14908: invalid start byte

I didn't go into the details in my commit messages because it was always the same thing, either an error that the broken filename couldn't be decoded, or an error that I can't mix bytes and str (in many variations), until I had rdiff-backup(-statistics) and the testsuite running properly again.

@ottok
Copy link
Contributor

ottok commented Aug 17, 2019

I found out that in Python 3 the file path is not of type str but it has its own type inherited from pathlib. Maybe the original problem was that some code mixed str and path objects? And maybe the solution is to consistently use the path object correctly?

See
https://realpython.com/python-pathlib/
https://snarky.ca/why-pathlib-path-doesn-t-inherit-from-str/

@ericzolf
Copy link
Member Author

Perhaps true, but this would mean IMHO a major rewrite, and more difficult than just changing everything from str to bytes or vice-versa, and I'm not absolutely convinced that it solves all our problems: the statement from the 2nd link that "all path can be represented by string" is quite wrong:

import pathlib
for p in pathlib.Path('.../rdiff-backup_testfiles/various_file_types').iterdir():
    print(p)

ends with:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcb1' in position 62: surrogates not allowed

So, we might end in dead-end because of cases like this, so it's something we might want to look at to get cleaner code, base on a new issue, but I don't see the topic relevant for now.

@ericzolf
Copy link
Member Author

So, how do we progress? I'm happy to accept/merge some of the other pending pull requests, but they would introduce conflicts with this pull request.

@ikus060
Copy link
Contributor

ikus060 commented Aug 21, 2019 via email

@ottok
Copy link
Contributor

ottok commented Aug 21, 2019

I would lean towards prioritizing Python 3 (and tha tPR was already merged) and CI / automatic testing pipeline to ensure the following changes don't break anything. A good testing system can also be utilized here to make files with UTF-8, UTF-16, UTF-32 etc unicode characters and then it is easy to see if fixing all str cases with Python Path objects is enough, or if we really need to have everything in binary all the time.

@ericzolf
Copy link
Member Author

ericzolf commented Aug 22, 2019

This is a huge patch again so I'm not keen to let it rot and do endless rebases, especially not if Travis CI has a PEP8 cleanup of the code as prerequisite. Should this happen, it would be a complete rewrite of the patch.

This PR might be ugly but it works the way it should and the tox tests prove it, so:

  1. It's better this way than the other way around, nice but non functional, as on master currently
  2. Introducing pathlib might be a good idea, actually I like it (*) BUT it's a huge change because bytes and strings act more or less alike but pathlib.Path has a completely different interface
  3. Once the code works correctly and the tests work as they should, refactoring/redesigning becomes easier, which is currently not the case on master.

So, my strong preference, merge this PR as ugly as it is, merge the other smaller stuff, including CI, clean the code PEP8, and in parallel continue the discussion about pathlib in a new issue.

(*) I'm thinking that the right way to do it would be to somehow derive RPath from pathlib.Path and/or RORPath from PurePath, with the potential issue that our code ought to be cross platform hence potentially mix windows and Linux path and transfer them remotely, so that there are a lot of thing to consider as pathlib uses internally different classes for both types of path.

@ottok
Copy link
Contributor

ottok commented Aug 22, 2019

I made a couple of extra test files:

ls testfiles/various_file_types/
 ไฟล์ทดสอบ
 aaaテストファイル
 aaaaaaåäöåäöÅÄÖ
'aaaملف الاختبار'
 aaa測試文件
 åäöåäöÅÄÖ
 changeable_permission
 executable
 executable2
 fifo
'Ø«±Wb®Å]'$'\302\212''»'$'\025''v*ô'$'\017''!ù>âY'$'\302\206''»«Ûp°'$'\302\204\023''k'$'\035''Âñõe¥U'$'\302\202\302\232''UV ôß4ºýX'$'\003\302\202\a''sÎ'$'\302\236\302\213''³4'$'\004\302\237\027'' ô'$'\302\217''¦ú'$'\302\227''«Ø¬Ú'$'\302\205''ÜKvCú#'$'\302\224\302\222\302\236''É·Ã_'$'\017\302\204''g'$'\302\232''B'$'\021''<=^ÛM'$'\023\302\226''c'$'\302\213''§|*"\'\''^$@#!(){}?+ ~` '
 regular_file
 regular_file.sig
 subdir
 symbolic_link
 test
 two_hardlinked_files1
 two_hardlinked_files2
'ث'$'\261''Wb'$'\256\305'']'$'\212\273\025''v*'$'\364\017''!'$'\371''>'$'\342''Y'$'\206\273\253\333''p'$'\260\204\023''k'$'\035\302\361\365''e'$'\245''U'$'\202\232''UV'$'\240\364\337''4'$'\272\375''X'$'\003\202\a''sΞ'$'\213\263''4'$'\004\237\027'' '$'\364\217\246\372\227\253''جڅ'$'\334''KvC'$'\372''#'$'\224\222\236''ɷ'$'\303''_'$'\017\204''g'$'\232''B'$'\021''<=^'$'\333''M'$'\023\226''c'$'\213\247''|*"\'\''^$@#!(){}?+ ~` '

When I run the test script I get this result:

$ sudo python3 pathlib-test.py 
testfiles/various_file_types/aaaaaaåäöåäöÅÄÖ
testfiles/various_file_types/aaaملف الاختبار
testfiles/various_file_types/aaaテストファイル
testfiles/various_file_types/aaa測試文件
testfiles/various_file_types/changeable_permission
testfiles/various_file_types/executable
testfiles/various_file_types/executable2
testfiles/various_file_types/fifo
testfiles/various_file_types/regular_file
testfiles/various_file_types/regular_file.sig
testfiles/various_file_types/subdir
testfiles/various_file_types/subdir/subdir_file
testfiles/various_file_types/symbolic_link
testfiles/various_file_types/test
testfiles/various_file_types/two_hardlinked_files1
testfiles/various_file_types/two_hardlinked_files2
testfiles/various_file_types/Ø«±Wb®Å]�»�v*ô!ù>âY»«Ûp°
                                                     �k�Âñõe¥U�UV ôß4ºýX��sÎ��³4��� ô¦ú«Ø¬Ú
ÜKvCú#���É·Ã_
             gB�<=^ÛM�c�§|*"\'^$@#!(){}?+ ~` 
testfiles/various_file_types/åäöåäöÅÄÖ
^[[?62;c^[[?62;cTraceback (most recent call last):
  File "pathlib-test.py", line 3, in <module>
    print(p)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcb1' in position 30: surrogates not allowed
$ 62;c62;c

(The 62.. on the terminal prompt are real and intentional.)

My test code:

import pathlib
for p in sorted(pathlib.Path('testfiles/various_file_types').iterdir()):
    print(p)

So it seems that Pathlib handles just fine reading and sorting all paths but when it is time to print, it fails to print the unprintable characters. The way ls(or bash) handles this is by printing the ordinal values, eg. \123 of characters it cannot print.

I also tried this which is used in the PR:

    x = b'\n'
    print(x)

Result: b'\n'

Your commit 'First try at using filenames handled as bytes' still removes C source files and does changes unrelated to the commit title. I understand that you want this change in and agree that it has its uglinesses, and that you think this is a better base to further improve and refactor the code than the current master branch, so I guess we just need to accept that stanze and merge this. I just hope that introducing "ugliness" does not later on add the work required to clean up and replace b'' with f'' or remove it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants