New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migration to Python 3 with file paths as bytes instead str #53
Conversation
Another thought: should we stash the branch? |
I almost forgot, if you want to test, it depends on https://github.com/ericzolf/rdiff-backup/releases/tag/Testfiles2019-08-10 |
Could you please rebase on master so that the relevant commits are easier to review? Thanks |
Based on discussion on the mailing list, still more to do. Single backup and incremental backup basically works. Still need to make all test cases work again (untried).
This commit is more of a generic let's check everything in a big swipe. There are surely things let to do, but all paths have bin set as bytes, where I could find them with some search pattern.
Just simple bytes vs. string fixes.
Somehow the librsync test failed after the selection was fixed, it was due to files not being cleaned up, not sure why it didn't appear before. The selection fixes were just the usual str vs. bytes thingy.
The tests were again mostly about bytes vs. str. rpath's dirsplit had to be reverted, the expected result didn't fit os.path.
The usual bytes vs. str stuff and replacing StringIO with BytesIO in compare.py and restore.py
Also improve the ability to create a new restoretest repo.
7054dde
to
c29a5b8
Compare
Branch re-based without conflict, can be merged |
Oh, so this change really is 18 commits.. I am surprised that many changes were needed. Was the removal of src/memoryleak.c in the first commit intentional? The commit does not reference that change or the motivation of it, so it leaves me guessing. I am no expert on binary vs unicode. I am surprised that I also see lines like Somehow this does not feel like a pythonic solution to me. I understand that |
Honestly, I went through the same "wonder" as I attacked this new topic, and searched also for a less intrusive approach. The problem is that as soon as you accept the fact that any filename can have any encoding, also a broken one, or multiple ones because of cross-platform remote backup etc, you need to work solely with bytes and can't work any more with str. One need to understand that at this point any "encoding" is possible but any "decoding" can lead to a conversion error, which means all files (metadata & co) containing paths need to be written as binary/byte, any command which might contain a path must be expressed as bytes, and the same applies to regex. There is perhaps an alternative but it would be way above my pay range (so to say). |
Sorry, forgot: removing |
Incidentally I came across the same discussion at https://lwn.net/Articles/796344/ but I can't find any info on what is the correct way to do this.. My love for Python has got a big dent if one really needs to sprinkle |
The linked thread is a good summary of what I had to learn the hard way.
…On August 16, 2019 9:39:13 AM UTC, "Otto Kekäläinen" ***@***.***> wrote:
Incidentally I came across the same discussion at
https://lwn.net/Articles/796344/ but I can't find any info on what is
the correct way to do this.. My love for Python has got a big dent if
one really needs to sprinkle `b'` everywhere to do something with file
paths..
--
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
#53 (comment)
|
Sorry, one additional commit to fix also rdiff-backup-statistics in regards to bytes and some "lazy" remains. |
Remove lazy references forgotten and replace with standard filter iterator Tested on ../rdiff-backup_testfiles/restoretest3 and works
11fb247
to
1b5c352
Compare
OK, basically, everything started with the thread https://lists.nongnu.org/archive/html/rdiff-backup-users/2019-08/msg00010.html - based on the feedback that I couldn't assume only UTF-8 paths but had to assume any codeset even mixed and broken ones. The first commit in the PR was about removing the The rest of the MR is solely about making sure that rdiff-backup and all the tests work properly, once I had started to switch from str/unicode to bytes. It is even worse because my earlier version wouldn't have been able to use a repo containing any file with broken repo. Take the rdiff-backup version at the end of PR #40 and try to restore the repo under
I didn't go into the details in my commit messages because it was always the same thing, either an error that the broken filename couldn't be decoded, or an error that I can't mix bytes and str (in many variations), until I had rdiff-backup(-statistics) and the testsuite running properly again. |
I found out that in Python 3 the file path is not of type See |
Perhaps true, but this would mean IMHO a major rewrite, and more difficult than just changing everything from str to bytes or vice-versa, and I'm not absolutely convinced that it solves all our problems: the statement from the 2nd link that "all path can be represented by string" is quite wrong: import pathlib
for p in pathlib.Path('.../rdiff-backup_testfiles/various_file_types').iterdir():
print(p) ends with:
So, we might end in dead-end because of cases like this, so it's something we might want to look at to get cleaner code, base on a new issue, but I don't see the topic relevant for now. |
So, how do we progress? I'm happy to accept/merge some of the other pending pull requests, but they would introduce conflicts with this pull request. |
I would merge stuff according to the priorities. According to me.
supporting python3 is the priority, since python2 end-of-line is plan on
2020.
So, my recommendation is to merge your python3 changes first, and slowly
merge other changes after that, by resolving the conflict.
…--
Patrik Dufresne Service Logiciel inc.
http://www.patrikdufresne.com <http://patrikdufresne.com/>/
514-971-6442
130 rue Doris
St-Colomban, QC J5K 1T9
On Wed, Aug 21, 2019 at 3:31 PM Eric L. ***@***.***> wrote:
So, how do we progress? I'm happy to accept/merge some of the other
pending pull requests, but they would introduce conflicts with this pull
request.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#53?email_source=notifications&email_token=AAHA5I7FCMIEN2IXYVEG6Q3QFWJ2JA5CNFSM4IK2JT7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD423ZVA#issuecomment-523615444>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAHA5I4FV727SAZDC3NMM63QFWJ2JANCNFSM4IK2JT7A>
.
|
I would lean towards prioritizing Python 3 (and tha tPR was already merged) and CI / automatic testing pipeline to ensure the following changes don't break anything. A good testing system can also be utilized here to make files with UTF-8, UTF-16, UTF-32 etc unicode characters and then it is easy to see if fixing all |
This is a huge patch again so I'm not keen to let it rot and do endless rebases, especially not if Travis CI has a PEP8 cleanup of the code as prerequisite. Should this happen, it would be a complete rewrite of the patch. This PR might be ugly but it works the way it should and the tox tests prove it, so:
So, my strong preference, merge this PR as ugly as it is, merge the other smaller stuff, including CI, clean the code PEP8, and in parallel continue the discussion about pathlib in a new issue. (*) I'm thinking that the right way to do it would be to somehow derive RPath from pathlib.Path and/or RORPath from PurePath, with the potential issue that our code ought to be cross platform hence potentially mix windows and Linux path and transfer them remotely, so that there are a lot of thing to consider as pathlib uses internally different classes for both types of path. |
I made a couple of extra test files:
When I run the test script I get this result:
(The 62.. on the terminal prompt are real and intentional.) My test code:
So it seems that Pathlib handles just fine reading and sorting all paths but when it is time to print, it fails to print the unprintable characters. The way I also tried this which is used in the PR:
Result: Your commit 'First try at using filenames handled as bytes' still removes C source files and does changes unrelated to the commit title. I understand that you want this change in and agree that it has its uglinesses, and that you think this is a better base to further improve and refactor the code than the current master branch, so I guess we just need to accept that stanze and merge this. I just hope that introducing "ugliness" does not later on add the work required to clean up and replace |
This is the promised adaptation for rdiff-backup to work with bytes to handle paths so that there shouldn't be a dependency on the charset used.
The branch is based upon #40 so that it can either be merged directly into master and the PR#40 been closed, or I can merge the branch of this PR and close it so that the review happens further on 40. Not sure what is best for the reviewers.