Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future maintenance #15

Open
stefan6419846 opened this issue Feb 6, 2023 · 35 comments
Open

Future maintenance #15

stefan6419846 opened this issue Feb 6, 2023 · 35 comments

Comments

@stefan6419846
Copy link

Thanks for your work on this fork, which seems to be the most active and up-to-date one.

Unfortunately, GitHub makes it hard to work with forks or even discover them as they usually are hidden in the search results and in-repository search for forks is not available. Additionally, while there is a package on PyPI, it is out-of-date and does not correspond to this repository directly.

What are your plans for the future of your fork? I considered working on an own fork to keep this package available for my use cases, but with your existing work this could become easier. What I am currently thinking of:

  • Move development into the pdfrw organization.
  • Perform a fresh import of your fork into this organization for future maintenance.
  • Try to keep maintaining the original pdfrw package on PyPI.
@federicobond
Copy link

federicobond commented Mar 27, 2023

Hi Stefan, I am interested in volunteering to get this done too. The outline you propose looks like a good plan. Are you the owner of the pdfrw organization?

cc/ @sarnold

@federicobond
Copy link

We should aim to get the test suite running in CI too.

@stefan6419846
Copy link
Author

Are you the owner of the pdfrw organization?

Yes, I am. For the time being, it just is a placeholder.

We should aim to get the test suite running in CI too.

It mostly does in this repository. Some tests have been disabled although, while they should be safe to enable as there is no visual difference (internal changes in reportlab for example).

@federicobond
Copy link

Pinging @t-houssian and @Lucas-C who might be interested in this too.

@Lucas-C
Copy link

Lucas-C commented Mar 27, 2023

Thanks for the ping :)

@MartinThoma may also be interested by the subject:
he is the maintainer of https://github.com/py-pdf/pypdf (formerly PyPDF2), and "currently helping to clean up the Python PDF ecosystem", to quote one of his recent emails 😊

Maybe the best option would be to maintain this package inside the https://github.com/py-pdf organization, which is already active?

@federicobond
Copy link

Maybe the best option would be to maintain this package inside the https://github.com/py-pdf organization, which is already active?

I think that's a great idea. It could bring greater visibility and increased collaboration between projects.

@t-houssian
Copy link

t-houssian commented Mar 27, 2023

@federicobond Thanks for the ping as well! I sadly don't have the time currently to help out much on this but do think that what @Lucas-C has mentioned is a great idea. I used this fork in a project of mine called fillpdf (https://github.com/t-houssian/fillpdf). I released this fork because it was the best I could find to use in my project.

I created that fillpdf project because of how hard it was to work in the current pdf filling libraries so I think any clean up and making things more user friendly would be awesome. Feel free as well to add fillpdf to the ecosystem and use any of the code from it.

Best of luck y'all!

@MartinThoma
Copy link

Thanks for pinging me 🤗 Yes, I've spend quite some time since April 2022 in merging PyPDF2 back into pypdf + setting up CI/tests + docs + merging over 100 PRs + fixing several hundred issues. Now we have at least one other super active pypdf developer again and I hope that PyPDF3 / PyPDF4 developers and users will move back to pypdf 🤞

It seems to me that pdfrw is solving a sub-set of the problems that pypdf is solving. For this reason I would love if the two projects (and especially the developers around them) could converge. I approached Patrick Maupin in April 2022. Sadly I don't know pdfrw well enough to really judge if merging the two would be reasonably possible.

I was thinking that we might be able to define a "pypdf-core" which is similar to pdfrw, but nobody did any work in that direction so far. I'm also uncertain about which use-cases current pdfrw users actually have. Looking at SO, I'd rather recommend them to use pypdf.

Another activity besides blog posts + answering questions is nudging the fpdf / fpdf2 people to make their relationship clear to the community. @Lucas-C and me recently received a super nice e-mail by the original author; I'm in good hope here 🎉

pdfrw in the py-pdf GitHub organization

I'd be open to move pdfrw into the py-pdf GitHub organization. I would love an exchange between PDF-related projects / developers and sharing of issues/solutions/test cases.

The official git seems to be https://github.com/pmaupin/pdfrw / (1700 stars) whereas https://github.com/sarnold/pdfrw only has 24 stars. I'm interested in bringing the Python-PDF communities closer together, not in fracturing the communities even more. So I'd rather not move https://github.com/sarnold/pdfrw into py-pdf at this stage.

The pdfrw GitHub organization

What does https://github.com/pdfrw do? I don't see anything in there.

@federicobond
Copy link

Thank you for your input @MartinThoma, very appreciated!

It seems to me that pdfrw is solving a sub-set of the problems that pypdf is solving. For this reason I would love if the two projects (and especially the developers around them) could converge.

I agree this would be very desirable for the ecosystem.

I'm also uncertain about which use-cases current pdfrw users actually have. Looking at SO, I'd rather recommend them to use pypdf.

I can add my 2 cents here: we began using PyPDF a few years ago at our company to include a stamp on each page of some files that are uploaded to our system. Its performance was pretty bad: it took whole seconds and consumed quite a bit of memory to process moderately long files. We ended up switching to pdfrw and saw a huge improvement. This could no longer hold now, but pdfrw worked well enough for us and was easy to debug that we remained with it since.

I would love an exchange between PDF-related projects / developers and sharing of issues/solutions/test cases.

That would be awesome! Also increasing the bus factor for these projects.

The official git seems to be https://github.com/pmaupin/pdfrw / (1700 stars) whereas https://github.com/sarnold/pdfrw only has 24 stars.

I believe sarnold's fork is just pmaupin master + some small fixes/improvements, most of which we would need to land into master eventually (someone correct me if I'm wrong). Other than that, the projects haven't really diverged.

What does https://github.com/pdfrw do? I don't see anything in there.

I believe it's just @stefan6419846 squatting the name in case it was going to be used.

@Lucas-C
Copy link

Lucas-C commented Mar 28, 2023

I would love an exchange between PDF-related projects / developers and sharing of issues/solutions/test cases.

That would be awesome! Also increasing the bus factor for these projects.

I totally agree! 😊

In fact, maybe we could consider merging https://github.com/PyFPDF (which is mostly fpdf2) into https://github.com/py-pdf?
I'm all for joining efforts, and I'd be happy to help on other PDF libraries!

Would you be open to this @MartinThoma?
This is not the main topic here, but I use the opportunity to drop this idea 😋

Also, maybe at some point the org should have a code of conduct & some projects management guidelines?
I'm thinking about some basic directions on how to handle issues, reviews, releases, etc.

(edit:) I see that the only public member of the py-pdf org is Matthew Peveler: https://github.com/orgs/py-pdf/people
You are not a member of the org @MartinThoma?
Having public org membership, and being able to know clearly who has the rights to release new versions seems important to me 😊

@stefan6419846
Copy link
Author

What does https://github.com/pdfrw do? I don't see anything in there.

I believe it's just @stefan6419846 squatting the name in case it was going to be used.

This is correct. I just created this organization to block the name when thinking about the future of the project and creating this issue as well. As responses have been quite sparse until yesterday (with my e-mails to Patrick and Steve being unanswered for nearly two months now as well), I did not yet take this further.

I am open to move this to the aforementioned py-pdf organization nevertheless.

I was thinking that we might be able to define a "pypdf-core" which is similar to pdfrw, but nobody did any work in that direction so far. I'm also uncertain about which use-cases current pdfrw users actually have. Looking at SO, I'd rather recommend them to use pypdf.

Speaking of my use-case: I mostly use pdfrw for working with PDF forms.

@MartinThoma
Copy link

In fact, maybe we could consider merging https://github.com/PyFPDF (which is mostly fpdf2) into https://github.com/py-pdf?
I'm all for joining efforts, and I'd be happy to help on other PDF libraries!

Sounds awesome to me! We should talk about permissions/expectations beforehand, though. I would suggest that you open an issue/discussion in https://github.com/PyFPDF/fpdf2 to discuss this :-)

The two roles I can give are:

image

I would make you @Lucas-C an owner of py-pdf, but would appreciate if we had a discussion before adding new owners (for members, I don't care too much)

Although owners have all permissions on all repositories, I would expect them/me not to interfere with them except if the repositories maintainer(s) are inactive for a long time (e.g. 3 months?) or if something security-critical happens (e.g. a dependency was introduced that is malicious/typo-squatting). As both, pypdf and fpdf are pretty big, we should write such things down within py-pdf (maybe make a github page at https://py-pdf.github.io/ )

@MartinThoma
Copy link

[pypdfs] performance was pretty bad: it took whole seconds and consumed quite a bit of memory to process moderately long files. We ended up switching to pdfrw and saw a huge improvement.

I've heard that before 🤔 When I have some time I need to create benchmarks + investigate that 🕵️

@Lucas-C
Copy link

Lucas-C commented Mar 28, 2023

would make you @Lucas-C an owner of py-pdf, but would appreciate if we had a discussion before adding new owners (for members, I don't care too much)

Although owners have all permissions on all repositories, I would expect them/me not to interfere with them except if the repositories maintainer(s) are inactive for a long time (e.g. 3 months?) or if something security-critical happens (e.g. a dependency was introduced that is malicious/typo-squatting). As both, pypdf and fpdf are pretty big, we should write such things down within py-pdf (maybe make a github page at py-pdf.github.io )

Sounds great to me! 😊
I'll open this issue during week, when I have some time available.

@brokenshield
Copy link

Hey, I'm just a user, but I know how hard it it to keep a project going, so from a user perspective: do what you got to do! Also: thank you for your continued work. It is appreciated.

@federicobond
Copy link

I'm so happy this is moving along! 😄

As for pdfrw, should we wait until @Lucas-C becomes a py-pdf owner to discuss next steps?

@Lucas-C
Copy link

Lucas-C commented Apr 3, 2023

Hi!

I described how I plan for fpdf2 to migrate to @py-pdf in this announcement:
py-pdf/fpdf2#752

I'd be happy to get feedback from you all 😊

@abubelinha
Copy link

I am not a developer, just a pypdf user.
I compared pdfrw and pypdf for extracting pages from a big pdf into smaller files.
pdfrw was the clear winner (much less time used; also better output file size optimization when stuff was repeated, I think).
Unfortunately I just posted my test output but later on modified my script and didn't keep my comparison code.
So certainly there might be errors in the way I coded my pypdf extraction test, but I think you guys might look further into this.

Thanks for your amazing job
@abubelinha

@MartinThoma
Copy link

MartinThoma commented Jun 30, 2023

This month we discovered+fixed a couple of issues that affect file size ( py-pdf/pypdf#1926 , py-pdf/pypdf#1906 ). If you can come up with a nice comparison script or a good test scenario, I could add it to https://github.com/py-pdf/benchmarks

I'm all for an open and fair assessment of the qualities of different libraries. This benchmark allowed us to improve the text extraction quality of pypdf a lot. Maybe we can do something similar for other workflows / operations.

edit: Recently I'm spending a less time with open source. If you make a PR to https://github.com/py-pdf/benchmarks that might help 😅

@abubelinha
Copy link

abubelinha commented Jul 1, 2023

EDIT: @MartinThoma I am not sure if your last post was an answer to mine or a general comment

As I said, I am not a developer. I do not use git, so PRs are pretty unknown to me.
But I was able to remember and reproduce that test and posted the code here:
py-pdf/benchmarks#7

@MartinThoma
Copy link

Thank you for clarifying and for sharing your benchmarking code. I will eventually add the idea to https://github.com/py-pdf/benchmarks . It might just take some time (and I will list you as a co-author of that PR, so you get credit for it :-) )

@baslr
Copy link

baslr commented Jul 25, 2023

@t-houssian @Lucas-C @MartinThoma what is the status of moving this repo to the py-pdf org?
I found a fix for one of the bugs in #17 and would like to add it to the project. I do not want to fragment pdfrw further by adding another unmaintained fork.

@sarnold is this project still maintained or archived?

@Lucas-C
Copy link

Lucas-C commented Aug 4, 2023

@t-houssian @Lucas-C @MartinThoma what is the status of moving this repo to the py-pdf org?

Good question.

I have just moved fpdf2 to the py-pdf GitHub org.
I'll be away for a few days, but when I'll be back I volunteer to setup a py-pdf/pdfrw repository,
based on this fork, with maybe extra commits from https://github.com/PyFPDF/pdfrw,
and a GitHub Actions pipelines running tests.

Would you agree with this suggestion @MartinThoma & @MasterOdin?

@MasterOdin
Copy link

Makes sense, happy to help get the GH action pipelines setup.

@stefan6419846
Copy link
Author

This fork already has GitHub actions set up, so this part should be relatively easy in theory.

Nevertheless, some of the tests have apparently been disabled for now and might need further evaluation: https://github.com/sarnold/pdfrw/commits/master/tests/expected.txt I did some research some months ago about the actual differences on the PDF files (related to more recent reportlab package etc.) and as far as I remember, most of the (visual) results were rather identical (I am currently on vacation and thus have no access to my experimental code).

Just for the record: Valid reference files generated by Python 3.5 (and partly Python 3.6) might be downloaded from the artifacts at https://github.com/stefan6419846/pdfrw_reference_python36/actions/

@MartinThoma
Copy link

MartinThoma commented Aug 5, 2023

Would you agree with [having pdfrw in the py-pdf organization]?

I want a healthy Python / PDF ecosystem and I want to avoid having lots of small projects with tons of overlap.

Maintenance:

Unique Selling Point: pdfrw can make modifications to PDF files, similar to pypdf. However, pdfrw is a lot faster. Besides the speed, I don't know of a single feature that pdfrw supports which pypdf does not.

Community: As it has big overlaps with pypdf, I take it as a comparison

  • StackOverflow: 79 questions vs 1439 questions to pypdf
  • PyPI downloads:
    • pdfrw: 159,461 downloads (rank 3689). pdfrw2 is not in the top 5000
    • pypdf: 3,511,042 downloads for PyPDF2 (rank 771), pypdf has 991,756 downloads (rank 1481), pypdf3 has 114,458 downloads (rank 4291)

Maintainer support for project transfer:

  • pdfrw: Did anybody write pmaupin (e.g. via e-mail)? What does he think about transfering pdfrw (Github) into py-pdf?
  • pdfrw2: @sarnold would you be open to transfer pdfrw2 into the py-pdf Github organization + adding a Github action so that others can help to maintain the library as well?

Summary

I'm uncertain. I think pdfrw must have some very good ideas regarding parsing of PDFs built-in. However, I don't see a single feature that pdfrw supports and pypdf doesn't. I'm also not certain how good the community support of pdfrw/ pdfrw2 is and if we could maintain it well.

Given those first impressions, I think I'd rather try to improve pypdf with ideas from pdfrw + help the community make a switch than move pdfrw to py-pdf.

@MartinThoma
Copy link

@Lucas-C Does fpdf2 use pdfrw2? If that is the case, I can see an inherent interest of you to take care of pdfrw. If you want to take care of it then, I'd be ok with it :-)

However, we should try to get some option to release a new version on PyPI. I'm currently observing how this does not work well with camelot-py 🥲

@MartinThoma
Copy link

I completely forgot this: pmaupin#232

If pdfrw is the basis of many other projects, I'd also say it would fit well into py-pdf.

@MartinThoma
Copy link

More download starts:

https://pypistats.org/packages/pdfrw - 7% still use python 2 😱

https://pypistats.org/packages/pdfrw about 4% of python 2 users

@stefan6419846
Copy link
Author

Maintainer support for project transfer

I wrote to both Patrick and Steve in February when I initially opened this issue to get their opinion about an organization-based approach and the future maintenance in general, but never received any public or private response from them. There might be different reasons for this.

@MartinThoma
Copy link

I have tried 5 days ago to contact Patrick Maupin, but didn't get a response so far. I would wait 2 weeks in total.

If somebody wants to take the work of a maintainer of pdfrw, we could do the following:

  1. Create a fork of pdfrw / pdfrw2 into py-pdf
  2. Rename the fork (pdfrw3 🙈 ... maybe something completely different?)
  3. Register that name at PyPI + add the CI parts into the repsitory to ensure that we can push new versions
  4. Announce it so that people can switch

@Lucas-C
Copy link

Lucas-C commented Aug 13, 2023

@Lucas-C Does fpdf2 use pdfrw2? If that is the case, I can see an inherent interest of you to take care of pdfrw. If you want to take care of it then, I'd be ok with it :-)

No, fpdf2 does not rely on pdfrw.
fpdf2 just has a documentation page on how to combine both libs:
https://py-pdf.github.io/fpdf2/CombineWithPdfrw.html

All things considered, I'm not particularly interested in maintaining pdfrw
and agree that it is probably better to focus on pypdf as a replacement.

I think I'm even going to get rid of that Combine with pdfrw page in fpdf2 documentation.

@Lucas-C
Copy link

Lucas-C commented Aug 13, 2023

I made a quick performance comparison between pdfrw & pypdf for 2 specifc use cases that we have in fpdf2 documentation:

Those are the execution times of running those scripts on my computer, using a 4.8MB base PDF document with 47 pages:

$ time ./add_on_page_with_pdfrw.py
real    0m1,649s
$ time ./add_on_page_with_pypdf.py
real    0m32,082s

$ time /add_new_page_with_pdfrw.py
real    0m2,769s
$ time ./add_new_page_with_pypdf.py
real    0m47,247s

Based on those results, pdfrw can be 20 times faster than pypdf for those use cases!
To me, this seems like a severe limitation of pypdf 😢

@MartinThoma: what do you think is the bottleneck here for pypdf?

Edit: the scripts I used can be found there: https://github.com/py-pdf/fpdf2/tree/master/tutorial (they require the source & destination PDF files to be specified as arguments)

@MartinThoma
Copy link

I am aware of the speed difference. I've actually already created a benchmark for it: https://github.com/py-pdf/benchmarks#watermarking-speed

Sadly, I cannot pin-point a single simple reason for that difference. I think a part of the reason is that we represent floats in with FloatObject which (I think) might be more heavy-weight than it needs to be.

@Lucas-C
Copy link

Lucas-C commented Aug 13, 2023

I think a part of the reason is that we represent floats in with FloatObject which (I think) might be more heavy-weight than it needs to be.

I spent an hour investigating the performances of pdf_benchmark.library_code.pypdf_watermarking,
and I think the issue is probably more that pypdf ALWAYS decode/parse content streams, and that objects are cloned "all the time". For example PageObject._merge_page() makes repeated calls to ContentStream.__init__() that itself calls EncodedStreamObject.get_data().

Whereas pdfrw does not bother to clone objects (cf. PageMerge.add()) and it does not parse streams (cf. https://github.com/pmaupin/pdfrw/blob/master/pdfrw/pdfreader.py#L7).

Maybe pypdf should lazy-parse content streams? That is, only parse them if it has to access / alter those streams.
What do you think of this @MartinThoma?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants