Rebooting PyPDF2 Maintenance #385

mstamy2 · 2017-12-22T21:11:06Z

As I have had numerous other occupations keeping me busy, it has been some time since PyPDF2 has been actively maintained. That's absolutely unfair to those of you have taken the time to submit pull requests and issues.

Meanwhile, the popularity of this library has continued to grow quite a lot. I would like to revitalize maintenance, but I'll need some help.

Would anyone be interested in being made a collaborator? We would prefer frequent contributors who are familiar with the source and have interest in helping with maintenance.

Again, I apologize for the lapse in maintenance. This is truly a really useful library, and it has potential to do a lot more as well!

SalilVishnuKapur · 2018-01-13T18:55:17Z

@mstamy2 Yes I would like take any kind of responsibility of this project. We should probably start by making a small team of collaborators and start with small pending tasks. Eventually as the piled up requests start finishing up then we can add better functionality into PyPDF2. Like having a better way to handle Interactive PDF scarping especially when there is a fillable box with values as Annotated Data.

jcampbell05 · 2018-01-16T23:19:46Z

We use this quite a lot in my company so would be happy to donate time to maintain it as it's critical for us.

We would love to help with optimizing the library as we are currently having issues with RAM and speed consumption.

cnicodeme · 2018-03-02T13:07:47Z

I agree that it should be restarted. I came upon this library while searching for one in Python. There is currently two top fighting the #1 place : PyPDF2 and pdfrw (I consider PyPDF2 to be an evolution of PyPDF so it's out of the equation).

The sad things is, there is NO library in Python that can do a up-to-date work in PDF. All the library I've found works for PDF v1.3 whereas the current version is 2.0.

So there is a lot of work, but a lot of great reward too. You can work to provide a free and commercial version for instance, or offer support in paid rate (I saw that on an other library: http://pyfpdf.readthedocs.io/en/latest/#support ).

There are ways to make money out of an open source project, and more importantly, there is a room for a good Python library as none are available today.

So yeah, kick off PyPDF :) I'll see if I can do anything to help!

claird · 2018-03-02T15:50:22Z

Phaseit originally sponsored PyPDF2, and, after a year of dealing with family medical needs, I'm getting ready to return to PyPDF2. I'll likely concentrate first on "housekeeping": preparation of tutorials, ensuring that the source works correctly across different releases of Python, and so on. I was ready to let PyPDF2 slip into oblivion, and would do so except for the expressions here of enthusiasm and support. Let's see what we can make of PyPDF2. Cyril, you closed with "... kick off PyPDF". My thought is that we move forward with the sources Matthew has tended at <URL: https://github.com/mstamy2/PyPDF2 >. When you write, "PyPDF", did you have something different in mind? Cameron Laird, vice president We make computers work for people.

…

On Fri, Mar 2, 2018 at 6:07 AM, Cyril Nicodème ***@***.***> wrote: I agree that it should be restarted. I came upon this library while searching for one in Python. There is currently two top fighting the #1 <#1> place : PyPDF2 and pdfrw (I consider PyPDF2 to be an evolution of PyPDF so it's out of the equation). The sad things is, there is NO library in Python that can do a up-to-date work in PDF. All the library I've found works for PDF v1.3 whereas the current version is 2.0. So there is a lot of work, but a lot of great reward too. You can work to provide a free and commercial version for instance, or offer support in paid rate (I saw that on an other library: http://pyfpdf.readthedocs.io/ en/latest/#support ). There are ways to make money out of an open source project, and more importantly, there is a room for a good Python library as none are available today. So yeah, kick off PyPDF :) I'll see if I can do anything to help! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#385 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAbN9DmwuZwg8bOCBYtGkmmM1Ch-GRl1ks5taUQqgaJpZM4RLbVK> .

jcampbell05 · 2018-03-02T15:53:44Z

I've recently switched away from this library as I found PDFium is the only open sourcelibrary capable of the speed we need. However would be happy to somehow port some of that work back.

I think it would be great if somehow PyPDF could be build ontop of PDFium, as it handles alot of the parsing already.

cnicodeme · 2018-03-03T08:29:38Z

Hi @claird , yes sorry about the confusion regarding "kick off PyPDF". For me, the original PyPDF is done, so when I mention PyPDF, I mean PyPDF2. What I meant was "Make PyPDF2 the best Python library for managing PDF" :)

As a indicator of how much there is potential, I recently stumbled upon this question on SO, asking how to change the metadata in PyPDF2.
At my great surprise, there was an answer for pdfrw, and non for PyPDF2.
I added mine, and in less than a day, I already had two upvotes.

It's clear there is a lot of interest for PyPDF2 so if your current state of family/work allows it, I think it would be awesome to work on it.

I don't have a great knowledge of how PDF work, and honestly, I don't want to dig in it as it appears to be pretty awful. But I can help in writing the doc, tutorials, etc.

We can work toward increasing the name recognition of PyPDF2 via https://stackoverflow.com/questions/tagged/pypdf2 mostly, and work on the website to show more examples (I think it's the key feature).

Finally, I'm happy to help because I plan to use PyPDF2 as the core tool to manipulate PDFs, but this means to implement a few missing features. I can try to help for some, but I don't know if I'll be able to tolerate PDFs format ;)

In those features I think would be great to have, are :

Encryption in higher format (AES 256)
Support for rules like no printing, no modifying, no copying, etc. QPDF provides a great length of options when encrypting a document
Support for version higher than the current 1.3 (PDF is currently in version 2.0!). I don't know what is the risk of "just" changing the header version in the document from 1.3 to 1.5 for instance?
Greater watermark support. This could be open to discussion, but if we can have a class dedicated to add an image to a PDF page, with rotation, opacity, etc, that would be perfect. A lot of questions on SO are related to watermarks. The PyFPDF library does it pretty well, it would be possible to work on a fork and integrate it on PyPDF2.

In a less, but still interesting item, to implement, I would say:

Rewrite the code to be PEP8 compliant (this is not the case right now)
Taking this opportunity to rewrite the code in Python 3 (methods signature with expected type for instance)

The rewriting could also help reducing the amount of code required for some work. Take a look at my PyPDF2 example on how to write metadata, compared to pdfrw. PyPDF2 requires twice the amount of code. This is open to debate as, I think, the issue is that we need a reader and a writer, where pdfrw does everything on one.

The last two points would raise the question of, do we keep PyPDF2 as the name, or go to PyPDF3?

Sorry for this long post, but I wanted to share my whole opinion on the current matter. I can help on some part (importing FPDF, re-writing the code for PEP8 compliancy, making it Python3 only (not required, just a thought). I can also help on the documentation, the examples, the website (not a designer though).

Maybe one thing interesting would be to set up a roadmap of what is plan, in what order, etc. If you are not aware of that, Github allows to write "todo list" kind of Issues ticket. You can open one about a roadmap, and checking what is done, what is left to do, etc.

sekrause · 2018-03-03T13:42:41Z

I think the highest priority should be to go through the open pull requests, merge some and prepare a quick release before thinking about long-term development.

Over a year ago I reported issue #329 which is a serious denial-of-service security issue because you can easily force PyPDF2 into an infinite loop. So far my really simple pull request #331 has been waiting for a merge for over a year, but nothing has happened so far.

claird · 2018-03-03T15:42:20Z

The internals of PDF are ... well, surprises abound for those who come from other domains of computing. I agree that watermark enhancements make sense. I strongly favor PEP8ification. While I've worked plenty on encryption in other domains, I have nearly no familiarity with PDF's approach and practices. I mildly incline to launch of PyPDF3. I'll think about that a few more days. In principle, I'm a good one to lead roadmap work. Again, I want to think a few days. Cameron Laird, vice president We make computers work for people.

…

On Sat, Mar 3, 2018 at 1:29 AM, Cyril Nicodème ***@***.***> wrote: Hi @claird <https://github.com/claird> , yes sorry about the confusion regarding "kick off PyPDF". For me, the original PyPDF is done, so when I mention PyPDF, I mean PyPDF2. What I meant was "Make PyPDF2 the best Python library for managing PDF" :) As a indicator of how much there is potential, I recently stumbled upon this question on SO <https://stackoverflow.com/questions/46849733/change-metadata-of-pdf-file-with-pypdf2/49053629#49053629>, asking how to change the metadata in PyPDF2. At my great surprise, there was an answer for pdfrw, and non for PyPDF2. I added mine, and in *less than a day, I already had two upvotes*. It's clear there is a lot of interest for PyPDF2 so if your current state of family/work allows it, I think it would be awesome to work on it. I don't have a great knowledge of how PDF work, and honestly, I don't want to dig in it as it appears to be pretty awful. But I can help in writing the doc, tutorials, etc. We can work toward increasing the name recognition of PyPDF2 via https://stackoverflow.com/questions/tagged/pypdf2 mostly, and work on the website to show more examples (I think it's the key feature). Finally, I'm happy to help because I plan to use PyPDF2 as the core tool to manipulate PDFs, but this means to implement a few missing features. I can try to help for some, but I don't know if I'll be able to tolerate PDFs format ;) In those features I think would be great to have, are : - Encryption in higher format (AES 256) - Support for rules like no printing, no modifying, no copying, etc. QPDF provides a great length of options when encrypting a document - Support for version higher than the current 1.3 (PDF is currently in version 2.0!). I don't know what is the risk of "just" changing the header version in the document from 1.3 to 1.5 for instance? - Greater watermark support. This could be open to discussion, but if we can have a class dedicated to add an image to a PDF page, with rotation, opacity, etc, that would be perfect. A lot of questions on SO are related to watermarks. The PyFPDF library does it pretty well, it would be possible to work on a fork and integrate it on PyPDF2. In a less, but still interesting item, to implement, I would say: - Rewrite the code to be PEP8 compliant (this is not the case right now) - Taking this opportunity to rewrite the code in Python 3 (methods signature with expected type for instance) The rewriting could also help reducing the amount of code required for some work. Take a look at my PyPDF2 example on how to write metadata, compared to pdfrw. PyPDF2 requires twice the amount of code. This is open to debate as, I think, the issue is that we need a reader and a writer, where pdfrw does everything on one. The last two points would raise the question of, do we keep PyPDF2 as the name, or go to PyPDF3? Sorry for this long post, but I wanted to share my whole opinion on the current matter. I can help on some part (importing FPDF, re-writing the code for PEP8 compliancy, making it Python3 only (not required, just a thought). I can also help on the documentation, the examples, the website (not a designer though). Maybe one thing interesting would be to set up a roadmap of what is plan, in what order, etc. If you are not aware of that, Github allows to write "todo list" kind of Issues ticket. You can open one about a roadmap, and checking what is done, what is left to do, etc. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#385 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAbN9L9dJI3fbwNfuHSoLeN95yclTDO4ks5talRzgaJpZM4RLbVK> .

LightningMan711 · 2018-03-09T14:16:08Z

I have to do some seriously complex PDF manipulation for my current company. I love what this can do enough to basically solve my own question. I don't really know enough to help with issues (my solution was 80% someone else's code and 20% happy accident), but I would love to encourage maintenance or development of PyPDF3 as claird offered.

mstamy2 · 2018-03-12T19:51:01Z

Thanks to everyone for their interest and suggestions!

I agree that the primary need is to take care of the existing bugs/performance issues, and then fully implementing some of the features that are only weakly supported such as text extraction, reading/writing to forms, encryption, watermarking, etc. before moving on to new features.

We would definitely need at least a new major version (i.e. PyPDF2 2.x.x), but I agree with @claird and @cnicodeme that a new project (PyPDF3) would be appropriate considering the number of backwards-incompatible changes taking place.

Kwilliams15 · 2018-03-17T05:11:03Z

I've just recently found a very large use case for this module at work and it's been fantastic. If this module wasn't around I dont know how long it would have taken to solve some of the issues we are now. I'd be happy to help however I can to resurrect either PyPDF2 or a potential PyPDF3.

mwhit74 · 2018-03-30T18:29:16Z

A reboot would be amazing!

I have implemented this library on a large project in my office as it was the only library I could find to do what I wanted to do. I was frustrated to find that some parts of it work amazingly well and others like #355 go unanswered for months and really hinder what can be done with the library.

I want to continue to use this library for other similar projects as they come along.

xilopaint · 2018-04-12T00:34:08Z

Thanks to everyone for their interest and suggestions!

I agree that the primary need is to take care of the existing bugs/performance issues, and then fully implementing some of the features that are only weakly supported such as text extraction, reading/writing to forms, encryption, watermarking, etc. before moving on to new features.

We would definitely need at least a new major version (i.e. PyPDF2 2.x.x), but I agree with @claird and @cnicodeme that a new project (PyPDF3) would be appropriate considering the number of backwards-incompatible changes taking place.

When this will start?

mstamy2 · 2018-04-14T22:37:15Z

@xilopaint This week I'm coordinating with others who have interest in the project. Should also have a PyPDF3 repository created this week and new collaborators added.

Sorry (again) for slow progress, but we're definitely looking to move forward this week. Also developing a road map of the most important issues/features that need to be addressed.

ryanchesler · 2018-04-17T17:02:11Z

I am looking forward to any progress on this project. Thank you for trying to revive this project. I will look to contribute if I can. Highest on my list right now is being able to handle layers within the PDF's Right now if I split a 43MB 147 page PDF it will become roughly 42.9MB per page. I believe this is because it is grabbing the image for the single page but carrying the layer info from all of the other pages as well. I will have to study up on the new standards and poke around with the files I have to see if my hypothesis is true

xilopaint · 2018-04-17T17:12:23Z

@ryanchesler could you reach me if you have some progress on this issue? I have been noticed a lot of weirdnesses in file sizes when trying to split some PDF files and a patch for these issues is on top of my priorities.

ryanchesler · 2018-04-17T17:28:55Z

@xilopaint I just poked around and confirmed my belief. I don't know a ton about the PDF standards, but I can see that annotation and layers from pages 2+ pop up when I ctrl+f on the page 1 file. A new system will have to address all of the additional data that has been added in the more recent PDF versions. I think a lot of the new standards are going to be a big headache.

mstamy2 · 2018-04-21T00:43:57Z

I've taken the initial steps for a PyPDF3 project and a roadmap (work in progress).

I won't be able to do much else for a couple weeks, but by then I will have finished my degree and have much more time on my hands! I'll also be reaching out to and adding those who have expressed interest in being made collaborators.

zdenop · 2018-04-21T08:23:04Z

You can create organization at github.com and then create team (with different rights) so project will not be depend on you and your time. But at the same time you will have control over it. We have been in similar situation in tesseract project.

14vv1A0516 · 2019-09-08T12:42:27Z

Sir, may I know how you are encrypting the PDF with user or owner password we enter in encrypt password. Please discuss the mechanism. I have tried to understand the source code. I only got to know that you are using RC4 algorithm to encrypt PDF.

I think you are padding our password with some hardcoded string. Am I right ?
Please discuss.
Thank you Sir.

MartinThoma · 2022-04-06T20:55:33Z

The last content is a while ago and I'm now maintaining PyPDF2. Let's close this

mstamy2 added the Meta label Dec 22, 2017

mstamy2 mentioned this issue Dec 22, 2017

Is this project still maintained? #373

Closed

soudegesu mentioned this issue Dec 2, 2018

python pdf soudegesu/blog#192

Closed

RussellLuo mentioned this issue Apr 11, 2019

new PDF file has proper bookmarks but blank content RussellLuo/pdfbookmarker#6

Closed

eamanu mentioned this issue Aug 8, 2019

Is this project active? #510

Closed

MartinThoma closed this as completed Apr 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebooting PyPDF2 Maintenance #385

Rebooting PyPDF2 Maintenance #385

mstamy2 commented Dec 22, 2017

SalilVishnuKapur commented Jan 13, 2018

jcampbell05 commented Jan 16, 2018

cnicodeme commented Mar 2, 2018

claird commented Mar 2, 2018 via email

jcampbell05 commented Mar 2, 2018

cnicodeme commented Mar 3, 2018

sekrause commented Mar 3, 2018

claird commented Mar 3, 2018 via email

LightningMan711 commented Mar 9, 2018

mstamy2 commented Mar 12, 2018

Kwilliams15 commented Mar 17, 2018

mwhit74 commented Mar 30, 2018

xilopaint commented Apr 12, 2018

mstamy2 commented Apr 14, 2018

ryanchesler commented Apr 17, 2018

xilopaint commented Apr 17, 2018 •

edited

Loading

ryanchesler commented Apr 17, 2018

mstamy2 commented Apr 21, 2018

zdenop commented Apr 21, 2018

14vv1A0516 commented Sep 8, 2019

MartinThoma commented Apr 6, 2022

Rebooting PyPDF2 Maintenance #385

Rebooting PyPDF2 Maintenance #385

Comments

mstamy2 commented Dec 22, 2017

SalilVishnuKapur commented Jan 13, 2018

jcampbell05 commented Jan 16, 2018

cnicodeme commented Mar 2, 2018

claird commented Mar 2, 2018 via email

jcampbell05 commented Mar 2, 2018

cnicodeme commented Mar 3, 2018

sekrause commented Mar 3, 2018

claird commented Mar 3, 2018 via email

LightningMan711 commented Mar 9, 2018

mstamy2 commented Mar 12, 2018

Kwilliams15 commented Mar 17, 2018

mwhit74 commented Mar 30, 2018

xilopaint commented Apr 12, 2018

mstamy2 commented Apr 14, 2018

ryanchesler commented Apr 17, 2018

xilopaint commented Apr 17, 2018 • edited Loading

ryanchesler commented Apr 17, 2018

mstamy2 commented Apr 21, 2018

zdenop commented Apr 21, 2018

14vv1A0516 commented Sep 8, 2019

MartinThoma commented Apr 6, 2022

xilopaint commented Apr 17, 2018 •

edited

Loading