Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rebooting PyPDF2 Maintenance #385

Closed
mstamy2 opened this issue Dec 22, 2017 · 21 comments
Closed

Rebooting PyPDF2 Maintenance #385

mstamy2 opened this issue Dec 22, 2017 · 21 comments
Labels

Comments

@mstamy2
Copy link
Collaborator

mstamy2 commented Dec 22, 2017

As I have had numerous other occupations keeping me busy, it has been some time since PyPDF2 has been actively maintained. That's absolutely unfair to those of you have taken the time to submit pull requests and issues.

Meanwhile, the popularity of this library has continued to grow quite a lot. I would like to revitalize maintenance, but I'll need some help.

Would anyone be interested in being made a collaborator? We would prefer frequent contributors who are familiar with the source and have interest in helping with maintenance.

Again, I apologize for the lapse in maintenance. This is truly a really useful library, and it has potential to do a lot more as well!

@SalilVishnuKapur
Copy link

@mstamy2 Yes I would like take any kind of responsibility of this project. We should probably start by making a small team of collaborators and start with small pending tasks. Eventually as the piled up requests start finishing up then we can add better functionality into PyPDF2. Like having a better way to handle Interactive PDF scarping especially when there is a fillable box with values as Annotated Data.

@jcampbell05
Copy link

We use this quite a lot in my company so would be happy to donate time to maintain it as it's critical for us.

We would love to help with optimizing the library as we are currently having issues with RAM and speed consumption.

@cnicodeme
Copy link

I agree that it should be restarted. I came upon this library while searching for one in Python. There is currently two top fighting the #1 place : PyPDF2 and pdfrw (I consider PyPDF2 to be an evolution of PyPDF so it's out of the equation).

The sad things is, there is NO library in Python that can do a up-to-date work in PDF. All the library I've found works for PDF v1.3 whereas the current version is 2.0.

So there is a lot of work, but a lot of great reward too. You can work to provide a free and commercial version for instance, or offer support in paid rate (I saw that on an other library: http://pyfpdf.readthedocs.io/en/latest/#support ).

There are ways to make money out of an open source project, and more importantly, there is a room for a good Python library as none are available today.

So yeah, kick off PyPDF :) I'll see if I can do anything to help!

@claird
Copy link
Contributor

claird commented Mar 2, 2018 via email

@jcampbell05
Copy link

I've recently switched away from this library as I found PDFium is the only open sourcelibrary capable of the speed we need. However would be happy to somehow port some of that work back.

I think it would be great if somehow PyPDF could be build ontop of PDFium, as it handles alot of the parsing already.

@cnicodeme
Copy link

Hi @claird , yes sorry about the confusion regarding "kick off PyPDF". For me, the original PyPDF is done, so when I mention PyPDF, I mean PyPDF2. What I meant was "Make PyPDF2 the best Python library for managing PDF" :)

As a indicator of how much there is potential, I recently stumbled upon this question on SO, asking how to change the metadata in PyPDF2.
At my great surprise, there was an answer for pdfrw, and non for PyPDF2.
I added mine, and in less than a day, I already had two upvotes.

It's clear there is a lot of interest for PyPDF2 so if your current state of family/work allows it, I think it would be awesome to work on it.

I don't have a great knowledge of how PDF work, and honestly, I don't want to dig in it as it appears to be pretty awful. But I can help in writing the doc, tutorials, etc.

We can work toward increasing the name recognition of PyPDF2 via https://stackoverflow.com/questions/tagged/pypdf2 mostly, and work on the website to show more examples (I think it's the key feature).

Finally, I'm happy to help because I plan to use PyPDF2 as the core tool to manipulate PDFs, but this means to implement a few missing features. I can try to help for some, but I don't know if I'll be able to tolerate PDFs format ;)

In those features I think would be great to have, are :

  • Encryption in higher format (AES 256)
  • Support for rules like no printing, no modifying, no copying, etc. QPDF provides a great length of options when encrypting a document
  • Support for version higher than the current 1.3 (PDF is currently in version 2.0!). I don't know what is the risk of "just" changing the header version in the document from 1.3 to 1.5 for instance?
  • Greater watermark support. This could be open to discussion, but if we can have a class dedicated to add an image to a PDF page, with rotation, opacity, etc, that would be perfect. A lot of questions on SO are related to watermarks. The PyFPDF library does it pretty well, it would be possible to work on a fork and integrate it on PyPDF2.

In a less, but still interesting item, to implement, I would say:

  • Rewrite the code to be PEP8 compliant (this is not the case right now)
  • Taking this opportunity to rewrite the code in Python 3 (methods signature with expected type for instance)

The rewriting could also help reducing the amount of code required for some work. Take a look at my PyPDF2 example on how to write metadata, compared to pdfrw. PyPDF2 requires twice the amount of code. This is open to debate as, I think, the issue is that we need a reader and a writer, where pdfrw does everything on one.

The last two points would raise the question of, do we keep PyPDF2 as the name, or go to PyPDF3?

Sorry for this long post, but I wanted to share my whole opinion on the current matter. I can help on some part (importing FPDF, re-writing the code for PEP8 compliancy, making it Python3 only (not required, just a thought). I can also help on the documentation, the examples, the website (not a designer though).

Maybe one thing interesting would be to set up a roadmap of what is plan, in what order, etc. If you are not aware of that, Github allows to write "todo list" kind of Issues ticket. You can open one about a roadmap, and checking what is done, what is left to do, etc.

@sekrause
Copy link
Contributor

sekrause commented Mar 3, 2018

I think the highest priority should be to go through the open pull requests, merge some and prepare a quick release before thinking about long-term development.

Over a year ago I reported issue #329 which is a serious denial-of-service security issue because you can easily force PyPDF2 into an infinite loop. So far my really simple pull request #331 has been waiting for a merge for over a year, but nothing has happened so far.

@claird
Copy link
Contributor

claird commented Mar 3, 2018 via email

@LightningMan711
Copy link

I have to do some seriously complex PDF manipulation for my current company. I love what this can do enough to basically solve my own question. I don't really know enough to help with issues (my solution was 80% someone else's code and 20% happy accident), but I would love to encourage maintenance or development of PyPDF3 as claird offered.

@mstamy2
Copy link
Collaborator Author

mstamy2 commented Mar 12, 2018

Thanks to everyone for their interest and suggestions!

I agree that the primary need is to take care of the existing bugs/performance issues, and then fully implementing some of the features that are only weakly supported such as text extraction, reading/writing to forms, encryption, watermarking, etc. before moving on to new features.

We would definitely need at least a new major version (i.e. PyPDF2 2.x.x), but I agree with @claird and @cnicodeme that a new project (PyPDF3) would be appropriate considering the number of backwards-incompatible changes taking place.

@Kwilliams15
Copy link

I've just recently found a very large use case for this module at work and it's been fantastic. If this module wasn't around I dont know how long it would have taken to solve some of the issues we are now. I'd be happy to help however I can to resurrect either PyPDF2 or a potential PyPDF3.

@mwhit74
Copy link
Contributor

mwhit74 commented Mar 30, 2018

A reboot would be amazing!

I have implemented this library on a large project in my office as it was the only library I could find to do what I wanted to do. I was frustrated to find that some parts of it work amazingly well and others like #355 go unanswered for months and really hinder what can be done with the library.

I want to continue to use this library for other similar projects as they come along.

@xilopaint
Copy link
Contributor

Thanks to everyone for their interest and suggestions!

I agree that the primary need is to take care of the existing bugs/performance issues, and then fully implementing some of the features that are only weakly supported such as text extraction, reading/writing to forms, encryption, watermarking, etc. before moving on to new features.

We would definitely need at least a new major version (i.e. PyPDF2 2.x.x), but I agree with @claird and @cnicodeme that a new project (PyPDF3) would be appropriate considering the number of backwards-incompatible changes taking place.

When this will start?

@mstamy2
Copy link
Collaborator Author

mstamy2 commented Apr 14, 2018

@xilopaint This week I'm coordinating with others who have interest in the project. Should also have a PyPDF3 repository created this week and new collaborators added.

Sorry (again) for slow progress, but we're definitely looking to move forward this week. Also developing a road map of the most important issues/features that need to be addressed.

@ryanchesler
Copy link

I am looking forward to any progress on this project. Thank you for trying to revive this project. I will look to contribute if I can. Highest on my list right now is being able to handle layers within the PDF's Right now if I split a 43MB 147 page PDF it will become roughly 42.9MB per page. I believe this is because it is grabbing the image for the single page but carrying the layer info from all of the other pages as well. I will have to study up on the new standards and poke around with the files I have to see if my hypothesis is true

@xilopaint
Copy link
Contributor

xilopaint commented Apr 17, 2018

@ryanchesler could you reach me if you have some progress on this issue? I have been noticed a lot of weirdnesses in file sizes when trying to split some PDF files and a patch for these issues is on top of my priorities.

@ryanchesler
Copy link

@xilopaint I just poked around and confirmed my belief. I don't know a ton about the PDF standards, but I can see that annotation and layers from pages 2+ pop up when I ctrl+f on the page 1 file. A new system will have to address all of the additional data that has been added in the more recent PDF versions. I think a lot of the new standards are going to be a big headache.

@mstamy2
Copy link
Collaborator Author

mstamy2 commented Apr 21, 2018

I've taken the initial steps for a PyPDF3 project and a roadmap (work in progress).

I won't be able to do much else for a couple weeks, but by then I will have finished my degree and have much more time on my hands! I'll also be reaching out to and adding those who have expressed interest in being made collaborators.

@zdenop
Copy link

zdenop commented Apr 21, 2018

You can create organization at github.com and then create team (with different rights) so project will not be depend on you and your time. But at the same time you will have control over it. We have been in similar situation in tesseract project.

@14vv1A0516
Copy link

Sir, may I know how you are encrypting the PDF with user or owner password we enter in encrypt password. Please discuss the mechanism. I have tried to understand the source code. I only got to know that you are using RC4 algorithm to encrypt PDF.

I think you are padding our password with some hardcoded string. Am I right ?
Please discuss.
Thank you Sir.

@MartinThoma
Copy link
Member

The last content is a while ago and I'm now maintaining PyPDF2. Let's close this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests