Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

find a way to work with PDF data without having to store it locally on the machine? #8

Closed
will-afs opened this issue Dec 5, 2021 · 2 comments
Assignees

Comments

@will-afs
Copy link
Owner

will-afs commented Dec 5, 2021

No description provided.

@will-afs
Copy link
Owner Author

will-afs commented Dec 7, 2021

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
import urllib.request
from io import BytesIO

def extract_pdf_metadata(pdf_file:str) -> dict:
    """Returns the metadata of a PDF

    Parameters:
    pdf_path (str) : the path of the PDF file of which metadata should be extracted

    Returns:
    dict: metadata of the PDF, presented as a JSON structured as follows : 
        {
            metadata:{
                        "Author": "AURORE",
                        "CreationDate": "D:20200325185329+01'00'",
                        "Creator": "Microsoft Office Word 2007",
                        "ModDate": "D:20210311153835+01'00'",
                        "Producer": "Microsoft Office Word 2007",
                        "Title": "DOSSIER COUP DE POUCE 2020"
                    }
            content:"Lorem ipsum dolor sit amet, ..."
        }
    """
    pdf_parser = PDFParser(pdf_file)
    doc = PDFDocument(pdf_parser)
    metadata = doc.info[0]
    for (key, value) in doc.info[0].items():
        # Need to decode each value from bytestrings toward strings
        metadata[key] = value.decode("utf-8", errors='ignore')
    return metadata
        

if __name__ == '__main__':
    response = urllib.request.urlopen('http://arxiv.org/pdf/cs/9308101v1')
    pdf_txt = response.read()
    fileObj = BytesIO()
    fileObj.write(pdf_txt)
    metadata = extract_pdf_metadata(fileObj)
    print(metadata)

@will-afs
Copy link
Owner Author

Closed with commit #cebc9a24d06933fd9b969901ba1fd8b38ad51861

@will-afs will-afs self-assigned this Dec 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant