# How To Dynamically Clean Up Corrupt PDFs

This shows a potential use of PyMuPDF with another Python PDF library (the excellent pure Python package pdfrw is used here as an example).

If a clean, non-corrupt / decompressed PDF is needed, one could dynamically invoke PyMuPDF to recover from many problems like so:

        import sys
        from io import BytesIO
        from pdfrw import PdfReader
        import pymupdf

        #---------------------------------------
        # 'Tolerant' PDF reader
        #---------------------------------------
        def reader(fname, password = None):
            idata = open(fname, "rb").read()  # read the PDF into memory and
            ibuffer = BytesIO(idata)  # convert to stream
            if password is None:
                try:
                    return PdfReader(ibuffer)  # if this works: fine!
                except:
                    pass

            # either we need a password or it is a problem-PDF
            # create a repaired / decompressed / decrypted version
            doc = pymupdf.open("pdf", ibuffer)
            if password is not None:  # decrypt if password provided
                rc = doc.authenticate(password)
                if not rc > 0:
                    raise ValueError("wrong password")
            c = doc.tobytes(garbage=3, deflate=True)
            del doc  # close & delete doc
            return PdfReader(BytesIO(c))  # let pdfrw retry
        #---------------------------------------
        # Main program
        #---------------------------------------
        pdf = reader("pymupdf.pdf", password = None) # include a password if necessary
        print pdf.Info
        # do further processing
        
With the command line utility pdftk (available for Windows only, but reported to also run under Wine) a similar result can be achieved, see here. However, you must invoke it as a separate process via subprocess.Popen, using stdin and stdout as communication vehicles.

# How to Convert Any Document to PDF

Here is a script that converts any PyMuPDF supported document to a PDF. These include XPS, EPUB, FB2, CBZ and image formats, including multi-page TIFF images.

It features maintaining any metadata, table of contents and links contained in the source document:


        """
        Demo script: Convert input file to a PDF
        -----------------------------------------
        Intended for multi-page input files like XPS, EPUB etc.

        Features:
        ---------
        Recovery of table of contents and links of input file.
        While this works well for bookmarks (outlines, table of contents),
        links will only work if they are not of type "LINK_NAMED".
        This link type is skipped by the script.

        For XPS and EPUB input, internal links however **are** of type "LINK_NAMED".
        Base library MuPDF does not resolve them to page numbers.

        So, for anyone expert enough to know the internal structure of these
        document types, can further interpret and resolve these link types.

        Dependencies
        --------------
        PyMuPDF v1.14.0+
        """
        import sys
        import pymupdf
        if not (list(map(int, pymupdf.VersionBind.split("."))) >= [1,14,0]):
            raise SystemExit("need PyMuPDF v1.14.0+")
        fn = sys.argv[1]

        print("Converting '%s' to '%s.pdf'" % (fn, fn))

        doc = pymupdf.open(fn)

        b = doc.convert_to_pdf()  # convert to pdf
        pdf = pymupdf.open("pdf", b)  # open as pdf

        toc= doc.get_toc()  # table of contents of input
        pdf.set_toc(toc)  # simply set it for output
        meta = doc.metadata  # read and set metadata
        if not meta["producer"]:
            meta["producer"] = "PyMuPDF v" + pymupdf.VersionBind

        if not meta["creator"]:
            meta["creator"] = "PyMuPDF PDF converter"
        meta["modDate"] = pymupdf.get_pdf_now()
        meta["creationDate"] = meta["modDate"]
        pdf.set_metadata(meta)

        # now process the links
        link_cnti = 0
        link_skip = 0
        for pinput in doc:  # iterate through input pages
            links = pinput.get_links()  # get list of links
            link_cnti += len(links)  # count how many
            pout = pdf[pinput.number]  # read corresp. output page
            for l in links:  # iterate though the links
                if l["kind"] == pymupdf.LINK_NAMED:  # we do not handle named links
                    print("named link page", pinput.number, l)
                    link_skip += 1  # count them
                    continue
                pout.insert_link(l)  # simply output the others

        # save the conversion result
        pdf.save(fn + ".pdf", garbage=4, deflate=True)
        # say how many named links we skipped
        if link_cnti > 0:
            print("Skipped %i named links of a total of %i in input." % (link_skip, link_cnti))


