Skip to content

Latest commit

 

History

History
1534 lines (934 loc) · 112 KB

changes.rst

File metadata and controls

1534 lines (934 loc) · 112 KB

Change Log

Changes in Version 1.21.0

  • This release uses MuPDF-1.21.0.
  • New feature: Stories.
  • Added wheels for Python-3.11.
  • Bug fixes:
    • Fixed #1701: Broken custom image insertion.
    • Fixed #1854: Document.delete_pages() declines keyword arguments.
    • Fixed #1868: Access Violation Error at page.apply_redactions().
    • Fixed #1909: Adding text with fontname="Helvetica" can silently fail.
    • Fixed #1913: `draw_rect()`: does not respect width if color is not specified.
    • Fixed #1917: `subset_fonts()`: make it possible to silence the stdout.
    • Fixed #1936: Rectangle detection can be incorrect producing wrong output.
    • Fixed #1945: Segmentation fault when saving with clean=True.
    • Fixed #1965: pdfocr_save() Hard Crash.
    • Fixed #1971: Segmentation fault when using get_drawings().
    • Fixed #1946: block_no and block_type switched in get_text() docs.
    • Fixed #2013: AttributeError: 'Widget' object has no attribute '_annot' in delete widget.
  • Misc changes to core code:
    • Fixed various compiler warnings and a sequence-point bug.
    • Added support for Memento builds.
    • Fixed leaks detected by Memento in test suite.
    • Fixed handling of exceptions in set_name() and set_rect().
    • Allow build with latest MuPDF, for regular testing of PyMuPDF master.
    • Cope with new MuPDF exceptions when setting rect for some Annot types.
    • Reduced cosmetic differences between MuPDF's config.h and PyMuPDF's _config.h.
    • Cope with various changes to MuPDF API.
  • Other:
    • Fixed various broken links and typos in docs.
    • Mention install of swig-python on MacOS for #875.
    • Added (untested) wheels for macos-arm64.

Changes in Version 1.20.2

  • This release uses MuPDF-1.20.3.
  • Fixed #1787. Fix linking issues on Unix systems.
  • Fixed #1824. SegFault when applying redactions overlapping a transparent image. (Fixed in MuPDF-1.20.3.)
  • Improvements to documentation:
    • Improved information about building from source in docs/installation.rst.
    • Clarified memory allocation setting JM_MEMORY` indocs/tools.rst``.
    • Fixed link to PDF Reference manual in docs/app3.rst.
    • Fixed building of html documentation on OpenBSD.
    • Moved old docs/faq.rst into separate docs/recipes-* files.
  • Removed some unused files and directories:
    • installation/
    • docs/wheelnames.txt

Changes in Version 1.20.1

  • Fixed #1724. Fix for building on FreeBSD.
  • Fixed #1771. linkDest() had a broken call to re.match(), introduced in 1.20.0.
  • Fixed #1751. get_drawings() and get_cdrawings() previously always returned with closePath=False.
  • Fixed #1645. Default FreeText annotation text color is now black.
  • Improvements to sphinx-generated documentation:
    • Use readthedocs theme with enhancements.
    • Renamed the .txt files to have .rst suffixes.

Changes in Version 1.20.0

This release uses MuPDF-1.20.0, released 2022-06-15.

  • Cope with new MuPDF link uri format, changed from #<int>,<int>,<int> to #page=<int>&zoom=<float>,<float>,<float>.
  • In tests/test_insertpdf.py, use new reference output joined-1.20.pdf. We also check that new output values are approximately the same as the old ones.
  • Fixed #1738. Leak of pdf_graft_map. Also fixed a SEGV issue that this seemed to expose, caused by incorrect freeing of underlying fz_document.
  • Fixed #1733. Fixed ownership of Annotation.get_pixmap().

Changes to build/release process:

  • If pip builds from source because an appropriate wheel is not available, we no longer require MuPDF to be pre-installed. Instead the required MuPDF source is embedded in the sdist and automatically built into PyMuPDF.
  • Various changes to setup.py to download the required MuPDF release as required. See comments at start of setup.py for details.
  • Added .github/workflows/build_wheels.yml to control building of wheels on Github.

Changes in Version 1.19.6

  • Fixed #1620. The TextPage created by Page.get_textpage will now be freed correctly (removed memory leak).
  • Fixed #1601. Document open errors should now be more concise and easier to interpret. In the course of this, two PyMuPDF-specific Python exceptions have been added:

    • EmptyFileError -- raised when trying to create a Document (fitz.open()) from an empty file or zero-length memory.
    • FileDataError -- raised when MuPDF encounters irrecoverable document structure issues.
  • Added Page.load_widget given a PDF field's xref.
  • Added Dictionary pdfcolor which provide the about 500 colors defined as PDF color values with the lower case color name as key.
  • Added algebra functionality to the Quad class. These objects can now also be added and subtracted among themselves, and be multiplied by numbers and matrices.
  • Added new constants defining the default text extraction flags for more comfortable handling. Their naming convention is like TEXTFLAGS_WORDS for page.get_text("words"). See text_extraction_flags.
  • Changed Page.annots and Page.widgets to detect and prevent reloading the page (illegally) inside the iterator loops via Document.reload_page. Doing this brings down the interpretor. Documented clean ways to do annotation and widget mass updates within properly designed loops.
  • Changed several internal utility functions to become standalone ("SWIG inline") as opposed to be part of the Tools class. This, among other things, increases the performance of geometry object creation.
  • Changed Document.update_stream to always accept stream updates - whether or not the dictionary object behind the xref already is a stream. Thus the former new parameter is now ignored and will be removed in v1.20.0.

Changes in Version 1.19.5

  • Fixed #1518. A limited "fix": in some cases, rectangles and quadrupels were not correctly encoded to support re-drawing by Shape.
  • Fixed #1521. This had the same ultimate reason behind issue #1510.
  • Fixed #1513. Some Optional Content functions did not support non-ASCII characters.
  • Fixed #1510. Support more soft-mask image subtypes.
  • Fixed #1507. Immunize against items in the outlines chain, that are "null" objects.
  • Fixed re-opened #1417. ("too many open files"). This was due to insufficient calls to MuPDF's fz_drop_document(). This also fixes #1550.
  • Fixed several undocumented issues in relation to incorrectly setting the text span origin point_like.
  • Fixed undocumented error computing the character bbox in method Page.get_texttrace when text is flipped (as opposed to just rotated).
  • Added items to the dictionary returned by image_properties: orientation and transform report the natural image orientation (EXIF data).
  • Added method Document.xref_copy. It will make a given target PDF object an exact copy of a source object.

Changes in Version 1.19.4

  • Fixed #1505. Immunize against circular outline items.
  • Fixed #1484. Correct CropBox coordinates are now returned in all situations.
  • Fixed #1479.
  • Fixed #1474. TextPage objects are now properly deleted again.
  • Added Page methods and attributes for PDF /ArtBox, /BleedBox, /TrimBox.
  • Added global attribute TESSDATA_PREFIX for easy checking of OCR support.
  • Changed Document.xref_set_key such that dictionary keys will physically be removed if set to value "null".
  • Changed Document.extract_font to optionally return a dictionary (instead of a tuple).

Changes in Version 1.19.3

This patch version implements minor improvements for Pixmap and also some important fixes.

  • Fixed #1351. Reverted code that introduced the memory growth in v1.18.15.
  • Fixed #1417. Developped circumvention for growth of open file handles using Document.insert_pdf.
  • Fixed #1418. Developped circumvention for memory growth using Document.insert_pdf.
  • Fixed #1430. Developped circumvention for mass pixmap generations of document pages.
  • Fixed #1433. Solves a bbox error for some Type 3 font in PyMuPDF text processing.
  • Added Pixmap.color_topusage to determine the share of the most frequently used color. Solves #1397.
  • Added Pixmap.warp which makes a new pixmap from a given arbitrary convex quad inside the pixmap.
  • Added Annot.irt_xref and Annot.set_irt_xref to inquire or set the /IRT ("In Responde To") property of an annotation. Implements #1450.
  • Added Rect.torect and IRect.torect which compute a matrix that transforms to a given other rectangle.
  • Changed Pixmap.color_count to also return the count of each color.
  • Changed Page.get_texttrace to also return correct span and character bboxes if span["dir"] != (1, 0).

Changes in Version 1.19.2

This patch version implements minor improvements for Page.get_drawings and also some important fixes.

  • Fixed #1388. Fixed intermittent memory corruption when insert or updating annotations.
  • Fixed #1375. Inconsistencies between line numbers as returned by the "words" and the "dict" options of Page.get_text have been corrected.
  • Fixed #1364. The check for being a "rawdict" span in recover_span_quad now works correctly.
  • Fixed #1342. Corrected the check for rectangle infiniteness in Page.show_pdf_page.
  • Changed Page.get_drawings, Page.get_cdrawings to return an indicator on the area orientation covered by a rectangle. This implements #1355. Also, the recognition rate for rectangles and quads has been significantly improved.
  • Changed all text search and extraction methods to set the new flags option TEXT_MEDIABOX_CLIP to ON by default. That bit causes the automatic suppression of all characters that are completely outside a page's mediabox (in as far as that notion is supported for a document type). This eliminates the need for using clip=page.rect or similar for omitting text outside the visible area.
  • Added parameter "dpi" to Page.get_pixmap and Annot.get_pixmap. When given, parameter "matrix" is ignored, and a Pixmap with the desired dots per inch is created.
  • Added attributes Pixmap.is_monochrome and Pixmap.is_unicolor allowing fast checks of pixmap properties. Addresses #1397.
  • Added method Pixmap.color_count to determine the unique colors in the pixmap.
  • Added boolean parameter "compress" to PDF document method Document.update_stream. Addresses / enables solution for #1408.

Changes in Version 1.19.1

This is the first patch version to support MuPDF v1.19.0. Apart from one bug fix, it includes important improvements for OCR support and the option to sort extracted text to the standard reading order "from top-left to bottom-right".

  • Fixed #1328. "words" text extraction again returns correct (x0, y0) coordinates.
  • Changed Page.get_textpage_ocr: it now supports parameter dpi to control OCR quality. It is also possible to choose whether the full page should be OCRed or only the images displayed by the page.
  • Changed Page.get_drawings and Page.get_cdrawings to automatically convert colors to RGB color tuples. Implements #1332. Similar change was applied to Page.get_texttrace.
  • Changed Page.get_text to support a parameter sort. If set to True the output is conveniently sorted.

Changes in Version 1.19.0

This is the first version supporting MuPDF 1.19., published 2021-10-05. It introduces many new features compared to the previous version 1.18..

PyMuPDF has now picked up integrated Tesseract OCR support, which was already present in MuPDF v1.18.0.

  • Supported images can be OCRed via their Pixmap which results in a 1-page PDF with a text layer.
  • All supported document pages (i.e. not only PDFs), can be OCRed using specialized text extraction methods. The result is a mixture of standard and OCR text (depending on which part of the page was deemed to require OCRing) that can be searched and extracted without restrictions.
  • All this requires an independent installation of Tesseract. MuPDF actually (only) needs the location of Tesseract's "tessdata" folder, where its language support data are stored. This location must be available as environment variable TESSDATA_PREFIX.

A new MuPDF feature is journalling PDF updates, which is also supported by this PyMuPDF version. Changes may be logged, rolled back or replayed, allowing to implement a whole new level of control over PDF document integrity -- similar to functions present in modern database systems.

A third feature (unrelated to the new MuPDF version) includes the ability to detect when page objects cover or hide each other. It is now e.g. possible to see that text is covered by a drawing or an image.

  • Changed terminology and meaning of important geometry concepts: Rectangles are now characterized as finite, valid or empty, while the definitions of these terms have also changed. Rectangles specifically are now thought of being "open": not all corners and sides are considered part of the retangle. Please do read the Rect section for details.
  • Added new parameter "no_new_id" to Document.save / Document.tobytes methods. Use it to suppress updating the second item of the document /ID which in PDF indicates that the original file has been updated. If the PDF has no /ID at all yet, then no new one will be created either.
  • Added a journalling facility for PDF updates. This allows logging changes, undoing or redoing them, or saving the journal for later use. Refer to Document.journal_enable and friends.
  • Added new Pixmap methods Pixmap.pdfocr_save and Pixmap.pdfocr_tobytes, which generate a 1-page PDF containing the pixmap as PNG image with OCR text layer.
  • Added Page.get_textpage_ocr which executes optical character recognition for the page, then extracts the results and stores them together with "normal" page content in a TextPage. Use or reuse this object in subsequent text extractions and text searches to avoid multiple efforts. The existing text search and text extraction methods have been extended to support a separately created textpage -- see next item.
  • Added a new parameter textpage to text extraction and text search methods. This allows reuse of a previously created TextPage and thus achieves significant runtime benefits -- which is especially important for the new OCR features. But "normal" text extractions can definitely also benefit.
  • Added Page.get_texttrace, a technical method delivering low-level text character properties. It was present before as a private method, but the author felt it now is mature enough to be officially available. It specifically includes a "sequence number" which indicates the page appearance build operation that painted the text.
  • Added Page.get_bboxlog which delivers the list of rectangles of page objects like text, images or drawings. Its significance lies in its sequence: rectangles intersecting areas with a lower index are covering or hiding them.
  • Changed methods Page.get_drawings and Page.get_cdrawings to include a "sequence number" indicating the page appearance build operation that created the drawing.
  • Fixed #1311. Field values in comboboxes should now be handled correctly.
  • Fixed #1290. Error was caused by incorrect rectangle emptiness check, which is fixed due to new geometry logic of this version.
  • Fixed #1286. Text alignment for redact annotations is working again.
  • Fixed #1287. Infinite loop issue for non-Windows systems when applying some redactions has been resolved.
  • Fixed #1284. Text layout destruction after applying redactions in some cases has been resolved.

Changes in Version 1.18.18 / 1.18.19

  • Fixed issue #1266. Failure to set Pixmap.samples in important cases, was hotfixed in a new version 1.18.19.
  • Fixed issue #1257. Removing the read-only flag from PDF fields is now possible.
  • Fixed issue #1252. Now correctly specifying the zoom value for PDF link annotations.
  • Fixed issue #1244. Now correctly computing the transform matrix in Page.get_image__bbox.
  • Fixed issue #1241. Prevent returning artifact characters in Page.get_textbox, which happened in certain constellations.
  • Fixed issue #1234. Avoid creating infinite rectangles in corner cases -- Page.get_drawings, Page.get_cdrawings.
  • Added test data and test scripts to the source PyPI source distribution.

Changes in Version 1.18.17

Focus of this version are major performance improvements of selected functions.

  • Fixed issue #1199. Using a non-existing page number in Document.get_page_images and friends will no longer lead to segfaults.
  • Changed Page.get_drawings to now differentiate between "stroke", "fill" and combined paths. Paths containing more than one rectangle (i.e. "re" items) are now supported. Extracting "clipped" paths is now available as an option.
  • Added Page.get_cdrawings, performance-optimized version of Page.get_drawings.
  • Added Pixmap.samples_mv, memoryview of a pixmap's pixel area. Does not copy and thus always accesses the current state of that area.
  • Added Pixmap.samples_ptr, Python "pointer" to a pixmap's pixel area. Allows much faster creation (factor 800+) of Qt images.

Changes in Version 1.18.16

  • Fixed issue #1184. Existing PDF widget fonts in a PDF are now accepted (i.e. not forcedly changed to a Base-14 font).
  • Fixed issue #1154. Text search hits should now be correct when clip is specified.
  • Fixed issue #1152.
  • Fixed issue #1146.
  • Added Link.flags and Link.set_flags to the Link class. Implements enhancement requests #1187.
  • Added option to simulate TextWriter.fill_textbox output for predicting the number of lines, that a given text would occupy in the textbox.
  • Added text output support as subcommand gettext to the fitz CLI module. Most importantly, original physical text layout reproduction is now supported.

Changes in Version 1.18.15

  • Fixed issue #1088. Removing an annotation's fill color should now work again both ways, using the fill_color=[] argument in Annot.update as well as fill=[] in Annot.set_colors.
  • Fixed issue #1081. Document.subset_fonts: fixed an error which created wrong character widths for some fonts.
  • Fixed issue #1078. Page.get_text and other methods related to text extraction: changed the default value of the TextPage flags parameter. All whitespace and ligatures are now preserved.
  • Fixed issue #1085. The old snake_cased alias of fitz.detTextlength is now defined correctly.
  • Changed Document.subset_fonts will now correctly prefix font subsets with an appropriate six letter uppercase tag, complying with the PDF specification.
  • Added new method Widget.button_states which returns the possible values that a button-type field can have when being set to "on" or "off".
  • Added support of text with Small Capital letters to the Font and TextWriter classes. This is reflected by an additional bool parameter small_caps in various of their methods.

Changes in Version 1.18.14

  • Finished implementing new, "snake_cased" names for methods and properties, that were "camelCased" and awkward in many aspects. At the end of this documentation, there is section Deprecated with more background and a mapping of old to new names.
  • Fixed issue #1053. Page.insert_image: when given, include image mask in the hash computation.
  • Fixed issue #1043. Added Pixmap.getPNGdata to the aliases of Pixmap.tobytes.
  • Fixed an internal error when computing the envelopping rectangle of drawn paths as returned by Page.get_drawings.
  • Fixed an internal error occasionally causing loops when outputting text via TextWriter.fill_textbox.
  • Added Font.char_lengths, which returns a tuple of character widths of a string.
  • Added more ways to specify pages in Document.delete_pages. Now a sequence (list, tuple or range) can be specified, and the Python del statement can be used. In the latter case, Python slices are also accepted.
  • Changed Document.del_toc_item, which disables a single item of the TOC: previously, the title text was removed. Instead, now the complete item will be shown grayed-out by supporting viewers.

Changes in Version 1.18.13

  • Fixed issue #1014.
  • Fixed an internal memory leak when computing image bboxes -- Page.get_image_bbox.
  • Added support for low-level access and modification of the PDF trailer. Applies to Document.xref_get_keys, Document.xref_get_key, and Document.xref_set_key.
  • Added documentation for maintaining private entries in PDF metadata.
  • Added documentation for handling transparent image insertions, Page.insert_image.
  • Added Page.get_image_rects, an improved version of Page.get_image_bbox.
  • Changed Document.delete_pages to support various ways of specifying pages to delete. Implements #1042.
  • Changed Page.insert_image to also accept the xref of an existing image in the file. This allows "copying" images between pages, and extremely fast mutiple insertions.
  • Changed Page.insert_image to also accept the integer parameter alpha. To be used for performance improvements.
  • Changed Pixmap.set_alpha to support new parameters for pre-multiplying colors with their alpha values and setting a specific color to fully transparent (e.g. white).
  • Changed Document.embfile_add to automatically set creation and modification date-time. Correspondingly, Document.embfile_upd automatically maintains modification date-time (/ModDate PDF key), and Document.embfile_info correspondingly reports these data. In addition, the embedded file's associated "collection item" is included via its xref. This supports the development of PDF portfolio applications.

Changes in Version 1.18.11 / 1.18.12

  • Fixed issue #972. Improved layout of source distribution material.
  • Fixed issue #962. Stabilized Linux distribution detection for generating PyMuPDF from sources.
  • Added: Page.get_xobjects delivers the result of Document.get_page_xobjects.
  • Added: Page.get_image_info delivers meta information for all images shown on the page.
  • Added: Tools.mupdf_display_warnings allows setting on / off the display of MuPDF-generated warnings. The default is off.
  • Added: Document.ez_save convenience alias of Document.save with some different defaults.
  • Changed: Image extractions of document pages now also contain the image's transformation matrix. This concerns Page.get_image_bbox and the DICT, JSON, RAWDICT, and RAWJSON variants of Page.get_text.

Changes in Version 1.18.10

  • Fixed issue #941. Added old aliases for DisplayList.get_pixmap and DisplayList.get_textpage.
  • Fixed issue #929. Stabilized removal of JavaScript objects with Document.scrub.
  • Fixed issue #927. Removed a loop in the reworked TextWriter.fill_textbox.
  • Changed Document.xref_get_keys and Document.xref_get_key to also allow accessing the PDF trailer dictionary. This can be done by using -1 as the xref number argument.
  • Added a number of functions for reconstructing the quads for text lines, spans and characters extracted by Page.get_text options "dict" and "rawdict". See recover_quad and friends.
  • Added Tools.unset_quad_corrections to suppress character quad corrections (occasionally required for erroneous fonts).

Changes in Version 1.18.9

  • Fixed issue #888. Removed ambiguous statements concerning PyMuPDF's license, which is now clearly stated to be GNU AGPL V3.
  • Fixed issue #895.
  • Fixed issue #896. Since v1.17.6 PyMuPDF suppresses the font subset tags and only reports the base fontname in text extraction outputs "dict" / "json" / "rawdict" / "rawjson". Now a new global parameter can request the old behaviour, Tools.set_subset_fontnames.
  • Fixed issue #885. Pixmap creation now also works with filenames given as pathlib.Paths.
  • Changed Document.subset_fonts: Text is not rewritten any more and should therefore retain all its origial properties -- like being hidden or being controlled by Optional Content mechanisms.
  • Changed TextWriter output to also accept text in right to left mode (Arabian, Hebrew): TextWriter.fill_textbox, TextWriter.append. These methods now accept a new boolean parameter right_to_left, which is False by default. Implements #897.
  • Changed TextWriter.fill_textbox to return all lines of text, that did not fit in the given rectangle. Also changed the default of the warn parameter to no longer print a warning message in overflow situations.
  • Added a utility function recover_quad, which computes the quadrilateral of a span. This function can be used for correctly marking text extracted with the "dict" or "rawdict" options of Page.get_text.

Changes in Version 1.18.8

This is a bug fix version only. We are publishing early because of the potentially widely used functions.

  • Fixed issue #881. Fixed a memory leak in Page.insert_image when inserting images from files or memory.
  • Fixed issue #878. pathlib.Path objects should now correctly handle file path hierarchies.

Changes in Version 1.18.7

  • Added an experimental Document.subset_fonts which reduces the size of eligible fonts based on their use by text in the PDF. Implements #855.
  • Implemented request #870: Document.convert_to_pdf now also supports PDF documents.
  • Renamed Document.write to Document.tobytes for greater clarity. But the deprecated name remains available for some time.
  • Implemented request #843: Document.tobytes now supports linearized PDF output. Document.save now also supports writing to Python file objects. In addition, the open function now also supports Python file objects.
  • Fixed issue #844.
  • Fixed issue #838.
  • Fixed issue #823. More logic for better support of OCRed text output (Tesseract, ABBYY).
  • Fixed issue #818.
  • Fixed issue #814.
  • Added Document.get_page_labels which returns a list of page label definitions of a PDF.
  • Added Document.has_annots and Document.has_links to check whether these object types are present anywhere in a PDF.
  • Added expert low-level functions to simplify inquiry and modification of PDF object sources: Document.xref_get_keys lists the keys of object xref, Document.xref_get_key returns type and content of a key, and Document.xref_set_key modifies the key's value.
  • Added parameter thumbnails to Document.scrub to also allow removing page thumbnail images.
  • Improved documentation for how to add valid text marker annotations for non-horizontal text.

We continued the process of renaming methods and properties from "mixedCase" to "snake_case". Documentation usually mentions the new names only, but old, deprecated names remain available for some time.


Changes in Version 1.18.6

  • Fixed issue #812.
  • Fixed issue #793. Invalid document metadata previously prevented opening some documents at all. This error has been removed.
  • Fixed issue #792. Text search and text extraction will make no rectangle containment checks at all if the default clip=None is used.
  • Fixed issue #785.
  • Fixed issue #780. Corrected a parameter check error.
  • Fixed issue #779. Fixed typo
  • Added an option to set the desired line height for text boxes. Implements #804.
  • Changed text position retrieval to better cope with Tesseract's glyphless font. Implements #803.
  • Added an option to choose the prefix of new annotations, fields and links for providing unique annotation ids. Implements request #807.
  • Added getting and setting color and text properties for Table of Contents items for PDFs. Implements #779.
  • Added PDF page label handling: Page.get_label() returns the page label, Document.get_page_numbers return all page numbers having a specified label, and Document.set_page_labels adds or updates a PDF's page label definition.

Note

This version introduces Python type hinting. The goal is to provide each parameter and the return value of all functions and methods with type information. This still is work in progress although the majority of functions has already been handled.


Changes in Version 1.18.5

Apart from several fixes, this version also focusses on several minor, but important feature improvements. Among the latter is a more precise computation of proper line heights and insertion points for writing / inserting text. As opposed to using font-agnostic constants, these values are now taken from the font's properties.

Also note that this is the first version which does no longer provide pregenerated wheels for Python versions older than 3.6. PIP also discontinues support for these by end of this year 2020.

  • Fixed issue #771. By using "small glyph heights" option, the full page text can be extracted.
  • Fixed issue #768.
  • Fixed issue #750.
  • Fixed issue #739. The "dict", "rawdict" and corresponding JSON output variants now have two new span keys: "ascender" and "descender". These floats represent special font properties which can be used to compute bboxes of spans or characters of exactly fontsize height (as opposed to the default line height). An example algorithm is shown in section "Span Dictionary" here. Also improved the detection and correction of ill-specified ascender / descender values encountered in some fonts.
  • Added a new, experimental Tools.set_small_glyph_heights -- also in response to issue #739. This method sets or unsets a global parameter to always compute bboxes with fontsize height. If "on", text searching and all text extractions will returned rectangles, bboxes and quads with a smaller height.
  • Fixed issue #728.
  • Changed fill color logic of 'Polyline' annotations: this parameter now only pertains to line end symbols -- the annotation itself can no longer have a fill color. Also addresses issue #727.
  • Changed Page.getImageBbox to also compute the bbox if the image is contained in an XObject.
  • Changed Shape.insertTextbox, resp. Page.insertTextbox, resp. TextWriter.fillTextbox to respect font's properties "ascender" / "descender" when computing line height and insertion point. This should no longer lead to line overlaps for multi-line output. These methods used to ignore font specifics and used constant values instead.

Changes in Version 1.18.4

This version adds several features to support PDF Optional Content. Among other things, this includes OCMDs (Optional Content Membership Dictionaries) with the full scope of "visibility expressions" (PDF key /VE), text insertions (including the TextWriter class) and drawings.

  • Fixed issue #727. Freetext annotations now support an uncolored rectangle when fill_color=None.
  • Fixed issue #726. UTF-8 encoding errors are now handled for HTML / XML Page.getText output.
  • Fixed issue #724. Empty values are no longer stored in the PDF /Info metadata dictionary.
  • Added new methods Document.set_oc and Document.get_oc to set or get optional content references for existing image and form XObjects. These methods are similar to the same-named methods of Annot.
  • Added Document.set_ocmd, Document.get_ocmd for handling OCMDs.
  • Added Optional Content support for text insertion and drawing.
  • Added new method Page.deleteWidget, which deletes a form field from a page. This is analogous to deleting annotations.
  • Added support for Popup annotations. This includes defining the Popup rectangle and setting the Popup to open or closed. Methods / attributes Annot.set_popup, Annot.set_open, Annot.has_popup, Annot.is_open, Annot.popup_rect, Annot.popup_xref.

Other changes:

  • The naming of methods and attributes in PyMuPDF is far from being satisfactory: we have CamelCases, mixedCases and lower_case_with_underscores all over the place. With the Annot as the first candidate, we have started an activity to clean this up step by step, converting to lower case with underscores for methods and attributes while keeping UPPERCASE for the constants.

    • Old names will remain available to prevent code breaks, but they will no longer be mentioned in the documentation.
    • New methods and attributes of all classes will be named according to the new standard.

Changes in Version 1.18.3

As a major new feature, this version introduces support for PDF's Optional Content concept.

  • Fixed issue #714.
  • Fixed issue #711.
  • Fixed issue #707: if a PDF user password, but no owner password is supplied nor present, then the user password is also used as the owner password.
  • Fixed expand and deflate parameters of methods Document.save and Document.write. Individual image and font compression should now finally work. Addresses issue #713.
  • Added a support of PDF optional content. This includes several new Document methods for inquiring and setting optional content status and adding optional content configurations and groups. In addition, images, form XObjects and annotations now can be bound to optional content specifications. Resolved issue #709.

Changes in Version 1.18.2

This version contains some interesting improvements for text searching: any number of search hits is now returned and the hit_max parameter was removed. The new clip parameter in addition allows to restrict the search area. Searching now detects hyphenations at line breaks and accordingly finds hyphenated words.

  • Fixed issue #575: if using quads=False in text searching, then overlapping rectangles on the same line are joined. Previously, parts of the search string, which belonged to different "marked content" items, each generated their own rectangle -- just as if occurring on separate lines.
  • Added Document.isRepaired, which is true if the PDF was repaired on open.
  • Added Document.setXmlMetadata which either updates or creates PDF XML metadata. Implements issue #691.
  • Added Document.getXmlMetadata returns PDF XML metadata.
  • Changed creation of PDF documents: they will now always carry a PDF identification (/ID field) in the document trailer. Implements issue #691.
  • Changed Page.searchFor: a new parameter clip is accepted to restrict the search to this rectangle. Correspondingly, the attribute TextPage.rect is now respected by TextPage.search.
  • Changed parameter hit_max in Page.searchFor and TextPage.search is now obsolete: methods will return all hits.
  • Changed character selection criteria in Page.getText: a character is now considered to be part of a clip if its bbox is fully contained. Before this, a non-empty intersection was sufficient.
  • Changed Document.scrub to support a new option redact_images. This addresses issue #697.

Changes in Version 1.18.1

  • Fixed issue #692. PyMuPDF now detects and recovers from more cyclic resource dependencies in PDF pages and for the first time reports them in the MuPDF warnings store.
  • Fixed issue #686.
  • Added opacity options for the Shape class: Stroke and fill colors can now be set to some transparency value. This means that all Page draw methods, methods Page.insertText, Page.insertTextbox, Shape.finish, Shape.insertText, and Shape.insertTextbox support two new parameters: stroke_opacity and fill_opacity.
  • Added new parameter mask to Page.insertImage for optionally providing an external image mask. Resolves issue #685.
  • Added Annot.soundGet for extracting the sound of an audio annotation.

Changes in Version 1.18.0

This is the first PyMuPDF version supporting MuPDF v1.18. The focus here is on extending PyMuPDF's own functionality -- apart from bug fixing. Subsequent PyMuPDF patches may address features new in MuPDF.

  • Fixed issue #519. This upstream bug occurred occasionally for some pages only and seems to be fixed now: page layout should no longer be ruined in these cases.
  • Fixed issue #675.
    • Unsuccessful storage allocations should now always lead to exceptions (circumvention of an upstream bug intermittently crashing the interpreter).
    • Pixmap size is now based on size_t instead of int in C and should be correct even for extremely large pixmaps.
  • Fixed issue #668. Specification of dashes for PDF drawing insertion should now correctly reflect the PDF spec.
  • Fixed issue #669. A major source of memory leakage in Page.insert_pdf has been removed.
  • Added keyword "images" to Page.apply_redactions for fine-controlling the handling of images.
  • Added Annot.getText and Annot.getTextbox, which offer the same functionality as the Page versions.
  • Added key "number" to the block dictionaries of Page.getText / Annot.getText for options "dict" and "rawdict".
  • Added glyph_name_to_unicode and unicode_to_glyph_name. Both functions do not really connect to a specific font and are now independently available, too. The data are now based on the Adobe Glyph List.
  • Added convenience functions adobe_glyph_names and adobe_glyph_unicodes which return the respective available data.
  • Added Page.getDrawings which returns details of drawing operations on a document page. Works for all document types.
  • Improved performance of Document.insert_pdf. Multiple object copies are now also suppressed across multiple separate insertions from the same source. This saves time, memory and target file size. Previously this mechanism was only active within each single method execution. The feature can also be suppressed with the new method bool parameter final=1, which is the default.
  • For PNG images created from pixmaps, the resolution (dpi) is now automatically set from the respective Pixmap.xres and Pixmap.yres values.

Changes in Version 1.17.7

  • Fixed issue #651. An upstream bug causing interpreter crashes in corner case redaction processings was fixed by backporting MuPDF changes from their development repo.
  • Fixed issue #645. Pixmap top-left coordinates can be set (again) by their own method, Pixmap.set_origin.
  • Fixed issue #622. Page.insertImage again accepts a rect_like parameter.
  • Added severeal new methods to improve and speed-up table of contents (TOC) handling. Among other things, TOC items can now changed or deleted individually -- without always replacing the complete TOC. Furthermore, access to some PDF page attributes is now possible without first loading the page. This has a very significant impact on the performance of TOC manipulation.
  • Added an option to Document.insert_pdf which allows displaying progress messages. Adresses #640.
  • Added Page.getTextbox which extracts text contained in a rectangle. In many cases, this should obsolete writing your own script for this type of thing.
  • Added new clip parameter to Page.getText to simplify and speed up text extraction of page sub areas.
  • Added TextWriter.appendv to add text in vertical write mode. Addresses issue #653

Changes in Version 1.17.6

  • Fixed issue #605
  • Fixed issue #600 -- text should now be correctly positioned also for pages with a CropBox smaller than MediaBox.
  • Added text span dictionary key origin which contains the lower left coordinate of the first character in that span.
  • Added attribute Font.buffer, a bytes copy of the font file.
  • Added parameter sanitize to Page.cleanContents. Allows switching of sanitization, so only syntax cleaning will be done.

Changes in Version 1.17.5

  • Fixed issue #561 -- second go: certain TextWriter usages with many alternating fonts did not work correctly.
  • Fixed issue #566.
  • Fixed issue #568.
  • Fixed -- opacity is now correctly taken from the TextWriter object, if not given in TextWriter.writeText.
  • Added a new global attribute fitz_fontdescriptors. Contains information about usable fonts from repository pymupdf-fonts.
  • Added Font.valid_codepoints which returns an array of unicode codepoints for which the font has a glyph.
  • Added option text_as_path to Page.getSVGimage. this implements #580. Generates much smaller SVG files with parseable text if set to False.

Changes in Version 1.17.4

  • Fixed issue #561. Handling of more than 10 Font objects on one page should now work correctly.
  • Fixed issue #562. Annotation pixmaps are no longer derived from the page pixmap, thus avoiding unintended inclusion of page content.
  • Fixed issue #559. This MuPDF bug is being temporarily fixed with a pre-version of MuPDF's next release.
  • Added utility function repair_mono_font for correcting displayed character spacing for some mono-spaced fonts.
  • Added utility method Document.need_appearances for fine-controlling Form PDF behavior. Addresses issue #563.
  • Added utility function sRGB_to_pdf to recover the PDF color triple for a given color integer in sRGB format.
  • Added utility function sRGB_to_rgb to recover the (R, G, B) color triple for a given color integer in sRGB format.
  • Added utility function make_table which delivers table cells for a given rectangle and desired numbers of columns and rows.
  • Added support for optional fonts in repository pymupdf-fonts.

Changes in Version 1.17.3

  • Fixed an undocumented issue, which prevented fully cleaning a PDF page when using Page.cleanContents.
  • Fixed issue #540. Text extraction for EPUB should again work correctly.
  • Fixed issue #548. Documentation now includes LINK_NAMED.
  • Added new parameter to control start of text in TextWriter.fillTextbox. Implements #549.
  • Changed documentation of Page.add_redact_annot to explain the usage of non-builtin fonts.

Changes in Version 1.17.2

  • Fixed issue #533.
  • Added options to modify 'Redact' annotation appearance. Implements #535.

Changes in Version 1.17.1

  • Fixed issue #520.
  • Fixed issue #525. Vertices for 'Ink' annots should now be correct.
  • Fixed issue #524. It is now possible to query and set rotation for applicable annotation types.

Also significantly improved inline documentation for better support of interactive help.


Changes in Version 1.17.0

This version is based on MuPDF v1.17. Following are highlights of new and changed features:

  • Added extended language support for annotations and widgets: a mixture of Latin, Greece, Russian, Chinese, Japanese and Korean characters can now be used in 'FreeText' annotations and text widgets. No special arrangement is required to use it.
  • Faster page access is implemented for documents supporting a "chapter" structure. This applies to EPUB documents currently. This comes with several new Document methods and changes for Document.loadPage and the "indexed" page access doc[n]: In addition to specifying a page number as before, a tuple (chaper, pno) can be specified to identify the desired page.
  • Changed: Improved support of redaction annotations: images overlapped by redactions are permanantly modified by erasing the overlap areas. Also links are removed if overlapped by redactions. This is now fully in sync with PDF specifications.

Other changes:

  • Changed TextWriter.writeText to support the "morph" parameter.
  • Added methods Rect.morph, IRect.morph, and Quad.morph, which return a new Quad.
  • Changed Page.add_freetext_annot to support text alignment via a new "align" parameter.
  • Fixed issue #508. Improved image rectangle calculation to hopefully deliver correct values in most if not all cases.
  • Fixed issue #502.
  • Fixed issue #500. Document.convertToPDF should no longer cause memory leaks.
  • Fixed issue #496. Annotations and widgets / fields are now added or modified using the coordinates of the unrotated page. This behavior is now in sync with other methods modifying PDF pages.
  • Added Page.rotationMatrix and Page.derotationMatrix to support coordinate transformations between the rotated and the original versions of a PDF page.

Potential code breaking changes:

  • The private method Page._getTransformation() has been removed. Use the public Page.transformationMattrix instead.

Changes in Version 1.16.18

This version introduces several new features around PDF text output. The motivation is to simplify this task, while at the same time offering extending features.

One major achievement is using MuPDF's capabilities to dynamically choosing fallback fonts whenever a character cannot be found in the current one. This seemlessly works for Base-14 fonts in combination with CJK fonts (China, Japan, Korea). So a text may contain any combination of characters from the Latin, Greek, Russian, Chinese, Japanese and Korean languages.

  • Fixed issue #493. Pixmap(doc, xref) should now again correctly resemble the loaded image object.
  • Fixed issue #488. Widget names are now modifyable.
  • Added new class Font which represents a font.
  • Added new class TextWriter which serves as a container for text to be written on a page.
  • Added Page.writeText to write one or more TextWriter objects to the page.

Changes in Version 1.16.17

  • Fixed issue #479. PyMuPDF should now more correctly report image resolutions. This applies to both, images (either from images files or extracted from PDF documents) and pixmaps created from images.
  • Added Pixmap.set_dpi which sets the image resolution in x and y directions.

Changes in Version 1.16.16

  • Fixed issue #477.
  • Fixed issue #476.
  • Changed annotation line end symbol coloring and fixed an error coloring the interior of 'Polyline' /'Polygon' annotations.

Changes in Version 1.16.14

  • Changed text marker annotations to accept parameters beyond just quadrilaterals such that now text lines between two given points can be marked.
  • Added Document.scrub which removes potentially sensitive data from a PDF. Implements #453.
  • Added Annot.blendMode which returns the blend mode of annotations.
  • Added Annot.setBlendMode to set the annotation's blend mode. This resolves issue #416.
  • Changed Annot.update to accept additional parameters for setting blend mode and opacity.
  • Added advanced graphics features to control the anti-aliasing values, Tools.set_aa_level. Resolves #467
  • Fixed issue #474.
  • Fixed issue #466.

Changes in Version 1.16.13

  • Added Document.getPageXObjectList which returns a list of Form XObjects of the page.
  • Added Page.setMediaBox for changing the physical PDF page size.
  • Added Page methods which have been internal before: Page.cleanContents (= Page._cleanContents), Page.getContents (= Page._getContents), Page.getTransformation (= Page._getTransformation).

Changes in Version 1.16.12

  • Fixed issue #447
  • Fixed issue #461.
  • Fixed issue #397.
  • Fixed issue #463.
  • Added JavaScript support to PDF form fields, thereby fixing #454.
  • Added a new annotation method Annot.delete_responses, which removes 'Popup' and response annotations referring to the current one. Mainly serves data protection purposes.
  • Added a new form field method Widget.reset, which resets the field value to its default.
  • Changed and extended handling of redactions: images and XObjects are removed if contained in a redaction rectangle. Any partial only overlaps will just be covered by the redaction background color. Now an overlay text can be specified to be inserted in the rectangle area to take the place the deleted original text. This resolves #434.

Changes in Version 1.16.11

  • Added Support for redaction annotations via method Page.add_redact_annot and Page.apply_redactions.
  • Fixed issue #426 ("PolygonAnnotation in 1.16.10 version").
  • Fixed documentation only issues #443 and #444.

Changes in Version 1.16.10

  • Fixed issue #421 ("annot.set_rect(rect) has no effect on text Annotation")
  • Fixed issue #417 ("Strange behavior for page.deleteAnnot on 1.16.9 compare to 1.13.20")
  • Fixed issue #415 ("Annot.setOpacity throws mupdf warnings")
  • Changed all "add annotation / widget" methods to store a unique name in the /NM PDF key.
  • Changed Annot.setInfo to also accept direct parameters in addition to a dictionary.
  • Changed Annot.info to now also show the annotation's unique id (/NM PDF key) if present.
  • Added Page.annot_names which returns a list of all annotation names (/NM keys).
  • Added Page.load_annot which loads an annotation given its unique id (/NM key).
  • Added Document.reload_page which provides a new copy of a page after finishing any pending updates to it.

Changes in Version 1.16.9

  • Fixed #412 ("Feature Request: Allow controlling whether TOC entries should be collapsed")
  • Fixed #411 ("Seg Fault with page.firstWidget")
  • Fixed #407 ("Annot.setOpacity trouble")
  • Changed methods Annot.setBorder, Annot.setColors, Link.setBorder, and Link.setColors to also accept direct parameters, and not just cumbersome dictionaries.

Changes in Version 1.16.8

  • Added several new methods to the Document class, which make dealing with PDF low-level structures easier. I also decided to provide them as "normal" methods (as opposed to private ones starting with an underscore "_"). These are Document.xrefObject, Document.xrefStream, Document.xrefStreamRaw, Document.PDFTrailer, Document.PDFCatalog, Document.metadataXML, Document.updateObject, Document.updateStream.
  • Added Tools.mupdf_disply_errors which sets the display of mupdf errors on sys.stderr.
  • Added a commandline facility. This a major new feature: you can now invoke several utility functions via "python -m fitz ...". It should obsolete the need for many of the most trivial scripts. Please refer to Module.

Changes in Version 1.16.7

Minor changes to better synchronize the binary image streams of TextPage image blocks and Document.extractImage images.

  • Fixed issue #394 ("PyMuPDF Segfaults when using TOOLS.mupdf_warnings()").
  • Changed redirection of MuPDF error messages: apart from writing them to Python sys.stderr, they are now also stored with the MuPDF warnings.
  • Changed Tools.mupdf_warnings to automatically empty the store (if not deactivated via a parameter).
  • Changed Page.getImageBbox to return an infinite rectangle if the image could not be located on the page -- instead of raising an exception.

Changes in Version 1.16.6

  • Fixed issue #390 ("Incomplete deletion of annotations").
  • Changed Page.searchFor / Document.searchPageFor to also support the flags parameter, which controls the data included in a TextPage.
  • Changed Document.getPageImageList, Document.getPageFontList and their Page counterparts to support a new parameter full. If true, the returned items will contain the xref of the Form XObject where the font or image is referenced.

Changes in Version 1.16.5

More performance improvements for text extraction.

  • Fixed second part of issue #381 (see item in v1.16.4).
  • Added Page.getTextPage, so it is no longer required to create an intermediate display list for text extractions. Page level wrappers for text extraction and text searching are now based on this, which should improve performance by ca. 5%.

Changes in Version 1.16.4

  • Fixed issue #381 ("TextPage.extractDICT ... failed ... after upgrading ... to 1.16.3")
  • Added method Document.pages which delivers a generator iterator over a page range.
  • Added method Page.links which delivers a generator iterator over the links of a page.
  • Added method Page.annots which delivers a generator iterator over the annotations of a page.
  • Added method Page.widgets which delivers a generator iterator over the form fields of a page.
  • Changed Document.is_form_pdf to now contain the number of widgets, and False if not a PDF or this number is zero.

Changes in Version 1.16.3

Minor changes compared to version 1.16.2. The code of the "dict" and "rawdict" variants of Page.getText has been ported to C which has greatly improved their performance. This improvement is mostly noticeable with text-oriented documents, where they now should execute almost two times faster.

  • Fixed issue #369 ("mupdf: cmsCreateTransform failed") by removing ICC colorspace support.
  • Changed Page.getText to accept additional keywords "blocks" and "words". These will deliver the results of Page.getTextBlocks and Page.getTextWords, respectively. So all text extraction methods are now available via a uniform API. Correspondingly, there are now new methods TextPage.extractBLOCKS and TextPage.extractWords.
  • Changed Page.getText to default bit indicator TEXT_INHIBIT_SPACES to off. Insertion of additional spaces is not suppressed by default.

Changes in Version 1.16.2

  • Changed text extraction methods of Page to allow detail control of the amount of extracted data.
  • Added planish_line which maps a given line (defined as a pair of points) to the x-axis.
  • Fixed an issue (w/o Github number) which brought down the interpreter when encountering certain non-UTF-8 encodable characters while using Page.getText with te "dict" option.
  • Fixed issue #362 ("Memory Leak with getText('rawDICT')").

Changes in Version 1.16.1

  • Added property Quad.is_convex which checks whether a line is contained in the quad if it connects two points of it.
  • Changed Document.insert_pdf to now allow dropping or including links and annotations independently during the copy. Fixes issue #352 ("Corrupt PDF data and ..."), which seemed to intermittently occur when using the method for some problematic PDF files.
  • Fixed a bug which, in matrix division using the syntax "m1/m2", caused matrix "m1" to be replaced by the result instead of delivering a new matrix.
  • Fixed issue #354 ("SyntaxWarning with Python 3.8"). We now always use "==" for literals (instead of the "is" Python keyword).
  • Fixed issue #353 ("mupdf version check"), to no longer refuse the import when there are only patch level deviations from MuPDF.

Changes in Version 1.16.0

This major new version of MuPDF comes with several nice new or changed features. Some of them imply programming API changes, however. This is a synopsis of what has changed:

  • PDF document encryption and decryption is now fully supported. This includes setting permissions, passwords (user and owner passwords) and the desired encryption method.
  • In response to the new encryption features, PyMuPDF returns an integer (ie. a combination of bits) for document permissions, and no longer a dictionary.
  • Redirection of MuPDF errors and warnings is now natively supported. PyMuPDF redirects error messages from MuPDF to sys.stderr and no longer buffers them. Warnings continue to be buffered and will not be displayed. Functions exist to access and reset the warnings buffer.
  • Annotations are now only supported for PDF.
  • Annotations and widgets (form fields) are now separate object chains on a page (although widgets technically still are PDF annotations). This means, that you will never encounter widgets when using Page.firstAnnot or Annot.next. You must use Page.firstWidget and Widget.next to access form fields.
  • As part of MuPDF's changes regarding widgets, only the following four fonts are supported, when adding or changing form fields: Courier, Helvetica, Times-Roman and ZapfDingBats.

List of change details:

  • Added Document.can_save_incrementally which checks conditions that are preventing use of option incremental=True of Document.save.
  • Added Page.firstWidget which points to the first field on a page.
  • Added Page.getImageBbox which returns the rectangle occupied by an image shown on the page.
  • Added Annot.setName which lets you change the (icon) name field.
  • Added outputting the text color in Page.getText: the "dict", "rawdict" and "xml" options now also show the color in sRGB format.
  • Changed Document.permissions to now contain an integer of bool indicators -- was a dictionary before.
  • Changed Document.save, Document.write, which now fully support password-based decryption and encryption of PDF files.
  • Changed the names of all Python constants related to annotations and widgets. Please make sure to consult the Constants and Enumerations chapter if your script is dealing with these two classes. This decision goes back to the dropped support for non-PDF annotations. The old names (starting with "ANNOT" or "WIDGET_") will be available as deprecated synonyms.
  • Changed font support for widgets: only Cour (Courier), Helv (Helvetica, default), TiRo (Times-Roman) and ZaDb (ZapfDingBats) are accepted when adding or changing form fields. Only the plain versions are possible -- not their italic or bold variations. Reading widgets, however will show its original font.
  • Changed the name of the warnings buffer to Tools.mupdf_warnings and the function to empty this buffer is now called Tools.reset_mupdf_warnings.
  • Changed Page.getPixmap, Document.get_page_pixmap: a new bool argument annots can now be used to suppress the rendering of annotations on the page.
  • Changed Page.add_file_annot and Page.add_text_annot to enable setting an icon.
  • Removed widget-related methods and attributes from the Annot object.
  • Removed Document attributes openErrCode, openErrMsg, and Tools attributes / methods stderr, reset_stderr, stdout, and reset_stdout.
  • Removed thirdparty zlib dependency in PyMuPDF: there are now compression functions available in MuPDF. Source installers of PyMuPDF may now omit this extra installation step.

No version published for MuPDF v1.15.0


Changes in Version 1.14.20 / 1.14.21

  • Changed text marker annotations to support multiple rectangles / quadrilaterals. This fixes issue #341 ("Question : How to addhighlight so that a string spread across more than a line is covered by one highlight?") and similar (#285).
  • Fixed issue #331 ("Importing PyMuPDF changes warning filtering behaviour globally").

Changes in Version 1.14.19

  • Fixed issue #319 ("InsertText function error when use custom font").
  • Added new method Document.get_sigflags which returns information on whether a PDF is signed. Resolves issue #326 ("How to detect signature in a form pdf?").

Changes in Version 1.14.17

  • Added Document.fullcopyPage to make full page copies within a PDF (not just copied references as Document.copyPage does).
  • Changed Page.getPixmap, Document.get_page_pixmap now use alpha=False as default.
  • Changed text extraction: the span dictionary now (again) contains its rectangle under the bbox key.
  • Changed Document.movePage and Document.copyPage to use direct functions instead of wrapping Document.select -- similar to Document.delete_page in v1.14.16.

Changes in Version 1.14.16

  • Changed Document methods around PDF /EmbeddedFiles to no longer use MuPDF's "portfolio" functions. That support will be dropped in MuPDF v1.15 -- therefore another solution was required.
  • Changed Document.embfile_Count to be a function (was an attribute).
  • Added new method Document.embfile_Names which returns a list of names of embedded files.
  • Changed Document.delete_page and Document.delete_pages to internally no longer use Document.select, but instead use functions to perform the deletion directly. As it has turned out, the Document.select method yields invalid outline trees (tables of content) for very complex PDFs and sophisticated use of annotations.

Changes in Version 1.14.15

  • Fixed issues #301 ("Line cap and Line join"), #300 ("How to draw a shape without outlines") and #298 ("utils.updateRect exception"). These bugs pertain to drawing shapes with PyMuPDF. Drawing shapes without any border is fully supported. Line cap styles and line line join style are now differentiated and support all possible PDF values (0, 1, 2) instead of just being a bool. The previous parameter roundCap is deprecated in favor of lineCap and lineJoin and will be deleted in the next release.
  • Fixed issue #290 ("Memory Leak with getText('rawDICT')"). This bug caused memory not being (completely) freed after invoking the "dict", "rawdict" and "json" versions of Page.getText.

Changes in Version 1.14.14

  • Added new low-level function ImageProperties to determine a number of characteristics for an image.
  • Added new low-level function Document.is_stream, which checks whether an object is of stream type.
  • Changed low-level functions Document._getXrefString and Document._getTrailerString now by default return object definitions in a formatted form which makes parsing easy.

Changes in Version 1.14.13

  • Changed methods working with binary input: while ever supporting bytes and bytearray objects, they now also accept io.BytesIO input, using their getvalue() method. This pertains to document creation, embedded files, FileAttachment annotations, pixmap creation and others. Fixes issue #274 ("Segfault when using BytesIO as a stream for insertImage").
  • Fixed issue #278 ("Is insertImage(keep_proportion=True) broken?"). Images are now correctly presented when keeping aspect ratio.

Changes in Version 1.14.12

  • Changed the draw methods of Page and Shape to support not only RGB, but also GRAY and CMYK colorspaces. This solves issue #270 ("Is there a way to use CMYK color to draw shapes?"). This change also applies to text insertion methods of Shape, resp. Page.
  • Fixed issue #269 ("AttributeError in Document.insert_page()"), which occurred when using Document.insert_page with text insertion.

Changes in Version 1.14.11

  • Changed Page.show_pdf_page to always position the source rectangle centered in the target. This method now also supports rotation by arbitrary angles. The argument reuse_xref has been deprecated: prevention of duplicates is now handled internally.
  • Changed Page.insertImage to support rotated display of the image and keeping the aspect ratio. Only rotations by multiples of 90 degrees are supported here.
  • Fixed issue #265 ("TypeError: insertText() got an unexpected keyword argument 'idx'"). This issue only occurred when using Document.insert_page with also inserting text.

Changes in Version 1.14.10

  • Changed Page.show_pdf_page to support rotation of the source rectangle. Fixes #261 ("Cannot rotate insterted pages").
  • Fixed a bug in Page.insertImage which prevented insertion of multiple images provided as streams.

Changes in Version 1.14.9

  • Added new low-level method Document._getTrailerString, which returns the trailer object of a PDF. This is much like Document._getXrefString except that the PDF trailer has no / needs no xref to identify it.
  • Added new parameters for text insertion methods. You can now set stroke and fill colors of glyphs (text characters) independently, as well as the thickness of the glyph border. A new parameter render_mode controls the use of these colors, and whether the text should be visible at all.
  • Fixed issue #258 ("Copying image streams to new PDF without size increase"): For JPX images embedded in a PDF, Document.extractImage will now return them in their original format. Previously, the MuPDF base library was used, which returns them in PNG format (entailing a massive size increase).
  • Fixed issue #259 ("Morphing text to fit inside rect"). Clarified use of get_text_length and removed extra line breaks for long words.

Changes in Version 1.14.8

  • Added Pixmap.set_rect to change the pixel values in a rectangle. This is also an alternative to setting the color of a complete pixmap (Pixmap.clear_with).
  • Fixed an image extraction issue with JBIG2 (monochrome) encoded PDF images. The issue occurred in Page.getText (parameters "dict" and "rawdict") and in Document.extractImage methods.
  • Fixed an issue with not correctly clearing a non-alpha Pixmap (Pixmap.clear_with).
  • Fixed an issue with not correctly inverting colors of a non-alpha Pixmap (Pixmap.invert_irect).

Changes in Version 1.14.7

  • Added Pixmap.set_pixel to change one pixel value.
  • Added documentation for image conversion in the FAQ.
  • Added new function get_text_length to determine the string length for a given font.
  • Added Postscript image output (changed Pixmap.save and Pixmap.tobytes).
  • Changed Pixmap.save and Pixmap.tobytes to ensure valid combinations of colorspace, alpha and output format.
  • Changed Pixmap.save: the desired format is now inferred from the filename.
  • Changed FreeText annotations can now have a transparent background - see Annot.update.

Changes in Version 1.14.5

  • Changed: Shape methods now strictly use the transformation matrix of the Page -- instead of "manually" calculating locations.
  • Added method Pixmap.pixel which returns the pixel value (a list) for given pixel coordinates.
  • Added method Pixmap.tobytes which returns a bytes object representing the pixmap in a variety of formats. Previously, this could be done for PNG outputs only (Pixmap.tobytes).
  • Changed: output of methods Pixmap.save and (the new) Pixmap.tobytes may now also be PSD (Adobe Photoshop Document).
  • Added method Shape.drawQuad which draws a Quad. This actually is a shorthand for a Shape.drawPolyline with the edges of the quad.
  • Changed method Shape.drawOval: the argument can now be either a rectangle (rect_like) or a quadrilateral (quad_like).

Changes in Version 1.14.4

  • Fixes issue #239 "Annotation coordinate consistency".

Changes in Version 1.14.3

This patch version contains minor bug fixes and CJK font output support.

  • Added support for the four CJK fonts as PyMuPDF generated text output. This pertains to methods Page.insertFont, Shape.insertText, Shape.insertTextbox, and corresponding Page methods. The new fonts are available under "reserved" fontnames "china-t" (traditional Chinese), "china-s" (simplified Chinese), "japan" (Japanese), and "korea" (Korean).
  • Added full support for the built-in fonts 'Symbol' and 'Zapfdingbats'.
  • Changed: The 14 standard fonts can now each be referenced by a 4-letter abbreviation.

Changes in Version 1.14.1

This patch version contains minor performance improvements.

  • Added support for Document filenames given as pathlib object by using the Python str() function.

Changes in Version 1.14.0

To support MuPDF v1.14.0, massive changes were required in PyMuPDF -- most of them purely technical, with little visibility to developers. But there are also quite a lot of interesting new and improved features. Following are the details:

  • Added "ink" annotation.
  • Added "rubber stamp" annotation.
  • Added "squiggly" text marker annotation.
  • Added new class Quad (quadrilateral or tetragon) -- which represents a general four-sided shape in the plane. The special subtype of rectangular, non-empty tetragons is used in text marker annotations and as returned objects in text search methods.
  • Added a new option "decrypt" to Document.save and Document.write. Now you can keep encryption when saving a password protected PDF.
  • Added suppression and redirection of unsolicited messages issued by the underlying C-library MuPDF. Consult RedirectMessages for details.
  • Changed: Changes to annotations now always require Annot.update to become effective.
  • Changed free text annotations to support the full Latin character set and range of appearance options.
  • Changed text searching, Page.searchFor, to optionally return Quad instead Rect objects surrounding each search hit.
  • Changed plain text output: we now add a n to each line if it does not itself end with this character.
  • Fixed issue 211 ("Something wrong in the doc").
  • Fixed issue 213 ("Rewritten outline is displayed only by mupdf-based applications").
  • Fixed issue 214 ("PDF decryption GONE!").
  • Fixed issue 215 ("Formatting of links added with pyMuPDF").
  • Fixed issue 217 ("extraction through json is failing for my pdf").

Behind the curtain, we have changed the implementation of geometry objects: they now purely exist in Python and no longer have "shadow" twins on the C-level (in MuPDF). This has improved processing speed in that area by more than a factor of two.

Because of the same reason, most methods involving geometry parameters now also accept the corresponding Python sequence. For example, in method "page.show_pdf_page(rect, ...)" parameter rect may now be any rect_like sequence.

We also invested considerable effort to further extend and improve the FAQ chapter.


Changes in Version 1.13.19

This version contains some technical / performance improvements and bug fixes.

  • Changed memory management: for Python 3 builds, Python memory management is exclusively used across all C-level code (i.e. no more native malloc() in MuPDF code or PyMuPDF interface code). This leads to improved memory usage profiles and also some runtime improvements: we have seen > 2% shorter runtimes for text extractions and pixmap creations (on Windows machines only to date).
  • Fixed an error occurring in Python 2.7, which crashed the interpreter when using TextPage.extractRAWDICT (= Page.getText("rawdict")).
  • Fixed an error occurring in Python 2.7, when creating link destinations.
  • Extended the FAQ chapter with more examples.

Changes in Version 1.13.18

  • Added method TextPage.extractRAWDICT, and a corresponding new string parameter "rawdict" to method Page.getText. It extracts text and images from a page in Python dict form like TextPage.extractDICT, but with the detail level of TextPage.extractXML, which is position information down to each single character.

Changes in Version 1.13.17

  • Fixed an error that intermittently caused an exception in Page.show_pdf_page, when pages from many different source PDFs were shown.
  • Changed method Document.extractImage to now return more meta information about the extracted imgage. Also, its performance has been greatly improved. Several demo scripts have been changed to make use of this method.
  • Changed method Document._getXrefStream to now return None if the object is no stream and no longer raise an exception if otherwise.
  • Added method Document._deleteObject which deletes a PDF object identified by its xref. Only to be used by the experienced PDF expert.
  • Added a method paper_rect which returns a Rect for a supplied paper format string. Example: fitz.paper_rect("letter") = fitz.Rect(0.0, 0.0, 612.0, 792.0).
  • Added a FAQ chapter to this document.

Changes in Version 1.13.16

  • Added support for correctly setting transparency (opacity) for certain annotation types.
  • Added a tool property (Tools.fitz_config) showing the configuration of this PyMuPDF version.
  • Fixed issue #193 ('insertText(overlay=False) gives "cannot resize a buffer with shared storage" error') by avoiding read-only buffers.

Changes in Version 1.13.15

  • Fixed issue #189 ("cannot find builtin CJK font"), so we are supporting builtin CJK fonts now (CJK = China, Japan, Korea). This should lead to correctly generated pixmaps for documents using these languages. This change has consequences for our binary file size: it will now range between 8 and 10 MB, depending on the OS.
  • Fixed issue #191 ("Jupyter notebook kernel dies after ca. 40 pages"), which occurred when modifying the contents of an annotation.

Changes in Version 1.13.14

This patch version contains several improvements, mainly for annotations.

  • Changed Annot.lineEnds is now a list of two integers representing the line end symbols. Previously was a dict of strings.
  • Added support of line end symbols for applicable annotations. PyMuPDF now can generate these annotations including the line end symbols.
  • Added Annot.setLineEnds adds line end symbols to applicable annotation types ('Line', 'PolyLine', 'Polygon').
  • Changed technical implementation of Page.insertImage and Page.show_pdf_page: they now create there own contents objects, thereby avoiding changes of potentially large streams with consequential compression / decompression efforts and high change volumes with incremental updates.

Changes in Version 1.13.13

This patch version contains several improvements for embedded files and file attachment annotations.

  • Added Document.embfile_Upd which allows changing file content and metadata of an embedded file. It supersedes the old method Document.embfile_SetInfo (which will be deleted in a future version). Content is automatically compressed and metadata may be unicode.
  • Changed Document.embfile_Add to now automatically compress file content. Accompanying metadata can now be unicode (had to be ASCII in the past).
  • Changed Document.embfile_Del to now automatically delete all entries having the supplied identifying name. The return code is now an integer count of the removed entries (was None previously).
  • Changed embedded file methods to now also accept or show the PDF unicode filename as additional parameter ufilename.
  • Added Page.add_file_annot which adds a new file attachment annotation.
  • Changed Annot.fileUpd (file attachment annot) to now also accept the PDF unicode ufilename parameter. The description parameter desc correctly works with unicode. Furthermore, all parameters are optional, so metadata may be changed without also replacing the file content.
  • Changed Annot.fileInfo (file attachment annot) to now also show the PDF unicode filename as parameter ufilename.
  • Fixed issue #180 ("page.getText(output='dict') return invalid bbox") to now also work for vertical text.
  • Fixed issue #185 ("Can't render the annotations created by PyMuPDF"). The issue's cause was the minimalistic MuPDF approach when creating annotations. Several annotation types have no /AP ("appearance") object when created by MuPDF functions. MuPDF, SumatraPDF and hence also PyMuPDF cannot render annotations without such an object. This fix now ensures, that an appearance object is always created together with the annotation itself. We still do not support line end styles.

Changes in Version 1.13.12

  • Fixed issue #180 ("page.getText(output='dict') return invalid bbox"). Note that this is a circumvention of an MuPDF error, which generates zero-height character rectangles in some cases. When this happens, this fix ensures a bbox height of at least fontsize.
  • Changed for ListBox and ComboBox widgets, the attribute list of selectable values has been renamed to Widget.choice_values.
  • Changed when adding widgets, any missing of the Base-14-Fonts is automatically added to the PDF. Widget text fonts can now also be chosen from existing widget fonts. Any specified field values are now honored and lead to a field with a preset value.
  • Added Annot.updateWidget which allows changing existing form fields -- including the field value.

Changes in Version 1.13.11

While the preceeding patch subversions only contained various fixes, this version again introduces major new features:

  • Added basic support for PDF widget annotations. You can now add PDF form fields of types Text, CheckBox, ListBox and ComboBox. Where necessary, the PDF is tranformed to a Form PDF with the first added widget.
  • Fixed issues #176 ("wrong file embedding"), #177 ("segment fault when invoking page.getText()")and #179 ("Segmentation fault using page.getLinks() on encrypted PDF").

Changes in Version 1.13.7

  • Added support of variable page sizes for reflowable documents (e-books, HTML, etc.): new parameters rect and fontsize in Document creation (open), and as a separate method Document.layout.
  • Added Annot creation of many annotations types: sticky notes, free text, circle, rectangle, line, polygon, polyline and text markers.
  • Added support of annotation transparency (Annot.opacity, Annot.setOpacity).
  • Changed Annot.vertices: point coordinates are now grouped as pairs of floats (no longer as separate floats).
  • Changed annotation colors dictionary: the two keys are now named "stroke" (formerly "common") and "fill".
  • Added Document.isDirty which is True if a PDF has been changed in this session. Reset to False on each Document.save or Document.write.

Changes in Version 1.13.6

  • Fix #173: for memory-resident documents, ensure the stream object will not be garbage-collected by Python before document is closed.

Changes in Version 1.13.5

  • New low-level method Page._setContents defines an object given by its xref to serve as the contents object.
  • Changed and extended PDF form field support: the attribute widget_text has been renamed to Annot.widget_value. Values of all form field types (except signatures) are now supported. A new attribute Annot.widget_choices contains the selectable values of listboxes and comboboxes. All these attributes now contain None if no value is present.

Changes in Version 1.13.4

  • Document.convertToPDF now supports page ranges, reverted page sequences and page rotation. If the document already is a PDF, an exception is raised.
  • Fixed a bug (introduced with v1.13.0) that prevented Page.insertImage for transparent images.

Changes in Version 1.13.3

Introduces a way to convert any MuPDF supported document to a PDF. If you ever wanted PDF versions of your XPS, EPUB, CBZ or FB2 files -- here is a way to do this.

  • Document.convertToPDF returns a Python bytes object in PDF format. Can be opened like normal in PyMuPDF, or be written to disk with the ".pdf" extension.

Changes in Version 1.13.2

The major enhancement is PDF form field support. Form fields are annotations of type (19, 'Widget'). There is a new document method to check whether a PDF is a form. The Annot class has new properties describing field details.

  • Document.is_form_pdf is true if object type /AcroForm and at least one form field exists.
  • Annot.widget_type, Annot.widget_text and Annot.widget_name contain the details of a form field (i.e. a "Widget" annotation).

Changes in Version 1.13.1

  • TextPage.extractDICT is a new method to extract the contents of a document page (text and images). All document types are supported as with the other TextPage extract()* methods. The returned object is a dictionary of nested lists and other dictionaries, and exactly equal to the JSON-deserialization of the old TextPage.extractJSON. The difference is that the result is created directly -- no JSON module is used. Because the user needs no JSON module to interpet the information, it should be easier to use, and also have a better performance, because it contains images in their original binary format -- they need not be base64-decoded.
  • Page.getText correspondingly supports the new parameter value "dict" to invoke the above method.
  • TextPage.extractJSON (resp. Page.getText("json")) is still supported for convenience, but its use is expected to decline.

Changes in Version 1.13.0

This version is based on MuPDF v1.13.0. This release is "primarily a bug fix release".

In PyMuPDF, we are also doing some bug fixes while introducing minor enhancements. There only very minimal changes to the user's API.

  • Document construction is more flexible: the new filetype parameter allows setting the document type. If specified, any extension in the filename will be ignored. More completely addresses issue #156. As part of this, the documentation has been reworked.
  • Changes to Pixmap constructors:
    • Colorspace conversion no longer allows dropping the alpha channel: source and target alpha will now always be the same. We have seen exceptions and even interpreter crashes when using alpha = 0.
    • As a replacement, the simple pixmap copy lets you choose the target alpha.
  • Document.save again offers the full garbage collection range 0 thru 4. Because of a bug in xref maintenance, we had to temporarily enforce garbage > 1. Finally resolves issue #148.
  • Document.save now offers to "prettify" PDF source via an additional argument.
  • Page.insertImage has the additional stream -parameter, specifying a memory area holding an image.
  • Issue with garbled PNGs on Linux systems has been resolved ("Problem writing PNG" #133).

Changes in Version 1.12.4

This is an extension of 1.12.3.

  • Fix of issue #147: methods Document.getPageFontlist and Document.getPageImagelist now also show fonts and images contained in resources nested via "Form XObjects".
  • Temporary fix of issue #148: Saving to new PDF files will now automatically use garbage = 2 if a lower value is given. Final fix is to be expected with MuPDF's next version. At that point we will remove this circumvention.
  • Preventive fix of illegally using stencil / image mask pixmaps in some methods.
  • Method Document.getPageFontlist now includes the encoding name for each font in the list.
  • Method Document.getPageImagelist now includes the decode method name for each image in the list.

Changes in Version 1.12.3

This is an extension of 1.12.2.

  • Many functions now return None instead of 0, if the result has no other meaning than just indicating successful execution (Document.close, Document.save, Document.select, Pixmap.save and many others).

Changes in Version 1.12.2

This is an extension of 1.12.1.

  • Method Page.show_pdf_page now accepts the new clip argument. This specifies an area of the source page to which the display should be restricted.
  • New Page.CropBox and Page.MediaBox have been included for convenience.

Changes in Version 1.12.1

This is an extension of version 1.12.0.

  • New method Page.show_pdf_page displays another's PDF page. This is a vector image and therefore remains precise across zooming. Both involved documents must be PDF.
  • New method Page.getSVGimage creates an SVG image from the page. In contrast to the raster image of a pixmap, this is a vector image format. The return is a unicode text string, which can be saved in a .svg file.
  • Method Page.getTextBlocks now accepts an additional bool parameter "images". If set to true (default is false), image blocks (metadata only) are included in the produced list and thus allow detecting areas with rendered images.
  • Minor bug fixes.
  • "text" result of Page.getText concatenates all lines within a block using a single space character. MuPDF's original uses "\n" instead, producing a rather ragged output.
  • New properties of Page objects Page.MediaBoxSize and Page.CropBoxPosition provide more information about a page's dimensions. For non-PDF files (and for most PDF files, too) these will be equal to Page.rect.bottom_right, resp. Page.rect.top_left. For example, class Shape makes use of them to correctly position its items.

Changes in Version 1.12.0

This version is based on and requires MuPDF v1.12.0. The new MuPDF version contains quite a number of changes -- most of them around text extraction. Some of the changes impact the programmer's API.

  • Outline.saveText and Outline.saveXML have been deleted without replacement. You probably haven't used them much anyway. But if you are looking for a replacement: the output of Document.get_toc can easily be used to produce something equivalent.
  • Class TextSheet does no longer exist.
  • Text "spans" (one of the hierarchy levels of TextPage) no longer contain positioning information (i.e. no "bbox" key). Instead, spans now provide the font information for its text. This impacts our JSON output variant.
  • HTML output has improved very much: it now creates valid documents which can be displayed by browsers to produce a similar view as the original document.
  • There is a new output format XHTML, which provides text and images in a browser-readable format. The difference to HTML output is, that no effort is made to reproduce the original layout.
  • All output formats of Page.getText now support creating complete, valid documents, by wrapping them with appropriate header and trailer information. If you are interested in using the HTML output, please make sure to read HTMLQuality.
  • To support finding text positions, we have added special methods that don't need detours like TextPage.extractJSON or TextPage.extractXML: use Page.getTextBlocks or resp. Page.getTextWords to create lists of text blocks or resp. words, which are accompanied by their rectangles. This should be much faster than the standard text extraction methods and also avoids using additional packages for interpreting their output.

Changes in Version 1.11.2

This is an extension of v1.11.1.

  • New Page.insertFont creates a PDF /Font object and returns its object number.
  • New Document.extractFont extracts the content of an embedded font given its object number.
  • Methods FontList(...) items no longer contain the PDF generation number. This value never had any significance. Instead, the font file extension is included (e.g. "pfa" for a "PostScript Font for ASCII"), which is more valuable information.
  • Fonts other than "simple fonts" (Type1) are now also supported.
  • New options to change Pixmap size:

    • Method Pixmap.shrink reduces the pixmap proportionally in place.
    • A new Pixmap copy constructor allows scaling via setting target width and height.

Changes in Version 1.11.1

This is an extension of v1.11.0.

  • New class Shape. It facilitates and extends the creation of image shapes on PDF pages. It contains multiple methods for creating elementary shapes like lines, rectangles or circles, which can be combined into more complex ones and be given common properties like line width or colors. Combined shapes are handled as a unit and e.g. be "morphed" together. The class can accumulate multiple complex shapes and put them all in the page's foreground or background -- thus also reducing the number of updates to the page's contents object.
  • All Page draw methods now use the new Shape class.
  • Text insertion methods insertText() and insertTextBox() now support morphing in addition to text rotation. They have become part of the Shape class and thus allow text to be freely combined with graphics.
  • A new Pixmap constructor allows creating pixmap copies with an added alpha channel. A new method also allows directly manipulating alpha values.
  • Binary algebraic operations with geometry objects (matrices, rectangles and points) now generally also support lists or tuples as the second operand. You can add a tuple (x, y) of numbers to a Point. In this context, such sequences are called "point_like" (resp. matrix_like, rect_like).
  • Geometry objects now fully support in-place operators. For example, p /= m replaces point p with p 1/m* for a number, or p ~m* for a matrix_like object m. Similarly, if r is a rectangle, then r |= (3, 4) is the new rectangle that also includes fitz.Point(3, 4), and r &= (1, 2, 3, 4) is its intersection with fitz.Rect(1, 2, 3, 4).

Changes in Version 1.11.0

This version is based on and requires MuPDF v1.11.

Though MuPDF has declared it as being mostly a bug fix version, one major new feature is indeed contained: support of embedded files -- also called portfolios or collections. We have extended PyMuPDF functionality to embrace this up to an extent just a little beyond the mutool utility as follows.

  • The Document class now support embedded files with several new methods and one new property:

    • embfile_Info() returns metadata information about an entry in the list of embedded files. This is more than mutool currently provides: it shows all the information that was used to embed the file (not just the entry's name).
    • embfile_Get() retrieves the (decompressed) content of an entry into a bytes buffer.
    • embfile_Add(...) inserts new content into the PDF portfolio. We (in contrast to mutool) restrict this to entries with a new name (no duplicate names allowed).
    • embfile_Del(...) deletes an entry from the portfolio (function not offered in MuPDF).
    • embfile_SetInfo() -- changes filename or description of an embedded file.
    • embfile_Count -- contains the number of embedded files.
  • Several enhancements deal with streamlining geometry objects. These are not connected to the new MuPDF version and most of them are also reflected in PyMuPDF v1.10.0. Among them are new properties to identify the corners of rectangles by name (e.g. Rect.bottom_right) and new methods to deal with set-theoretic questions like Rect.contains(x) or IRect.intersects(x). Special effort focussed on supporting more "Pythonic" language constructs: if x in rect ... is equivalent to rect.contains(x).
  • The Rect chapter now has more background on empty amd infinite rectangles and how we handle them. The handling itself was also updated for more consistency in this area.
  • We have started basic support for generation of PDF content:

    • Document.insert_page() adds a new page into a PDF, optionally containing some text.
    • Page.insertImage() places a new image on a PDF page.
    • Page.insertText() puts new text on an existing page
  • For FileAttachment annotations, content and name of the attached file can extracted and changed.

Changes in Version 1.10.0

MuPDF v1.10 Impact

MuPDF version 1.10 has a significant impact on our bindings. Some of the changes also affect the API -- in other words, you as a PyMuPDF user.

  • Link destination information has been reduced. Several properties of the linkDest class no longer contain valuable information. In fact, this class as a whole has been deleted from MuPDF's library and we in PyMuPDF only maintain it to provide compatibilty to existing code.
  • In an effort to minimize memory requirements, several improvements have been built into MuPDF v1.10:

    • A new config.h file can be used to de-select unwanted features in the C base code. Using this feature we have been able to reduce the size of our binary _fitz.o / _fitz.pyd by about 50% (from 9 MB to 4.5 MB). When UPX-ing this, the size goes even further down to a very handy 2.3 MB.
    • The alpha (transparency) channel for pixmaps is now optional. Letting alpha default to False significantly reduces pixmap sizes (by 20% -- CMYK, 25% -- RGB, 50% -- GRAY). Many Pixmap constructors therefore now accept an alpha boolean to control inclusion of this channel. Other pixmap constructors (e.g. those for file and image input) create pixmaps with no alpha alltogether. On the downside, save methods for pixmaps no longer accept a savealpha option: this channel will always be saved when present. To minimize code breaks, we have left this parameter in the call patterns -- it will just be ignored.
  • DisplayList and TextPage class constructors now require the mediabox of the page they are referring to (i.e. the page.bound() rectangle). There is no way to construct this information from other sources, therefore a source code change cannot be avoided in these cases. We assume however, that not many users are actually employing these rather low level classes explixitely. So the impact of that change should be minor.

Other Changes compared to Version 1.9.3

  • The new Document method write() writes an opened PDF to memory (as opposed to a file, like save() does).
  • An annotation can now be scaled and moved around on its page. This is done by modifying its rectangle.
  • Annotations can now be deleted. Page contains the new method deleteAnnot().
  • Various annotation attributes can now be modified, e.g. content, dates, title (= author), border, colors.
  • Method Document.insert_pdf() now also copies annotations of source pages.
  • The Pages class has been deleted. As documents can now be accessed with page numbers as indices (like doc[n] = doc.loadPage(n)), and document object can be used as iterators, the benefit of this class was too low to maintain it. See the following comments.
  • loadPage(n) / doc[n] now accept arbitrary integers to specify a page number, as long as n < pageCount. So, e.g. doc[-500] is always valid and will load page (-500) % pageCount.
  • A document can now also be used as an iterator like this: for page in doc: ...<do something with "page"> .... This will yield all pages of doc as page.
  • The Pixmap method getSize() has been replaced with property size. As before Pixmap.size == len(Pixmap) is true.
  • In response to transparency (alpha) being optional, several new parameters and properties have been added to Pixmap and Colorspace classes to support determining their characteristics.
  • The Page class now contains new properties firstAnnot and firstLink to provide starting points to the respective class chains, where firstLink is just a mnemonic synonym to method loadLinks() which continues to exist. Similarly, the new property rect is a synonym for method bound(), which also continues to exist.
  • Pixmap methods samplesRGB() and samplesAlpha() have been deleted because pixmaps can now be created without transparency.
  • Rect now has a property irect which is a synonym of method round(). Likewise, IRect now has property rect to deliver a Rect which has the same coordinates as floats values.
  • Document has the new method searchPageFor() to search for a text string. It works exactly like the corresponding Page.searchFor() with page number as additional parameter.

Changes in Version 1.9.3

This version is also based on MuPDF v1.9a. Changes compared to version 1.9.2:

  • As a major enhancement, annotations are now supported in a similar way as links. Annotations can be displayed (as pixmaps) and their properties can be accessed.
  • In addition to the document select() method, some simpler methods can now be used to manipulate a PDF:

    • copyPage() copies a page within a document.
    • movePage() is similar, but deletes the original.
    • delete_page() deletes a page
    • delete_pages() deletes a page range
  • rotation or setRotation() access or change a PDF page's rotation, respectively.
  • Available but undocumented before, IRect, Rect, Point and Matrix support the len() method and their coordinate properties can be accessed via indices, e.g. IRect.x1 == IRect[2].
  • For convenience, documents now support simple indexing: doc.loadPage(n) == doc[n]. The index may however be in range -pageCount < n < pageCount, such that doc[-1] is the last page of the document.

Changes in Version 1.9.2

This version is also based on MuPDF v1.9a. Changes compared to version 1.9.1:

  • fitz.open() (no parameters) creates a new empty PDF document, i.e. if saved afterwards, it must be given a .pdf extension.
  • Document now accepts all of the following formats (Document and open are synonyms):

    • open(),
    • open(filename) (equivalent to open(filename, None)),
    • open(filetype, area) (equivalent to open(filetype, stream = area)).

    Type of memory area stream may be bytes or bytearray. Thus, e.g. area = open("file.pdf", "rb").read() may be used directly (without first converting it to bytearray).

  • New method Document.insert_pdf() (PDFs only) inserts a range of pages from another PDF.
  • Document objects doc now support the len() function: len(doc) == doc.pageCount.
  • New method Document.getPageImageList() creates a list of images used on a page.
  • New method Document.getPageFontList() creates a list of fonts referenced by a page.
  • New pixmap constructor fitz.Pixmap(doc, xref) creates a pixmap based on an opened PDF document and an xref number of the image.
  • New pixmap constructor fitz.Pixmap(cspace, spix) creates a pixmap as a copy of another one spix with the colorspace converted to cspace. This works for all colorspace combinations.
  • Pixmap constructor fitz.Pixmap(colorspace, width, height, samples) now allows samples to also be bytes, not only bytearray.

Changes in Version 1.9.1

This version of PyMuPDF is based on MuPDF library source code version 1.9a published on April 21, 2016.

Please have a look at MuPDF's website to see which changes and enhancements are contained herein.

Changes in version 1.9.1 compared to version 1.8.0 are the following:

  • New methods get_area() for both fitz.Rect and fitz.IRect
  • Pixmaps can now be created directly from files using the new constructor fitz.Pixmap(filename).
  • The Pixmap constructor fitz.Pixmap(image) has been extended accordingly.
  • fitz.Rect can now be created with all possible combinations of points and coordinates.
  • PyMuPDF classes and methods now all contain __doc__ strings, most of them created by SWIG automatically. While the PyMuPDF documentation certainly is more detailed, this feature should help a lot when programming in Python-aware IDEs.
  • A new document method of getPermits() returns the permissions associated with the current access to the document (print, edit, annotate, copy), as a Python dictionary.
  • The identity matrix fitz.Identity is now immutable.
  • The new document method select(list) removes all pages from a document that are not contained in the list. Pages can also be duplicated and re-arranged.
  • Various improvements and new members in our demo and examples collections. Perhaps most prominently: PDF_display now supports scrolling with the mouse wheel, and there is a new example program wxTableExtract which allows to graphically identify and extract table data in documents.
  • fitz.open() is now an alias of fitz.Document().
  • New pixmap method tobytes() which will return a bytearray formatted as a PNG image of the pixmap.
  • New pixmap method samplesRGB() providing a samples version with alpha bytes stripped off (RGB colorspaces only).
  • New pixmap method samplesAlpha() providing the alpha bytes only of the samples area.
  • New iterator fitz.Pages(doc) over a document's set of pages.
  • New matrix methods invert() (calculate inverted matrix), concat() (calculate matrix product), pretranslate() (perform a shift operation).
  • New IRect methods intersect() (intersection with another rectangle), translate() (perform a shift operation).
  • New Rect methods intersect() (intersection with another rectangle), transform() (transformation with a matrix), include_point() (enlarge rectangle to also contain a point), include_rect() (enlarge rectangle to also contain another one).
  • Documented Point.transform() (transform a point with a matrix).
  • Matrix, IRect, Rect and Point classes now support compact, algebraic formulations for manipulating such objects.
  • Incremental saves for changes are possible now using the call pattern doc.save(doc.name, incremental=True).
  • A PDF's metadata can now be deleted, set or changed by document method set_metadata(). Supports incremental saves.
  • A PDF's bookmarks (or table of contents) can now be deleted, set or changed with the entries of a list using document method set_toc(list). Supports incremental saves.