# Accessing Standard Text Files as PyMuPDF Documents

This notebook demonstrates basic features and specifics of this Document type.

In [55]:
import fitz

# Open a (non-ASCII) text file as a Document from memory data.
# The following text is from the Wikipedia website https://en.wikipedia.org/wiki/Beijing
text = """Over the past 3,000 years, the city of Beijing has had numerous other names. The name Beijing, which means "Northern Capital" (from the Chinese characters 北 běi for north and 京 jīng for capital), was applied to the city in 1403 during the Ming dynasty to distinguish the city from Nanjing (the "Southern Capital").[31] The English spelling Beijing is based on the government's official romanization (adopted in the 1980s) of the two characters as they are pronounced in Standard Mandarin."""

doc = fitz.open("txt", text.encode())  # open a Document from memory data
print(f"page count: {doc.page_count}")
print(f"Document metadata 'format' value: '{doc.metadata['format']}'")
page = doc[0]
print(f"page dimension ({page.rect.width} x {page.rect.height})\n")
print(page.get_text())


page count: 1
Document metadata 'format' value: 'Tex'
page dimension (400.0 x 600.0)

Over the past 3,000 years, the city of Beijing has
had numerous other names. The name Beijing, which
means "Northern Capital" (from the Chinese characters
北 běi for north and 京 jīng for capital), was applied
to the city in 1403 during the Ming dynasty to
distinguish the city from Nanjing (the "Southern
Capital").[31] The English spelling Beijing is based
on the government's official romanization (adopted in
the 1980s) of the two characters as they are
pronounced in Standard Mandarin.



The Document has been created with default values. Therefore, line breaks have been generated after approx. 53 characters to maintain the left / right margins of twice the value of font size (equal to default 11 points).

We now layout the document to use an ISO A4 page format and a font size of 10 points.

Then print the page text again to see where line breaks are happening. Please note, that the margins can only be changed indirectly via changing the font size.

In [56]:
# alter the page layout
doc.layout(rect=fitz.paper_rect("a4"),fontsize=10)

# we must create the page again after this
page = doc[0]
print(f"page dimension ({page.rect.width} x {page.rect.height})\n")
print(page.get_text())

page dimension (595.0 x 842.0)

Over the past 3,000 years, the city of Beijing has had numerous other names. The name
Beijing, which means "Northern Capital" (from the Chinese characters 北 běi for north and 京
jīng for capital), was applied to the city in 1403 during the Ming dynasty to distinguish
the city from Nanjing (the "Southern Capital").[31] The English spelling Beijing is based on
the government's official romanization (adopted in the 1980s) of the two characters as they
are pronounced in Standard Mandarin.



The following code extracts text with all details down to text line level to demonstrate that each line's boundary box starts at 20 (2 * font size) and ends no later than page width 595 - 20 = 575.

Each text is wrapped by "|" characters and their position value in points.

In [57]:
# Text lines start at 2 * font size and end before page widh - 2 font size:
print(f"Left margin 20, right margin {page.rect.width-20}.\n")
text_items = page.get_text("dict")
for block in text_items["blocks"]:
    for line in block["lines"]:
        text = "".join([s["text"] for s in line["spans"]])
        print(f"({line['bbox'][0]}) {text} ({line['bbox'][2]})")

Left margin 20, right margin 575.0.

(20.0) Over the past 3,000 years, the city of Beijing has had numerous other names. The name (530.0)
(20.0) Beijing, which means "Northern Capital" (from the Chinese characters 北 běi for north and 京 (568.0)
(20.0) jīng for capital), was applied to the city in 1403 during the Ming dynasty to distinguish (554.0)
(20.0) the city from Nanjing (the "Southern Capital").[31] The English spelling Beijing is based on (572.0)
(20.0) the government's official romanization (adopted in the 1980s) of the two characters as they (566.0)
(20.0) are pronounced in Standard Mandarin. (236.0)


Just a demo that the Chinese character for "capital" ("京" end of second line above) indeed ends at the end of the line.

In [58]:
page.search_for("京")  # search for Chinese "capital"

[Rect(558.0, 40.5703125, 568.0, 53.65625)]

Depending on the unicode of each single character, the appropriate font to represent it is returned. In our case, all characters are either extended Latin (for which the Courier equivalent "Nimbus Mono PS Regular" is taken) or Chinese (for which "Droid Sans Fallback Regular" is chosen).

In [59]:
# MuPDF has automatically selected the necessary fonts
fonts = set()
for block in text_items["blocks"]:
    for line in block["lines"]:
        for span in line["spans"]:
            fonts.add(span["font"])
print(f"Used fonts: {fonts}.")

Used fonts: {'Droid Sans Fallback Regular', 'Nimbus Mono PS Regular'}.


Finally, let us check which font sizes per font we are encountering. Text file Documents should mainly be reported using monospaced fonts.

In our example two fonts are used, so we should expect no more than two different character widths.

In [60]:
widths=set()  # store pairs of font sizes and font names here
for block in page.get_text("rawdict")["blocks"]:
    for line in block["lines"]:
        for span in line["spans"]:
            font = span["font"]
            for char in span["chars"]:
                bbox = fitz.Rect(char["bbox"])
                widths.add((bbox.width, font))
print(f"character widths per font: {widths}")

character widths per font: {(10.0, 'Droid Sans Fallback Regular'), (6.0, 'Nimbus Mono PS Regular')}


## Conclusions

* Similar to e-books, TXT Documents are **_reflowable_**. Therefore, page dimension and font size can be changed at any time.
* All characters are regarded as having the same font size, which has been set during open or `doc.layout()` (in points, 1 point = 1/72 inches).
* Pages have fixed left and right margins of twice the font size.
* Text may have been written in ASCII, UTF-8 or UTF-16 encoding. Therefore, any language is acceptable and a wide range of files may be accessed as TXT documents.
* Depending on the Unicode encountered, a suitable font is chosen from the [Noto](https://notofonts.github.io/) fonts.
    - In our example, Latin characters use the 'Nimbus Mono PS Regular' (Courier) font and have a width of `0.6 * fontsize`. Chinese (also Japanese and Korean) characters are represented by the 'Droid Sans Fallback Regular' font and have a width of `fontsize`.
* When TXT Documents are created from files, file name extensions **".txt"** and **".text"** are automatically recognized. In other cases, the `filetype` parameter must be used.
* Like any other document type, TXT Documents may also be **_opened from memory_** via `fitz.open("txt", binary-data)`.