Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: CMaps #811

Merged
merged 2 commits into from
Apr 24, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions docs/dev/cmaps.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# CMaps

Looking at the cmap of "crazyones":

```bash
pdftk crazyones.pdf output crazyones-uncomp.pdf uncompress
```

You can see this:

```text
begincmap
/CMapName /T1Encoding-UTF16 def
/CMapType 2 def
/CIDSystemInfo <<
/Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
1 begincodespacerange
<00> <FF>
endcodespacerange
1 beginbfchar
<1B> <FB00>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
```

## codespacerange

A codespacerange maps a complete sequence of bytes to a range of unicode glyphs.
It defines a starting point:

```text
1 beginbfchar
<1B> <FB00>
```

That means that `1B` (Hex for 27) maps to the unicode character [`FB00`](https://unicode-table.com/en/FB00/) - the ligature ff (two lowercase f's).

The two numbers in `begincodespacerange` mean that it starts with an offset of
0 (hence from `1B ➜ FB00`) upt to an offset of FF (dec: 255), hence 1B+FF = 282
➜ [FBFF](https://www.compart.com/de/unicode/U+FBFF).

Within the text stream, there is

```text
(The)-342(mis\034ts.)
```

`\034 ` is octal for 28 decimal.
14 changes: 13 additions & 1 deletion docs/dev/pdf-format.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ startxref 1234

Let's go through it:

* `trailer <<` indicates that the *trailer dictionary` starts. It ends with `>>`.
* `trailer <<` indicates that the *trailer dictionary* starts. It ends with `>>`.
* `startxref` is a keyword followed by the byte-location of the `xref` keyword.
As the trailer is always at the bottom of the file, this allows readers to
quickly find the xref table.
Expand All @@ -99,3 +99,15 @@ Table 3.13 of the PDF Reference 1.7, e.g. `/Root` and `/Size` (both are required
* `R` is the keyword that indicates that the object is a reference to the
catalog dictionary.
* `/Size` (integer) contains the total number of entries in the files xref table.


## Reading PDF files

Most PDF files are compressed. If you want to read them, first uncompress them:

```bash
pdftk crazyones.pdf output crazyones-uncomp.pdf uncompress
```

Then rename `crazyones-uncomp.pdf` to `crazyones-uncomp.txt` and open it in
our favorite IDE / text editor.
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ You can contribute to `PyPDF2 on Github <https://github.com/py-pdf/PyPDF2>`_.

dev/intro
dev/pdf-format
dev/cmaps

.. toctree::
:caption: About PyPDF2
Expand Down