Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Minor improvements #2542

Merged
merged 3 commits into from
Mar 26, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
16 changes: 8 additions & 8 deletions docs/dev/pdf-format.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# The PDF Format

It's recommended to look in the PDF specification for details and clarifications.
It is recommended to look in the PDF specification for details and clarifications.
This is only intended to give a very rough overview of the format.

## Overall Structure
Expand Down Expand Up @@ -32,7 +32,7 @@ Let's go through it step-by-step:

* `xref` is just a keyword that specifies the start of the xref table.
* `42` is the numerical ID of the first object in this xref section; `5` is the number of entries in the xref table.
* Now every object has 3 entries `nnnnnnnnnn ggggg n`: The 10-digit byte offset,
* Now every object has 3 entries `nnnnnnnnnn ggggg n`: a 10-digit byte offset,
a 5-digit generation number, and a literal keyword which is either `n` or `f`.
* `nnnnnnnnnn` is the byte offset of the object. It tells the reader where
the object is in the file.
Expand All @@ -49,10 +49,10 @@ Let's go through it step-by-step:

The body is a sequence of indirect objects:

`counter generationnumber << the_object >> endobj`
`counter generation_number << the_object >> endobj`

* `counter` (integer) is a unique identifier for the object.
* `generationnumber` (integer) is the generation number of the object.
* `generation_number` (integer) is the generation number of the object.
* `the_object` is the object itself. It can be empty. Starts with `/Keyword` to
specify which kind of object it is.
* `endobj` marks the end of the object.
Expand Down Expand Up @@ -91,11 +91,11 @@ Let's go through it:
* `%%EOF` is the end-of-file marker.

The trailer dictionary is a key-value list. The keys are specified in
Table 3.13 of the PDF Reference 1.7, e.g. `/Root` and `/Size` (both are required).
Table 15 of the PDF Reference 1.7, e.g. `/Root` and `/Size` (both are required).

* `/Root` (dictionary) contains the document catalog.
* The `5` is the object number of the catalog dictionary
* `0` is the generation number of the catalog dictionary
* The `5` is the object number of the catalog dictionary.
* `0` is the generation number of the catalog dictionary.
* `R` is the keyword that indicates that the object is a reference to the
catalog dictionary.
* `/Size` (integer) contains the total number of entries in the files xref table.
Expand All @@ -110,4 +110,4 @@ pdftk crazyones.pdf output crazyones-uncomp.pdf uncompress
```

Then rename `crazyones-uncomp.pdf` to `crazyones-uncomp.txt` and open it in
our favorite IDE / text editor.
your favorite IDE / text editor.
8 changes: 4 additions & 4 deletions docs/dev/pypdf-parsing.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,14 @@ structure of parsing:
proceeds to parse the objects in the PDF. Objects in a PDF can be of various
types such as dictionaries, arrays, streams, and simple data types (e.g.,
integers, strings). pypdf parses these objects and stores them in
{py:meth}`PdfReader.resolved_objects <pypdf.PdfReader.resolved_objects>`
via {py:meth}`cache_indirect_object <pypdf.PdfReader.cache_indirect_object>`.
{py:meth}`PdfReader.resolved_objects <pypdf.PdfReader.resolved_objects>`,
populated by {py:meth}`cache_indirect_object <pypdf.PdfReader.cache_indirect_object>`.
3. **Decoding content streams**: The content of a PDF is typically stored in
content streams, which are sequences of PDF operators and operands. pypdf
decodes these content streams by applying filters (e.g., `FlateDecode`,
`LZWDecode`) specified in the stream's dictionary. This is only done when the
object is requested via {py:meth}`PdfReader.get_object
<pypdf.PdfReader.get_object>` in the `PdfReader._get_object_from_stream` method.
object is requested by {py:meth}`PdfReader.get_object
<pypdf.PdfReader.get_object>` which uses the `PdfReader._get_object_from_stream` method.

## References

Expand Down
30 changes: 15 additions & 15 deletions docs/user/extract-text.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Extract Text from a PDF

You can extract text from a PDF like this:
You can extract text from a PDF:

```python
from pypdf import PdfReader
Expand All @@ -10,7 +10,7 @@ page = reader.pages[0]
print(page.extract_text())
```

You can also choose to limit the text orientation you want to extract, e.g:
You can also choose to limit the text orientation you want to extract:

```python
# extract only text oriented up
Expand Down Expand Up @@ -42,7 +42,7 @@ Refer to [extract\_text](../modules/PageObject.html#pypdf._page.PageObject.extra

## Using a visitor

You can use visitor-functions to control which part of a page you want to process and extract. The visitor-functions you provide will get called for each operator or for each text fragment.
You can use visitor functions to control which part of a page you want to process and extract. The visitor functions you provide will get called for each operator or for each text fragment.

The function provided in argument visitor_text of function extract_text has five arguments:
* text: the current text (as long as possible, can be up to a full line)
Expand All @@ -51,19 +51,19 @@ The function provided in argument visitor_text of function extract_text has five
* font-dictionary: full font dictionary
* font-size: the size (in text coordinate space)

The matrix stores 6 parameters. The first 4 provide the rotation/scaling matrix and the last two provide the translation (horizontal/vertical)
The matrix stores six parameters. The first four provide the rotation/scaling matrix and the last two provide the translation (horizontal/vertical).
It is recommended to use the user_matrix as it takes into all transformations.

Notes :

- as indicated in the PDF 1.7 reference, page 204 the user matrix applies to text space/image space/form space/pattern space.
- if you want to get the full transformation from text to user space, you can use the `mult` function (availalbe in global import) as follows:
`txt2user = mult(tm, cm))`
The font-size is the raw text size, that is affected by the `user_matrix`
- As indicated in §8.3.3 of the PDF 1.7 or PDF 2.0 specification, the user matrix applies to text space/image space/form space/pattern space.
- If you want to get the full transformation from text to user space, you can use the `mult` function (availalbe in global import) as follows:
stefan6419846 marked this conversation as resolved.
Show resolved Hide resolved
`txt2user = mult(tm, cm))`.
The font-size is the raw text size, that is affected by the `user_matrix`.
stefan6419846 marked this conversation as resolved.
Show resolved Hide resolved


The font-dictionary may be None in case of unknown fonts.
If not None it may e.g. contain key "/BaseFont" with value "/Arial,Bold".
If not None it could contain something like key "/BaseFont" with value "/Arial,Bold".

**Caveat**: In complicated documents the calculated positions may be difficult to (if you move from multiple forms to page user space for example).

Expand All @@ -72,7 +72,7 @@ operator, operand-arguments, current transformation matrix and text matrix.

### Example 1: Ignore header and footer

The following example reads the text of page 4 of [this PDF document](https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf), but ignores header (y < 720) and footer (y > 50).
The following example reads the text of page four of [this PDF document](https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf), but ignores the header (y < 720) and footer (y > 50).

```python
from pypdf import PdfReader
Expand All @@ -97,10 +97,10 @@ print(text_body)

### Example 2: Extract rectangles and texts into a SVG-file

The following example converts page 3 of [this PDF document](https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf) into a
The following example converts page three of [this PDF document](https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf) into a
[SVG file](https://en.wikipedia.org/wiki/Scalable_Vector_Graphics).

Such a SVG export may help to understand whats going on in a page.
Such a SVG export may help to understand what is going on in a page.

```python
from pypdf import PdfReader
Expand Down Expand Up @@ -131,13 +131,13 @@ dwg.save()

The SVG generated here is bottom-up because the coordinate systems of PDF and SVG differ.

Unfortunately in complicated PDF documents the coordinates given to the visitor-functions may be wrong.
Unfortunately in complicated PDF documents the coordinates given to the visitor functions may be wrong.

## Why Text Extraction is hard

### Unclear Objective

Extracting text from a PDF can be pretty tricky. In several cases there is no
Extracting text from a PDF can be tricky. In several cases there is no
clear answer what the expected result should look like:

1. **Paragraphs**: Should the text of a paragraph have line breaks at the same places
Expand Down Expand Up @@ -191,7 +191,7 @@ printing. It was not created for parsing the content. PDF files don't contain a
semantic layer.

Specifically, there is no information what the header, footer, page numbers,
tables, and paragraphs are. The visual appearence is there and people might
tables, and paragraphs are. The visual appearance is there and people might
find heuristics to make educated guesses, but there is no way of being certain.

This is a shortcoming of the PDF file format, not of pypdf.
Expand Down
34 changes: 16 additions & 18 deletions docs/user/post-processing-in-text-extraction.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,13 @@
# Post-Processing in Text Extraction
# Post-Processing of Text Extraction

Post-processing can recognizably improve the results of text extraction.
It is, however, outside of the scope of pypdf itself. Hence the library will
not give any direct support for it. It is a natural language processing (NLP)
task.
Post-processing can recognizably improve the results of text extraction. It is,
however, outside of the scope of pypdf itself. Hence the library will not give
any direct support for it. It is a natural language processing (NLP) task.

This page lists a few examples what can be done as well as a community
recipie that can be used as a best-practice general purpose post processing
step. If you know more about the specific domain of your documents, e.g. the
language, it is likely that you can find custom solutions that work better in
your context
This page lists a few examples what can be done as well as a community recipe
that can be used as a general purpose post-processing step. If you know more
about the specific domain of your documents, e.g. the language, it is likely
that you can find custom solutions that work better in your context.

## Ligature Replacement

Expand All @@ -32,7 +30,7 @@ def replace_ligatures(text: str) -> str:
return text
```

## De-Hyphenation
## Dehyphenation

Hyphens are used to break words up so that the appearance of the page is nicer.

Expand Down Expand Up @@ -77,11 +75,11 @@ def dehyphenate(lines: List[str], line_no: int) -> List[str]:

The following header/footer removal has several drawbacks:

* False-positives, e.g. for the first page when there is a date like 2021.
* False-positives, e.g. for the first page when there is a date like 2024.
* False-negatives in many cases:
* Dynamic part, e.g. page label is in the header
* Even/odd pages have different headers
* Some pages, e.g. the first one or chapter pages, don't have a header
* Dynamic part, e.g. page label is in the header.
* Even/odd pages have different headers.
* Some pages, e.g. the first one or chapter pages, do not have a header.

```python
def remove_footer(extracted_texts: list[str], page_labels: list[str]):
Expand All @@ -105,9 +103,9 @@ def remove_footer(extracted_texts: list[str], page_labels: list[str]):

## Other ideas

* Whitespaces between Units: Between a number and it's unit should be a space
* Whitespaces in units: Between a number and its unit should be a space.
([source](https://tex.stackexchange.com/questions/20962/should-i-put-a-space-between-a-number-and-its-unit)).
That means: 42 ms, 42 GHz, 42 GB.
* Percent: English style guides prescribe writing the percent sign following the number without any space between (e.g. 50%).
* Whitespaces before dots: Should typically be removed
* Whitespaces after dots: Should typically be added
* Whitespaces before dots: Should typically be removed.
* Whitespaces after dots: Should typically be added.
2 changes: 1 addition & 1 deletion docs/user/streaming-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,4 +73,4 @@ obj = s3.get_object(Body=csv_buffer.getvalue(), Bucket="my-bucket", Key="my/doc.
reader = PdfReader(BytesIO(obj["Body"].read()))
```

It works similarly for Google Cloud Storage ([example](https://stackoverflow.com/a/68403628/562769))
It works similarly for Google Cloud Storage ([example](https://stackoverflow.com/a/68403628/562769)).