Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Addition of optional visitor-functions in extract_text() #1252

Merged
merged 28 commits into from Sep 25, 2022
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
76801d7
ENH: Added visitor-callbacks in PageObject.extract_text(...).
srogmann Aug 18, 2022
39a9f08
TST: Test of visitor-callbacks in extract_text().
srogmann Aug 18, 2022
92c0cf8
STY: Executed black to format code (spaces, line-breaks, ...).
srogmann Aug 19, 2022
c320ea8
Fetch main-Updates (_utils.py).
srogmann Aug 19, 2022
177fea2
TST: Added function extractTable(...) to read text in cells of a table.
srogmann Aug 20, 2022
4389590
STY: Updated some comments in test-code.
srogmann Aug 22, 2022
eccc779
ENH: Added visitor-callbacks in PageObject.extract_text(...).
srogmann Aug 18, 2022
8297b13
TST: Test of visitor-callbacks in extract_text().
srogmann Aug 18, 2022
165b686
STY: Executed black to format code (spaces, line-breaks, ...).
srogmann Aug 19, 2022
ed784e9
TST: Added function extractTable(...) to read text in cells of a table.
srogmann Aug 20, 2022
9922f1c
STY: Updated some comments in test-code.
srogmann Aug 22, 2022
ae7c993
ENH: visitor_text additionally gets font-dictionary and font-size.
srogmann Aug 22, 2022
4afa052
Merge remote branch 'extract_text_visitors' into extract_text_visitors
srogmann Aug 22, 2022
f83ae31
TST: Added funtion get_base_font() the get the BaseFont.
srogmann Aug 23, 2022
18d2f4a
Merge branch 'main' into extract_text_visitors
srogmann Sep 14, 2022
19003b3
BUG: Merged output-changes into visitor-calls.
srogmann Sep 14, 2022
a5b8b44
TST: Updated text_visitor-test (line-break disappeared)
srogmann Sep 14, 2022
17f2d61
flake8 fixes
MartinThoma Sep 17, 2022
72e51be
Fix type annotations
MartinThoma Sep 18, 2022
fe11b54
Merge branch 'main' into extract_text_visitors
MartinThoma Sep 24, 2022
ab5d118
Missed a bracket
MartinThoma Sep 24, 2022
c5733f5
another bracket
MartinThoma Sep 24, 2022
9aad439
Remove unused functions
MartinThoma Sep 24, 2022
3809522
Type annotations
MartinThoma Sep 24, 2022
5b87ecc
Fix type
MartinThoma Sep 24, 2022
e47e16c
Fix type:ignore comment
MartinThoma Sep 24, 2022
fb7807c
Merge branch 'py-pdf:main' into extract_text_visitors
srogmann Sep 24, 2022
1969c9f
MAINT: Replaced DictionaryObject by Dict[str, str]] in cmaps.
srogmann Sep 24, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
8 changes: 7 additions & 1 deletion PyPDF2/_cmap.py
Expand Up @@ -12,8 +12,13 @@
def build_char_map(
font_name: str, space_width: float, obj: DictionaryObject
) -> Tuple[
str, float, Union[str, Dict[int, str]], Dict
str, float, Union[str, Dict[int, str]], Dict, DictionaryObject
]: # font_type,space_width /2, encoding, cmap
"""Determine information about a font.

This function returns a tuple consisting of:
font sub-type, space_width/2, encoding, map character-map, font-dictionary.
The font-dictionary itself is suitable for the curious."""
ft: DictionaryObject = obj["/Resources"]["/Font"][font_name] # type: ignore
font_type: str = cast(str, ft["/Subtype"])

Expand Down Expand Up @@ -58,6 +63,7 @@ def build_char_map(
encoding,
# https://github.com/python/mypy/issues/4374
map_dict,
ft,
)


Expand Down
144 changes: 134 additions & 10 deletions PyPDF2/_page.py
Expand Up @@ -1261,6 +1261,9 @@ def _extract_text(
orientations: Tuple[int, ...] = (0, 90, 180, 270),
space_width: float = 200.0,
content_key: Optional[str] = PG.CONTENTS,
visitor_operand_before: Optional[Callable[[Any, Any, Any, Any], None]] = None,
visitor_operand_after: Optional[Callable[[Any, Any, Any, Any], None]] = None,
visitor_text: Optional[Callable[[Any, Any, Any, Any, Any], None]] = None,
) -> str:
"""
Locate all text drawing commands, in the order they are provided in the
Expand All @@ -1273,6 +1276,9 @@ def _extract_text(
Arabic, Hebrew,... are extracted in the good order. If required an custom RTL range of characters
can be defined; see function set_custom_rtl

Additionally you can provide visitor-methods to get informed on all operands and all text-objects.
For example in some PDF files this can be useful to parse tables.

:param Tuple[int, ...] orientations: list of orientations text_extraction will look for
default = (0, 90, 180, 270)
note: currently only 0(Up),90(turned Left), 180(upside Down), 270 (turned Right)
Expand All @@ -1281,6 +1287,17 @@ def _extract_text(
:param Optional[str] content_key: indicate the default key where to extract data
None = the object; this allow to reuse the function on XObject
default = "/Content"
:param Optional[Function] visitor_operand_before: function to be called before processing an operand.
It has four arguments: operand, operand-arguments,
current transformation matrix and text matrix.
:param Optional[Function] visitor_operand_after: function to be called after processing an operand.
It has four arguments: operand, operand-arguments,
current transformation matrix and text matrix.
:param Optional[Function] visitor_text: function to be called when extracting some text at some position.
It has five arguments: text,
current transformation matrix, text matrix, font-dictionary and font-size.
The font-dictionary may be None in case of unknown fonts.
If not None it may e.g. contain key "/BaseFont" with value "/Arial,Bold".
:return: a string object.
"""
text: str = ""
Expand All @@ -1301,11 +1318,14 @@ def _extract_text(
if "/Font" in resources_dict:
for f in cast(DictionaryObject, resources_dict["/Font"]):
cmaps[f] = build_char_map(f, space_width, obj)
MartinThoma marked this conversation as resolved.
Show resolved Hide resolved
cmap: Tuple[Union[str, Dict[int, str]], Dict[str, str], str] = (
cmap: Tuple[
Union[str, Dict[int, str]], Dict[str, str], str, Optional[DictionaryObject]
] = (
"charmap",
{},
"NotInitialized",
) # (encoding,CMAP,font_name)
None,
) # (encoding,CMAP,font resource name,dictionary-object of font)
try:
content = (
obj[content_key].get_object() if isinstance(content_key, str) else obj
Expand Down Expand Up @@ -1360,7 +1380,7 @@ def current_spacewidth() -> float:
return _space_width / 1000.0

def process_operation(operator: bytes, operands: List) -> None:
nonlocal cm_matrix, cm_stack, tm_matrix, tm_prev, output, text, char_scale, space_scale, _space_width, TL, font_size, cmap, orientations, rtl_dir
nonlocal cm_matrix, cm_stack, tm_matrix, tm_prev, output, text, char_scale, space_scale, _space_width, TL, font_size, cmap, orientations, rtl_dir, visitor_text
global CUSTOM_RTL_MIN, CUSTOM_RTL_MAX, CUSTOM_RTL_SPECIAL_CHARS

check_crlf_space: bool = False
Expand All @@ -1369,15 +1389,19 @@ def process_operation(operator: bytes, operands: List) -> None:
tm_matrix = [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
# tm_prev = tm_matrix
output += text
if visitor_text is not None:
visitor_text(text, cm_matrix, tm_matrix, cmap[3], font_size)
# based
# if output != "" and output[-1]!="\n":
# output += "\n"
text = ""
return None
elif operator == b"ET":
output += text
if visitor_text is not None:
visitor_text(text, cm_matrix, tm_matrix, cmap[3], font_size)
text = ""
# table 4.7, page 219
# table 4.7 "Graphics state operators", page 219
# cm_matrix calculation is a reserved for the moment
elif operator == b"q":
cm_stack.append(
Expand Down Expand Up @@ -1407,6 +1431,8 @@ def process_operation(operator: bytes, operands: List) -> None:
# rtl_dir = False
elif operator == b"cm":
output += text
if visitor_text is not None:
visitor_text(text, cm_matrix, tm_matrix, cmap[3], font_size)
text = ""
cm_matrix = mult(
[
Expand All @@ -1430,21 +1456,29 @@ def process_operation(operator: bytes, operands: List) -> None:
elif operator == b"Tf":
if text != "":
output += text # .translate(cmap)
if visitor_text is not None:
visitor_text(text, cm_matrix, tm_matrix, cmap[3], font_size)
text = ""
# rtl_dir = False
try:
_space_width = cmaps[operands[0]][1]
# charMapTuple: font_type, float(sp_width / 2), encoding, map_dict, font-dictionary
charMapTuple = cmaps[operands[0]]
_space_width = charMapTuple[1]
# current cmap: encoding, map_dict, font resource name (internal name, not the real font-name),
# font-dictionary. The font-dictionary describes the font.
cmap = (
cmaps[operands[0]][2],
cmaps[operands[0]][3],
charMapTuple[2],
charMapTuple[3],
operands[0],
charMapTuple[4],
MartinThoma marked this conversation as resolved.
Show resolved Hide resolved
)
except KeyError: # font not found
_space_width = unknown_char_map[1]
cmap = (
unknown_char_map[2],
unknown_char_map[3],
"???" + operands[0],
None,
)
try:
font_size = float(operands[1])
Expand Down Expand Up @@ -1525,6 +1559,8 @@ def process_operation(operator: bytes, operands: List) -> None:
rtl_dir = True
# print("RTL",text,"*")
output += text
if visitor_text is not None:
visitor_text(text, cm_matrix, tm_matrix, cmap[3], font_size)
text = ""
text = x + text
else: # left-to-right
Expand All @@ -1533,6 +1569,8 @@ def process_operation(operator: bytes, operands: List) -> None:
rtl_dir = False
# print("LTR",text,"*")
output += text
if visitor_text is not None:
visitor_text(text, cm_matrix, tm_matrix, cmap[3], font_size)
text = ""
text = text + x
# fmt: on
Expand All @@ -1553,6 +1591,14 @@ def process_operation(operator: bytes, operands: List) -> None:
if deltaY < -0.8 * f:
if (output + text)[-1] != "\n":
output += text + "\n"
if visitor_text is not None:
visitor_text(
text + "\n",
cm_matrix,
tm_matrix,
cmap[3],
font_size,
)
text = ""
elif (
abs(deltaY) < f * 0.3
Expand All @@ -1564,6 +1610,14 @@ def process_operation(operator: bytes, operands: List) -> None:
if deltaY > 0.8 * f:
if (output + text)[-1] != "\n":
output += text + "\n"
if visitor_text is not None:
visitor_text(
text + "\n",
cm_matrix,
tm_matrix,
cmap[3],
font_size,
)
text = ""
elif (
abs(deltaY) < f * 0.3
Expand All @@ -1575,6 +1629,14 @@ def process_operation(operator: bytes, operands: List) -> None:
if deltaX > 0.8 * f:
if (output + text)[-1] != "\n":
output += text + "\n"
if visitor_text is not None:
visitor_text(
text + "\n",
cm_matrix,
tm_matrix,
cmap[3],
font_size,
)
text = ""
elif (
abs(deltaX) < f * 0.3
Expand All @@ -1586,6 +1648,14 @@ def process_operation(operator: bytes, operands: List) -> None:
if deltaX < -0.8 * f:
if (output + text)[-1] != "\n":
output += text + "\n"
if visitor_text is not None:
visitor_text(
text + "\n",
cm_matrix,
tm_matrix,
cmap[3],
font_size,
)
text = ""
elif (
abs(deltaX) < f * 0.3
Expand All @@ -1597,6 +1667,8 @@ def process_operation(operator: bytes, operands: List) -> None:
pass

for operands, operator in content.operations:
if visitor_operand_before is not None:
visitor_operand_before(operator, operands, cm_matrix, tm_matrix)
# multiple operators are defined in here ####
if operator == b"'":
process_operation(b"T*", [])
Expand All @@ -1622,17 +1694,30 @@ def process_operation(operator: bytes, operands: List) -> None:
process_operation(b"Tj", [" "])
elif operator == b"Do":
output += text
if visitor_text is not None:
visitor_text(text, cm_matrix, tm_matrix, cmap[3], font_size)
try:
if output[-1] != "\n":
output += "\n"
if visitor_text is not None:
visitor_text("\n", cm_matrix, tm_matrix, cmap[3], font_size)
except IndexError:
pass
try:
xobj = resources_dict["/XObject"]
if xobj[operands[0]]["/Subtype"] != "/Image": # type: ignore
# output += text
text = self.extract_xform_text(xobj[operands[0]], orientations, space_width) # type: ignore
text = self.extract_xform_text(
xobj[operands[0]],
orientations,
space_width,
visitor_operand_before,
visitor_operand_after,
visitor_text,
) # type: ignore
MartinThoma marked this conversation as resolved.
Show resolved Hide resolved
output += text
if visitor_text is not None:
visitor_text(text, cm_matrix, tm_matrix, cmap[3], font_size)
except Exception:
logger_warning(
f" impossible to decode XFormObject {operands[0]}",
Expand All @@ -1642,7 +1727,11 @@ def process_operation(operator: bytes, operands: List) -> None:
text = ""
else:
process_operation(operator, operands)
if visitor_operand_after is not None:
visitor_operand_after(operator, operands, cm_matrix, tm_matrix)
output += text # just in case of
if text != "" and visitor_text is not None:
visitor_text(text, cm_matrix, tm_matrix, cmap[3], font_size)
return output

def extract_text(
Expand All @@ -1652,6 +1741,9 @@ def extract_text(
TJ_sep: str = None,
orientations: Union[int, Tuple[int, ...]] = (0, 90, 180, 270),
space_width: float = 200.0,
visitor_operand_before: Optional[Callable[[Any, Any, Any, Any], None]] = None,
visitor_operand_after: Optional[Callable[[Any, Any, Any, Any], None]] = None,
visitor_text: Optional[Callable[[Any, Any, Any, Any, Any], None]] = None,
) -> str:
"""
Locate all text drawing commands, in the order they are provided in the
Expand All @@ -1663,12 +1755,25 @@ def extract_text(
Do not rely on the order of text coming out of this function, as it
will change if this function is made more sophisticated.

Additionally you can provide visitor-methods to get informed on
all operations and all text-objects.
For example in some PDF files this can be useful to parse tables.

:param Tj_sep: Deprecated. Kept for compatibility until PyPDF2==4.0.0
:param TJ_sep: Deprecated. Kept for compatibility until PyPDF2==4.0.0
:param orientations: (list of) orientations (of the characters) (default: (0,90,270,360))
single int is equivalent to a singleton ( 0 == (0,) )
note: currently only 0(Up),90(turned Left), 180(upside Down),270 (turned Right)
:param float space_width: force default space width (if not extracted from font (default: 200)
:param Optional[Function] visitor_operand_before: function to be called before processing an operand.
It has four arguments: operator, operand-arguments,
current transformation matrix and text matrix.
:param Optional[Function] visitor_operand_after: function to be called after processing an operand.
It has four arguments: operand, operand-arguments,
current transformation matrix and text matrix.
:param Optional[Function] visitor_text: function to be called when extracting some text at some position.
It has three arguments: text,
current transformation matrix and text matrix.
:return: The extracted text
"""
if len(args) >= 1:
Expand Down Expand Up @@ -1708,14 +1813,24 @@ def extract_text(
orientations = (orientations,)

return self._extract_text(
self, self.pdf, orientations, space_width, PG.CONTENTS
self,
self.pdf,
orientations,
space_width,
PG.CONTENTS,
visitor_operand_before,
visitor_operand_after,
visitor_text,
)

def extract_xform_text(
self,
xform: EncodedStreamObject,
orientations: Tuple[int, ...] = (0, 90, 270, 360),
space_width: float = 200.0,
visitor_operand_before: Optional[Callable[[Any, Any, Any, Any], None]] = None,
visitor_operand_after: Optional[Callable[[Any, Any, Any, Any], None]] = None,
visitor_text: Optional[Callable[[Any, Any, Any, Any, Any], None]] = None,
) -> str:
"""
Extract text from an XObject.
Expand All @@ -1724,7 +1839,16 @@ def extract_xform_text(

:return: The extracted text
"""
return self._extract_text(xform, self.pdf, orientations, space_width, None)
return self._extract_text(
xform,
self.pdf,
orientations,
space_width,
None,
visitor_operand_before,
visitor_operand_after,
visitor_text,
)

def extractText(
self, Tj_sep: str = "", TJ_sep: str = ""
Expand Down