Skip to content

Commit

Permalink
Some fix for image @600 DPI
Browse files Browse the repository at this point in the history
  • Loading branch information
sbrunner committed Jul 19, 2021
1 parent 66cdc43 commit 9a82e06
Show file tree
Hide file tree
Showing 21 changed files with 105 additions and 39 deletions.
10 changes: 9 additions & 1 deletion Makefile
Expand Up @@ -15,4 +15,12 @@ prospector: build-test

.PHONY: pytest
pytest: build-test
docker run --rm --env=PYTHONPATH=/opt/ --volume=$$(pwd)/results:/results --volume=$$(pwd)/tests:/tests tests bash -c 'cd /tests && pytest --durations=0 --verbose --color=yes'
docker run --rm --env=PYTHONPATH=/opt/ --volume=$$(pwd)/results:/results --volume=$$(pwd)/tests:/tests --volume=$$(pwd)/scan_to_paperless:/opt/scan_to_paperless tests bash -c 'cd /tests && pytest --durations=0 --verbose --color=yes'

.PHONY: pytest-last-failed
pytest-last-failed:
docker run --rm --env=PYTHONPATH=/opt/ --volume=$$(pwd)/results:/results --volume=$$(pwd)/tests:/tests --volume=$$(pwd)/scan_to_paperless:/opt/scan_to_paperless tests bash -c 'cd /tests && pytest --durations=0 --verbose --color=yes --last-failed'

.PHONY: pytest-exitfirst
pytest-exitfirst:
docker run --rm --env=PYTHONPATH=/opt/ --volume=$$(pwd)/results:/results --volume=$$(pwd)/tests:/tests --volume=$$(pwd)/scan_to_paperless:/opt/scan_to_paperless tests bash -c 'cd /tests && pytest --durations=0 --verbose --color=yes --exitfirst'
3 changes: 2 additions & 1 deletion config.md
Expand Up @@ -30,5 +30,6 @@
- **`min_box_black_crop`** *(number)*: The minimum black in a box on content find on witch one we will crop [%]. Default: `2`.
- **`min_box_black_limit`** *(number)*: The minimum black in a box on content find the limits based on content [%]. Default: `2`.
- **`min_box_black_empty`** *(number)*: The minimum black in a box on content find to determine if the page is empty [%]. Default: `2`.
- **`box_kernel_size`** *(number)*: The block size used in a box on content find [mm]. Default: `1.5`.
- **`box_block_size`** *(number)*: The block size used in a box on content find [mm]. Default: `1.5`.
- **`box_threshold_value_c`** *(number)*: A variable of double type representing the constant used in the both methods (subtracted from the mean or weighted mean, used in a box on content find. Default: `25`.
- **`box_threshold_value_c`** *(number)*: A variable used on threshold, should be low on low contrast image, used in a box on content find. Default: `70`.
5 changes: 3 additions & 2 deletions process.md
Expand Up @@ -57,5 +57,6 @@
- **`min_box_black_crop`** *(number)*: The minimum black in a box on content find on witch one we will crop [%]. Default: `2`.
- **`min_box_black_limit`** *(number)*: The minimum black in a box on content find the limits based on content [%]. Default: `2`.
- **`min_box_black_empty`** *(number)*: The minimum black in a box on content find to determine if the page is empty [%]. Default: `2`.
- **`box_block_size`** *(number)*: The block size used in a box on content find [mm]. Default: `1.5`.
- **`box_threshold_value_c`** *(number)*: A variable of double type representing the constant used in the both methods (subtracted from the mean or weighted mean, used in a box on content find. Default: `25`.
- **`box_kernel_size`** *(number)*: The block size used in a box on content find [mm]. Default: `1.5`.
- **`box_block_size`** *(number)*: The block size used in a box on threshold for content find [mm]. Default: `1.5`.
- **`box_threshold_value_c`** *(number)*: A variable used on threshold, should be low on low contrast image, used in a box on content find. Default: `70`.
8 changes: 6 additions & 2 deletions scan_to_paperless/config.py
Expand Up @@ -81,10 +81,14 @@
# The block size used in a box on content find [mm]
#
# default: 1.5
"box_kernel_size": Union[int, float],
# The block size used in a box on content find [mm]
#
# default: 1.5
"box_block_size": Union[int, float],
# A variable of double type representing the constant used in the both methods (subtracted from the mean or weighted mean, used in a box on content find
# A variable used on threshold, should be low on low contrast image, used in a box on content find
#
# default: 25
# default: 70
"box_threshold_value_c": Union[int, float],
},
total=False,
Expand Down
9 changes: 7 additions & 2 deletions scan_to_paperless/config_schema.json
Expand Up @@ -104,15 +104,20 @@
"default": 2,
"description": "The minimum black in a box on content find to determine if the page is empty [%]"
},
"box_kernel_size": {
"type": "number",
"default": 1.5,
"description": "The block size used in a box on content find [mm]"
},
"box_block_size": {
"type": "number",
"default": 1.5,
"description": "The block size used in a box on content find [mm]"
},
"box_threshold_value_c": {
"type": "number",
"default": 25,
"description": "A variable of double type representing the constant used in the both methods (subtracted from the mean or weighted mean, used in a box on content find"
"default": 70,
"description": "A variable used on threshold, should be low on low contrast image, used in a box on content find"
}
}
}
Expand Down
49 changes: 35 additions & 14 deletions scan_to_paperless/process.py
Expand Up @@ -326,12 +326,15 @@ def crop(context: Context, margin_horizontal: int, margin_vertical: int) -> None
Margin in px
"""
image = context.get_masked()
process_count = context.get_process_count()
contours = find_contours(
image,
f"{process_count}-crop",
context.get_px_value("min_box_size_crop", 3),
context.config["args"].get("min_box_black_crop", 2),
context.get_px_value("box_kernel_size", 1.5),
context.get_px_value("box_block_size", 1.5),
context.config["args"].get("box_threshold_value_c", 25),
context.config["args"].get("box_threshold_value_c", 70),
)

if contours:
Expand All @@ -341,7 +344,7 @@ def crop(context: Context, margin_horizontal: int, margin_vertical: int) -> None
save_image(
image,
context.root_folder,
"{}-crop".format(context.get_process_count()),
"{}-crop".format(process_count),
context.image_name,
True,
)
Expand Down Expand Up @@ -547,13 +550,17 @@ def zero_ranges(values: np_ndarray_int) -> np_ndarray_int:

def find_limit_contour(
image: np_ndarray_int,
name: str,
vertical: bool,
min_box_size: float,
min_box_black: Union[int, float],
block_size: Union[float, int] = 17,
threshold_value_c: Union[float, int] = 25,
kernel_size: Union[float, int] = 16,
block_size: Union[float, int] = 16,
threshold_value_c: Union[float, int] = 100,
) -> Tuple[List[int], List[Tuple[int, int, int, int]]]:
contours = find_contours(image, min_box_size, min_box_black, block_size, threshold_value_c)
contours = find_contours(
image, name, min_box_size, min_box_black, kernel_size, block_size, threshold_value_c
)
image_size = image.shape[1 if vertical else 0]

values = np.zeros(image_size)
Expand All @@ -578,11 +585,13 @@ def fill_limits(
peaks, properties = find_lines(image, vertical)
contours_limits, contours = find_limit_contour(
image,
f"{context.get_process_count()}-limits",
vertical,
context.get_px_value("min_box_size_limit", 10),
context.config["args"].get("min_box_black_limit", 2),
context.get_px_value("box_kernel_size", 1.5),
context.get_px_value("box_block_size", 1.5),
context.config["args"].get("box_threshold_value_c", 25),
context.config["args"].get("box_threshold_value_c", 70),
)
for contour_limit in contours:
draw_rectangle(image, contour_limit)
Expand Down Expand Up @@ -613,21 +622,26 @@ def fill_limits(

def find_contours(
image: np_ndarray_int,
name: str,
min_size: Union[float, int],
min_black: Union[float, int],
kernel_size: Union[float, int] = 16,
block_size: Union[float, int] = 16,
threshold_value_c: Union[float, int] = 25,
threshold_value_c: Union[float, int] = 100,
) -> List[Tuple[int, int, int, int]]:
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
block_size = int(round(block_size / 2) * 2)
kernel_size = int(round(kernel_size / 2))

# Clean the image using otsu method with the inversed binarized image
thresh = cv2.adaptiveThreshold(
gray, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY_INV, block_size + 1, threshold_value_c
)
if os.environ.get("PROGRESS", "FALSE") == "TRUE":
cv2.imwrite(os.path.join(name, "threshold.png"), thresh)

# Assign a rectangle kernel size
kernel = np.ones((5, 5), "uint8")
kernel = np.ones((kernel_size, kernel_size), "uint8")
par_img = cv2.dilate(thresh, kernel, iterations=5)

contours, _ = cv2.findContours(par_img.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
Expand All @@ -640,7 +654,12 @@ def find_contours(
contour_image = rgb2gray(contour_image)
if ((1 - np.mean(contour_image)) * 100) > min_black:
result.append(
(x + block_size / 2, y + block_size / 2, width - block_size, height - block_size)
(
x + kernel_size * 2,
y + kernel_size * 2,
width - kernel_size * 4,
height - kernel_size * 4,
)
)

return result
Expand All @@ -665,7 +684,7 @@ def transform(
images = []
process_count = 0

if config["args"]["assisted_split"]:
if config["args"].get("assisted_split", False):
config["assisted_split"] = []

for index, img in enumerate(step["sources"]):
Expand All @@ -691,18 +710,20 @@ def transform(
# Is empty ?
contours = find_contours(
context.get_masked(),
f"{context.get_process_count()}-is-empty",
context.get_px_value("min_box_size_empty", 20),
context.config["args"].get("min_box_black_crop", 2),
context.get_px_value("box_kernel_size", 1.5),
context.get_px_value("box_block_size", 1.5),
context.config["args"].get("box_threshold_value_c", 25),
context.config["args"].get("box_threshold_value_c", 70),
)
if not contours:
print("Ignore image with no content: {}".format(img))
continue

tesseract(context)

if config["args"]["assisted_split"]:
if config["args"].get("assisted_split", False):
assisted_split: scan_to_paperless.process_schema.AssistedSplit = {}
name = os.path.join(root_folder, context.image_name)
assert context.image is not None
Expand Down Expand Up @@ -739,7 +760,7 @@ def transform(

return {
"sources": images,
"name": "split" if config["args"]["assisted_split"] else "finalise",
"name": "split" if config["args"].get("assisted_split", False) else "finalise",
"process_count": process_count,
}

Expand Down Expand Up @@ -925,7 +946,7 @@ def finalise(

images = step["sources"]

if config["args"]["append_credit_card"]:
if config["args"].get("append_credit_card", False):
images2 = []
for img in images:
if os.path.exists(img):
Expand Down
11 changes: 8 additions & 3 deletions scan_to_paperless/process_schema.json
Expand Up @@ -108,15 +108,20 @@
"default": 2,
"description": "The minimum black in a box on content find to determine if the page is empty [%]"
},
"box_block_size": {
"box_kernel_size": {
"type": "number",
"default": 1.5,
"description": "The block size used in a box on content find [mm]"
},
"box_block_size": {
"type": "number",
"default": 1.5,
"description": "The block size used in a box on threshold for content find [mm]"
},
"box_threshold_value_c": {
"type": "number",
"default": 25,
"description": "A variable of double type representing the constant used in the both methods (subtracted from the mean or weighted mean, used in a box on content find"
"default": 70,
"description": "A variable used on threshold, should be low on low contrast image, used in a box on content find"
}
}
}
Expand Down
8 changes: 6 additions & 2 deletions scan_to_paperless/process_schema.py
Expand Up @@ -85,10 +85,14 @@
# The block size used in a box on content find [mm]
#
# default: 1.5
"box_kernel_size": Union[int, float],
# The block size used in a box on threshold for content find [mm]
#
# default: 1.5
"box_block_size": Union[int, float],
# A variable of double type representing the constant used in the both methods (subtracted from the mean or weighted mean, used in a box on content find
# A variable used on threshold, should be low on low contrast image, used in a box on content find
#
# default: 25
# default: 70
"box_threshold_value_c": Union[int, float],
},
total=False,
Expand Down
Binary file added tests/600.expected.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/600.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified tests/assisted-split-contour-1.expected.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified tests/assisted-split-contour-3.expected.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified tests/assisted-split-contour-5.expected.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified tests/assisted-split-join-1.expected.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified tests/assisted-split-join-2.expected.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified tests/assisted-split-lines-1.expected.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified tests/assisted-split-lines-3.expected.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified tests/assisted-split-lines-4.expected.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified tests/assisted-split-lines-5.expected.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified tests/credit-card-1.expected.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 9a82e06

Please sign in to comment.