Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[do not merge] refactoring #2

Merged
merged 47 commits into from
Dec 16, 2020
Merged

[do not merge] refactoring #2

merged 47 commits into from
Dec 16, 2020

Conversation

kba
Copy link
Contributor

@kba kba commented Nov 20, 2020

Extends #1 with some initial refactorings:

  • reorganize imports
  • move CLI to dedicated module
  • introduce uniform spacing for readability
  • start extracting methods that can be functions (do not access self) to utils.py

I'll keep it at that for now because I don't want to inadvertently break anything, so further refactoring must wait until tests/CI are set up.

@kba kba changed the title Refactor [do not merge] refactoring Nov 20, 2020
Copy link
Member

@mikegerber mikegerber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I absolutely like that you

  • (re)moved unused code
  • used a code formatter
  • moved a lot of utility functions out of the main code

The few remarks I made are maybe out of scope for this refactoring, because these refactoring are mostly pure refactoring, i.e. reformatting and moving code.

We should have a deeper look after this. Let's talk about it on Friday?

Comment on lines +27 to +61
@click.option(
"--save_layout",
"-sl",
help="if a directory is given, plot of layout will be saved there",
type=click.Path(exists=True, file_okay=False),
)
@click.option(
"--save_deskewed",
"-sd",
help="if a directory is given, deskewed image will be saved there",
type=click.Path(exists=True, file_okay=False),
)
@click.option(
"--save_all",
"-sa",
help="if a directory is given, all plots needed for documentation will be saved there",
type=click.Path(exists=True, file_okay=False),
)
@click.option(
"--allow_enhancement",
"-ae",
is_flag=True,
help="if this parameter set to true, this tool would check that input image need resizing and enhancement or not. If so output of resized and enhanced image and corresponding layout data will be written in out directory",
)
@click.option(
"--curved_line",
"-cl",
is_flag=True,
help="if this parameter set to true, this tool will try to return contoure of textlines instead of rectabgle bounding box of textline. This should be taken into account that with this option the tool need more time to do process.",
)
@click.option(
"--full_layout",
"-fl",
is_flag=True,
help="if this parameter set to true, this tool will try to return all elements of layout.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if that was in there before, but I like that these are proper options!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't, I fixed this in 97bc57b @vahidrezanezhad you can define command line options as "flags", so they represent boolean values which makes the code cleaner because you don't need to compare strings "true", "TRUE" etc. if defined like this.

Comment on lines +38 to +50
from .utils.contour import (
contours_in_same_horizon,
filter_contours_area_of_image_interiors,
filter_contours_area_of_image_tables,
filter_contours_area_of_image,
find_contours_mean_y_diff,
find_features_of_contours,
find_new_features_of_contoures,
get_text_region_boxes_by_given_contours,
get_textregion_contours_in_org_image,
return_bonding_box_of_contours,
return_contours_of_image,
return_contours_of_interested_region,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice that these (and other) utility functions are now moved out of the main code!

full_layout=False,
allow_scaling=False,
headers_off=False
):
self.image_dir = image_dir # XXX This does not seem to be a directory as the name suggests, but a file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noticed the problem that image_dir is not a directory, in my initial review of sbb-textline-detector. It would be nice if that problem could be resolved too, i.e. renaming it to image_file or similar. (That XXX is from me and is now in multiple variants of the code.)

That is, if it is still the case that image_dir is not a directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The code would benefit a lot from more consistent naming of variables and functions in general. But best to do that in a separate PR.

if self.f_name is None:
try:
self.f_name = image_dir.split('/')[len(image_dir.split('/')) - 1]
self.f_name = self.f_name.split('.')[0]
self.f_name = image_dir.split("/")[len(image_dir.split("/")) - 1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be clearer as:

self.f_name = image_dir.split("/")[-1]

or even:

self.f_name = os.path.basename(image_dir)

(Here again the misleading name of image_dir, I think it should be image_file or image_path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, also I would use os.path.join or Path() for concatenating paths throughout.

Comment on lines 160 to 171
self.model_dir_of_enhancemnet = dir_models + "/model_enhancement.h5"
self.model_dir_of_col_classifier = dir_models + "/model_scale_classifier.h5"
self.model_region_dir_p = dir_models + "/model_main_covid19_lr5-5_scale_1_1_great.h5" # dir_models +'/model_main_covid_19_many_scalin_down_lr5-5_the_best.h5'#'/model_main_covid19_lr5-5_scale_1_1_great.h5'#'/model_main_scale_1_1und_1_2_corona_great.h5'
# self.model_region_dir_p_ens = dir_models +'/model_ensemble_s.h5'#'/model_main_covid19_lr5-5_scale_1_1_great.h5'#'/model_main_scale_1_1und_1_2_corona_great.h5'
self.model_region_dir_p2 = dir_models + "/model_main_home_corona3_rot.h5"

self.model_region_dir_fully_np = dir_models + "/model_no_patches_class0_30eopch.h5"
self.model_region_dir_fully = dir_models + "/model_3up_new_good_no_augmentation.h5" # "model_3col_p_soft_10_less_aug_binarization_only.h5"

self.model_page_dir = dir_models + "/model_page_mixed_best.h5"
self.model_region_dir_p_ens = dir_models + "/model_ensemble_s.h5" # dir_models +'/model_main_covid_19_many_scalin_down_lr5-5_the_best.h5' #dir_models +'/model_ensemble_s.h5'
###self.model_region_dir_p = dir_models +'/model_layout_newspapers.h5'#'/model_ensemble_s.h5'#'/model_layout_newspapers.h5'#'/model_ensemble_s.h5'#'/model_main_home_5_soft_new.h5'#'/model_home_soft_5_all_data.h5' #'/model_main_office_long_soft.h5'#'/model_20_cat_main.h5'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe dir is misleading here too. Also, os.path.join could be used to join a directory name with a file name.

Copy link
Member

@cneud cneud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the effort @kba, makes it a joy (almost) to review ;)

Mostly minor remarks and suggestions from my side, possibly also to target in a separate PR, or for discussion in the next meeting.

sbb_newspapers_org_image/eynollah.py Outdated Show resolved Hide resolved
Makefile Show resolved Hide resolved

__version__ = '1.0'
"""
tool to extract table form data from alto xml data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DocString should be updated and start with a capital letter and end with a fullstop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 and offer more details. And we should reuse that in the docstring for the CLI and upcoming OCR-D CLI.

sbb_newspapers_org_image/eynollah.py Show resolved Hide resolved
self.model_textline_dir = dir_models + "/model_textline_newspapers.h5" #'/model_hor_ver_home_trextline_very_good.h5'# '/model_hor_ver_1_great.h5'#'/model_curved_office_works_great.h5'

def predict_enhancement(self, img):
model_enhancement, session_enhancemnet = self.start_new_session_and_model(self.model_dir_of_enhancemnet)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can consider renaming some of the models or making the model names more consistent (good/best/very good/great) @vahidrezanezhad

sbb_newspapers_org_image/eynollah.py Show resolved Hide resolved

else:
peaks_new_tot=peaks_e[:]
data.set("xmlns", "http://schema.primaresearch.org/PAGE/gts/pagecontent/2017-07-15")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update this to PAGE-XML 2019 (http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15)?
(also L1459)

else:
point_up = peaks[jj] + first_nonzero - int(1.23 * dis_to_next_up) ##+int(dis_to_next_up*1./4.0)
point_down = peaks[jj] + first_nonzero + int(1.33 * dis_to_next_down) ###-int(dis_to_next_down*1./4.0)
data.set("xmlns", "http://schema.primaresearch.org/PAGE/gts/pagecontent/2017-07-15")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again update to PAGE-XML 2019 when possible (same for L1966)

sbb_newspapers_org_image/eynollah.py Show resolved Hide resolved
sbb_newspapers_org_image/eynollah.py Show resolved Hide resolved
@cneud cneud mentioned this pull request Dec 4, 2020
6 tasks
@kba kba merged commit 044ff0c into main Dec 16, 2020
@kba kba deleted the refactor branch January 8, 2021 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants