Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Page order issue #29

Closed
jayantanth opened this issue Jan 14, 2016 · 19 comments
Closed

Page order issue #29

jayantanth opened this issue Jan 14, 2016 · 19 comments

Comments

@jayantanth
Copy link
Contributor

Please look at the screen-shot, I have skipped page 2 by pressing Ctrl+C.

text_for_page_00002.txt created from the content of Page No 3
text_for_page_00003.txt created from the content of Page No 4
text_for_page_00004.txt created from the content of Page No 5
text_for_page_00005.txt not created

screenshot from 2016-01-14 09 26 46

do_ocr_2016-01-14-09-21-48_log.txt

@jayantanth
Copy link
Contributor Author

Proposal: need one script to rename.

@jayantanth
Copy link
Contributor Author

Found one script

j=99;for i in *.txt; do mv "$i" text_for_page_000"$j".txt; let j=j+1;done

if start page 99

@tshrinivasan
Copy link
Owner

Will work on this from monday
On 16 Jan 2016 16:36, "Jayanta Nath" notifications@github.com wrote:

Found one script

j=99;for i in *.txt; do mv "$i" text_for_page_000"$j".txt; let j=j+1;done

if start page 99


Reply to this email directly or view it on GitHub
#29 (comment)
.

@jayantanth
Copy link
Contributor Author

1

@jayantanth
Copy link
Contributor Author

I am sharing my all test result about this issue. On first run, it create disorder( 1,2,3,4) page if interrupt by user (Cnrl+C). Just next run only remaining pages uploaded to GD and match the perfect order ( 1,2,3,4,5) .

But if I leave the machine to run script automatically, we never know when uploading was stuck. Specially in night I have been using do_ocr and leave it, on morning I was watching that stuck at 55 page or 250 page. So my full process was lost if I am not interrupt by Cnrl+C. So I have to awake to watch when stuck at uploading. So if the script automatically skipped to next page when stuck to upload at GD, that will very helpful for us.

And finally If the above screenshot issue need to be fixed, after final run of do_ocr.py of create all txt files ( ie 1,2,3,4,5), all pdf, log, .upload files should move to temp folder.

@jayantanth
Copy link
Contributor Author

Tested about 25 books. Now fee that this issue is most needed. As I mentioned that Page No 2 Should not be present at first run. Because in next run, sometimes not re-order properly.

@jayantanth
Copy link
Contributor Author

Hi I have observed that page_0001.txt, page_0002.txt have created proper order, means , If I have skipped/or not done by any reason of page 2, the following pages are created.

page_0001.txt,
page_0003.txt
page_0004.txt,
page_0005.txt

@tshrinivasan
Copy link
Owner

tshrinivasan commented Jan 25, 2016 via email

@tshrinivasan
Copy link
Owner

Fixed the skipping uploads in version 1.38

No of individual pages should be equal to the no of relevant text files.

If we skip manually or automatically on the upload process, it wont proceed further.

We have to rerun the script to upload the pending files.

It will process the text files only after all the PDF files are uploaded and received their text content.

Check this and share the results.

@jayantanth
Copy link
Contributor Author

do_ocr_2016-02-03-22-45-51_log.txt
do_ocr_2016-02-04-09-08-59_log.txt
do_ocr_2016-02-04-09-09-52_log.txt

I have run again two times , but every times said that,

=========ERROR===========

INFO:main:

Text files are not equal to PDF files. Some PDF files not OCRed. Run this script again to complete OCR all the PDF files

@tshrinivasan
Copy link
Owner

tshrinivasan commented Feb 4, 2016 via email

@jayantanth
Copy link
Contributor Author

Sorry Shrini, how many times I have to re-run ? I have re-ran about 5 times nothing happened,
no file was trying to upload at GD.

@jayantanth
Copy link
Contributor Author

I have manually checked that only five files missed. This is 1167 pages book, only 1162 pages OCRed.

@tshrinivasan
Copy link
Owner

tshrinivasan commented Feb 4, 2016 via email

@jayantanth
Copy link
Contributor Author

So I have done few things manually, copy all "page_00001.txt" to new folder, that just batch rename to text_for_page_00001.txt, then run "python mediawiki_uploader.py" to upload to wikisource. Rest of 5 files will done by Manually. :-(

screenshot from 2016-02-04 21 47 40

@jayantanth
Copy link
Contributor Author

Ok I have tried again about seven times, nothing was happend , the remaining 5 file should trying to upload to GD. My internet connection is OK during that time.

do_ocr_2016-02-04-21-30-08_log.txt
do_ocr_2016-02-04-21-54-27_log.txt
do_ocr_2016-02-04-21-54-52_log.txt
do_ocr_2016-02-04-21-55-15_log.txt
do_ocr_2016-02-04-21-55-37_log.txt
do_ocr_2016-02-04-21-56-04_log.txt
do_ocr_2016-02-04-21-57-24_log.txt
do_ocr_2016-02-04-21-58-31_log.txt

but every times said that,

=========ERROR===========

INFO:main:

Text files are not equal to PDF files. Some PDF files not OCRed. Run this script again to complete OCR all the PDF files

@tshrinivasan
Copy link
Owner

tshrinivasan commented Feb 4, 2016 via email

@jayantanth
Copy link
Contributor Author

using v1.42 , after run of 708 pages book, the message has come.

=========ERROR===========

INFO:main:Missing page_00064.txt
INFO:main:page_00064.pdf should be reuploaded
INFO:main:Missing page_00420.txt
INFO:main:page_00420.pdf should be reuploaded
INFO:main:Missing page_00493.txt
INFO:main:page_00493.pdf should be reuploaded
INFO:main:Missing page_00544.txt
INFO:main:page_00544.pdf should be reuploaded
INFO:main:Missing page_00627.txt
INFO:main:page_00627.pdf should be reuploaded
INFO:main:

Text files are not equal to PDF files. Some PDF files not OCRed. Run this script again to complete OCR all the PDF files

"THIS IS GREAT" THAT WAS MY WISH 👍

after second run , only remaining file was uploaded and ocred.

Moving all temp files to OCR-স্ত্রী-রোগ.djvu-temp-2016-02-07-00-33-32

INFO:main:Running mv folder_.log currentfile.pdf doc_data.txt pg_.pdf page* txt* 'OCR-স্ত্রী-রোগ.djvu-temp-2016-02-07-00-33-32'
INFO:main:

Done. Check the text files start with text_for_page_
INFO:main:

The PDF files and result text files are equval. Now run the mediawiki_uploader.py script

"THIS IS GREAT" MY WISH FULFILL 👍 👍 👍
do_ocr_2016-02-06-00-38-44_log.txt
do_ocr_2016-02-07-00-33-32_log.txt

@tshrinivasan
Copy link
Owner

tshrinivasan commented Feb 6, 2016 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants