Page order issue #29

jayantanth · 2016-01-14T04:03:30Z

Please look at the screen-shot, I have skipped page 2 by pressing Ctrl+C.

text_for_page_00002.txt created from the content of Page No 3
text_for_page_00003.txt created from the content of Page No 4
text_for_page_00004.txt created from the content of Page No 5
text_for_page_00005.txt not created

do_ocr_2016-01-14-09-21-48_log.txt

jayantanth · 2016-01-16T16:18:59Z

Proposal: need one script to rename.

jayantanth · 2016-01-16T16:36:07Z

Found one script

j=99;for i in *.txt; do mv "$i" text_for_page_000"$j".txt; let j=j+1;done

if start page 99

tshrinivasan · 2016-01-16T16:40:45Z

Will work on this from monday
On 16 Jan 2016 16:36, "Jayanta Nath" notifications@github.com wrote:

Found one script

j=99;for i in *.txt; do mv "$i" text_for_page_000"$j".txt; let j=j+1;done

if start page 99

—
Reply to this email directly or view it on GitHub
#29 (comment)
.

jayantanth · 2016-01-16T20:00:52Z

jayantanth · 2016-01-19T17:08:13Z

I am sharing my all test result about this issue. On first run, it create disorder( 1,2,3,4) page if interrupt by user (Cnrl+C). Just next run only remaining pages uploaded to GD and match the perfect order ( 1,2,3,4,5) .

But if I leave the machine to run script automatically, we never know when uploading was stuck. Specially in night I have been using do_ocr and leave it, on morning I was watching that stuck at 55 page or 250 page. So my full process was lost if I am not interrupt by Cnrl+C. So I have to awake to watch when stuck at uploading. So if the script automatically skipped to next page when stuck to upload at GD, that will very helpful for us.

And finally If the above screenshot issue need to be fixed, after final run of do_ocr.py of create all txt files ( ie 1,2,3,4,5), all pdf, log, .upload files should move to temp folder.

jayantanth · 2016-01-21T02:34:35Z

Tested about 25 books. Now fee that this issue is most needed. As I mentioned that Page No 2 Should not be present at first run. Because in next run, sometimes not re-order properly.

jayantanth · 2016-01-25T16:18:55Z

Hi I have observed that page_0001.txt, page_0002.txt have created proper order, means , If I have skipped/or not done by any reason of page 2, the following pages are created.

page_0001.txt,
page_0003.txt
page_0004.txt,
page_0005.txt

tshrinivasan · 2016-01-25T16:21:18Z

Sorry for the long delay on this project. Resumed my works to fix the issues on this.

tshrinivasan · 2016-02-03T01:44:38Z

Fixed the skipping uploads in version 1.38

No of individual pages should be equal to the no of relevant text files.

If we skip manually or automatically on the upload process, it wont proceed further.

We have to rerun the script to upload the pending files.

It will process the text files only after all the PDF files are uploaded and received their text content.

Check this and share the results.

jayantanth · 2016-02-04T03:42:04Z

do_ocr_2016-02-03-22-45-51_log.txt
do_ocr_2016-02-04-09-08-59_log.txt
do_ocr_2016-02-04-09-09-52_log.txt

I have run again two times , but every times said that,

=========ERROR===========

INFO:main:

Text files are not equal to PDF files. Some PDF files not OCRed. Run this script again to complete OCR all the PDF files

tshrinivasan · 2016-02-04T08:07:45Z

Yes. It means, few PDF files are not uploaded and not received their text files. Run again and again until the error is gone.

jayantanth · 2016-02-04T16:10:40Z

Sorry Shrini, how many times I have to re-run ? I have re-ran about 5 times nothing happened,
no file was trying to upload at GD.

jayantanth · 2016-02-04T16:15:00Z

I have manually checked that only five files missed. This is 1167 pages book, only 1162 pages OCRed.

tshrinivasan · 2016-02-04T16:21:39Z

Is rerunning uploads the missing 5 files? Text splitting won't run until the no of PDF is equal to no of text files received. Just to make sure that no page is missed to ocr. Rerun few more times and watch if missed files are being uploaded.

jayantanth · 2016-02-04T16:22:20Z

So I have done few things manually, copy all "page_00001.txt" to new folder, that just batch rename to text_for_page_00001.txt, then run "python mediawiki_uploader.py" to upload to wikisource. Rest of 5 files will done by Manually. :-(

jayantanth · 2016-02-04T16:31:45Z

Ok I have tried again about seven times, nothing was happend , the remaining 5 file should trying to upload to GD. My internet connection is OK during that time.

do_ocr_2016-02-04-21-30-08_log.txt
do_ocr_2016-02-04-21-54-27_log.txt
do_ocr_2016-02-04-21-54-52_log.txt
do_ocr_2016-02-04-21-55-15_log.txt
do_ocr_2016-02-04-21-55-37_log.txt
do_ocr_2016-02-04-21-56-04_log.txt
do_ocr_2016-02-04-21-57-24_log.txt
do_ocr_2016-02-04-21-58-31_log.txt

but every times said that,

=========ERROR===========

INFO:main:

Text files are not equal to PDF files. Some PDF files not OCRed. Run this script again to complete OCR all the PDF files

tshrinivasan · 2016-02-04T19:12:24Z

if you are online now, can you show this issue by screensharing?

jayantanth · 2016-02-06T19:02:52Z

using v1.42 , after run of 708 pages book, the message has come.

=========ERROR===========

INFO:main:Missing page_00064.txt
INFO:main:page_00064.pdf should be reuploaded
INFO:main:Missing page_00420.txt
INFO:main:page_00420.pdf should be reuploaded
INFO:main:Missing page_00493.txt
INFO:main:page_00493.pdf should be reuploaded
INFO:main:Missing page_00544.txt
INFO:main:page_00544.pdf should be reuploaded
INFO:main:Missing page_00627.txt
INFO:main:page_00627.pdf should be reuploaded
INFO:main:

Text files are not equal to PDF files. Some PDF files not OCRed. Run this script again to complete OCR all the PDF files

"THIS IS GREAT" THAT WAS MY WISH 👍

after second run , only remaining file was uploaded and ocred.

Moving all temp files to OCR-স্ত্রী-রোগ.djvu-temp-2016-02-07-00-33-32

INFO:main:Running mv folder_.log currentfile.pdf doc_data.txt pg_.pdf page* txt* 'OCR-স্ত্রী-রোগ.djvu-temp-2016-02-07-00-33-32'
INFO:main:

Done. Check the text files start with text_for_page_
INFO:main:

The PDF files and result text files are equval. Now run the mediawiki_uploader.py script

"THIS IS GREAT" MY WISH FULFILL 👍 👍 👍
do_ocr_2016-02-06-00-38-44_log.txt
do_ocr_2016-02-07-00-33-32_log.txt

tshrinivasan · 2016-02-06T22:18:15Z

Shall we close this issue? Can you check for other related reported issues for closing them too?

jayantanth mentioned this issue Jan 21, 2016

Resuming do_ocr.py based on page number #22

Open

jayantanth closed this as completed Feb 7, 2016

jayantanth mentioned this issue Feb 7, 2016

Handling network disconnection #21

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Page order issue #29

Page order issue #29

jayantanth commented Jan 14, 2016

jayantanth commented Jan 16, 2016

jayantanth commented Jan 16, 2016

tshrinivasan commented Jan 16, 2016

jayantanth commented Jan 16, 2016

jayantanth commented Jan 19, 2016

jayantanth commented Jan 21, 2016

jayantanth commented Jan 25, 2016

tshrinivasan commented Jan 25, 2016 via email

tshrinivasan commented Feb 3, 2016

jayantanth commented Feb 4, 2016

tshrinivasan commented Feb 4, 2016 via email

jayantanth commented Feb 4, 2016

jayantanth commented Feb 4, 2016

tshrinivasan commented Feb 4, 2016 via email

jayantanth commented Feb 4, 2016

jayantanth commented Feb 4, 2016

tshrinivasan commented Feb 4, 2016 via email

jayantanth commented Feb 6, 2016

tshrinivasan commented Feb 6, 2016 via email

Page order issue #29

Page order issue #29

Comments

jayantanth commented Jan 14, 2016

jayantanth commented Jan 16, 2016

jayantanth commented Jan 16, 2016

tshrinivasan commented Jan 16, 2016

jayantanth commented Jan 16, 2016

jayantanth commented Jan 19, 2016

jayantanth commented Jan 21, 2016

jayantanth commented Jan 25, 2016

tshrinivasan commented Jan 25, 2016 via email

tshrinivasan commented Feb 3, 2016

jayantanth commented Feb 4, 2016

tshrinivasan commented Feb 4, 2016 via email

jayantanth commented Feb 4, 2016

jayantanth commented Feb 4, 2016

tshrinivasan commented Feb 4, 2016 via email

jayantanth commented Feb 4, 2016

jayantanth commented Feb 4, 2016

tshrinivasan commented Feb 4, 2016 via email

jayantanth commented Feb 6, 2016

tshrinivasan commented Feb 6, 2016 via email