-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Page order issue #29
Comments
Proposal: need one script to rename. |
Found one script j=99;for i in *.txt; do mv "$i" text_for_page_000"$j".txt; let j=j+1;done if start page 99 |
Will work on this from monday
|
I am sharing my all test result about this issue. On first run, it create disorder( 1,2,3,4) page if interrupt by user (Cnrl+C). Just next run only remaining pages uploaded to GD and match the perfect order ( 1,2,3,4,5) . But if I leave the machine to run script automatically, we never know when uploading was stuck. Specially in night I have been using do_ocr and leave it, on morning I was watching that stuck at 55 page or 250 page. So my full process was lost if I am not interrupt by Cnrl+C. So I have to awake to watch when stuck at uploading. So if the script automatically skipped to next page when stuck to upload at GD, that will very helpful for us. And finally If the above screenshot issue need to be fixed, after final run of do_ocr.py of create all txt files ( ie 1,2,3,4,5), all pdf, log, .upload files should move to temp folder. |
Tested about 25 books. Now fee that this issue is most needed. As I mentioned that Page No 2 Should not be present at first run. Because in next run, sometimes not re-order properly. |
Hi I have observed that page_0001.txt, page_0002.txt have created proper order, means , If I have skipped/or not done by any reason of page 2, the following pages are created. page_0001.txt, |
Sorry for the long delay on this project.
Resumed my works to fix the issues on this.
|
Fixed the skipping uploads in version 1.38 No of individual pages should be equal to the no of relevant text files. If we skip manually or automatically on the upload process, it wont proceed further. We have to rerun the script to upload the pending files. It will process the text files only after all the PDF files are uploaded and received their text content. Check this and share the results. |
do_ocr_2016-02-03-22-45-51_log.txt I have run again two times , but every times said that, =========ERROR=========== INFO:main: Text files are not equal to PDF files. Some PDF files not OCRed. Run this script again to complete OCR all the PDF files |
Yes.
It means, few PDF files are not uploaded and not received their text files.
Run again and again until the error is gone.
|
Sorry Shrini, how many times I have to re-run ? I have re-ran about 5 times nothing happened, |
I have manually checked that only five files missed. This is 1167 pages book, only 1162 pages OCRed. |
Is rerunning uploads the missing 5 files?
Text splitting won't run until the no of PDF is equal to no of text files
received.
Just to make sure that no page is missed to ocr.
Rerun few more times and watch if missed files are being uploaded.
|
Ok I have tried again about seven times, nothing was happend , the remaining 5 file should trying to upload to GD. My internet connection is OK during that time. do_ocr_2016-02-04-21-30-08_log.txt but every times said that, =========ERROR=========== INFO:main: Text files are not equal to PDF files. Some PDF files not OCRed. Run this script again to complete OCR all the PDF files |
if you are online now, can you show this issue by screensharing?
|
using v1.42 , after run of 708 pages book, the message has come. =========ERROR=========== INFO:main:Missing page_00064.txt Text files are not equal to PDF files. Some PDF files not OCRed. Run this script again to complete OCR all the PDF files "THIS IS GREAT" THAT WAS MY WISH 👍 after second run , only remaining file was uploaded and ocred. Moving all temp files to OCR-স্ত্রী-রোগ.djvu-temp-2016-02-07-00-33-32 INFO:main:Running mv folder_.log currentfile.pdf doc_data.txt pg_.pdf page* txt* 'OCR-স্ত্রী-রোগ.djvu-temp-2016-02-07-00-33-32' Done. Check the text files start with text_for_page_ The PDF files and result text files are equval. Now run the mediawiki_uploader.py script "THIS IS GREAT" MY WISH FULFILL 👍 👍 👍 |
Shall we close this issue?
Can you check for other related reported issues for closing them too?
|
Please look at the screen-shot, I have skipped page 2 by pressing Ctrl+C.
text_for_page_00002.txt created from the content of Page No 3
text_for_page_00003.txt created from the content of Page No 4
text_for_page_00004.txt created from the content of Page No 5
text_for_page_00005.txt not created
do_ocr_2016-01-14-09-21-48_log.txt
The text was updated successfully, but these errors were encountered: