Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The release is corrupted #161

Closed
V-kto opened this issue Jun 15, 2022 · 9 comments
Closed

The release is corrupted #161

V-kto opened this issue Jun 15, 2022 · 9 comments
Labels

Comments

@V-kto
Copy link

V-kto commented Jun 15, 2022

Description

I have downloaded the following release : https://github.com/tesseract-ocr/tessdata/releases/tag/4.1.0

I have an error when I try to unzip it.

Tested Environments

  • Mac OSX 12.4
  • Amazon Linux 2 (Docker container)

Reproduce

Run the following commands

wget https://github.com/tesseract-ocr/tessdata/archive/refs/tags/4.1.0.zip
unzip ./4.1.0.zip

You will get this

Archive:  ./4.1.0.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of ./4.1.0.zip or
        ./4.1.0.zip.zip, and cannot find ./4.1.0.zip.ZIP, period.
@V-kto V-kto changed the title Release is corrupted The release is corrupted Jun 15, 2022
@stweil
Copy link
Contributor

stweil commented Jun 15, 2022

I can confirm that problem. The release zip files are provided by GitHub, so there is nothing we can do to fix it. But the tar.gz file is fine, so there is a working alternative.

@zdenop
Copy link
Contributor

zdenop commented Jun 15, 2022

tessdata repository is huge repository and:

ZIP format had a 4 GB limit on various things (uncompressed size of a file, compressed size of a file, and total size of the archive), as well as a limit of 65,535 entries in a ZIP archive.

so it seems you chose to use the wrong technology. Use git clone --depth 1 https://github.com/tesseract-ocr/tessdata.git for downloading all data.

@stweil
Copy link
Contributor

stweil commented Jun 15, 2022

The tar.gz file is about 565 MiB. The zip file 543 MiB, far away from the limit but the size seems to be too small.

git clone is a good alternative which has several advantages compared to downloading a zip or tar.gz file.

@mabrydozier
Copy link

I am also seeing the same issue with 4.0.0. I opened an issue earlier, but don't see it now.

Is the root problem understood? I don't think the zip file we have been downloading has been updated in some time.

Our build system does this quite often, so the issue has come up very recently, within the last few days.

Also, we had similar issues with the tar.gz file.

Thanks,

Mabry

@stweil
Copy link
Contributor

stweil commented Jun 15, 2022

Is the root problem understood?

Ask Microsoft, the owners of GitHub. The zip files are created automatically by GitHub.

Also, we had similar issues with the tar.gz file.

In my test tar.gz worked fine.

@stweil
Copy link
Contributor

stweil commented Jun 15, 2022

I now tried several downloads of the zip file using wget. The resulting files had different sizes although there was no download error. All downloads finished after 1:40, so maybe the GitHub servers have a fixed time limit.

@stweil
Copy link
Contributor

stweil commented Jun 15, 2022

More tests confirmed that the download terminates after exactly 100 seconds, resulting in a partial file.

@stweil
Copy link
Contributor

stweil commented Jun 15, 2022

I close this issue as we cannot solve it, but there is a good alternative way to get the data using git clone (see above).

@stweil stweil closed this as not planned Won't fix, can't repro, duplicate, stale Jun 15, 2022
@V-kto
Copy link
Author

V-kto commented Jun 17, 2022

Hi !

Thank you for your reply. I will use git clone instead. About tar.gz in my remember it did not work either but i'm not sure.

Anyway, the main thing is that I can use tessdata somehow !

thank you for your explanations and your time :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants