Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resources returned extremely slowly for large collection. #573

Open
jswrenn opened this issue Jul 1, 2020 · 9 comments
Open

Resources returned extremely slowly for large collection. #573

jswrenn opened this issue Jul 1, 2020 · 9 comments

Comments

@jswrenn
Copy link
Sponsor

jswrenn commented Jul 1, 2020

Describe the bug

Resources are returned extremely slowly (~3 minutes) for a large collection (34Gb, >1m records). While the page is loading, exactly one core of the server's CPU goes to 100% utilization.

Steps to reproduce the bug

Unfortunately, I'm not permitted to share the archive as it includes sensitive personal information.

Expected behavior

Resources are returned quickly.

Screenshots

Here's a pyspy flamegraph of wayback handling a single request initiated by curl: https://jswrenn.com/misc/pywb_573-profile.svg

Environment

  • OS: Ubuntu 18.04
  • HW: DigitalOcean VPS with 6 cores, 16GB of memory, SSD.
  • py-wb version: 2.4.1
@jswrenn
Copy link
Sponsor Author

jswrenn commented Jul 1, 2020

I've updated the issue to include a flamegraph of wayback handling a single request, initiated by curl. A substantial amount of time appears to be spent searching the index.

@ikreymer
Copy link
Member

ikreymer commented Jul 1, 2020

Thanks for including this! What does the indexes directory look like? Are there multiple cdxj files in there, or a single index?

@jswrenn
Copy link
Sponsor Author

jswrenn commented Jul 1, 2020

Multiple index files:

indexes/
├── [596M]  autoindex.cdxj
├── [118M]  autoindex.cdxj.tmp.20200629011258909273
├── [   0]  autoindex.cdxj.tmp.20200629152048161219
├── [   0]  autoindex.cdxj.tmp.20200630020638761945
└── [611M]  index.cdxj

@jswrenn
Copy link
Sponsor Author

jswrenn commented Jul 2, 2020

I deleted the existing indices and re-indexed the collection, but there was no improvement.

@ikreymer
Copy link
Member

ikreymer commented Jul 2, 2020

Hm, it seems like it should still work at that size in pywb, and it'll definitely work with a compressed index..
If there's any way you can share the example privately, I can try to debug further.. but a couple of things you can try:

You can make a compressed index as explained here: https://github.com/ikreymer/webarchive-indexing#building-a-local-cluster.

It's a bit old (I'm trying to build new tools to generate the compressed index), but essentially you can run:
python build_local_zipnum.py -s 1 -l 300 ./zip/ ./cdx/path/to and then copy the contents ofthe ./zip/ into the indices directory (and remove the uncompressed index). This requires python 2.7 at the moment.
I'll let you know when there's an updated tool to create this compressed index.

Another option is to use OutbackCDX, which many folks have been using with pywb:
https://github.com/nla/outbackcdx

@jswrenn
Copy link
Sponsor Author

jswrenn commented Jul 2, 2020

If there's any way you can share the example privately, I can try to debug further..

Would sharing the index be sufficient? (There's a bunch of FERPA-restricted and NDA-restricted material in the actual archive, so I'm not easily able to share that.)

@ikreymer
Copy link
Member

ikreymer commented Jul 2, 2020

Sure, I can see if I can find something, or at least compress it, and you can try the compressed version also.
You can send me a link via email instead of attaching here..

If there's any way you can share the example privately, I can try to debug further..

Would sharing the index be sufficient? (There's a bunch of FERPA-restricted and NDA-restricted material in the actual archive, so I'm not easily able to share that.)

Yes, that will help with debugging.. I can also compress it and then you can try out the compressed version too.

@jswrenn
Copy link
Sponsor Author

jswrenn commented Jul 2, 2020

Thanks!!! I'll send that email momentarily.

@wdcs-nikhilvibhani
Copy link

wdcs-nikhilvibhani commented Jul 15, 2020

Hello,

I am using pywb to handle http.mydomain.com/https:google.com, It is working fine, But taking more time to load those websites.

Can anyone please help me to make it faster?

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants