Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--urlFile option errors #55

Closed
MattoElGato opened this issue Jun 16, 2021 · 3 comments
Closed

--urlFile option errors #55

MattoElGato opened this issue Jun 16, 2021 · 3 comments

Comments

@MattoElGato
Copy link

When attempting to use the --urlFile option I am getting an error Missing required argument: url.

Command:
docker run -v $PWD/urlFile.txt:/app/urlFile.txt -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --urlFile urlFile.txt

What seems to be happening is that when running the above command urlFile.txt is being created as a directory and so there is no file for the crawler to use as a seed list - hence the error (see second paragraph here).

I have added the line RUN touch urlFile.txt to a local copy of the Dockerfile and this appears to have solved that issue but then I run into another error: pages/pages.jsonl creation failed [Error: ENOENT: no such file or directory, mkdir '/crawls/collections/testcollection/pages'].

Am I missing something obvious?
Thanks

@emmadickson
Copy link
Contributor

emmadickson commented Jun 20, 2021

Could you try running

docker run -v $PWD/urlFile.txt:/app/urlFile.txt -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler:0.4.0-beta.1 crawl --urlFile /app/urlFile.txt --scope ".*"

I'm going to attempt to improve our documentation on the urlfile
option so that its clearer.

@MattoElGato
Copy link
Author

That worked perfectly, thank you. I ran the command with --scope initially and it started trying to archive lots of other sites so ran it without and the crawler stuck to my list.

Also, I ran the 0.4.0-beta.0 version (I couldn't pull 0.4.0-beta.1, is it private atm?).

Command I ran:
docker run -v $PWD/urlFile.txt:/app/urlFile.txt -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler:0.4.0-beta.0 crawl --urlFile /app/urlFile.txt

@ikreymer
Copy link
Member

This should be fixed now, we've also updated the docs on using urlFile. Beta 2 has been released and 0.4.0 release is coming soon. We now also have the YAML config for specifying seed urls. Feel free to open a new issues if things are unclear/not working as expected!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants