-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawler Execution Failed #207
Comments
For the first problem, you were probably configuring the docker volume at the wrong directory, but you seem to have already fixed it. For the second screenshot, the crawling is ignoring non-english pages by default. You can disable this feature by adding the following on the ache.yml file: # Store only pages that contain english text using language detector
target_storage.english_language_detection_enabled: false The sample config file at https://github.com/VIDA-NYU/ache/blob/master/config/sample_config/ache.yml has other configurations that my be useful. |
The crawler also ignores non-HTML content by default (e.g., jpg images as seen in the log). To allow other types of content, you need to add the following config on ache.yml (including other mime-types that you need): crawler_manager.downloader.valid_mime_types:
- text/xml
- text/html
- text/plain
- application/x-asp
- application/xhtml+xml
- application/vnd.wap.xhtml+xml |
Dear Aecio, Thank you for the response. Let me try this and I will come back to this thread. so please don't close it. |
Closing this issue. Feel free to open another issue if you find other problems. |
Hi
I'm new to ACHe crawler and trying to run the sample to see how the crawler is collecting data then i can run myown bnut its giving me below error. I'm running on centos7 with docker.
Please help.
The text was updated successfully, but these errors were encountered: