Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler Execution Failed #207

Closed
fatimasadiq opened this issue Jul 6, 2021 · 5 comments
Closed

Crawler Execution Failed #207

fatimasadiq opened this issue Jul 6, 2021 · 5 comments
Labels

Comments

@fatimasadiq
Copy link

Hi
I'm new to ACHe crawler and trying to run the sample to see how the crawler is collecting data then i can run myown bnut its giving me below error. I'm running on centos7 with docker.

Please help.

image

@fatimasadiq
Copy link
Author

Hi now im getting attached while running the crawler nothing is downloaded.
Screenshot 2021-07-06 at 14 49 03

@aecio
Copy link
Member

aecio commented Jul 6, 2021

For the first problem, you were probably configuring the docker volume at the wrong directory, but you seem to have already fixed it.

For the second screenshot, the crawling is ignoring non-english pages by default. You can disable this feature by adding the following on the ache.yml file:

# Store only pages that contain english text using language detector
target_storage.english_language_detection_enabled: false

The sample config file at https://github.com/VIDA-NYU/ache/blob/master/config/sample_config/ache.yml has other configurations that my be useful.

@aecio aecio added the question label Jul 6, 2021
@aecio
Copy link
Member

aecio commented Jul 6, 2021

The crawler also ignores non-HTML content by default (e.g., jpg images as seen in the log). To allow other types of content, you need to add the following config on ache.yml (including other mime-types that you need):

crawler_manager.downloader.valid_mime_types:
 - text/xml
 - text/html
 - text/plain
 - application/x-asp
 - application/xhtml+xml
 - application/vnd.wap.xhtml+xml

@fatimasadiq
Copy link
Author

Dear Aecio,

Thank you for the response. Let me try this and I will come back to this thread. so please don't close it.

@aecio
Copy link
Member

aecio commented May 22, 2022

Closing this issue. Feel free to open another issue if you find other problems.

@aecio aecio closed this as completed May 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants