Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding problems #21

Closed
wuiscmc opened this issue Nov 8, 2013 · 5 comments
Closed

Encoding problems #21

wuiscmc opened this issue Nov 8, 2013 · 5 comments

Comments

@wuiscmc
Copy link

wuiscmc commented Nov 8, 2013

Regardless from sidekiq or resque, I always get this error:

crawl_id: fdc9cd1655a54b3d303e2f38a916cc114c9be2c7
url: https://github.com/stewartmckee/cobweb/blob/master/.ruby-version
processing_queue: CrawlerResqueJob
crawl_finished_queue: CrawlerFinishedJob
internal_urls:
- https://github.com/stewartmckee/cobweb/blob/master/*
debug: true
raise_exceptions: true
redis_options:
  host: localhost
  port: '6379'
use_encoding_safe_process_job: false
follow_redirects: true
redirect_limit: 10
queue_system: resque
quiet: true
cache: 300
cache_type: crawl_based
timeout: 10
external_urls: []
seed_urls: []
first_page_redirect_internal: true
text_mime_types:
- text/*
- application/xhtml+xml
obey_robots: false
user_agent: cobweb/1.0.18 (ruby/1.9.3 nokogiri/1.6.0)
valid_mime_types:
- ! '*/*'
store_inbound_links: false
crawl_limit_by_page: false
parent: https://github.com/stewartmckee/cobweb/blob/master/
Exception
Encoding::UndefinedConversionError
Error
"\xC2" from ASCII-8BIT to UTF-8

The only workaround possible is to make this crawler work is to do it from inside Rails... which is a pity since I planned to build a service - without rails - integrating this crawler in my project.

Sidekiq doesnt work from inside Rails neither...

On the other hand, this error does not occur (Resque) when the encoding_flash is setup but then the process job is not being executed.

@wuiscmc
Copy link
Author

wuiscmc commented Nov 9, 2013

Seems to be working fine in
https://github.com/stewartmckee/cobweb/pulls

@wuiscmc wuiscmc closed this as completed Nov 10, 2013
@hallmatt
Copy link

I'm having the same issue, but within Rails. The error reads:

"\xEF" from ASCII-8BIT to UTF-8

The exception is:

Encoding::UndefinedConversionError

Any thoughts?

@stewartmckee
Copy link
Owner

This is usually because the charater encoding specified by the server (either in the headers or content itself) does not match the content that is actually on the page. We should probably add in a bit more logic around how this state is handled. Is the url you are requesting public, could you post it for me to have a look at?

@hallmatt
Copy link

Sounds great. I'll send you a link via email. It is a public site.

@SimonBirrell
Copy link

I'm getting this too:

"\xC3" from ASCII-8BIT to UTF-8

for the URL

http://www.segurocontraroubo.com.br/wp-content/themes/segurocontraroubo/javascripts/add.js

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants