URL allowed domain issue blocking onsite links #31

tyler-foxworthy · 2014-04-30T16:47:21Z

Hi. I'm trying to gather basic description and address information for some business pages on Yahoo finance. I was able to use the Portia interface to successfully pull metadata for pages such as http://biz.yahoo.com/ic/42/42034.html. However, when i go to the main page where links to all businesses in the same domain are listed http://biz.yahoo.com/ic/774_cl_all.html all of the business links are highlighted in red. I believe this is because they are listed under another domain http://us.rd.yahoo.com/finance/industry/front/industrynav/423/*http://biz.yahoo.com/ic/423.html . I've tried writing a regex for the allowed urls but I believe the fact that they are not on the .biz domain is preventing them from being scooped up by the link extractor. Any thoughts on a fix?

duendex · 2014-04-30T17:44:30Z

You are right, they are in red because only in-domain links will be followed. Please do this:
on the spider configuration go to Start urls and add http://us.rd.yahoo.com
That should work for you, let me know how it goes.

Cheers

tpeng added wontfix and removed wontfix labels Jul 9, 2014

jpswade mentioned this issue Jul 29, 2014

Links with query strings appear as red links when configuring follow patterns #82

Closed

tpeng closed this as completed Oct 26, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URL allowed domain issue blocking onsite links #31

URL allowed domain issue blocking onsite links #31

tyler-foxworthy commented Apr 30, 2014

duendex commented Apr 30, 2014

URL allowed domain issue blocking onsite links #31

URL allowed domain issue blocking onsite links #31

Comments

tyler-foxworthy commented Apr 30, 2014

duendex commented Apr 30, 2014