You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi. I'm trying to gather basic description and address information for some business pages on Yahoo finance. I was able to use the Portia interface to successfully pull metadata for pages such as http://biz.yahoo.com/ic/42/42034.html. However, when i go to the main page where links to all businesses in the same domain are listed http://biz.yahoo.com/ic/774_cl_all.html all of the business links are highlighted in red. I believe this is because they are listed under another domain http://us.rd.yahoo.com/finance/industry/front/industrynav/423/*http://biz.yahoo.com/ic/423.html . I've tried writing a regex for the allowed urls but I believe the fact that they are not on the .biz domain is preventing them from being scooped up by the link extractor. Any thoughts on a fix?
The text was updated successfully, but these errors were encountered:
You are right, they are in red because only in-domain links will be followed. Please do this:
on the spider configuration go to Start urls and add http://us.rd.yahoo.com
That should work for you, let me know how it goes.
Hi. I'm trying to gather basic description and address information for some business pages on Yahoo finance. I was able to use the Portia interface to successfully pull metadata for pages such as http://biz.yahoo.com/ic/42/42034.html. However, when i go to the main page where links to all businesses in the same domain are listed http://biz.yahoo.com/ic/774_cl_all.html all of the business links are highlighted in red. I believe this is because they are listed under another domain http://us.rd.yahoo.com/finance/industry/front/industrynav/423/*http://biz.yahoo.com/ic/423.html . I've tried writing a regex for the allowed urls but I believe the fact that they are not on the .biz domain is preventing them from being scooped up by the link extractor. Any thoughts on a fix?
The text was updated successfully, but these errors were encountered: