Next page navigation #439

agramesh · 2016-03-29T10:47:36Z

Is the new portia scrape next page data?
I build spider in portia with nextpage recorded option which is run in docker, but when i deployed it in scrapyd there is no data scraped i got output like

2016-03-29 15:57:55 [scrapy] INFO: Spider opened
2016-03-29 15:57:55 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-29 15:57:55 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-29 15:58:32 [scrapy] DEBUG: Retrying <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (failed 1 times): 504 Gateway Time-out
2016-03-29 15:58:55 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-29 15:59:07 [scrapy] DEBUG: Retrying <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (failed 2 times): 504 Gateway Time-out
2016-03-29 15:59:38 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)
2016-03-29 15:59:39 [scrapy] DEBUG: Filtered offsite request to 'www.successfactors.com': <GET http://www.successfactors.com/>
2016-03-29 15:59:51 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)
2016-03-29 15:59:51 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)
2016-03-29 15:59:52 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)
2016-03-29 15:59:52 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)
2016-03-29 15:59:52 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)
2016-03-29 15:59:52 [scrapy] DEBUG: Filtered offsite request to 'www.cpchem.com': <GET http://www.cpchem.com/>
2016-03-29 15:59:52 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)
2016-03-29 15:59:55 [scrapy] INFO: Crawled 7 pages (at 7 pages/min), scraped 0 items (at 0 items/min)
2016-03-29 15:59:57 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)
2016-03-29 15:59:58 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)
2016-03-29 15:59:59 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)
2016-03-29 16:00:00 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)
2016-03-29 16:00:00 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)
2016-03-29 16:00:03 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)
2016-03-29 16:00:04 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)
2016-03-29 16:00:04 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)

there scrapyd and splash are running in local, is there any idea for this issue?

thanks in advance

ruairif · 2016-06-07T07:25:21Z

Did you use a Javascript action for the next page link?

agramesh · 2016-06-07T10:16:02Z

Im using page action for the nextpage link .

How use this page action for navigate to next page. is this working fine. How it will execute on spider deployment in scrapyd.

ruairif · 2016-06-07T11:02:33Z

You need to configure a splash instance. You will also need to add a setting to your projects settings.py file that points to the splash instance:

SPLASH_URL = 'http://127.0.0.1:8050'

You also need to make sure that the slybot.pageactions.PageActionsMiddleware is enabled for your spider.

agramesh · 2016-06-07T11:36:35Z

yes i configured the splash instance in spider settings file and its working

but the pageaction middleware is not triggered even i added it in settings file.

agramesh · 2016-06-09T12:24:03Z

I receive error like this while running the spider with pageaction

Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 45, in mustbe_deferred
result = f(_args, *_kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/middleware.py", line 32, in process_request
response = method(request=request, spider=spider)
File "/home/portia/portia/slybot/slybot/pageactions.py", line 44, in process_request
events = spider.page_actions
AttributeError: 'SlybotSpider' object has no attribute 'page_actions'
2016-06-09 17:51:09 [scrapy] INFO: Closing spider (finished)

agramesh · 2016-07-15T07:19:40Z

hi
anyupdates on this

agramesh · 2016-08-19T06:46:27Z

Hi
while i added the pageaction in the spider, the pageaction middleware is enabled but first page data only scraped in scrapyd. Is there any way to go next page with click option. Or how can we deploy the spider with the pageaction. i tried portiacrawl also but no response from nextpage.

AlexTan-b-z · 2016-08-27T08:42:29Z

my problem is same with you.
First page data only scraped in scrapyd.
Have you sovled the problem?

agramesh · 2016-08-29T04:24:46Z

hi AlexTan-b-z
still i am analyze this pageaction. I try to implement a selenium script middleware for navigation. Do you have any idea for this.

AlexTan-b-z · 2016-08-29T06:27:13Z

hi agramesh
I know a pagination function.
If the spider is run with setting AUTO_PAGINATION=1 the algorithm kicks in and follows pagination links.But the spider will run very slowly after set it.
But the page action looks like doesn't work.I have tried every function in the page action,it doesn't work,too. I even doubt the 'click' in the pgae action if can be used to page turining.
Have you run the page actioon? What is displayed on the console?

agramesh · 2016-08-29T06:41:33Z

Hi AlexTan-b-z

If i run the spider with above configured settings file, the pageaction middleware just triggered only. I didn't check with AUTO_PAGINATION=1.
where you configured this,did you configured in settings.py file? Let me check what happened.
thanks for the update.

AlexTan-b-z · 2016-08-29T07:46:56Z

The slybot/settings.py file.

AlexTan-b-z · 2016-08-29T09:05:54Z

Hi agramesh

Hava you find something?

matthieu637 · 2016-09-22T21:54:00Z

+1

AlexTan-b-z · 2017-01-19T04:09:29Z

Hi!
I have configured the splash instance and have set the SPLASH_URL.
I find the splash can make Enable JS working.
But I set the pageaction to crawl pages. It looks like that pageaction is not working.
I know that pageaction is to solve ajax pages. But it looks like doesn't work.
I set the pageaction that is scroll and wait. To crawl the page's data must scroll, because of ajax. I use selenium and phantomjs can find the data.
I missed something? Or I should set something else in settings.py? Or I should set something for splash?

ruairif · 2017-01-19T10:21:34Z

You need to set the SPLASH_URL in splash to point to your splash instance

ruairif closed this as completed Jan 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Next page navigation #439

Next page navigation #439

agramesh commented Mar 29, 2016

ruairif commented Jun 7, 2016

agramesh commented Jun 7, 2016

ruairif commented Jun 7, 2016

agramesh commented Jun 7, 2016

agramesh commented Jun 9, 2016

agramesh commented Jul 15, 2016

agramesh commented Aug 19, 2016

AlexTan-b-z commented Aug 27, 2016

agramesh commented Aug 29, 2016

AlexTan-b-z commented Aug 29, 2016

agramesh commented Aug 29, 2016

AlexTan-b-z commented Aug 29, 2016

AlexTan-b-z commented Aug 29, 2016

matthieu637 commented Sep 22, 2016

AlexTan-b-z commented Jan 19, 2017

ruairif commented Jan 19, 2017

Next page navigation #439

Next page navigation #439

Comments

agramesh commented Mar 29, 2016

ruairif commented Jun 7, 2016

agramesh commented Jun 7, 2016

ruairif commented Jun 7, 2016

agramesh commented Jun 7, 2016

agramesh commented Jun 9, 2016

agramesh commented Jul 15, 2016

agramesh commented Aug 19, 2016

AlexTan-b-z commented Aug 27, 2016

agramesh commented Aug 29, 2016

AlexTan-b-z commented Aug 29, 2016

agramesh commented Aug 29, 2016

AlexTan-b-z commented Aug 29, 2016

AlexTan-b-z commented Aug 29, 2016

matthieu637 commented Sep 22, 2016

AlexTan-b-z commented Jan 19, 2017

ruairif commented Jan 19, 2017