Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Next page navigation #439

Closed
agramesh opened this issue Mar 29, 2016 · 16 comments
Closed

Next page navigation #439

agramesh opened this issue Mar 29, 2016 · 16 comments

Comments

@agramesh
Copy link

Is the new portia scrape next page data?
I build spider in portia with nextpage recorded option which is run in docker, but when i deployed it in scrapyd there is no data scraped i got output like

2016-03-29 15:57:55 [scrapy] INFO: Spider opened
2016-03-29 15:57:55 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-29 15:57:55 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-29 15:58:32 [scrapy] DEBUG: Retrying <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (failed 1 times): 504 Gateway Time-out
2016-03-29 15:58:55 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-29 15:59:07 [scrapy] DEBUG: Retrying <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (failed 2 times): 504 Gateway Time-out
2016-03-29 15:59:38 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)
2016-03-29 15:59:39 [scrapy] DEBUG: Filtered offsite request to 'www.successfactors.com': <GET http://www.successfactors.com/>
2016-03-29 15:59:51 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)
2016-03-29 15:59:51 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)
2016-03-29 15:59:52 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)
2016-03-29 15:59:52 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)
2016-03-29 15:59:52 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)
2016-03-29 15:59:52 [scrapy] DEBUG: Filtered offsite request to 'www.cpchem.com': <GET http://www.cpchem.com/>
2016-03-29 15:59:52 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)
2016-03-29 15:59:55 [scrapy] INFO: Crawled 7 pages (at 7 pages/min), scraped 0 items (at 0 items/min)
2016-03-29 15:59:57 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)
2016-03-29 15:59:58 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)
2016-03-29 15:59:59 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)
2016-03-29 16:00:00 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)
2016-03-29 16:00:00 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)
2016-03-29 16:00:03 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)
2016-03-29 16:00:04 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)
2016-03-29 16:00:04 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)

there scrapyd and splash are running in local, is there any idea for this issue?

thanks in advance

@ruairif
Copy link
Contributor

ruairif commented Jun 7, 2016

Did you use a Javascript action for the next page link?

@agramesh
Copy link
Author

agramesh commented Jun 7, 2016

Im using page action for the nextpage link .
pageaction
How use this page action for navigate to next page. is this working fine. How it will execute on spider deployment in scrapyd.

@ruairif
Copy link
Contributor

ruairif commented Jun 7, 2016

You need to configure a splash instance. You will also need to add a setting to your projects settings.py file that points to the splash instance:

SPLASH_URL = 'http://127.0.0.1:8050'

You also need to make sure that the slybot.pageactions.PageActionsMiddleware is enabled for your spider.

@agramesh
Copy link
Author

agramesh commented Jun 7, 2016

yes i configured the splash instance in spider settings file and its working

pageaction1

but the pageaction middleware is not triggered even i added it in settings file.

page

@agramesh
Copy link
Author

agramesh commented Jun 9, 2016

I receive error like this while running the spider with pageaction

Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 45, in mustbe_deferred
result = f(_args, *_kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/middleware.py", line 32, in process_request
response = method(request=request, spider=spider)
File "/home/portia/portia/slybot/slybot/pageactions.py", line 44, in process_request
events = spider.page_actions
AttributeError: 'SlybotSpider' object has no attribute 'page_actions'
2016-06-09 17:51:09 [scrapy] INFO: Closing spider (finished)

@agramesh
Copy link
Author

hi
anyupdates on this

@agramesh
Copy link
Author

Hi
while i added the pageaction in the spider, the pageaction middleware is enabled but first page data only scraped in scrapyd. Is there any way to go next page with click option. Or how can we deploy the spider with the pageaction. i tried portiacrawl also but no response from nextpage.
click
pageaction

@AlexTan-b-z
Copy link

my problem is same with you.
First page data only scraped in scrapyd.
Have you sovled the problem?

@agramesh
Copy link
Author

hi AlexTan-b-z
still i am analyze this pageaction. I try to implement a selenium script middleware for navigation. Do you have any idea for this.

@AlexTan-b-z
Copy link

hi agramesh
I know a pagination function.
If the spider is run with setting AUTO_PAGINATION=1 the algorithm kicks in and follows pagination links.But the spider will run very slowly after set it.
But the page action looks like doesn't work.I have tried every function in the page action,it doesn't work,too. I even doubt the 'click' in the pgae action if can be used to page turining.
Have you run the page actioon? What is displayed on the console?

@agramesh
Copy link
Author

Hi AlexTan-b-z

If i run the spider with above configured settings file, the pageaction middleware just triggered only. I didn't check with AUTO_PAGINATION=1.
where you configured this,did you configured in settings.py file? Let me check what happened.
thanks for the update.

@AlexTan-b-z
Copy link

The slybot/settings.py file.

@AlexTan-b-z
Copy link

Hi agramesh

Hava you find something?

@matthieu637
Copy link

+1

@AlexTan-b-z
Copy link

Hi!
I have configured the splash instance and have set the SPLASH_URL.
I find the splash can make Enable JS working.
But I set the pageaction to crawl pages. It looks like that pageaction is not working.
I know that pageaction is to solve ajax pages. But it looks like doesn't work.
I set the pageaction that is scroll and wait. To crawl the page's data must scroll, because of ajax. I use selenium and phantomjs can find the data.
I missed something? Or I should set something else in settings.py? Or I should set something for splash?

@ruairif
Copy link
Contributor

ruairif commented Jan 19, 2017

You need to set the SPLASH_URL in splash to point to your splash instance

@ruairif ruairif closed this as completed Jan 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants