Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passes meta information from previous request #145

Closed
yvmarques opened this issue Mar 3, 2018 · 6 comments
Closed

Passes meta information from previous request #145

yvmarques opened this issue Mar 3, 2018 · 6 comments
Labels

Comments

@yvmarques
Copy link

I am not sure how I can archive this, but my requirement it's that I need to know the order of the links of the initial request and pass it to the next request in order to save it with more data from the link's request.

Let's say that my initial request contains three pages in that order:

foo.html -> 1st link in the HTML
bar.html -> 2nd link in the HTML
baz.html -> etc.

When I will request foo.html (because I configured the crawler to depth: 2) I would like to know that this page was the 1st link from the previous page.

Is that possible ?

Thanks,

@BubuAnabelas
Copy link

I'm not sure if this works like this because of the asynchronicity, but everytime onSuccess(response) is called it returns an array of links inside response. Those links are the ones the crawler will continue to crawl up to the configured depth. If the crawler does this sequentially you would have an ordered list of pages that the crawler will follow.

@yvmarques
Copy link
Author

I noticed it as well, and my best guess so fa with this is that we could store this lis on a global variable, because the order is correct and then on the preRequest match the future request with this global variable.

But I am also thinking that this option could also be useful for example configure the referrer for the next request. As far I understand, currently all the requests won't have any referrer and this can set off a few alarms and got blocked.

@yujiosaka
Copy link
Owner

yujiosaka commented Mar 10, 2018

@yvmarques
Is your use case satisfied if the previous page's information (like document.referrer ) is passed to onSuccess's result?

@yvmarques
Copy link
Author

@yujiosaka I am not sure in the onSuccess you can change the headers for the coming request ? Wouldn't that previous page's information be more useful on the preRequest method ?

The idea would to have something similar to what Scrapy has for the Request and Response.

https://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.meta

@yujiosaka
Copy link
Owner

@yvmarques

Wouldn't that previous page's information be more useful on the preRequest method ?

Yes, it will be. I just thought you only wanted to know where the request is coming from.
If the referrer is passed to preRequest, you can even modify headers by extraHeaders options.

If it's what you wanted, I can probably add the feature quick.

@yvmarques
Copy link
Author

yvmarques commented Mar 13, 2018

I don't know how hard would it be to, for example get the result of a previous request passed to preRequest and the executed request on onSuccess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants