Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feed: stockxapi rate limit #1

Open
zhehaowang opened this issue Jun 12, 2019 · 8 comments
Open

feed: stockxapi rate limit #1

zhehaowang opened this issue Jun 12, 2019 · 8 comments
Assignees
Labels
enhancement New feature or request feed feed from exchanges urgent urgent issues
Milestone

Comments

@zhehaowang
Copy link
Owner

zhehaowang commented Jun 12, 2019

stockx appears to be one of those sites constantly upgrading their anti-bot mechanism.
On 06/02/19 my auth requests get through if they have User-Agent set.
On 06/09/19 I had to add Referrer, Origin and Content-Type.
On 06/12/19 I had to add these to get_details requests as well, and I still get 403 after the first few requests. As a short term solution perhaps a rate limit, or multiple sources, will do.

Goal of this is to be able to continue scraping stockx uninterruptedly. I can think of

  • either add a rate limit on our side, or
  • find out if additional fields can just let our requests get through, or
  • switch to a different framework with such support built in

I believe ultimately they want people to use their api but what I'm doing now is probably too brutal.

@djian618

@zhehaowang zhehaowang added enhancement New feature or request feed feed from exchanges urgent urgent issues labels Jun 12, 2019
@zhehaowang zhehaowang added this to the Sprint 1 milestone Jun 13, 2019
@zhehaowang zhehaowang self-assigned this Jun 14, 2019
@zhehaowang
Copy link
Owner Author

zhehaowang commented Jun 14, 2019

b51c779 should address this.
Added per-query and per-page voluntary throttling, multiple accounts rotation and auth cookie.

It appears after being blocked from one get_details the next auth still fails indefinitely without human intervention. Need more investigation.

With aggressive throttling we can mostly get through the current list. When we eventually get stuck after all the AJs it appears manual cookies reloading helps.

  • Which particular cookie help?
  • Does setting cookie in every get_detail help?
  • Would an adaptive sleep by query response term help?

They use a thirdparty solution called PerimeterX. Need some targeted research.

@zhehaowang
Copy link
Owner Author

There does not appear to be an easy fix for this PerimeterX thing.
We added significant self throttling but still wasn't able to get the entire search list through: after extended time a get_details would 403 and all subsequent get_details would 403, until manual intervention in browser to just click "I'm a human". The fact that our browser request gets blocked, too, seems to indicate this is an IP based blocking.

Selenium was not able to help click that button: when simulating a Selenium click more checks from reCaptcha popped up.

As a start, we should make sure to not duplicate queries. Then we should consider spreading our requests out more.

@zhehaowang
Copy link
Owner Author

zhehaowang commented Jul 8, 2019

Note that the API endpoint we are using is not what they have in the official repo (https://github.com/stockx/PublicAPI).
It would appear the API endpoint there needs an API key which is only available for lv4 seller.
Also the API listed there seems incomplete for our use case: transaction history, e.g., is not available.

Not knowing PerimeterX's mechanism the best thing to try now could be having a fleet of IP addresses and activating them throughout different times of day.

@zhehaowang
Copy link
Owner Author

zhehaowang commented Jul 16, 2019

It would appear throttle time and different logins don't seem to help.
Standard query kw setup,

  • 0.1s per-request throttle, 20s per keyword throttle, 1 account
    • 1st 403 at 30 items / 118 request / halfway through AJ1 in
    • 2nd 403 at 227 items / 910 request / halfway through AJ6 in
    • 3rd 403 at 225 items / 905 request / AJ11
    • 4th 403 at 255 items / 1019 request / AJ19
    • 5th 403 at 176 items / 705 request / AJ29
    • 6th 403 at 174 items / 699 request / adidas ultraboost
    • 7th 403 at 245 items / 979 request / nike kobe
    • 8th 403 right at the end

Each item right now is 4 requests.

We could try

  • fleet of ip address (which requires deployment-ready version of the script)
  • record when the hiccups happen under different setups to study PerimeterX's behavior.

This is an IP-based block as once 403'ed, other devices behind the same NAT also need to go through captcha.

@zhehaowang
Copy link
Owner Author

We were not blocked in last scrape on 07/27. Presumably this was lifted? Closing for now.

@zhehaowang
Copy link
Owner Author

This is observed again since feedv2 on 20191222.
Presumably the new architecture could help. Need to implement / test.

@zhehaowang zhehaowang reopened this Dec 22, 2019
@zhehaowang
Copy link
Owner Author

zhehaowang commented Dec 26, 2019

This is observed in both update and query modes. The current workaround are shell scripts to limit how many we update each time.

If we breach such we become temporarily unavailable for about 30min, no human intervention needed.
If not seems we can just sleep for 60s and keep going.
This is not as harsh as the previous iteration.

One problem is a script may never finish updating everything due to limit's interaction with requests that didn't error out due to 403.

@zhehaowang
Copy link
Owner Author

The problem has since been addressed and 403 on stockx no longer seems to be a major blocker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feed feed from exchanges urgent urgent issues
Projects
None yet
Development

No branches or pull requests

1 participant