feed: stockxapi rate limit #1

zhehaowang · 2019-06-12T04:58:08Z

stockx appears to be one of those sites constantly upgrading their anti-bot mechanism.
On 06/02/19 my auth requests get through if they have User-Agent set.
On 06/09/19 I had to add Referrer, Origin and Content-Type.
On 06/12/19 I had to add these to get_details requests as well, and I still get 403 after the first few requests. As a short term solution perhaps a rate limit, or multiple sources, will do.

Goal of this is to be able to continue scraping stockx uninterruptedly. I can think of

either add a rate limit on our side, or
find out if additional fields can just let our requests get through, or
switch to a different framework with such support built in

I believe ultimately they want people to use their api but what I'm doing now is probably too brutal.

@djian618

zhehaowang · 2019-06-14T04:04:22Z

b51c779 should address this.
Added per-query and per-page voluntary throttling, multiple accounts rotation and auth cookie.

It appears after being blocked from one get_details the next auth still fails indefinitely without human intervention. Need more investigation.

With aggressive throttling we can mostly get through the current list. When we eventually get stuck after all the AJs it appears manual cookies reloading helps.

Which particular cookie help?
Does setting cookie in every get_detail help?
Would an adaptive sleep by query response term help?

They use a thirdparty solution called PerimeterX. Need some targeted research.

zhehaowang · 2019-06-16T21:44:18Z

There does not appear to be an easy fix for this PerimeterX thing.
We added significant self throttling but still wasn't able to get the entire search list through: after extended time a get_details would 403 and all subsequent get_details would 403, until manual intervention in browser to just click "I'm a human". The fact that our browser request gets blocked, too, seems to indicate this is an IP based blocking.

Selenium was not able to help click that button: when simulating a Selenium click more checks from reCaptcha popped up.

As a start, we should make sure to not duplicate queries. Then we should consider spreading our requests out more.

zhehaowang · 2019-07-08T00:37:58Z

Note that the API endpoint we are using is not what they have in the official repo (https://github.com/stockx/PublicAPI).
It would appear the API endpoint there needs an API key which is only available for lv4 seller.
Also the API listed there seems incomplete for our use case: transaction history, e.g., is not available.

Not knowing PerimeterX's mechanism the best thing to try now could be having a fleet of IP addresses and activating them throughout different times of day.

zhehaowang · 2019-07-16T03:39:26Z

It would appear throttle time and different logins don't seem to help.
Standard query kw setup,

0.1s per-request throttle, 20s per keyword throttle, 1 account
- 1st 403 at 30 items / 118 request / halfway through AJ1 in
- 2nd 403 at 227 items / 910 request / halfway through AJ6 in
- 3rd 403 at 225 items / 905 request / AJ11
- 4th 403 at 255 items / 1019 request / AJ19
- 5th 403 at 176 items / 705 request / AJ29
- 6th 403 at 174 items / 699 request / adidas ultraboost
- 7th 403 at 245 items / 979 request / nike kobe
- 8th 403 right at the end

Each item right now is 4 requests.

We could try

fleet of ip address (which requires deployment-ready version of the script)
record when the hiccups happen under different setups to study PerimeterX's behavior.

This is an IP-based block as once 403'ed, other devices behind the same NAT also need to go through captcha.

zhehaowang · 2019-07-29T00:48:01Z

We were not blocked in last scrape on 07/27. Presumably this was lifted? Closing for now.

zhehaowang · 2019-12-22T22:16:55Z

This is observed again since feedv2 on 20191222.
Presumably the new architecture could help. Need to implement / test.

zhehaowang · 2019-12-26T05:36:18Z

This is observed in both update and query modes. The current workaround are shell scripts to limit how many we update each time.

If we breach such we become temporarily unavailable for about 30min, no human intervention needed.
If not seems we can just sleep for 60s and keep going.
This is not as harsh as the previous iteration.

One problem is a script may never finish updating everything due to limit's interaction with requests that didn't error out due to 403.

zhehaowang · 2019-12-29T00:21:11Z

The problem has since been addressed and 403 on stockx no longer seems to be a major blocker.

zhehaowang added enhancement New feature or request feed feed from exchanges urgent urgent issues labels Jun 12, 2019

zhehaowang added this to the Sprint 1 milestone Jun 13, 2019

zhehaowang self-assigned this Jun 14, 2019

zhehaowang closed this as completed Jun 14, 2019

zhehaowang reopened this Jun 14, 2019

zhehaowang mentioned this issue Jul 11, 2019

feed: set up stable cron rescrape and email notification #6

Open

zhehaowang closed this as completed Jul 29, 2019

zhehaowang reopened this Dec 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feed: stockxapi rate limit #1

feed: stockxapi rate limit #1

zhehaowang commented Jun 12, 2019 •

edited

Loading

zhehaowang commented Jun 14, 2019 •

edited

Loading

zhehaowang commented Jun 16, 2019

zhehaowang commented Jul 8, 2019 •

edited

Loading

zhehaowang commented Jul 16, 2019 •

edited

Loading

zhehaowang commented Jul 29, 2019

zhehaowang commented Dec 22, 2019

zhehaowang commented Dec 26, 2019 •

edited

Loading

zhehaowang commented Dec 29, 2019

feed: stockxapi rate limit #1

feed: stockxapi rate limit #1

Comments

zhehaowang commented Jun 12, 2019 • edited Loading

zhehaowang commented Jun 14, 2019 • edited Loading

zhehaowang commented Jun 16, 2019

zhehaowang commented Jul 8, 2019 • edited Loading

zhehaowang commented Jul 16, 2019 • edited Loading

zhehaowang commented Jul 29, 2019

zhehaowang commented Dec 22, 2019

zhehaowang commented Dec 26, 2019 • edited Loading

zhehaowang commented Dec 29, 2019

zhehaowang commented Jun 12, 2019 •

edited

Loading

zhehaowang commented Jun 14, 2019 •

edited

Loading

zhehaowang commented Jul 8, 2019 •

edited

Loading

zhehaowang commented Jul 16, 2019 •

edited

Loading

zhehaowang commented Dec 26, 2019 •

edited

Loading