Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instagram is blocking our scraping #665

Open
snarfed opened this issue Apr 30, 2016 · 23 comments
Labels

Comments

@snarfed
Copy link
Owner

@snarfed snarfed commented Apr 30, 2016

... by returning empty 429s to our profile page HTTP requests. seems like it started under 36h ago. may have happened before too though. eg https://brid.gy/log?start_time=1461974780&key=aglzfmJyaWQtZ3lyFgsSCUluc3RhZ3JhbSIHc25hcmZlZAw

instagram-atom isn't having this problem, and it's using the same IPs, so maybe changing user agent might fix it.

@snarfed

This comment has been minimized.

Copy link
Owner Author

@snarfed snarfed commented Apr 30, 2016

changed our user agent to a normal browser string, which seemed to fix this....but i doubt it'll stay fixed for long. app engine still appends our app id to the user agent, so instagram will still be able to identify us. we'll see.

@snarfed

This comment has been minimized.

Copy link
Owner Author

@snarfed snarfed commented Apr 30, 2016

didn't work :(

snarfed added a commit to snarfed/granary that referenced this issue Apr 30, 2016
@snarfed

This comment has been minimized.

Copy link
Owner Author

@snarfed snarfed commented Apr 30, 2016

the 429 body:

screen shot 2016-04-30 at 9 21 43 am

snarfed added a commit that referenced this issue Apr 30, 2016
@snarfed

This comment has been minimized.

Copy link
Owner Author

@snarfed snarfed commented Apr 30, 2016

i also dropped instagram max poll freq down to 2h. with that and the new user agent, we're back in business. >75% of active instagram accounts have polled successfully in the last few hrs. eg https://brid.gy/instagram/aaronpk

not sure which of the two changes did the trick. we'll see.

@snarfed

This comment has been minimized.

Copy link
Owner Author

@snarfed snarfed commented May 1, 2016

we're still mostly blocked after all. :/ a few fetches went through ok, but they were the exception.

i'm going to disable instagram entirely for a day or two to see if that resets anything on their end.

@snarfed

This comment has been minimized.

Copy link
Owner Author

@snarfed snarfed commented May 1, 2016

i wonder if this is all of app engine's (slash google's) IP block, not just bridgy. eg granary-demo sees the same problem: https://granary-demo.appspot.com/?site=instagram

@snarfed

This comment has been minimized.

Copy link
Owner Author

@snarfed snarfed commented May 1, 2016

evidence for that: scraping instagram with bridgy's user agent works fine on my local machine.

@snarfed

This comment has been minimized.

Copy link
Owner Author

@snarfed snarfed commented May 2, 2016

tried switching to sockets instead of urlfetch in the hopes that it used a different IP block, but no luck. one request made it through out of five, but the other four were 429ed. :/

@snarfed

This comment has been minimized.

Copy link
Owner Author

@snarfed snarfed commented May 2, 2016

@snarfed

This comment has been minimized.

Copy link
Owner Author

@snarfed snarfed commented May 2, 2016

i set up a reverse proxy to get around the IP block.

snarfed added a commit that referenced this issue May 2, 2016
@snarfed

This comment has been minimized.

Copy link
Owner Author

@snarfed snarfed commented May 3, 2016

this has been working ok for a couple days now, yay. we'll see how long it lasts. :P closing.

@snarfed snarfed closed this May 3, 2016
@gerbz

This comment has been minimized.

Copy link

@gerbz gerbz commented May 24, 2016

I noticed you're scraping the profile page - you should checkout /username/media/. No auth needed.
https://www.instagram.com/snarfed/media/

Discovering this blew my mind.

Are you using a single IP? How often are you polling? Been working for 21+ days since your fix?

@snarfed

This comment has been minimized.

Copy link
Owner Author

@snarfed snarfed commented May 24, 2016

@gerbz sadly that only works if you're logged in. http://stackoverflow.com/questions/17373886/33783840#comment61481772_33783840

the proxy was a single IP, yes, but instagram actually stopped blocking app engine recently, so i switched back to fetching directly instead.

we're polling ~1k users between once a day and once an hr, depending on how active they are. each poll may also fetch up to N individual media pages too though. in practice it looks like we average <1qpm right now, slightly bursty.

@gerbz

This comment has been minimized.

Copy link

@gerbz gerbz commented May 24, 2016

@snarfed that comment is incorrect - try for yourself. I've even hit it unauthed using Tor. Works fine. Haven't polled it excessively but should work.

Thanks for the info.

@snarfed

This comment has been minimized.

Copy link
Owner Author

@snarfed snarfed commented May 24, 2016

good point! you're right. thanks! i just realized i was testing on a private account. public accounts work fine.

@shafikhaan

This comment has been minimized.

Copy link

@shafikhaan shafikhaan commented Jun 30, 2018

@snarfed what the current status of your scraping, Is project still up ?
p.s 👍 Thanks for the comments, Its really helping

@snarfed

This comment has been minimized.

Copy link
Owner Author

@snarfed snarfed commented Jun 30, 2018

@shafikhaan yup! https://brid.gy/ , https://granary.io , and https://instagram-atom.appspot.com are still happily scraping Instagram.

@shafikhaan

This comment has been minimized.

Copy link

@shafikhaan shafikhaan commented Jun 30, 2018

@snarfed Which one will you pick from the above ?

@snarfed

This comment has been minimized.

Copy link
Owner Author

@snarfed snarfed commented Jun 30, 2018

@shafikhaan sorry? i don't follow the question.

they all share this scraping code, if that helps:

https://github.com/snarfed/granary/blob/master/granary/instagram.py#L758-L975

@snarfed snarfed reopened this Aug 26, 2019
@snarfed

This comment has been minimized.

Copy link
Owner Author

@snarfed snarfed commented Aug 26, 2019

happening again. started 8/21, probably due to an ongoing flood of https://granary.io/ instagram fetches for individual profiles via subscriptions in Aperture-based news readers. ugh. i've disabled instagram in granary entirely for now.

for the record, and since i might need to use it again, when i proxied requests last time, i used Apache 2.4's mod_proxy and mod_ssl with this config:

LoadModule proxy_module /usr/lib64/httpd/modules/mod_proxy.so
LoadModule ssl_module /usr/lib64/httpd/modules/mod_ssl.so
SSLProxyEngine on
ProxyPass /instagram/ https://www.instagram.com/
@snarfed

This comment has been minimized.

Copy link
Owner Author

@snarfed snarfed commented Aug 26, 2019

interestingly, the symptom this time is different. when it happened originally, back in 2016, we got 429s with a nice Sorry, too many requests. HTML body. now, it's 401s with an empty body. example log.

@snarfed

This comment has been minimized.

Copy link
Owner Author

@snarfed snarfed commented Aug 26, 2019

back to proxying. working for now. i've re-enabled all affected IG accounts.

@snarfed

This comment has been minimized.

Copy link
Owner Author

@snarfed snarfed commented Aug 27, 2019

instagram blocked my proxy's IP. whee.

snarfed added a commit to snarfed/granary that referenced this issue Aug 27, 2019
snarfed added a commit that referenced this issue Aug 27, 2019
for #665
snarfed added a commit to snarfed/granary that referenced this issue Aug 29, 2019
snarfed added a commit to snarfed/granary that referenced this issue Aug 29, 2019
trying to discourage people from using granary for social feeds, esp due to eg IG's recent blocking, snarfed/bridgy#665 (comment)
snarfed added a commit to snarfed/instagram-atom that referenced this issue Aug 30, 2019
...since i had to block instagram in granary due to their rate limiting/blocking. snarfed/bridgy#665 (comment)
snarfed added a commit to snarfed/twitter-atom that referenced this issue Aug 30, 2019
inspired by snarfed/instagram-atom@856575b, since i had to block instagram in granary due to their rate limiting/blocking. snarfed/bridgy#665 (comment)

UI next!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.