Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid YouTube Captcha #12229

Open
wants to merge 14 commits into
base: master
from

Conversation

@hugogameiro
Copy link
Contributor

hugogameiro commented Oct 27, 2019

Gradually during the past week some of my servers started running into an issue with YouTube where the preview_cards for videos were failing.

After doing some debugging I noticed that the HTML that is fetched on fetch_link_card_service.rb was returning a YouTube reCaptcha page instead of the video page. On this page there is no link[@type="application/json+oembed"] or link[@type="text/xml+oembed"] present so the link to oembed is never found.

I think this is related to the number of requests my servers do to YouTube and what Google calls automated traffic.

This is a work around to bypass the problem. I know it's not ideal and it creates an exception/hardcode for YouTube but that was all I could think to solve the issue.

I am not a Rails coder and please do double check and suggest corrections to what I have done.

For both of these reasons I completely understand if this is never merged to master but decided to share this in case someone else runs into the same issue or to see if we can find a better solution.

hugogameiro added 5 commits Oct 27, 2019
@Gargron

This comment has been minimized.

Copy link
Member

Gargron commented Oct 28, 2019

I wonder if instead of hardcoding YouTube URLs we could long-term cache discovered OEmbed endpoints. One question is how to map discovered OEmbed endpoints back onto URL schemes like https://youtube.com/watch?v= automatically. Perhaps storing the host would be enough though.

@hugogameiro

This comment has been minimized.

Copy link
Contributor Author

hugogameiro commented Oct 28, 2019

I have been thinking about that since yesterday and something that could do that would be to do the initial discovery of the oEmbed like it has been done and if found doing 2 regex matches (one mandatory, the other optional, when the mandatory is not matched continue the way it is right now). The matches would be something like:

Mandatory: match https%3A and caching the string before that match
Optional: match format=json or format=xml and storing it if present

According to the oEmbed Spec:

format (optional)
The required response format. When not specified, the provider can return any valid response format. When specified, the provider must return data in the request format, else return an error...
Note: Providers may choose to have the format specified as part of the endpoint URL itself, rather than as a query string parameter.

So, on an URL like: https://www.youtube.com/oembed?url=https%3A//youtube.com/watch%3Fv%3DM3r2XDceM6A&format=json

The cache would be an array like:

{
  key: www.youtube.com{
    endpoint: https://www.youtube.com/oembed?url=,
    format: json
  }
}

For an URL like: https://masto.pt/api/oembed.json?url=https%3A%2F%2Fmasto.pt%2F%40hugo%2F102593582113984855

The cache would be an array like:

{
  key: masto.pt{
    endpoint: https://masto.pt/api/oembed.json?url=
  }
}

The cache could last 24 hours or something like that and when a new preview_card needs to be generated check if domain endpoint cache exist and just append the encoded URL to the endpoint (with format if present).

hugogameiro added 3 commits Oct 28, 2019
@hugogameiro

This comment has been minimized.

Copy link
Contributor Author

hugogameiro commented Oct 29, 2019

OK, so I updated the code to do what I mentioned in the previous comment.

It's the first time I attempt to do some sort of "serious" coding using Ruby and I am just using coding experience from other languages. So, I'm pretty sure it could be cleaner (safer?) but from my testing it works and it's pretty fast when the cache is used.

Feel free to not use it or to copy/adapt/change if you think something is useful.

hugogameiro added 6 commits Oct 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.