Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-set a Chinese extractor proxy. #1063

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

xuehuichao
Copy link

This pull request enables "-C" and "--china", to auto-set a proxy for Chinese video websites.

A common usecase of you-get is to download website content from Chinese websites. One inconvenience is: we always need a proxy, most likely by looking up on the web. This patch automated this process. It parses a proxy listing website, and randomly choose a proxy from that list.

This is my first pull request to this repo. Please don't hesitate to share your comments on this pull request. Thanks!


This change is Reviewable

I used BeautifulSoup to parse a proxy listing website http://www.proxynova.com/proxy-server-list/country-cn/, and then pick from them. Tested on my local machine and it worked.
@soimort-bot
Copy link
Collaborator

Hello @xuehuichao,
Thanks for the Pull Request. We ❤️ our contributors!
Please wait for one of our human maintainers to review your patches. This may take a few days to weeks. Also, please understand that although your Pull Request may or may not be eventually merged, we value all contributions equally.

祝您健康!

@cnbeining
Copy link
Collaborator

Hi there,

Thanks for your PR.

It is always a pain to bypass the geographic restrictions outside China. That's a very good observation.

The only point I am not 100% sure is whether it is stable enough to build in a 3rd party service to achieve some core functions.

@soimort What's your opinion?

Yours,
Beining

@soimort
Copy link
Owner

soimort commented Apr 15, 2016

The code itself is ok, but I'm skeptical about the reliability of using 3rd party services (especially, when they are random selected).

What if the auto-selected proxy is malicious, or just behaves strangely without forwarding data correctly? We have no guarantee about this, and I don't want to mislead users to think that this is a you-get issue.

@xuehuichao
Copy link
Author

Thanks for your comments!

I guess we can split our discussions into two layers: (1) fundamentally, do we want this feature in you-get? (2) technically, how can we improve the current implementation, if we decide to have this feature.

  • Is the proxy website reliable (e.g. will it be down, or will its format change)?
  • Are the listed proxies valid (e.g. malicious, or down)?

I guess the technical issues can all be resolved, but the most fundamental question is whether you folks would want this feature included in you-get. I personally think this feature would be worthwhile, as a lot of users want to download from Chinese video websites, and it's a lot of pain finding proxies manually. With that said, I would totally understand if you think the proxy picker should be implemented in a separate tool.

Regarding the technical details:

  • Proxy listing website's reliability: I think we may in the long run (1) avoid putting all eggs into the same basket by implementing proxy pickers using other proxy listing websites (2) implement some tests to check for format changes etc.
  • The listed proxies may not be valid. This is a very valid concern. In the current implementation, the proxy isn't working for around 30% of the time. This is still much better than picking proxies on my own, but it definitely has made you-get look less reliable. I think one way to mitigate it is to double check the proxy's validity before we initiate the actual download. For example, we may first try using the proxy to download a cnn.com page, and stop using it if the result doesn't look reasonable. We may try a couple of proxies in parallel first, and then pick a random one only from the ones that are known to work. Does that make sense?

Please let me know what you think.

@soimort
Copy link
Owner

soimort commented Apr 23, 2016

Sorry for the late response.

In the current implementation, the proxy isn't working for around 30% of the time. This is still much better than picking proxies on my own

True. I don't know anything about ProxyNova, how it operates and provides these free proxy services, but I can imagine that they are not as reliable as you may think. It claims to enable anonymity, but you could achieve the same thing via Tor as well. Instead of that we pick an arbitrary proxy for users, they must be aware of the transparency and the risk (there's very little risk if you're only going to download videos though) of using it -- and if they really do, they would just pick their own. Thus I'm reluctant to let you-get do this for users.

As for now, "picking a freely available proxy" was never my intent; I want to use just my own proxy, so it's not really a priority for me.

...and stop using it if the result doesn't look reasonable. We may try a couple of proxies in parallel first, and then pick a random one only from the ones that are known to work

How? IMHO we should not handle that level of complexity, for this simple command-line tool that uses regexp to scrape the web. We are not going to make it a fault-tolerant, server middleware anyway.

As a follow-up on the technical side: BeautifulSoup is cool, but we are not going to introduce any external dependency to this simple tool at this stage.

@cnbeining
Copy link
Collaborator

It could be risky to pass the login credentials via a random service provided by god-knows-who. I cannot see any good scenario that a well-awarded user would need this function; also I am concerning about the mortality burden of providing risky service to end-users.

BTW BS4 is good: but from my point of view I cannot see the necessarily of adding this as our dependency.

@asurinsaka
Copy link

I do think this is very important funciton. It maybe good to provide the option to the users and let the users to decide whether to take the risk.

Also, for some services, credentials are not nessesary. It dose not matter whether the proxeis are malicious. I couldn't care less whether they are trying to peak the video I am watching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants