Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

option for change User-Agent in http headers #154

Closed
kjonca opened this issue Nov 6, 2020 · 20 comments · Fixed by #216
Closed

option for change User-Agent in http headers #154

kjonca opened this issue Nov 6, 2020 · 20 comments · Fixed by #216

Comments

@kjonca
Copy link

kjonca commented Nov 6, 2020

Some sites refuses fetching rss, when user-agent is not known to them. I prepared simple change which allow define "http-user-agent-option" and pass it to parser (diff is against debian version but I think this is no issue)

@kjonca
Copy link
Author

kjonca commented Nov 6, 2020


--- rss2email-3.12.2.orig/rss2email/config.py
+++ rss2email-3.12.2/rss2email/config.py
@@ -100,6 +100,7 @@ CONFIG['DEFAULT'] = _collections.Ordered
         ### Fetching
         # Set an HTTP proxy (e.g. 'http://your.proxy.here:8080/')
         ('proxy', ''),
+        ('http-user-agent', ''),
         # Set the timeout (in seconds) for feed server response
         ('feed-timeout', str(60)),
         # Set the time (in seconds) to sleep between fetches from the same server
--- rss2email-3.12.2.orig/rss2email/feed.py
+++ rss2email-3.12.2/rss2email/feed.py
@@ -75,8 +75,9 @@ from . import email as _email
 from . import error as _error
 from . import util as _util
 
-
 _urllib_request.install_opener(_urllib_request.build_opener())
+
+ 
 _SOCKET_ERRORS = []
 for e in ['error', 'herror', 'gaierror']:
     if hasattr(_socket, e):
@@ -370,6 +371,11 @@ class Feed (object):
             config = self.config['DEFAULT']
         proxy = config['proxy']
         timeout = config.getint('feed-timeout')
+
+        http_user_agent = config ['http-user-agent']
+        if http_user_agent:
+            _feedparser.USER_AGENT= http_user_agent
+
         kwargs = {}
         if proxy:
             kwargs['handlers'] = [_urllib_request.ProxyHandler({'http':proxy})]

@auouymous
Copy link
Contributor

Is this different from the existing user-agent setting? The patch in #120 fixes it.

@kjonca
Copy link
Author

kjonca commented Nov 6, 2020

According to documentation "user-agent" is used in outgoing mails, not http headers. Even if that patch, make it used in http headers, I am not sure if this value is proper for both purposes (so I used different option name).
In other words:
I want to have "User-Agent: rss2email' in mails and 'User-Agent: Mozilla/Linux" in http headers. How can it be achieved with #120 ?

@auouymous
Copy link
Contributor

#120 was waiting for someone to write a regression test, but you are correct about the documentation saying user-agent is for outgoing mails. So #120 is probably a new feature and not a bug fix.

I don't think there should be separate user-agent settings. The current setting would need to be renamed to mail-user-agent to avoid confusion. There would also need to be migration code, otherwise the next release would silently remove any user-agent the user had set.

Using separate values is only needed in order to spoof. You should instead contact the faulty feed and let them know that whitelisting is lazy and prevents access from valid clients.

@kjonca
Copy link
Author

kjonca commented Nov 6, 2020

Using separate values is only needed in order to spoof

Sorry, could you explain?

@auouymous
Copy link
Contributor

Setting UA to 'Mozilla/Linux' is lying (spoofing) about which client you are using. Anyone changing user-agent for security reasons would change both mail and http to the same obscure value.

@kjonca
Copy link
Author

kjonca commented Nov 6, 2020

Sorry, if you use, #120 for changing user-agent in http headers is not "lying" about clientu you are using? Still not understand.

@auouymous
Copy link
Contributor

Setting user-agent to nothing, 'none of your business' or simply 'rss2email' without a version isn't lying, it is just sharing less information with others who don't need to know which version or even which client you use.

@kjonca
Copy link
Author

kjonca commented Nov 6, 2020

Ok I want to use:
http User-Agent "none of your businness"
mail User-Agent "rss2email" (for filtering for example or sth)
does this change anything?

@Ekleog
Copy link
Member

Ekleog commented Nov 6, 2020

Re-reading #120 and #74, it sounds like the existing user-agent is mail-user-agent indeed, so there is a place for http-user-agent as this patch suggests.

@kjonca Would it be possible for you to add documentation for this new setting (in r2e.1), and to add a test for this new feature so we don't break it unknowingly?

@Ekleog
Copy link
Member

Ekleog commented Nov 6, 2020

Oh, I missed one comment! @auouymous Feeds requiring spoofing are unfortunately numerous, and trying to fight them all is way more work than adding an HTTP User-Agent parameter. This being said, maybe the documentation for user-agent should mention that people should contact the feed owner before using it? This way, people could read feeds, and at the same time broken feeds would hopefully be notified (even though most feed owners probably have literally no idea how their feed is being operated)

@kjonca kjonca changed the title option for change User-Agent in http headerss option for change User-Agent in http headers Nov 6, 2020
@auouymous
Copy link
Contributor

http-user-agent should default to the same value as user-agent so it advertises as rss2email and not feedparser.

-        ('http-user-agent', ''),
+        ('http-user-agent', 'rss2email/__VERSION__ (__URL__)'),                                                                                                                                            

Blocking clients with an empty UA is common for web servers so it is good to not allow the user to set a blank UA. But the man page and config comment should mention that leaving this blank will advertise as feedparser with feedparser's version.

+        http_user_agent = config ['http-user-agent']
+        if http_user_agent:
+            _feedparser.USER_AGENT= http_user_agent

@kjonca
Copy link
Author

kjonca commented Nov 7, 2020

--- rss2email-3.12.2.orig/r2e.1
+++ rss2email-3.12.2/r2e.1
@@ -173,6 +173,10 @@ Set this to default To email addresses.
 Set an HTTP proxy (e.g. 'http://your.proxy.here:8080/')
 .IP feed-timeout
 Set the timeout (in seconds) for feed server response
+.IP http-user-agent
+String to use as User-Agent header in http requests. Defaults to 'rss2email/__VERSION__ (__URL__)'.
+This is disputable to change this parameter, as it can be recognized as a kind of spoofing.
+Think about to contact feed owner to stop blocking rss2email first.
 .RE
 .SS Processing
 .IP active

Sorry for my English. For test I need some time, as I am not a python expert.

@Ekleog
Copy link
Member

Ekleog commented Mar 19, 2021

@kjonca The documentation you added looks good to me, thank you :) Do you still plan on adding a test (which can be added using the procedure described in the top post of #159)? Also, would it be possible for you to either open a pull request with your code changes, or if not to consolidate all the patches and suggestions in a single patch, so it's possible to review without having to go through currently 3, hopefully 4, code blocks? :)

@auouymous
Copy link
Contributor

What exactly would a test for this do? Can tests setup a local web server and compare the headers sent to it?

@Ekleog
Copy link
Member

Ekleog commented Mar 19, 2021

Yes we do :) In test/test.py, the TestFetch task group spins up a webserver. Adjusting the NoLogHandler to return a special feed upon request with a certain user-agent, and then adding a test that uses said user-agent to fetch the feed at an otherwise non-existent URL, should be a reasonable way to make it I think

@kjonca
Copy link
Author

kjonca commented Mar 20, 2021

@kjonca The documentation you added looks good to me, thank you :) Do you still plan on adding a test (which can be added using the procedure described in the top post of #159)?

Sorry for delay. I forked this repository and tried to run tests (before making any changes) but they

  1. ends with error
  2. hang on "Waits before fetching repeatedly from the same server ..."
    I do not know what I am doing wrong.

@Ekleog
Copy link
Member

Ekleog commented Mar 20, 2021

Hmm this is weird… how are you attempting to test? Here is a transcript of a local run:

$ git checkout e2a6d9181639c2ab41159f876a2b2fb91f99c77d     # the latest master as of this writing
[...]
$ nix-shell --pure     # helps making sure I got the right things in the environment
$ python test/test.py -v
[... only “ok” tests]
Ran 29 tests in 6.283s

OK

@kjonca
Copy link
Author

kjonca commented Mar 20, 2021

(I did not use nix-shell, I have to check it out)
$git clone https://github.com/rss2email/rss2email.git
Cloning into 'rss2email'...
remote: Enumerating objects: 30, done.
remote: Counting objects: 100% (30/30), done.
remote: Compressing objects: 100% (24/24), done.
remote: Total 2507 (delta 12), reused 12 (delta 6), pack-reused 2477
Receiving objects: 100% (2507/2507), 967.00 KiB | 227.00 KiB/s, done.
Resolving deltas: 100% (1592/1592), done.
$cd rss2email
$git checkout e2a6d91
Note: switching to 'e2a6d9181639c2ab41159f876a2b2fb91f99c77d'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

git switch -c

Or undo this operation with:

git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at e2a6d91 Add wrap-links option to prevent links from wrapping over multiple lines (#172)
$python3 test/test.py -v
test_email_data/allthingsrss/1.config (main.TestEmails) ... ERROR
test_email_data/allthingsrss/2.config (main.TestEmails) ... ERROR
test_email_data/allthingsrss/3.config (main.TestEmails) ... ERROR
test_email_data/allthingsrss/4.config (main.TestEmails) ... ERROR
test_email_data/allthingsrss/5.config (main.TestEmails) ... ERROR
test_email_data/allthingsrss/6.config (main.TestEmails) ... ERROR
test_email_data/bbc-chinese/1.config (main.TestEmails) ... ERROR
test_email_data/bbc-chinese/2.config (main.TestEmails) ... ERROR
test_email_data/bbc-chinese/3.config (main.TestEmails) ... ERROR
test_email_data/disqus/1.config (main.TestEmails) ... ERROR
test_email_data/gmane/1.config (main.TestEmails) ... ERROR
test_email_data/gmane/2.config (main.TestEmails) ... ERROR
test_email_data/gmane/3.config (main.TestEmails) ... ERROR
test_email_data/gmane/4.config (main.TestEmails) ... ERROR
test_email_data/tails/1.config (main.TestEmails) ... ERROR
test_email_data/tails/2.config (main.TestEmails) ... ERROR
test_user_agent_sub_fixed (main.TestFeedConfig)
Badly substituted user agent from v3.11 is corrected ... ok
test_user_agent_substitutions (main.TestFeedConfig)
User agent with substitutions done is not written to config ... ok
test_verbose_setting_debug (main.TestFeedConfig)
Verbose setting set to debug in configuration should be respected ... ok
test_verbose_setting_info (main.TestFeedConfig)
Verbose setting set to info in configuration should be respected ... ok
test_delay (main.TestFetch)
Waits before fetching repeatedly from the same server ... (hangs)

$python3 --version
Python 3.9.2
(debian sid box)

@Ekleog
Copy link
Member

Ekleog commented Mar 20, 2021

Hmm… do you have all the development packages installed? If you have nix installed nix-shell does it for you, but if not I think you have to use poetry to install the things, potentially in a virtualenv — it's been a very long time I haven't developed without nix, so I'm a bit fuzzy on how things work nowadays, especially as IIRC poetry wasn't a thing last I did python dev pre-nix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants