Skip to content
This repository has been archived by the owner on Sep 7, 2023. It is now read-only.

Don't send any user's browser info upstream (Accept-Language header, etc) #648

Closed
ghost opened this issue Jul 26, 2016 · 5 comments
Closed
Labels

Comments

@ghost
Copy link

ghost commented Jul 26, 2016

The developer said "Accept-Language header" is sent over to upsteam service when you use searx.
This is bad because the value is sometimes unique and it could be used to track/profiling user by upstream service such as Google.

Accept-Language Example:
en-US, en
en
en, en-US
en-GB etc...

Many combinations can be found in the wild.
It'll be nice if searx does NOT use browser's data at all, and just use "options" cookie instead.
(and if the user block cookie, return English result by default)

@MarcAbonce
Copy link
Contributor

But it doesn't.
I think there was some misinterpretation or something in #641. Nowhere in the code is the user's browser's HTTP headers actually consumed.
When I print the outgoing search requests generated here for Google and Bing, I get:

[
    (
        <function get at 0x7f008dee00c8>,
        'https://www.google.com/search?q=this+is+my+query&start=0&gws_rd=cr&gbv=1&lr=&ei=x',
        {
            'headers': {
                'Accept-Language': 'en',
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
                'User-Agent': 'Mozilla/5.0 (X11; Linux x86; rv:44.0) Gecko/20100101 Firefox/44.0'
            },
            'cookies': {},
            'hooks': {
                'response': <function process_callback at 0x7f008d369140>
            },
            'timeout': 2.0,
            'verify': True
        },
        u'google'
    ),
    (
        <function get at 0x7f008dee00c8>,
        'https://www.bing.com/search?q=this+is+my+query&setmkt=en-US&first=1',
        {
            'headers': {
                'User-Agent': 'Mozilla/5.0 (X11; Linux x86; rv:44.0) Gecko/20100101 Firefox/44.0'
            },
            'cookies': {'SRCHHPGUSR': 'NEWWND=0&NRSLT=-1&SRCHLANG=en'},
            'hooks': {
                'response': <function process_callback at 0x7f008d3692a8>
            },
            'timeout': 2.0,
            'verify': True}
        ,
        u'bing'
    )
]

The only outgoing request with an Accept-Language header here is for Google, which is set here (my actual browser's Accept-Language header is not en) based on the language parameter set here (self.lang is the language set on the cookie or in the query).
Other engines might set this header, but never based on the user's Accept-Language header.
If you find an engine request that actually leaks user's data, though, please point it out so it can be fixed.

@asciimoo
Copy link
Member

@logouthere sorry, I was wrong, searx doesn't send this option to the services yet. It is just a planned feature

@ghost
Copy link
Author

ghost commented Jul 28, 2016

Yep sorry for misinformation. Unlike @a01200356, I just don't have time to analyze source code at this moment.

send this option to the services 'yet'
It is just a planned feature

Ok, but I hope searx never send User information(yes, anything) to upstream service like Google at all.

/master/searx/search.py
Line 266: user_agent = request.headers.get('User-Agent', '') (disabled by # mark)
Line 267: user_agent = gen_useragent()

  1. user_agent, which sent to upstream engine, must be randomized.

https://github.com/dillbyrne/random-agent-spoofer/blob/master/data/json/useragents.json
Take a look at this for good UserAgent collection. When gen_useragent() called, it will return 1 randomized
string, for example "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.75 Safari/537.36".
Upstream engine can't identify the request is searx because it is NOT static(same UA everytime).

  1. "Accept" can be randomize as well.
  2. Is "Accept-Language" already 'normalized'?
    I mean, if I access searx from UK computer(Accept-Language= en-GB, en | en-GB), will searx send ONLY "en"?
    If searx send "en-GB" this is a problem(expecting "en" for all English).

@ghost
Copy link
Author

ghost commented Jul 28, 2016

except search query, of couse.

@dalf dalf added the security label Aug 27, 2016
@unixfox
Copy link
Member

unixfox commented Feb 8, 2021

This highly depend on the engine and for instance Google engine doesn't send the Accept-Language header of the user but the language that the user chose to use.

@unixfox unixfox closed this as completed Feb 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants