Skip to content

Fixes #3656 - Top sites priority change#3690

Merged
karlcow merged 1 commit intomainfrom
issue/3656/1
May 2, 2022
Merged

Fixes #3656 - Top sites priority change#3690
karlcow merged 1 commit intomainfrom
issue/3656/1

Conversation

@ksy36
Copy link
Copy Markdown
Contributor

@ksy36 ksy36 commented Apr 18, 2022

The new script fetches results from both Tranco (global) and Alexa (regional) and saves results in 2 dbs. Regional db only includes domains whose priority is higher than in global db, therefore it's much smaller. I have decided to store the rankings in 2 dbs since once Alexa's API is deprecated, we can only update the global db and preserve the regional one for some time.

The script can be run is this way:
python3 ./tools/fetch_topsites.py --retrieve-regional --ats_access_key=<access_key> --ats_secret_key=<secret_key>

and works as follows:

  1. Gets top 10k csv from Tranco and saves it to the "global sites" db and cache.
  2. Optionally fetches top 100 sites from specified regions if --retrieve-regional parameter is passed and saves the results in "regional sites" db.

@ksy36 ksy36 force-pushed the issue/3656/1 branch 2 times, most recently from b570c47 to 0ef8b4a Compare April 21, 2022 22:32
Comment thread tools/topsites/alexa.py
@@ -0,0 +1,156 @@
#!/usr/bin/python
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is based on the code from the original topsites.py

@ksy36 ksy36 marked this pull request as ready for review April 21, 2022 23:07
@ksy36 ksy36 requested a review from karlcow April 21, 2022 23:07
@ksy36
Copy link
Copy Markdown
Contributor Author

ksy36 commented Apr 21, 2022

r? @karlcow

Copy link
Copy Markdown
Member

@karlcow karlcow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ksy36
Big update! Impressive.

I'm a bit concerned by the amount of mock we do for testing the DB. It's code inherited, so probably it would be good to open a new issue on this for the future. That could help us to probably test some weird scenarios.

There's also probably an optimization which would be good to do on the queries in a for loop which will return only one result.

Request Changes just for getting the discussion rolling and see your thoughts.

Comment thread webcompat/webhooks/helpers.py Outdated
Comment on lines 118 to 119
# No host_name in DB, find less-level domain (>2)
# If host_name is lv4.lv3.example.com, find lv3.example.com/example.com
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why we left host_name instead of hostname in there. Left over from a previous version?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've left this part untouched from the previous version. Updated now though :)

Comment thread webcompat/webhooks/helpers.py Outdated
Comment on lines +113 to +114
for site in regional_site_db.query(SiteRegional).filter_by(url=hostname): # noqa
return f'priority-{priorities[site.priority - 1]}'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember why the code is using a for loop. Maybe an opportunity to fix it.

Do we expect regional_site_db.query(SiteRegional).filter_by(url=hostname) to return more than one domain? and it would mean that even if there was more than one domain, only the first hit would be taken into consideration, because of the return.

Basically the code is currently the equivalent of

sites = ['A', 'B', 'C', 'A2', 'C']
def extract(sites):
    for site in sites:
        if site.startswith('A'):
            return site

which would return 'A' always.

Should we just have something like.

Suggested change
for site in regional_site_db.query(SiteRegional).filter_by(url=hostname): # noqa
return f'priority-{priorities[site.priority - 1]}'
site = regional_site_db.query(SiteRegional).filter_by(url=hostname).first()
if not site:
site = global_site_db.query(SiteGlobal).filter_by(url=hostname).first()
if site:
return f'priority-{priorities[site.priority - 1]}'
#… and so on for the subdomains.

.first() returns None if the row doesn't exist.

or something else.

Maybe there is a function to extract here. because the code is repeating the code in a loop.

Or maybe I totally misunderstood the code 🍭 Long time I have touched this part.

Copy link
Copy Markdown
Contributor Author

@ksy36 ksy36 Apr 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The domain names are unique yes, so using .first() will work indeed. Thanks for the suggestion, I've changed the function.

Comment thread webcompat/webhooks/helpers.py Outdated
Comment on lines +125 to +128
for site in regional_site_db.query(SiteRegional).filter_by(url=domain): # noqa
return f'priority-{priorities[site.priority - 1]}'

for site in global_site_db.query(SiteGlobal).filter_by(url=domain):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that we do it here again makes me think there is an opportunity for a function to reduce it and test it.

# License, v. 2.0. If a copy of the MPL was not distributed with this
# file, You can obtain one at http://mozilla.org/MPL/2.0/.

"""Tests for Siterank class."""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the future, maybe there should be two mock DBs covering what we want instead of mocking. or in the setup a dictionary with data creating the mock DB. to think. Nothing to do now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good idea, I'll file an issue for that.

Comment thread tools/fetch_topsites.py
Comment on lines +23 to +25
REGIONS = ['US', 'FR', 'IN', 'DE', 'TW', 'ID', 'HK', 'SG', 'PL',
'GB', 'RU']
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought, this was initially created like this I guess because of priority market for Firefox, but we can imagine that other browsers could have a desire to extend this. Nothing to do now.

Comment thread tools/fetch_topsites.py Outdated
Comment on lines +50 to +51
if args.retrieve_regional:
print('Warning: Alexa APIs will be deprecated on December 15, 2022.')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the code have a failsafe feature. Either based on the date. "Hey it's past December 15, 2022. We will ignore the request." Or based on failing the request to Alexa.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've added the check 👍

@ksy36
Copy link
Copy Markdown
Contributor Author

ksy36 commented Apr 26, 2022

Could you please take another look @karlcow? I've refactored extract_priority_label and added the check for Alexa deprecation date.

Copy link
Copy Markdown
Member

@karlcow karlcow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @ksy36 for addressing the changes. 🔝 🥳

@karlcow karlcow merged commit 9da9698 into main May 2, 2022
@karlcow karlcow deleted the issue/3656/1 branch May 2, 2022 04:34
@ksy36
Copy link
Copy Markdown
Contributor Author

ksy36 commented May 2, 2022

Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants