Conversation
b570c47 to
0ef8b4a
Compare
| @@ -0,0 +1,156 @@ | |||
| #!/usr/bin/python | |||
There was a problem hiding this comment.
This class is based on the code from the original topsites.py
|
r? @karlcow |
karlcow
left a comment
There was a problem hiding this comment.
Thanks @ksy36
Big update! Impressive.
I'm a bit concerned by the amount of mock we do for testing the DB. It's code inherited, so probably it would be good to open a new issue on this for the future. That could help us to probably test some weird scenarios.
There's also probably an optimization which would be good to do on the queries in a for loop which will return only one result.
Request Changes just for getting the discussion rolling and see your thoughts.
| # No host_name in DB, find less-level domain (>2) | ||
| # If host_name is lv4.lv3.example.com, find lv3.example.com/example.com |
There was a problem hiding this comment.
I wonder why we left host_name instead of hostname in there. Left over from a previous version?
There was a problem hiding this comment.
Yeah, I've left this part untouched from the previous version. Updated now though :)
| for site in regional_site_db.query(SiteRegional).filter_by(url=hostname): # noqa | ||
| return f'priority-{priorities[site.priority - 1]}' |
There was a problem hiding this comment.
I don't remember why the code is using a for loop. Maybe an opportunity to fix it.
Do we expect regional_site_db.query(SiteRegional).filter_by(url=hostname) to return more than one domain? and it would mean that even if there was more than one domain, only the first hit would be taken into consideration, because of the return.
Basically the code is currently the equivalent of
sites = ['A', 'B', 'C', 'A2', 'C']
def extract(sites):
for site in sites:
if site.startswith('A'):
return sitewhich would return 'A' always.
Should we just have something like.
| for site in regional_site_db.query(SiteRegional).filter_by(url=hostname): # noqa | |
| return f'priority-{priorities[site.priority - 1]}' | |
| site = regional_site_db.query(SiteRegional).filter_by(url=hostname).first() | |
| if not site: | |
| site = global_site_db.query(SiteGlobal).filter_by(url=hostname).first() | |
| if site: | |
| return f'priority-{priorities[site.priority - 1]}' | |
| #… and so on for the subdomains. |
.first() returns None if the row doesn't exist.
or something else.
Maybe there is a function to extract here. because the code is repeating the code in a loop.
Or maybe I totally misunderstood the code 🍭 Long time I have touched this part.
There was a problem hiding this comment.
The domain names are unique yes, so using .first() will work indeed. Thanks for the suggestion, I've changed the function.
| for site in regional_site_db.query(SiteRegional).filter_by(url=domain): # noqa | ||
| return f'priority-{priorities[site.priority - 1]}' | ||
|
|
||
| for site in global_site_db.query(SiteGlobal).filter_by(url=domain): |
There was a problem hiding this comment.
The fact that we do it here again makes me think there is an opportunity for a function to reduce it and test it.
| # License, v. 2.0. If a copy of the MPL was not distributed with this | ||
| # file, You can obtain one at http://mozilla.org/MPL/2.0/. | ||
|
|
||
| """Tests for Siterank class.""" |
There was a problem hiding this comment.
For the future, maybe there should be two mock DBs covering what we want instead of mocking. or in the setup a dictionary with data creating the mock DB. to think. Nothing to do now.
There was a problem hiding this comment.
Yeah, good idea, I'll file an issue for that.
| REGIONS = ['US', 'FR', 'IN', 'DE', 'TW', 'ID', 'HK', 'SG', 'PL', | ||
| 'GB', 'RU'] |
There was a problem hiding this comment.
Another thought, this was initially created like this I guess because of priority market for Firefox, but we can imagine that other browsers could have a desire to extend this. Nothing to do now.
| if args.retrieve_regional: | ||
| print('Warning: Alexa APIs will be deprecated on December 15, 2022.') |
There was a problem hiding this comment.
Should the code have a failsafe feature. Either based on the date. "Hey it's past December 15, 2022. We will ignore the request." Or based on failing the request to Alexa.
There was a problem hiding this comment.
Thanks, I've added the check 👍
|
Could you please take another look @karlcow? I've refactored |
|
Thanks!! |
The new script fetches results from both Tranco (global) and Alexa (regional) and saves results in 2 dbs. Regional db only includes domains whose priority is higher than in global db, therefore it's much smaller. I have decided to store the rankings in 2 dbs since once Alexa's API is deprecated, we can only update the global db and preserve the regional one for some time.
The script can be run is this way:
python3 ./tools/fetch_topsites.py --retrieve-regional --ats_access_key=<access_key> --ats_secret_key=<secret_key>and works as follows:
--retrieve-regionalparameter is passed and saves the results in "regional sites" db.