Skip to content

Support more Second-Level Domain Names in lib http.cookiejar #135528

Closed as not planned
@LamentXU123

Description

@LamentXU123

Feature or enhancement

Proposal:

in line 1023 of lib/http/cookiejar.py:

 if cookie.domain_specified:
            req_host, erhn = eff_request_host(request)
            domain = cookie.domain
            if self.strict_domain and (domain.count(".") >= 2):
                # XXX This should probably be compared with the Konqueror
                # (kcookiejar.cpp) and Mozilla implementations, but it's a
                # losing battle.
                i = domain.rfind(".")
                j = domain.rfind(".", 0, i)
                if j == 0:  # domain like .foo.bar
                    tld = domain[i+1:]
                    sld = domain[j+1:i]
                    if sld.lower() in ("co", "ac", "com", "edu", "org", "net",
                       "gov", "mil", "int", "aero", "biz", "cat", "coop",
                       "info", "jobs", "mobi", "museum", "name", "pro",
                       "travel", "eu") and len(tld) == 2:
                        # domain like .co.uk
                        _debug("   country-code second level domain %s", domain)
                        return False

well, the Second-Level Domain tuple are written in 2006. I've noticed in today's wiki we've got some new second-level domains to add.

Source: https://en.wikipedia.org/wiki/Second-level_domain

I've dumped all the countries metioned in the wiki link, some of them are from https://github.com/derangeddk/cc2lds (based on wiki, written in 2022) which I check each of them to make sure it's correct. And for the new ones, I've copied them manually.

the dumped data is as follows:

second-level-country-domains.zip

then I wrote a script to count how many times each second-level domain is used.

import os
import yaml
from collections import defaultdict

def count_domains_in_folder(folder_path):
    domain_counts = defaultdict(int)
    total_countries = 0
    for filename in os.listdir(folder_path):
        if filename.endswith('.yml') or filename.endswith('.yaml'):
            filepath = os.path.join(folder_path, filename)
            try:
                with open(filepath, 'r', encoding='utf-8') as file:
                    data = yaml.safe_load(file)
                    if data and isinstance(data, dict):
                        for country_code, domains in data.items():
                            if isinstance(domains, list):
                                total_countries += 1
                                for domain in domains:
                                    domain_counts[domain] += 1
            except Exception as e:
                print(f"Error processing file {filename}: {e}")

    sorted_domains = sorted(domain_counts.items(), key=lambda x: x[1], reverse=True)
    
    return sorted_domains, total_countries


folder_path = './second-level-country-domains' 
sorted_domains, total_countries = count_domains_in_folder(folder_path)

print("\nin list domains: ")
for domain, count in sorted_domains:  
    if domain in ("co", "ac", "com", "edu", "org", "net",
                       "gov", "mil", "int", "aero", "biz", "cat", "coop",
                       "info", "jobs", "mobi", "museum", "name", "pro",
                       "travel", "eu"):
        print(f"in-list {domain}: {count} times ({count/total_countries:.1%})")
print('\nfirst 20 domains not in list:')
for domain, count in sorted_domains[:20]:  
    if domain not in ("co", "ac", "com", "edu", "org", "net",
                       "gov", "mil", "int", "aero", "biz", "cat", "coop",
                       "info", "jobs", "mobi", "museum", "name", "pro",
                       "travel", "eu"):
        print(f"not-in-list {domain}: {count} times ({count/total_countries:.1%})")

Well, the results:

in list domains: 
in-list org: 21 times (77.8%)
in-list net: 21 times (77.8%)
in-list gov: 19 times (70.4%)
in-list edu: 18 times (66.7%)
in-list com: 17 times (63.0%)
in-list ac: 14 times (51.9%)
in-list co: 13 times (48.1%)
in-list mil: 13 times (48.1%)
in-list info: 5 times (18.5%)
in-list biz: 4 times (14.8%)
in-list int: 3 times (11.1%)
in-list name: 3 times (11.1%)
in-list coop: 2 times (7.4%)
in-list pro: 2 times (7.4%)
in-list mobi: 2 times (7.4%)
in-list travel: 2 times (7.4%)
in-list museum: 1 times (3.7%)
in-list aero: 1 times (3.7%)

first 20 domains not in list:
not-in-list tv: 5 times (18.5%)
not-in-list or: 4 times (14.8%)
not-in-list nom: 4 times (14.8%)
not-in-list sch: 4 times (14.8%)
not-in-list web: 4 times (14.8%)
not-in-list tm: 3 times (11.1%)
not-in-list gen: 3 times (11.1%)
not-in-list go: 3 times (11.1%)
not-in-list ltd: 3 times (11.1%)

As the result shows, I think we can add .or, .tv, .nom, .sch, .web to the list.

Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

Links to previous discussion of this feature:

No response

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    stdlibPython modules in the Lib dirtype-featureA feature request or enhancement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions