Support unicode and punycode when validating TLDs #1182

jamescurtin · 2020-01-20T19:30:45Z

Feature Request

Output of python -c "import pydantic.utils; print(pydantic.utils.version_info())":

pydantic version: 1.3
            pydantic compiled: True
                 install path: /usr/local/lib/python3.8/site-packages/pydantic
               python version: 3.8.1 (default, Jan  3 2020, 22:55:55)  [GCC 8.3.0]
                     platform: Linux-4.9.184-linuxkit-x86_64-with-glibc2.2.5
     optional deps. installed: ['typing-extensions']

The IANA has approved many TLDs that are not matched by the TLD domain ending regex used for HttpUrl validation. There are currently ~152 such TLDs: see the entries in the authoritative list of TLDs containing the ASCII Compatible Encoding prefix xn--.

One approach to adding compatibility for such TLDs would be to modify the domain ending regex pattern to allow for Unicode characters, as well as the corresponding internationalized ASCII strings. For example:

_domain_ending = r"(?P<tld>(\.[^\W\d_]{2,63})|(\.(?:xn--)[_0-9a-z-]{2,63}))?\.?"

Such a change would allow for the following to run successfully

from pydantic import BaseModel, HttpUrl, ValidationError 


class Domain(BaseModel):
    domain: HttpUrl
        

ascii_domains = ["https://example.com"]

idna_domains = [
    "https://example.xn--p1ai",
    "https://example.xn--vermgensberatung-pwb",
    "https://example.xn--zfr164b",
]

unicode_domains = [str.encode(domain).decode("idna") for domain in idna_domains]
        
valid_domains = ascii_domains + idna_domains + unicode_domains

invalid_domains = ["https://example.123", "https://example.ab34"]

for domain in valid_domains:
    Domain(domain=domain)
    
for invalid_domain in invalid_domains:
    try:
        Domain(domain=invalid_domain)
    except ValidationError:
        pass

Would you accept a PR for this?

The text was updated successfully, but these errors were encountered:

samuelcolvin · 2020-01-20T21:19:21Z

Yes, I would.

One concern: you might need to wrap the regex in a function to avoid increasing module import time, same as we do for other regexes.

jamescurtin added the feature request label Jan 20, 2020

samuelcolvin added the help wanted Pull Request welcome label Jan 20, 2020

jamescurtin mentioned this issue Jan 21, 2020

Support unicode and punycode when validating TLDs #1183

Merged

4 tasks

samuelcolvin closed this as completed in #1183 Jan 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support unicode and punycode when validating TLDs #1182

Support unicode and punycode when validating TLDs #1182

jamescurtin commented Jan 20, 2020

samuelcolvin commented Jan 20, 2020

Support unicode and punycode when validating TLDs #1182

Support unicode and punycode when validating TLDs #1182

Comments

jamescurtin commented Jan 20, 2020

Feature Request

samuelcolvin commented Jan 20, 2020