Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid using the locale encoding for open() in tests #85235

Closed
serhiy-storchaka opened this issue Jun 21, 2020 · 3 comments
Closed

Avoid using the locale encoding for open() in tests #85235

serhiy-storchaka opened this issue Jun 21, 2020 · 3 comments
Labels
3.7 (EOL) end of life 3.8 (EOL) end of life 3.9 only security fixes 3.10 only security fixes tests Tests in the Lib/test dir type-feature A feature request or enhancement

Comments

@serhiy-storchaka
Copy link
Member

serhiy-storchaka commented Jun 21, 2020

BPO 41063
Nosy @vstinner, @methane, @serhiy-storchaka
Dependencies
  • bpo-41048: read_mime_types() should read the rule file using UTF-8, not the locale encoding
  • bpo-41055: Remove outdated tests for tp_print
  • bpo-41058: pdb reads source files using the locale encoding
  • bpo-41069: Use non-ascii file names in tests by default
  • argparse uses default encoding when read arguments from file #85308: argparse uses default encoding when read arguments from file
  • bpo-41137: pdb uses the locale encoding for .pdbrc
  • bpo-41138: trace CLI reads source files using the locale encoding
  • bpo-41139: cgi uses the locale encoding for log files
  • bpo-41140: cgitb uses the locale encoding for log files
  • bpo-41143: distutils uses the locale encoding for the .pypirc file
  • pipes uses text files and the locale encodig #85322: pipes uses text files and the locale encodig
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2020-06-21.11:04:15.601>
    labels = ['3.7', '3.8', '3.9', '3.10', 'type-feature', 'tests']
    title = 'Avoid using the locale encoding for open() in tests'
    updated_at = <Date 2020-06-28.16:23:08.646>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2020-06-28.16:23:08.646>
    actor = 'serhiy.storchaka'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Tests']
    creation = <Date 2020-06-21.11:04:15.601>
    creator = 'serhiy.storchaka'
    dependencies = ['41048', '41055', '41058', '41069', '41136', '41137', '41138', '41139', '41140', '41143', '41150']
    files = []
    hgrepos = []
    issue_num = 41063
    keywords = []
    message_count = 1.0
    messages = ['371994']
    nosy_count = 3.0
    nosy_names = ['vstinner', 'methane', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = None
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue41063'
    versions = ['Python 3.7', 'Python 3.8', 'Python 3.9', 'Python 3.10']

    @serhiy-storchaka
    Copy link
    Member Author

    Many tests use open() with the locale encoding for writing or reading files. They are passed because the written and read data a ASCII, and file paths are ASCII. But they do not test the case of non-ASCII data and file paths. In general, most of uses of the locale encoding should be changed.

    1. In some cases it is enough to open the file in binary mode. For example when create an empty file, or use just fileno of the opened file.

    2. In some cases the file should be opened in binary mode. For example, when compile the content of the file or parse it as XML, because the correct encoding is determined by the content (BOM, encoding coockie, XML declaration).

    3. tokenize.open() or tokenize.detect_encoding() should be used when we read a Python source as a text.

    4. os.fsdecode() and os.fsencode() may be used if the test file contains file paths and is read by bash or other external program.

    5. encoding='ascii' should be specified if the test data always ASCII-only.

    6. encoding='utf-8' should be specified if the test data can contain arbitrary Unicode characters.

    7. Encoding different from 'ascii', 'latin1' and 'utf-8' should be used if arbitrary encodings should be supported.

    8. Implicit locale encoding should be only used if the test is purposed to test the implicit encoding.

    It is preferable to add non-ASCII characters in the test data.

    I am working on a large patch for this (>50% is ready). Some parts of it may be extracted as separate PRs, and the rest will be exposed as a large PR. If changes are required not only in tests. separate issues will be opened.

    @serhiy-storchaka serhiy-storchaka added 3.7 (EOL) end of life 3.8 (EOL) end of life 3.9 only security fixes 3.10 only security fixes tests Tests in the Lib/test dir type-feature A feature request or enhancement labels Jun 21, 2020
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @vstinner
    Copy link
    Member

    You can use python -X warn_default_encoding to get a warning when open() is called without an explicit encoding.

    @methane fixed many stdlib modules. I don't know the status of the test suite.

    @vstinner
    Copy link
    Member

    All issues listed in this meta-issues have been fixed. I close the issue.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 (EOL) end of life 3.9 only security fixes 3.10 only security fixes tests Tests in the Lib/test dir type-feature A feature request or enhancement
    Projects
    Status: Doc issues
    Status: Done
    Development

    No branches or pull requests

    2 participants