Merge pull request #64 from LuminosoInsight/test-cli

Version 5.0: drop Python 2, add CLI tests
rspeer · Mar 9, 2017 · f88b40b · f88b40b
2 parents 39bde07 + 1f665ac
commit f88b40b
Show file tree

Hide file tree

Showing 25 changed files with 176 additions and 296 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,16 @@
+## Version 5.0 (February 17, 2017)
+
+Breaking changes:
+
+- Dropped support for Python 2. If you need Python 2 support, you should get
+  version 4.4, which has the same features as this version.
+
+- The top-level functions require their arguments to be given as keyword
+  arguments.
+
+Version 5.0 also now has tests for the command-line invocation of ftfy.
+
+
 ## Version 4.4.0 (February 17, 2017)
 
 Heuristic changes:

diff --git a/docs/index.rst b/docs/index.rst
@@ -6,8 +6,7 @@
 ftfy: fixes text for you
 ========================
 
-`ftfy` fixes Unicode that's broken in various ways. It works in Python 2.7,
-Python 3.2, or later.
+**ftfy** fixes Unicode that's broken in various ways.
 
 The goal of ftfy is to **take in bad Unicode and output good Unicode**, for use
 in your Unicode-aware code. This is different from taking in non-Unicode and
@@ -19,18 +18,15 @@ Of course you're better off if your input is decoded properly and has no
 glitches. But you often don't have any control over your input; it's someone
 else's mistake, but it's your problem now.
 
-`ftfy` will do everything it can to fix the problem.
+ftfy will do everything it can to fix the problem.
 
 .. note::
 
-    Time is marching on. ftfy 4.x supports Python 2.7 and 3.x, but when
-    ftfy 5.0 is released, it will probably only support Python 3.
+    This documentation is for ftfy 5, which runs on Python 3 only, following
+    the plan to drop Python 2 support that was announced in ftfy 3.3.
 
-    If you're running on Python 2, ftfy 4.x will keep working for you. You
-    don't have to upgrade to 5.0. You can save yourself a headache by adding
-    `ftfy < 5` to your requirements, making sure you stay on version 4.
-
-    See `Future versions of ftfy`_ for why this needs to happen.
+    If you're running on Python 2, ftfy 4.x will keep working for you. In that
+    case, you should add `ftfy < 5` to your requirements.
 
 
 Mojibake
@@ -102,7 +98,7 @@ interacting with the erroneous decoding. The main function of ftfy,
     parts of NFKC are implemented as separate, limited fixes.
 
 
-There are other interesting things that `ftfy` can do that aren't part of
+There are other interesting things that ftfy can do that aren't part of
 the :func:`ftfy.fix_text` pipeline, such as:
 
 * :func:`ftfy.explain_unicode`: show you what's going on in a string,
@@ -113,10 +109,10 @@ the :func:`ftfy.fix_text` pipeline, such as:
 Encodings ftfy can handle
 -------------------------
 
-`ftfy` can't fix all possible mix-ups. Its goal is to cover the most common
+ftfy can't fix all possible mix-ups. Its goal is to cover the most common
 encoding mix-ups while keeping false positives to a very low rate.
 
-`ftfy` can understand text that was decoded as any of these single-byte
+ftfy can understand text that was decoded as any of these single-byte
 encodings:
 
 - Latin-1 (ISO-8859-1)
@@ -146,7 +142,7 @@ Korean, such as ``shift-jis`` and ``gb18030``.  See `issue #34
 <https://github.com/LuminosoInsight/python-ftfy/issues/34>`_ for why this is so
 hard.
 
-But remember that the input to `ftfy` is Unicode, so it handles actual
+But remember that the input to ftfy is Unicode, so it handles actual
 CJK *text* just fine. It just can't discover that a CJK *encoding* introduced
 mojibake into the text.
 
@@ -179,7 +175,7 @@ If the only fix you need is to detect and repair decoding errors (mojibake), the
 you should use :func:`ftfy.fix_encoding` directly.
 
 .. versionchanged:: 4.0
-   The default normalization was changed from `'NFKC'` to `'NFC'`. The new options
+   The default normalization was changed from `'NFKC'` to `'NFC'`. The options
    *fix_latin_ligatures* and *fix_character_width* were added to implement some
    of the less lossy parts of NFKC normalization on top of NFC.
 
@@ -194,30 +190,20 @@ you should use :func:`ftfy.fix_encoding` directly.
 .. autofunction:: ftfy.explain_unicode
 
 
-Non-Unicode strings
--------------------
-
-When first using ftfy, you might be confused to find that you can't give it a
-bytestring (the type of object called `str` in Python 2).
-
-ftfy fixes text. Treating bytestrings as text is exactly the kind of thing that
-causes the Unicode problems that ftfy has to fix. So if you don't give it a
-Unicode string, ftfy will point you to the `Python Unicode HOWTO`_.
-
-.. _`Python Unicode Howto`: http://docs.python.org/3/howto/unicode.html
-
-Reasonable ways that you might exchange data, such as JSON or XML, already have
-perfectly good ways of expressing Unicode strings. Given a Unicode string, ftfy
-can apply fixes that are very likely to work without false positives.
-
-
 A note on encoding detection
 ----------------------------
 
-If your input is a mess of unmarked bytes, you might want a tool that can just
-statistically analyze those bytes and predict what encoding they're in.
+:func:`ftfy.fix_text` expects its input to be a Python 3 `str` (a Unicode
+string).  If you pass in `bytes` instead, ftfy will point you to the `Python
+Unicode HOWTO`_.
+
+.. _`Python Unicode HOWTO`: http://docs.python.org/3/howto/unicode.html
 
-`ftfy` is not that tool. The :func:`ftfy.guess_bytes` function it contains will
+Now, you may know that your input is a mess of bytes in an unknown encoding,
+and you might want a tool that can just statistically analyze those bytes and
+predict what encoding they're in.
+
+ftfy is not that tool. The :func:`ftfy.guess_bytes` function it contains will
 do this in very limited cases, but to support more encodings from around the
 world, something more is needed.
 
@@ -249,7 +235,7 @@ Here's the usage documentation for the `ftfy` command::
                 [--preserve-entities]
                 [filename]
 
-    ftfy (fixes text for you), version 4.0.0
+    ftfy (fixes text for you), version 5.0
 
     positional arguments:
       filename              The file whose Unicode is to be fixed. Defaults to -,
@@ -323,29 +309,3 @@ that ftfy's behavior is consistent across versions.
    :members:
 
 .. autofunction:: ftfy.build_data.make_char_data_file
-
-
-Future versions of ftfy
-=======================
-
-ftfy has full support for Python 2.7, even including a backport of Unicode 9
-character classes to Python 2. But given the sweeping changes to Unicode in
-Python, it's getting inconvenient to add new features to ftfy that work the
-same on both versions.
-
-ftfy 5.0, when it is released, will probably only support Python 3.
-
-If you want to see examples of why ftfy is particularly difficult to maintain
-on two versions of Python (which is more like three versions because of Python
-2's "wide" and "narrow" builds), take a look at functions such as
-:func:`ftfy.bad_codecs.utf8_variants.mangle_surrogates` and
-:func:`ftfy.compatibility._narrow_unichr_workaround`.
-
-This will happen soon, and we'll follow the plan that jQuery used years ago
-when it dropped support for IE 6-8. We'll release the last Python 2 version and
-the first Python-3-only version with the same feature set. ftfy 5.0 will
-reduce the size and complexity of the code greatly, but ftfy 4.x will remain
-there for those who need it.
-
-If you're running on Python 2, please make sure that `ftfy < 5` is in your
-requirements list, not just `ftfy`.
diff --git a/ftfy/__init__.py b/ftfy/__init__.py
@@ -1,26 +1,24 @@
-# -*- coding: utf-8 -*-
 """
 ftfy: fixes text for you
 
 This is a module for making text less broken. See the `fix_text` function
 for more information.
 """
 
-from __future__ import unicode_literals
 import unicodedata
 import ftfy.bad_codecs
 from ftfy import fixes
 from ftfy.formatting import display_ljust
-from ftfy.compatibility import is_printable
 
-__version__ = '4.4'
+__version__ = '5.0'
 
 
 # See the docstring for ftfy.bad_codecs to see what we're doing here.
 ftfy.bad_codecs.ok()
 
 
 def fix_text(text,
+             *,
              fix_entities='auto',
              remove_terminal_escapes=True,
              fix_encoding=True,
@@ -195,6 +193,7 @@ def fix_text(text,
 
 def fix_file(input_file,
              encoding=None,
+             *,
              fix_entities='auto',
              remove_terminal_escapes=True,
              fix_encoding=True,
@@ -242,6 +241,7 @@ def fix_file(input_file,
 
 
 def fix_text_segment(text,
+                     *,
                      fix_entities='auto',
                      remove_terminal_escapes=True,
                      fix_encoding=True,
@@ -330,7 +330,7 @@ def guess_bytes(bstring):
     - "sloppy-windows-1252", the Latin-1-like encoding that is the most common
       single-byte encoding
     """
-    if type(bstring) == type(''):
+    if isinstance(bstring, str):
         raise UnicodeError(
             "This string was already decoded as Unicode. You should pass "
             "bytes to guess_bytes, not Unicode."
@@ -339,11 +339,9 @@ def guess_bytes(bstring):
     if bstring.startswith(b'\xfe\xff') or bstring.startswith(b'\xff\xfe'):
         return bstring.decode('utf-16'), 'utf-16'
 
-    byteset = set(bytes(bstring))
-    byte_ed, byte_c0, byte_CR, byte_LF = b'\xed\xc0\r\n'
-
+    byteset = set(bstring)
     try:
-        if byte_ed in byteset or byte_c0 in byteset:
+        if 0xed in byteset or 0xc0 in byteset:
             # Byte 0xed can be used to encode a range of codepoints that
             # are UTF-16 surrogates. UTF-8 does not use UTF-16 surrogates,
             # so when we see 0xed, it's very likely we're being asked to
@@ -370,7 +368,8 @@ def guess_bytes(bstring):
     except UnicodeDecodeError:
         pass
 
-    if byte_CR in bstring and byte_LF not in bstring:
+    if 0x0d in byteset and 0x0a not in byteset:
+        # Files that contain CR and not LF are likely to be MacRoman.
         return bstring.decode('macroman'), 'macroman'
     else:
         return bstring.decode('sloppy-windows-1252'), 'sloppy-windows-1252'
@@ -399,7 +398,7 @@ def explain_unicode(text):
         U+253B  ┻       [So] BOX DRAWINGS HEAVY UP AND HORIZONTAL
     """
     for char in text:
-        if is_printable(char):
+        if char.isprintable():
             display = char
         else:
             display = char.encode('unicode-escape').decode('ascii')

diff --git a/ftfy/bad_codecs/__init__.py b/ftfy/bad_codecs/__init__.py
@@ -1,4 +1,3 @@
-# coding: utf-8
 r"""
 Give Python the ability to decode some common, flawed encodings.
 
@@ -29,7 +28,6 @@
     >>> print(b'\xed\xa0\xbd\xed\xb8\x8d'.decode('utf-8-variants'))
     😍
 """
-from __future__ import unicode_literals
 from encodings import normalize_encoding
 import codecs
 

diff --git a/ftfy/bad_codecs/sloppy.py b/ftfy/bad_codecs/sloppy.py
@@ -1,4 +1,3 @@
-# coding: utf-8
 r"""
 Decodes single-byte encodings, filling their "holes" in the same messy way that
 everyone else does.
@@ -69,14 +68,14 @@
     U+0081  \x81    [Cc] <unknown>
     U+201A  ‚       [Ps] SINGLE LOW-9 QUOTATION MARK
 """
-from __future__ import unicode_literals
 import codecs
 from encodings import normalize_encoding
 import sys
 
 REPLACEMENT_CHAR = '\ufffd'
 PY26 = sys.version_info[:2] == (2, 6)
 
+
 def make_sloppy_codec(encoding):
     """
     Take a codec name, and return a 'sloppy' version of that codec that can
@@ -87,8 +86,8 @@ def make_sloppy_codec(encoding):
     `codecs.charmap_decode` and `charmap_encode`. This function, given an
     encoding name, *defines* those boilerplate classes.
     """
-    # Make an array of all 256 possible bytes.
-    all_bytes = bytearray(range(256))
+    # Make a bytestring of all 256 possible bytes.
+    all_bytes = bytes(range(256))
 
     # Get a list of what they would decode to in Latin-1.
     sloppy_chars = list(all_bytes.decode('latin-1'))

diff --git a/ftfy/bad_codecs/utf8_variants.py b/ftfy/bad_codecs/utf8_variants.py
@@ -38,12 +38,10 @@
    again, using UTF-8 as the codec every time.
 """
 
-from __future__ import unicode_literals
 import re
 import codecs
 from encodings.utf_8 import (IncrementalDecoder as UTF8IncrementalDecoder,
                              IncrementalEncoder as UTF8IncrementalEncoder)
-from ftfy.compatibility import bytes_to_ints, unichr, PYTHON2
 
 NAME = 'utf-8-variants'
 
@@ -190,11 +188,8 @@ def _buffer_decode_surrogates(sup, input, errors, final):
             if final:
                 # We found 0xed near the end of the stream, and there aren't
                 # six bytes to decode. Delegate to the superclass method to
-                # handle it as an error.
-                if PYTHON2 and len(input) >= 3:
-                    # We can't trust Python 2 to raise an error when it's
-                    # asked to decode a surrogate, so let's force the issue.
-                    input = mangle_surrogates(input)
+                # handle it as normal UTF-8. It might be a Hangul character
+                # or an error.
                 return sup(input, errors, final)
             else:
                 # We found a surrogate, the stream isn't over yet, and we don't
@@ -205,50 +200,21 @@ def _buffer_decode_surrogates(sup, input, errors, final):
             if CESU8_RE.match(input):
                 # Given this is a CESU-8 sequence, do some math to pull out
                 # the intended 20-bit value, and consume six bytes.
-                bytenums = bytes_to_ints(input[:6])
                 codepoint = (
-                    ((bytenums[1] & 0x0f) << 16) +
-                    ((bytenums[2] & 0x3f) << 10) +
-                    ((bytenums[4] & 0x0f) << 6) +
-                    (bytenums[5] & 0x3f) +
+                    ((input[1] & 0x0f) << 16) +
+                    ((input[2] & 0x3f) << 10) +
+                    ((input[4] & 0x0f) << 6) +
+                    (input[5] & 0x3f) +
                     0x10000
                 )
-                return unichr(codepoint), 6
+                return chr(codepoint), 6
             else:
                 # This looked like a CESU-8 sequence, but it wasn't one.
                 # 0xed indicates the start of a three-byte sequence, so give
-                # three bytes to the superclass to decode as usual -- except
-                # for working around the Python 2 discrepancy as before.
-                if PYTHON2:
-                    input = mangle_surrogates(input)
+                # three bytes to the superclass to decode as usual.
                 return sup(input[:3], errors, False)
 
 
-def mangle_surrogates(bytestring):
-    """
-    When Python 3 sees the UTF-8 encoding of a surrogate codepoint, it treats
-    it as an error (which it is). In 'replace' mode, it will decode as three
-    replacement characters. But Python 2 will just output the surrogate
-    codepoint.
-
-    To ensure consistency between Python 2 and Python 3, and protect downstream
-    applications from malformed strings, we turn surrogate sequences at the
-    start of the string into the bytes `ff ff ff`, which we're *sure* won't
-    decode, and which turn into three replacement characters in 'replace' mode.
-
-    This function does nothing in Python 3, and it will be deprecated in ftfy
-    5.0.
-    """
-    if PYTHON2:
-        if bytestring.startswith(b'\xed') and len(bytestring) >= 3:
-            decoded = bytestring[:3].decode('utf-8', 'replace')
-            if '\ud800' <= decoded <= '\udfff':
-                return b'\xff\xff\xff' + mangle_surrogates(bytestring[3:])
-        return bytestring
-    else:
-        # On Python 3, nothing needs to be done.
-        return bytestring
-
 # The encoder is identical to UTF-8.
 IncrementalEncoder = UTF8IncrementalEncoder