Skip to content

Commit

Permalink
Merge pull request #64 from LuminosoInsight/test-cli
Browse files Browse the repository at this point in the history
Version 5.0: drop Python 2, add CLI tests
  • Loading branch information
alin-luminoso committed Mar 9, 2017
2 parents 39bde07 + 1f665ac commit f88b40b
Show file tree
Hide file tree
Showing 25 changed files with 176 additions and 296 deletions.
13 changes: 13 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,16 @@
## Version 5.0 (February 17, 2017)

Breaking changes:

- Dropped support for Python 2. If you need Python 2 support, you should get
version 4.4, which has the same features as this version.

- The top-level functions require their arguments to be given as keyword
arguments.

Version 5.0 also now has tests for the command-line invocation of ftfy.


## Version 4.4.0 (February 17, 2017)

Heuristic changes:
Expand Down
84 changes: 22 additions & 62 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,7 @@
ftfy: fixes text for you
========================

`ftfy` fixes Unicode that's broken in various ways. It works in Python 2.7,
Python 3.2, or later.
**ftfy** fixes Unicode that's broken in various ways.

The goal of ftfy is to **take in bad Unicode and output good Unicode**, for use
in your Unicode-aware code. This is different from taking in non-Unicode and
Expand All @@ -19,18 +18,15 @@ Of course you're better off if your input is decoded properly and has no
glitches. But you often don't have any control over your input; it's someone
else's mistake, but it's your problem now.

`ftfy` will do everything it can to fix the problem.
ftfy will do everything it can to fix the problem.

.. note::

Time is marching on. ftfy 4.x supports Python 2.7 and 3.x, but when
ftfy 5.0 is released, it will probably only support Python 3.
This documentation is for ftfy 5, which runs on Python 3 only, following
the plan to drop Python 2 support that was announced in ftfy 3.3.

If you're running on Python 2, ftfy 4.x will keep working for you. You
don't have to upgrade to 5.0. You can save yourself a headache by adding
`ftfy < 5` to your requirements, making sure you stay on version 4.

See `Future versions of ftfy`_ for why this needs to happen.
If you're running on Python 2, ftfy 4.x will keep working for you. In that
case, you should add `ftfy < 5` to your requirements.


Mojibake
Expand Down Expand Up @@ -102,7 +98,7 @@ interacting with the erroneous decoding. The main function of ftfy,
parts of NFKC are implemented as separate, limited fixes.


There are other interesting things that `ftfy` can do that aren't part of
There are other interesting things that ftfy can do that aren't part of
the :func:`ftfy.fix_text` pipeline, such as:

* :func:`ftfy.explain_unicode`: show you what's going on in a string,
Expand All @@ -113,10 +109,10 @@ the :func:`ftfy.fix_text` pipeline, such as:
Encodings ftfy can handle
-------------------------

`ftfy` can't fix all possible mix-ups. Its goal is to cover the most common
ftfy can't fix all possible mix-ups. Its goal is to cover the most common
encoding mix-ups while keeping false positives to a very low rate.

`ftfy` can understand text that was decoded as any of these single-byte
ftfy can understand text that was decoded as any of these single-byte
encodings:

- Latin-1 (ISO-8859-1)
Expand Down Expand Up @@ -146,7 +142,7 @@ Korean, such as ``shift-jis`` and ``gb18030``. See `issue #34
<https://github.com/LuminosoInsight/python-ftfy/issues/34>`_ for why this is so
hard.

But remember that the input to `ftfy` is Unicode, so it handles actual
But remember that the input to ftfy is Unicode, so it handles actual
CJK *text* just fine. It just can't discover that a CJK *encoding* introduced
mojibake into the text.

Expand Down Expand Up @@ -179,7 +175,7 @@ If the only fix you need is to detect and repair decoding errors (mojibake), the
you should use :func:`ftfy.fix_encoding` directly.

.. versionchanged:: 4.0
The default normalization was changed from `'NFKC'` to `'NFC'`. The new options
The default normalization was changed from `'NFKC'` to `'NFC'`. The options
*fix_latin_ligatures* and *fix_character_width* were added to implement some
of the less lossy parts of NFKC normalization on top of NFC.

Expand All @@ -194,30 +190,20 @@ you should use :func:`ftfy.fix_encoding` directly.
.. autofunction:: ftfy.explain_unicode


Non-Unicode strings
-------------------

When first using ftfy, you might be confused to find that you can't give it a
bytestring (the type of object called `str` in Python 2).

ftfy fixes text. Treating bytestrings as text is exactly the kind of thing that
causes the Unicode problems that ftfy has to fix. So if you don't give it a
Unicode string, ftfy will point you to the `Python Unicode HOWTO`_.

.. _`Python Unicode Howto`: http://docs.python.org/3/howto/unicode.html

Reasonable ways that you might exchange data, such as JSON or XML, already have
perfectly good ways of expressing Unicode strings. Given a Unicode string, ftfy
can apply fixes that are very likely to work without false positives.


A note on encoding detection
----------------------------

If your input is a mess of unmarked bytes, you might want a tool that can just
statistically analyze those bytes and predict what encoding they're in.
:func:`ftfy.fix_text` expects its input to be a Python 3 `str` (a Unicode
string). If you pass in `bytes` instead, ftfy will point you to the `Python
Unicode HOWTO`_.

.. _`Python Unicode HOWTO`: http://docs.python.org/3/howto/unicode.html

`ftfy` is not that tool. The :func:`ftfy.guess_bytes` function it contains will
Now, you may know that your input is a mess of bytes in an unknown encoding,
and you might want a tool that can just statistically analyze those bytes and
predict what encoding they're in.

ftfy is not that tool. The :func:`ftfy.guess_bytes` function it contains will
do this in very limited cases, but to support more encodings from around the
world, something more is needed.

Expand Down Expand Up @@ -249,7 +235,7 @@ Here's the usage documentation for the `ftfy` command::
[--preserve-entities]
[filename]

ftfy (fixes text for you), version 4.0.0
ftfy (fixes text for you), version 5.0

positional arguments:
filename The file whose Unicode is to be fixed. Defaults to -,
Expand Down Expand Up @@ -323,29 +309,3 @@ that ftfy's behavior is consistent across versions.
:members:

.. autofunction:: ftfy.build_data.make_char_data_file


Future versions of ftfy
=======================

ftfy has full support for Python 2.7, even including a backport of Unicode 9
character classes to Python 2. But given the sweeping changes to Unicode in
Python, it's getting inconvenient to add new features to ftfy that work the
same on both versions.

ftfy 5.0, when it is released, will probably only support Python 3.

If you want to see examples of why ftfy is particularly difficult to maintain
on two versions of Python (which is more like three versions because of Python
2's "wide" and "narrow" builds), take a look at functions such as
:func:`ftfy.bad_codecs.utf8_variants.mangle_surrogates` and
:func:`ftfy.compatibility._narrow_unichr_workaround`.

This will happen soon, and we'll follow the plan that jQuery used years ago
when it dropped support for IE 6-8. We'll release the last Python 2 version and
the first Python-3-only version with the same feature set. ftfy 5.0 will
reduce the size and complexity of the code greatly, but ftfy 4.x will remain
there for those who need it.

If you're running on Python 2, please make sure that `ftfy < 5` is in your
requirements list, not just `ftfy`.
21 changes: 10 additions & 11 deletions ftfy/__init__.py
Original file line number Diff line number Diff line change
@@ -1,26 +1,24 @@
# -*- coding: utf-8 -*-
"""
ftfy: fixes text for you
This is a module for making text less broken. See the `fix_text` function
for more information.
"""

from __future__ import unicode_literals
import unicodedata
import ftfy.bad_codecs
from ftfy import fixes
from ftfy.formatting import display_ljust
from ftfy.compatibility import is_printable

__version__ = '4.4'
__version__ = '5.0'


# See the docstring for ftfy.bad_codecs to see what we're doing here.
ftfy.bad_codecs.ok()


def fix_text(text,
*,
fix_entities='auto',
remove_terminal_escapes=True,
fix_encoding=True,
Expand Down Expand Up @@ -195,6 +193,7 @@ def fix_text(text,

def fix_file(input_file,
encoding=None,
*,
fix_entities='auto',
remove_terminal_escapes=True,
fix_encoding=True,
Expand Down Expand Up @@ -242,6 +241,7 @@ def fix_file(input_file,


def fix_text_segment(text,
*,
fix_entities='auto',
remove_terminal_escapes=True,
fix_encoding=True,
Expand Down Expand Up @@ -330,7 +330,7 @@ def guess_bytes(bstring):
- "sloppy-windows-1252", the Latin-1-like encoding that is the most common
single-byte encoding
"""
if type(bstring) == type(''):
if isinstance(bstring, str):
raise UnicodeError(
"This string was already decoded as Unicode. You should pass "
"bytes to guess_bytes, not Unicode."
Expand All @@ -339,11 +339,9 @@ def guess_bytes(bstring):
if bstring.startswith(b'\xfe\xff') or bstring.startswith(b'\xff\xfe'):
return bstring.decode('utf-16'), 'utf-16'

byteset = set(bytes(bstring))
byte_ed, byte_c0, byte_CR, byte_LF = b'\xed\xc0\r\n'

byteset = set(bstring)
try:
if byte_ed in byteset or byte_c0 in byteset:
if 0xed in byteset or 0xc0 in byteset:
# Byte 0xed can be used to encode a range of codepoints that
# are UTF-16 surrogates. UTF-8 does not use UTF-16 surrogates,
# so when we see 0xed, it's very likely we're being asked to
Expand All @@ -370,7 +368,8 @@ def guess_bytes(bstring):
except UnicodeDecodeError:
pass

if byte_CR in bstring and byte_LF not in bstring:
if 0x0d in byteset and 0x0a not in byteset:
# Files that contain CR and not LF are likely to be MacRoman.
return bstring.decode('macroman'), 'macroman'
else:
return bstring.decode('sloppy-windows-1252'), 'sloppy-windows-1252'
Expand Down Expand Up @@ -399,7 +398,7 @@ def explain_unicode(text):
U+253B ┻ [So] BOX DRAWINGS HEAVY UP AND HORIZONTAL
"""
for char in text:
if is_printable(char):
if char.isprintable():
display = char
else:
display = char.encode('unicode-escape').decode('ascii')
Expand Down
2 changes: 0 additions & 2 deletions ftfy/bad_codecs/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
# coding: utf-8
r"""
Give Python the ability to decode some common, flawed encodings.
Expand Down Expand Up @@ -29,7 +28,6 @@
>>> print(b'\xed\xa0\xbd\xed\xb8\x8d'.decode('utf-8-variants'))
😍
"""
from __future__ import unicode_literals
from encodings import normalize_encoding
import codecs

Expand Down
7 changes: 3 additions & 4 deletions ftfy/bad_codecs/sloppy.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
# coding: utf-8
r"""
Decodes single-byte encodings, filling their "holes" in the same messy way that
everyone else does.
Expand Down Expand Up @@ -69,14 +68,14 @@
U+0081 \x81 [Cc] <unknown>
U+201A ‚ [Ps] SINGLE LOW-9 QUOTATION MARK
"""
from __future__ import unicode_literals
import codecs
from encodings import normalize_encoding
import sys

REPLACEMENT_CHAR = '\ufffd'
PY26 = sys.version_info[:2] == (2, 6)


def make_sloppy_codec(encoding):
"""
Take a codec name, and return a 'sloppy' version of that codec that can
Expand All @@ -87,8 +86,8 @@ def make_sloppy_codec(encoding):
`codecs.charmap_decode` and `charmap_encode`. This function, given an
encoding name, *defines* those boilerplate classes.
"""
# Make an array of all 256 possible bytes.
all_bytes = bytearray(range(256))
# Make a bytestring of all 256 possible bytes.
all_bytes = bytes(range(256))

# Get a list of what they would decode to in Latin-1.
sloppy_chars = list(all_bytes.decode('latin-1'))
Expand Down
50 changes: 8 additions & 42 deletions ftfy/bad_codecs/utf8_variants.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,12 +38,10 @@
again, using UTF-8 as the codec every time.
"""

from __future__ import unicode_literals
import re
import codecs
from encodings.utf_8 import (IncrementalDecoder as UTF8IncrementalDecoder,
IncrementalEncoder as UTF8IncrementalEncoder)
from ftfy.compatibility import bytes_to_ints, unichr, PYTHON2

NAME = 'utf-8-variants'

Expand Down Expand Up @@ -190,11 +188,8 @@ def _buffer_decode_surrogates(sup, input, errors, final):
if final:
# We found 0xed near the end of the stream, and there aren't
# six bytes to decode. Delegate to the superclass method to
# handle it as an error.
if PYTHON2 and len(input) >= 3:
# We can't trust Python 2 to raise an error when it's
# asked to decode a surrogate, so let's force the issue.
input = mangle_surrogates(input)
# handle it as normal UTF-8. It might be a Hangul character
# or an error.
return sup(input, errors, final)
else:
# We found a surrogate, the stream isn't over yet, and we don't
Expand All @@ -205,50 +200,21 @@ def _buffer_decode_surrogates(sup, input, errors, final):
if CESU8_RE.match(input):
# Given this is a CESU-8 sequence, do some math to pull out
# the intended 20-bit value, and consume six bytes.
bytenums = bytes_to_ints(input[:6])
codepoint = (
((bytenums[1] & 0x0f) << 16) +
((bytenums[2] & 0x3f) << 10) +
((bytenums[4] & 0x0f) << 6) +
(bytenums[5] & 0x3f) +
((input[1] & 0x0f) << 16) +
((input[2] & 0x3f) << 10) +
((input[4] & 0x0f) << 6) +
(input[5] & 0x3f) +
0x10000
)
return unichr(codepoint), 6
return chr(codepoint), 6
else:
# This looked like a CESU-8 sequence, but it wasn't one.
# 0xed indicates the start of a three-byte sequence, so give
# three bytes to the superclass to decode as usual -- except
# for working around the Python 2 discrepancy as before.
if PYTHON2:
input = mangle_surrogates(input)
# three bytes to the superclass to decode as usual.
return sup(input[:3], errors, False)


def mangle_surrogates(bytestring):
"""
When Python 3 sees the UTF-8 encoding of a surrogate codepoint, it treats
it as an error (which it is). In 'replace' mode, it will decode as three
replacement characters. But Python 2 will just output the surrogate
codepoint.
To ensure consistency between Python 2 and Python 3, and protect downstream
applications from malformed strings, we turn surrogate sequences at the
start of the string into the bytes `ff ff ff`, which we're *sure* won't
decode, and which turn into three replacement characters in 'replace' mode.
This function does nothing in Python 3, and it will be deprecated in ftfy
5.0.
"""
if PYTHON2:
if bytestring.startswith(b'\xed') and len(bytestring) >= 3:
decoded = bytestring[:3].decode('utf-8', 'replace')
if '\ud800' <= decoded <= '\udfff':
return b'\xff\xff\xff' + mangle_surrogates(bytestring[3:])
return bytestring
else:
# On Python 3, nothing needs to be done.
return bytestring

# The encoder is identical to UTF-8.
IncrementalEncoder = UTF8IncrementalEncoder

Expand Down
Loading

0 comments on commit f88b40b

Please sign in to comment.