# JSON load: str instead of unicode

Python's `json` module is the first thing to use when reading/writing JSON in Python.

In [1]:
import json
json.dumps((0, 1, True, (4.2, 5.7), {'name': 'Jason'}))

'[0, 1, true, [4.2, 5.7], {"name": "Jason"}]'

In [2]:
json.dumps({'No': 9, 'Op': 95, 'B': 178})

'{"B": 178, "Op": 95, "No": 9}'

In [3]:
json.loads(json.dumps({'No': 9, 'Op': 95, 'B': 178}))

{u'B': 178, u'No': 9, u'Op': 95}

You see, JSON load gives you Python `unicode` objects for JSON `string` objects.

In Python 3, this wouldn't be an issue, but if you are still stuck with Python 2, you know `str` is different from `unicode`.  What if you know you are working with only ASCII strings, and you would like to continue using `str` everywhere in your code?

As such, looks like `str` and `unicode` equate nicely when using only ASCII characters.

In [4]:
dict1 = {'No': 9, 'Op': 95, 'B': 178}
dict2 = json.loads(json.dumps(dict1))

In [5]:
dict1

{'B': 178, 'No': 9, 'Op': 95}

In [6]:
dict2

{u'B': 178, u'No': 9, u'Op': 95}

In [7]:
dict1 == dict2

True

StackOverflow to the rescue, as always.  [How to get string objects instead of unicode from JSON](https://stackoverflow.com/questions/956867/how-to-get-string-objects-instead-of-unicode-from-json).  It can't get more to the point than that.

## My solution

Adapted from excellent solutions by Mirec Miskuf and Mark Amery in [How to get string objects instead of unicode from JSON](https://stackoverflow.com/questions/956867/how-to-get-string-objects-instead-of-unicode-from-json).

In [8]:
def json_loads_ensure_str(string):
    """Decodes JSON string to Python object ensuring 'str' not 'unicode'.

    The standard json.loads converts JSON string objects to Python unicode
    objects.  This simple wrapper around json.loads ensures that the result
    contains Python str objects instead of Python unicode objects.

    Args:
        string: the string containing JSON text.

    Returns:
        Python object decoded from the JSON string.
    """
    return unicode_to_str(json.loads(string, object_hook=unicode_to_str_object_hook),
                          process_dict=False, process_child_dict=False)


def json_load_ensure_str(file_object):
    """Decodes JSON file object to Python object ensuring 'str' not 'unicode'.

    The standard json.load converts JSON string objects to Python unicode
    objects.  This simple wrapper around json.load ensures that the result
    contains Python str objects instead of Python unicode objects.

    Args:
        file_object: the file-like object containing JSON text.

    Returns:
        Python object decoded from the JSON file object.
    """
    return unicode_to_str(json.load(file_object, object_hook=unicode_to_str_object_hook),
                          process_dict=False, process_child_dict=False)


def unicode_to_str_object_hook(dictionary):
    """Converts unicode to str in a dictionary to serve as JSON object_hook."""
    return unicode_to_str(dictionary, process_dict=True, process_child_dict=False)


def unicode_to_str(obj, process_dict=True, process_child_dict=True):
    """Converts unicode to str recursively given an object.

    Args:
        obj: the object to convert.
        process_dict: whether to process dictionaries.
        process_child_dict: whether to process dictionaries other than the
            root-level object itself.

    How it behaves on every combination of the two Boolean flags:

        False, _    : no dictionaries are processed.
        True , True : all dictionaries are processed.
        True , False: the root-level dictionary is processed, no lower-level
                      dictionaries are processed.

    Returns:
        New copy of the object with all unicode objects replaced by str objects.
    """
    process_child_dict = process_dict and process_child_dict

    if isinstance(obj, unicode):
        return obj.encode('utf-8')
    if isinstance(obj, list):
        return [unicode_to_str(item, process_child_dict, process_child_dict) for item in obj]
    if isinstance(obj, tuple):
        return (unicode_to_str(item, process_child_dict, process_child_dict) for item in obj)
    if process_dict and isinstance(obj, dict):
        return {unicode_to_str(key, process_child_dict, process_child_dict)
                : unicode_to_str(value, process_child_dict, process_child_dict)
                for key, value in obj.iteritems()}
    return obj

## How I arrived at my solution

Here's how I took Mark and Mirec's `byteify` and made my `unicode_to_str`.

In [9]:
example = [0, 1, 'two', True, [4.2, 5.7, 'six'], {'name': 'Jason'}]

Mark Amery's `byteify`.

In [10]:
def byteify(input):
    if isinstance(input, dict):
        return {byteify(key): byteify(value)
                for key, value in input.iteritems()}
    elif isinstance(input, list):
        return [byteify(element) for element in input]
    elif isinstance(input, unicode):
        return input.encode('utf-8')
    else:
        return input


def json_loads_byteify(string):
    return byteify(json.loads(string))


json_loads_byteify(json.dumps(example))

[0, 1, 'two', True, [4.2, 5.7, 'six'], {'name': 'Jason'}]

Mirec Miskuf's `byteify`, an improvement over Mark Amery's.

In [11]:
def byteify(data, ignore_dicts=False):
    if isinstance(data, unicode):
        return data.encode('utf-8')
    if isinstance(data, list):
        return [byteify(item, ignore_dicts=True) for item in data]
    if not ignore_dicts and isinstance(data, dict):
        return {byteify(key, ignore_dicts=True): byteify(value, ignore_dicts=True)
                for key, value in data.iteritems()}
    return data


def json_loads_byteify(string):
    return byteify(json.loads(string, object_hook=byteify),
                   ignore_dicts=True)


json_loads_byteify(json.dumps(example))

[0, 1, 'two', True, [4.2, 5.7, 'six'], {'name': 'Jason'}]

That `ignore_dicts` trick is a great bit of logic.  I wanted to understand it by transforming it.

First modification: break it into dict and non-dict parts.

In [12]:
def byteify_nondict(value):
    if isinstance(value, unicode):
        return value.encode('utf-8')
    if isinstance(value, list):
        return [byteify_nondict(item) for item in value]
    return value


def byteify_dict(value):
    if isinstance(value, dict):
        return {byteify_nondict(key): byteify_nondict(value)
                for key, value in value.iteritems()}
    return value


def json_loads_byteify(string):
    return byteify_nondict(json.loads(string, object_hook=byteify_dict))


json_loads_byteify(json.dumps(example))

[0, 1, 'two', True, [4.2, 5.7, 'six'], {'name': 'Jason'}]

The two functions are clearer, but well, why have these very specific pairs of functions just for this JSON load purpose.  They have little general use.  Mark Avery's function had a niceity that it worked as a standalone recursive `unicode` to `str` function.

To combine all goodness, it dawned upon me that it is about root-level dict and lower-level dicts.  We need two Boolean flags instead of Mirec Miskuf's one.

In [13]:
def unicode_to_str(value, process_dict=True, process_child_dict=True):
    if isinstance(value, unicode):
        return value.encode('utf-8')
    if isinstance(value, list):
        return [unicode_to_str(item, process_child_dict, process_child_dict) for item in value]
    if isinstance(value, tuple):
        return (unicode_to_str(item, process_child_dict, process_child_dict) for item in value)
    if process_dict and isinstance(value, dict):
        return {unicode_to_str(key, process_child_dict, process_child_dict)
                : unicode_to_str(value, process_child_dict, process_child_dict)
                for key, value in value.iteritems()}
    return value


def unicode_to_str_object_hook(dictionary):
    return unicode_to_str(dictionary, process_dict=True, process_child_dict=False)


def json_loads_unicode_to_str(string):
    return unicode_to_str(json.loads(string, object_hook=unicode_to_str_object_hook),
                          process_dict=False, process_child_dict=False)


json_loads_unicode_to_str(json.dumps(example))

[0, 1, 'two', True, [4.2, 5.7, 'six'], {'name': 'Jason'}]

The two-flags function behaves like so:

1. True, True: all dicts are processed.
   Identical to Mark Avery's function.
2. False, False: no dicts are processed.
   Identical to Mirec Miskuf's function with ignore_dict=True.
3. True, False: the root-level dict is processed, no lower-level dicts are processed.
   Identical to Mirec Miskuf's function with ignore_dict=False.
4. False, True: the root-level dict is not processed, and thus its lower-level dicts are not processed.
   otherwise, if the root-level is not a dict, then all lower-level dicts are processed.
   This one is a funky mode, best banned from use.

It later occurred to me how to avoid case 4.  We say `process_dict` applies to all dictionaries, and thus it is superset of `process_child_dict`, which only applies to non-root-level dictionaries.  So, if the former is false, the latter is automatically false, i.e. its value has no effect.  We can enforce that at the start of the function.  Here's the final solution, same as the one in the "My solution" section.

In [14]:
def unicode_to_str(obj, process_dict=True, process_child_dict=True):
    process_child_dict = process_dict and process_child_dict
    if isinstance(obj, unicode):
        return obj.encode('utf-8')
    if isinstance(obj, list):
        return [unicode_to_str(item, process_child_dict, process_child_dict) for item in obj]
    if isinstance(obj, tuple):
        return (unicode_to_str(item, process_child_dict, process_child_dict) for item in obj)
    if process_dict and isinstance(obj, dict):
        return {unicode_to_str(key, process_child_dict, process_child_dict)
                : unicode_to_str(value, process_child_dict, process_child_dict)
                for key, value in obj.iteritems()}
    return obj


def unicode_to_str_object_hook(dictionary):
    return unicode_to_str(dictionary, process_dict=True, process_child_dict=False)


def json_loads_ensure_str(string):
    return unicode_to_str(json.loads(string, object_hook=unicode_to_str_object_hook),
                          process_dict=False, process_child_dict=False)


json_loads_unicode_to_str(json.dumps(example))

[0, 1, 'two', True, [4.2, 5.7, 'six'], {'name': 'Jason'}]