# Deep dive on Dataclasses
_Guillaume Fidanza_

The current notebook reviews one of the new features of python 3.7: the **DataClasses**

Keywords: _dataclasses, typing, python 3.7, side-effects, code inspection, inheritance, immutability_

---

This notebook was correctly run on the following configuration:

Setup a python 3.7 kernel:
    
```bash
# Install python 3.7 (if not already installed)
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt-get update
sudo apt-get install python3.7-dev

# Create a directory for the project and change to it
mkdir /path/to/dataclass-notebook-dir && cd /path/to/dataclass-notebook-dir

# Create a virtual environment using Python 3.7 (required for dataclasses)
virtualenv -p $(which python3.7) py37

# Add a python 3.7 kernel to Jupyter
source py37/bin/activate
pip install ipykernel
python -m ipykernel install --user --name=py37
```

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#What-is-a-dataclass-?" data-toc-modified-id="What-is-a-dataclass-?-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>What is a dataclass ?</a></span><ul class="toc-item"><li><span><a href="#Dataclass-options" data-toc-modified-id="Dataclass-options-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Dataclass options</a></span><ul class="toc-item"><li><span><a href="#Dataclass-can-emulate-immutability" data-toc-modified-id="Dataclass-can-emulate-immutability-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Dataclass can <em>emulate</em> immutability</a></span></li><li><span><a href="#How-to-force-creation-of-__hash__-?-(advanced-level)" data-toc-modified-id="How-to-force-creation-of-__hash__-?-(advanced-level)-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>How to force creation of __hash__ ? (advanced level)</a></span></li></ul></li><li><span><a href="#What-dataclasses-also-give-for-free" data-toc-modified-id="What-dataclasses-also-give-for-free-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>What dataclasses also give for free</a></span><ul class="toc-item"><li><span><a href="#Some-helpers" data-toc-modified-id="Some-helpers-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Some helpers</a></span></li><li><span><a href="#Some-methods" data-toc-modified-id="Some-methods-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Some methods</a></span></li><li><span><a href="#How-to-inspect-the-generated-methods-?-(advanced-level)" data-toc-modified-id="How-to-inspect-the-generated-methods-?-(advanced-level)-1.2.3"><span class="toc-item-num">1.2.3&nbsp;&nbsp;</span>How to inspect the generated methods ? (advanced level)</a></span></li></ul></li></ul></li><li><span><a href="#How-to-customize-__init__-?" data-toc-modified-id="How-to-customize-__init__-?-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>How to customize __init__ ?</a></span></li><li><span><a href="#Remarks-about-inheritance" data-toc-modified-id="Remarks-about-inheritance-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Remarks about inheritance</a></span><ul class="toc-item"><li><span><a href="#Only-between-dataclasses" data-toc-modified-id="Only-between-dataclasses-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Only between dataclasses</a></span></li><li><span><a href="#Mixed-dataclasses-and-non-dataclasses" data-toc-modified-id="Mixed-dataclasses-and-non-dataclasses-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Mixed dataclasses and non-dataclasses</a></span></li></ul></li><li><span><a href="#Summary" data-toc-modified-id="Summary-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Summary</a></span></li></ul></div>

**An important reminder about side-effects in Python**

In [1]:
class PulseLesson:
    """Class to help Pulse lesson organization"""
    def __init__(self, date, members=[]):
        self.date = date                # Day of the lesson
        self.members = members          # List of the members attending the lesson
        self.members.append('Rudolphe') # Rudolphe always comes, he is the coach !
    
    def add_members(self, names):
        """Add members to the Pulse Lesson"""
        self.members.extend(names)
        return self.members

In [2]:
lesson_tuesday = PulseLesson(date='tuesday')
lesson_tuesday.add_members(['Alex', 'Isabelle'])

['Rudolphe', 'Alex', 'Isabelle']

In [3]:
lesson_tuesday.add_members(['Antoine'])

['Rudolphe', 'Alex', 'Isabelle', 'Antoine']

In [4]:
lesson_friday = PulseLesson(date='friday')
lesson_friday.add_members(['Seb'])

['Rudolphe', 'Alex', 'Isabelle', 'Antoine', 'Rudolphe', 'Seb']

Since we created a new object (`lesson_friday`), we expected to have an empty `members` list (as defined at Line 5 in `PulseLesson` definition). Instead, we got a list already filled with previous instantiation (`lesson_tuesday`).
This is an annoying **side-effect** of `PulseLesson`'s initialization. 

Indeed, the attribute `members` is initalized with Mutable at _import time_ (when python interpreter reads Class definition, its methods defaults are _run_, and its methods content are only _read_).
To prevent side-effects, we can rewrite `__init__` like below (moving Mutable initialization from method's defaults to its content):


```python
    def __init__(self, date, members=None):
        self.date = date                # Day of the lesson
        self.members = members or []    # List of the members attending the lesson
        self.members.append('Rudolphe') # Rudolphe always comes, he is the coach !
```

We'll see that use of `dataclasses` will help preventing such side-effects, one of the great new features.




## What is a dataclass ?

> Dataclasses is a **new standard module** which provides a **decorator** to automate the creation of **data-structured classes**.

- New feature of `python 3.7`, backported to `python 3.6` [But **backporting is incompatible** with python 3.7](https://github.com/ericvsmith/dataclasses#compatibility)
- It is a standard python class
- It behaves like a namedtuple (similar to _mutable namedtuple with defaults_)
- It generates a lot of boilerplate code

###### Before Dataclasses

In [5]:
class User:
    def __init__(self, id_, first_name='Barbara', last_name='Streisand', cities=None):
        self.id_        = id_
        self.first_name = first_name
        self.last_name  = last_name
        self.cities     = cities or []

In [6]:
User(69)

<__main__.User at 0x7f09d8593908>

###### With Dataclasses & Typing

Typing module (python 3.5+) allows us  to define abstract types (better for description).
So they are very convenient to **describe the data fields**

If they are not respected, no exception will be raised (because of python flexibility) but we can be more strict using [mypy](https://github.com/python/mypy).

In [7]:
from dataclasses import dataclass, field
from typing import Text, List

In [8]:
@dataclass
class User:
    id_        : int
    first_name : Text = 'Barbara'
    last_name  : Text = field(default='Streisand')
    cities     : List[Text] = field(default_factory=list)

In [9]:
User(42) # repr given for free

User(id_=42, first_name='Barbara', last_name='Streisand', cities=[])

As you can see:
- class definition is **less verbose**
- Use of `field` **prevents side-effects** (mutable defaults)

###### How to define a field (dataclass attribute) ?

The `dataclass` module gives us a helper (`field`) to define the attributes with several parameters

`field`'s signature (to see parameters and their defaults, [detailed list](https://www.python.org/dev/peps/pep-0557/#id8)):

    def field(*, default=MISSING, default_factory=MISSING, repr=True, 
              hash=None, init=True, compare=True, metadata=None)

Some remarks:
- It is optional to use `field` to define a default value
  ```python
     @dataclass
     class NiceClass:
        age   : int                      # no default value
        width : int = 0                  # default value
        height: int = field(default=0)   # default value with field (completely equivalent to width)
  ```
- Attributes with defaults **must be defined** after the ones without(like usually):
  ```python
     @dataclass
     class NiceClass:
        width : int = 0                  # default value
        age   : int                      # no default value
        height: int = field(default=0)   # default value
     # Will raise TypeError
  ```
- `default_factory` parameter must be a **zero-argument callable** that will be called when a default value is required for the field. Setting `default` and `default_factory` will raise an error.
- `metadata` parameter is a read-only dict (if set, `None` otherwise) that can contain data, description about the field
    - Example: if you want to associate a list of mirrors to download a dataset 

```python
       @dataclass
       class NiceClass:
           dataset : ByteString = field(default=b'', metadata={'mirrors': ['http://...', 'http://...']}
 ```

[^ Go to the top!](#Table-of-Contents)

### Dataclass options

`dataclass` constructor offers some options:
- to control which methods are automatically generated (see [section](#Some-methods))
- to monitor behaviour:
    - How to make the class immutable ? (`frozen` option)
    - How to force creation of `__hash__` ? (`unsafe_hash` option)

#### Dataclass can _emulate_ immutability

Let's use the `frozen` option to emulate immutability of a dataclass.
What's great with this feature is that we'll be able to use instances of our dataclasses in sets !

In [10]:
@dataclass(frozen=True)
class FrozenActor:
    id_        : int
    first_name : Text = 'Barbara'
    last_name  : Text = field(default='Streisand')

In [11]:
frozen_roger = FrozenActor(666, 'Roger', 'Hanin')
frozen_roger

FrozenActor(id_=666, first_name='Roger', last_name='Hanin')

In [12]:
# We can use the class as unique key (because made hashable)
actor_films = {frozen_roger: 67}
actor_films

{FrozenActor(id_=666, first_name='Roger', last_name='Hanin'): 67}

Now, Let's see what happens when setting an _mutable attribute_:

In [13]:
@dataclass(frozen=True)
class FrozenActorWithMutable:
    id_        : int
    first_name : Text = 'Barbara'
    last_name  : Text = field(default='Streisand')
    cities     : List[Text] = field(default_factory=list)

In [14]:
frozen_barbara = FrozenActorWithMutable(314, 'Barbara', 'Streisand', cities=['New York'])
frozen_barbara

FrozenActorWithMutable(id_=314, first_name='Barbara', last_name='Streisand', cities=['New York'])

In [15]:
# We can use the class as unique key (because made hashable)
actor_films = {frozen_barbara: 26}
actor_films

TypeError: unhashable type: 'list'

Great news! Python can't be fooled so easily ;)

But be careful, attributes are still frozen but **not recursively** :

In [16]:
# Let's try to change a direct attribute
frozen_barbara.first_namerst_name = 'Mauricette'

FrozenInstanceError: cannot assign to field 'first_namerst_name'

In [17]:
# but !
frozen_barbara.cities.append('Lyon')
frozen_barbara

FrozenActorWithMutable(id_=314, first_name='Barbara', last_name='Streisand', cities=['New York', 'Lyon'])

[^ Go to the top!](#Table-of-Contents)

#### How to force creation of \_\_hash\_\_ ? (advanced level)

Default behaviour(`unsafe_hash=False`) means that we let python make the dataclass hashable when it is logical (see table below)

       +------------------- unsafe_hash= parameter
       |       +----------- eq= parameter
       |       |       +--- frozen= parameter
       |       |       |
       v       v       v    |        |        |
                            |   no   |  yes   |  <--- class has explicitly defined __hash__
    +=======+=======+=======+========+========+
    | False | False | False |        |        | No __eq__, use the base class __hash__
    +-------+-------+-------+--------+--------+
    | False | False | True  |        |        | No __eq__, use the base class __hash__
    +-------+-------+-------+--------+--------+
    | False | True  | False | None   |        | <-- the default, not hashable
    +-------+-------+-------+--------+--------+
    | False | True  | True  | add    |        | Frozen, so hashable, allows override
    +-------+-------+-------+--------+--------+
    | True  | False | False | add    | raise  | Has no __eq__, but hashable
    +-------+-------+-------+--------+--------+
    | True  | False | True  | add    | raise  | Has no __eq__, but hashable
    +-------+-------+-------+--------+--------+
    | True  | True  | False | add    | raise  | Not frozen, but hashable
    +-------+-------+-------+--------+--------+
    | True  | True  | True  | add    | raise  | Frozen, so hashable
    +=======+=======+=======+========+========+
    For boxes that are blank, __hash__ is untouched and therefore
    inherited from the base class.  If the base is object, then
    id-based hashing is used.

    Note that a class may already have __hash__=None if it specified an
    __eq__ method in the class body (not one that was created by
    @dataclass).

[source](https://github.com/python/cpython/blob/3.7/Lib/dataclasses.py#L118)

[^ Go to the top!](#Table-of-Contents)

### What dataclasses also give for free

#### Some helpers
Full list [here](https://www.python.org/dev/peps/pep-0557/#module-level-helper-functions)

In [18]:
from dataclasses import asdict, astuple

user = User(42)

In [19]:
asdict(user)

{'id_': 42, 'first_name': 'Barbara', 'last_name': 'Streisand', 'cities': []}

In [20]:
astuple(user)

(42, 'Barbara', 'Streisand', [])

[^ Go to the top!](#Table-of-Contents)

#### Some methods
Full list with examples [here](https://www.python.org/dev/peps/pep-0557/#id7)

`@dataclass` will automatically add to the class the following methods:
- `__init__` when using `@dataclass(init=True)`
- `__repr__` when using `@dataclass(repr=True)`
- `__eq__`   when using `@dataclass(eq=True)`
- `__lt__`, `__le__`, `__gt__`, and `__ge__`: when using `@dataclass(order=True)` (default is `False`)

**But**, note that _these methods_:
- **won't be generated** if said so in dataclass constructor `@dataclass(init=False, repr=False...)`
    
    Full signature: `def dataclass(*, init=True, repr=True, eq=True, order=False, unsafe_hash=False, frozen=False)`


- **will be overwritten** if the class already defined the method. (_Explicit is better than implicit._)
- **will overwrite** inherited methods when the class does not define the method.

#### How to inspect the generated methods ? (advanced level)
- Not possible to get source code with `inspect` (but signature can be inspected)
- Possible to disassemble the bytecode with `dis.dis`
- Possible to reassemble the code with external lib [DataClassInspector](https://github.com/DamlaAltun/DataclassInspector)

###### Inspect

We cannot inspect the generated code because inspection works either by reading the file containing the source (`__file__` attribute) or by reading the _cache_ of `linecache` but [not in _interactive sessions_](https://mail.python.org/pipermail/python-list/2014-August/677076.html). ([see source](https://github.com/python/cpython/blob/3.7/Lib/inspect.py#L680))

The code generated by `@dataclass` was written in a `str` and executed with `exec`([see source](https://github.com/python/cpython/blob/3.7/Lib/dataclasses.py#L386)) so neither a `__file__` attribute nor a cache is available.

Note that it's exceptionally possible to get source code of _function_ (but [not class](https://bugs.python.org/issue33826?@ok_message=msg%20328824%20created%0Aissue%2033826%20message_count%2C%20messages%20edited%20ok&@template=item#msg319692)) in ipython(embedded in Jupyter), because [ipython write code cell source in `linecache.cache`](https://bugs.python.org/issue12920#msg245721)

In [21]:
import inspect

In [22]:
class VolcanNoDataClass:
    def __init__(self, lava):
        print("lavaa")

In [23]:
@dataclass
class Volcan:
    country: str

In [24]:
print(inspect.getsource(VolcanNoDataClass.__init__))

    def __init__(self, lava):
        print("lavaa")



In [25]:
try: 
    print(inspect.getsource(Volcan.__init__))
except OSError:
    print("Source code not accessible")

Source code not accessible


But, we can still access the signatures of generated methods

In [26]:
str(inspect.signature(VolcanNoDataClass.__init__))

'(self, lava)'

In [27]:
str(inspect.signature(Volcan.__init__))

'(self, country: str) -> None'

In [28]:
inspect.getsourcefile(VolcanNoDataClass.__init__)

'<ipython-input-22-cd6d750458f6>'

In [29]:
inspect.getsourcefile(Volcan.__init__) is None

True

_side-note about signature inspection between python 2 & 3_

```python
    str(inspect.signature(Volcan.__init__)) # python 3.3+
    inspect.getfullargspec(Volcan.__init__) # python 2.7+
```

###### Disassembling the generated bytecode

In [30]:
import dis

In [31]:
dis.dis(VolcanNoDataClass.__init__)

  3           0 LOAD_GLOBAL              0 (print)
              2 LOAD_CONST               1 ('lavaa')
              4 CALL_FUNCTION            1
              6 POP_TOP
              8 LOAD_CONST               0 (None)
             10 RETURN_VALUE


In [32]:
dis.dis(Volcan.__init__)

  2           0 LOAD_FAST                1 (country)
              2 LOAD_FAST                0 (self)
              4 STORE_ATTR               0 (country)
              6 LOAD_CONST               0 (None)
              8 RETURN_VALUE


###### Reassembling the code with DataClassInspector

`pip install DataclassInspector`

In [33]:
from DataclassInspector.inspector import Inspector

VolcanInspected = Inspector(Volcan)

print(VolcanInspected.code)

from dataclasses import Field, _MISSING_TYPE, _DataclassParams


class Volcan:
    __dataclass_fields__ = {
        "country": "Field(name='country', type=str, default=_MISSING_TYPE, default_factory=_MISSING_TYPE, init=True, repr=True, hash=None, compare=True, metadata={}, _field_type=_FIELD)"
    }
    __dataclass_params__ = _DataclassParams(
        init=True,
        repr=True,
        eq=True,
        order=False,
        unsafe_hash=False,
        frozen=False,
    )
    country: str

    def __eq__(self, other):
        if other.__class__ is self.__class__:
            return (self.country,) == (other.country,)
        else:
            return NotImplemented

    __hash__ = None

    def __init__(self, country: str) -> None:
        self.country = country

    def __repr__(self):
        return (
            self.__class__.__qualname__ + f"""(country={(self.country)!r})"""
        )



[^ Go to the top!](#Table-of-Contents)

## How to customize \_\_init\_\_ ?

In [34]:
@dataclass
class Duration:
    years  : int
    months : int = field(init=False)
    
    def __post_init__(self):
        self.months = self.years * 12

In [35]:
Duration(20)

Duration(years=20, months=240)

[^ Go to the top!](#Table-of-Contents)

## Remarks about inheritance

### Only between dataclasses

In [36]:
@dataclass
class Mother:
    attr_mother: int = 69

@dataclass
class Child(Mother):
    attr_child: int = 91

In [37]:
Mother()

Mother(attr_mother=69)

In [38]:
Child()

Child(attr_mother=69, attr_child=91)

[^ Go to the top!](#Table-of-Contents)

### Mixed dataclasses and non-dataclasses

Since dataclasses are usual python classes, they are compatible with non-dataclasses, but we should keep it mind what it provides for free

_Without dataclasses_

_With only the Child as a dataclass_

In [39]:
class Mother:
    def __init__(self, id_=1):
        self.id_ = id_

class Child(Mother):
    pass

child = Child()
assert child.id_ == 1

print("child has an attribute 'id_'\n"
      "because Mother.__init__ was called\n"
      "because Child does not have an __init__")

child has an attribute 'id_'
because Mother.__init__ was called
because Child does not have an __init__


In [40]:
class Mother:
    def __init__(self, id_=1):
        self.id_ = id_

@dataclass
class Child(Mother):
    pass

child = Child()
try:
    assert child.id_ == 1
except AttributeError:
    print("child has no attribute 'id_'\n"
      "because Mother.__init__ was not called\n"
      "because Child already has an __init__ (that does not call its super's)")

child has no attribute 'id_'
because Mother.__init__ was not called
because Child already has an __init__ (that does not call its super's)


_How to make it work ?_

In [41]:
@dataclass
class Child(Mother):
    def __post_init__(self):
        super().__init__()  # Mother'__init__ explicitely called

child = Child()
assert child.id_ == 1

Conclusion
- `@dataclass` adds an `__init__` method whether there are fields defined or not !
- Advice: If you use `@dataclass`, use what it provides for free (i.e.: fields and `__init__`)

[^ Go to the top!](#Table-of-Contents)

## Summary

Pros:
- **Less verbose**: avoid setting all the attributes in \__init\__
- **Prevent side-effects** at attribute initialization (at import time)
    - raise ValueError when setting a mutable default (use default_factory for mutables)
- **Can be used in sets** (very easy to make hashable)
    - Setting `frozen=True` makes the fields immutable (similar to namedtuple)

Cons:
- Does not checks types natively but does used in conjunction with `Typing` module
- Not JSON-serializable by default (need to be converted to dict first(with asdict))
- Not possible to frozen only some fields

Ressources:
- [PEP 557, Dataclasses](https://www.python.org/dev/peps/pep-0557/), [PEP 484, Type Hints](https://www.python.org/dev/peps/pep-0484/)
- Talks:
    - PyParis 2018: Pierre Alexandre Schembri - Unexpected Dataclasses [Slides](http://pyparis.org/static/slides/Pierre%20Alexandre%20Schembri-9cc74f5a.pdf), [Video](https://www.youtube.com/watch?v=Npsovzwcd-w&list=PLzjFI0G5nSsry3cm_k1tPOi9SRaAXsZAt&index=6)
    - PyCon 2018: Raymond Hettinger - Dataclasses: The code generator to end all code generators, [Video](https://www.youtube.com/watch?v=T-TwcmT6Rcw)

[^ Go to the top!](#Table-of-Contents)