## Dataclasses in Python

#### Simon Westerberg
##### Sartorius Stedim Data Analytics



### Chapter 1 
*The licensed cat owner*


Assume that we have some software in which we want to represent cat owners. 

There is some information tied to each cat owner, specifically:
* Name
* Whether they have a cat owner license
* A list of cats that they own




In python there are of course many different ways to represent this data.
For instance, we could use a tuple like this:

In [None]:
a = ('Anna', True, ['Meowski'])
print(a)

However, a major drawback of tuples is that we don’t know what the data represent. For instance, `True` could mean anything. 

Another option is to use a key/value datatype, like a `dict`:

In [None]:
a = {'name': 'Anna', 'has_license': True, 'cats': ['Meowski']}
print(a)

This is a bit better since we now know what each property represents. However, we still don’t know what the dictionary itself represents. It could be a cat hunter, with a cat hunting license, who has currently killed one cat (Meowski).

We have no type specification, we don’t know if there is some missing information, or too much information.

Next datatype we can try is the named tuple. It exists in two versions. Let’s look at the old one first.

In [None]:
from collections import namedtuple
CatOwner = namedtuple('CatOwner', ['name', 'has_license', 'cats'])

a = CatOwner('Anna', True, ['Meowski'])
print(a)

Now we have a name for the type, and for the fields. Great. 
However, we have to specify the name twice, and if they mismatch, things get confusing…
Further, we cannot make use of Python type hints, e.g. specifying that the name should be a string. 

So in Python 3.6, we got the other version of the named tuple, the typed NamedTuple:

In [None]:
from typing import NamedTuple, List
class CatOwner(NamedTuple):
    name: str
    has_license: bool = True
    cats: List[str] = ["Meowski"]

This version also provides an easy way to specify default values. 

Let’s say most cat owners actually become cat owners i samband med att de get their cat license. They apply for license and receives a diploma, as well as a cat, called Meowski.

We can specify this in the class template and then instantiate two cat owners, and they both have their default values. 

In [None]:
a = CatOwner("Anna")
b = CatOwner("Bob")

print(a)
print(b)

Looks good, right? Well, it's not...

### Chapter 2 
*The tumble dryer incident*

After a while, Anna's cat accidentally ends up in the tumble dryer.
We want our data to reflect this.

First we remove the cat from her list of cats.

In [None]:
a.cats.remove('Meowski')
print(a.cats)

Now that list is empty.
However, Bobs list of cats is also empty…

In [None]:
print(b.cats)

This is because a list behaves like a reference type, where the list ovject is created when the class is created, and all CatOwner instances get references to the same object.

We also want to revoke Anna’s cat owner license:

In [None]:
a.has_license = False

However, named tuples are immutable, and this results in an error.

Immutability is often desirable, so nothing inherently wrong with this. 
But there are cases where mutability is desired, e.g. for performance reasons.
And we still need to fix the problem with default values for reference types.

## Chapter 3
*The* `class`*y solution*

How about using a good old regular class then? Let's try!

First we need a constructor. It would be tempting to specify the default values for the `cats` parameter directly in the parameter list, however that would lead to the same problem, where all instances would reference the same object. Instead we need to construct the list in the constructor body.

In [None]:
class CatOwner:
    def __init__(self, name, has_license=True, cats=None):
        self.name = name
        self.has_license = has_license
        if cats is None:
            self.cats = ['Meowski']

We try again to create two cat owners, remove Meowski from one of them, then print the objects to see if it works…

In [None]:
a = CatOwner('Anna')
b = CatOwner('Bob')

a.cats.remove('Meowski')
print(a)
print(b)

Oh, we cant't see by just printing the object, because we don’t have any string representation of the class. Let’s add one:

In [None]:
class CatOwner:
    def __init__(self, name, has_license=True, cats=None):
        self.name = name
        self.has_license = has_license
        if cats is None:
            self.cats = ['Meowski']
        
    def __repr__(self):
        return (self.__class__.__qualname__ +
            f"(name={self.name}, has_license={self.has_license}, cats={self.cats})")

In [None]:
a = CatOwner('Anna')
b = CatOwner('Bob')

a.cats.remove('Meowski')
print(a)
print(b)

Good, it seems to work.

Of course, we would also like to be able to compare different cat owners, so we need to add an equality method:

In [None]:
twin_1 = CatOwner('Twin')
twin_2 = CatOwner('Twin')

print(twin_1)
print(twin_2)


In [None]:
print('Twin == Twin?', twin_1 == twin_2)

In [None]:
class CatOwner:
    def __init__(self, name, has_license=True, cats=None):
        self.name = name
        self.has_license = has_license
        if cats is None:
            self.cats = ['Meowski', 'Purrski']
        
    def __repr__(self):
        return (self.__class__.__qualname__ +
            f"(name={self.name}, has_license={self.has_license}, cats={self.cats})")
    
    def __eq__(self, other):
        if not isinstance(other, CatOwner):
            return NotImplemented
        return \
            self.name == other.name and \
            self.has_license == other.has_license and \
            self.cats == other.cats

And honestly, our code doesn't look that nice any more. If we also would like to be able to order objects, or place them in a dict, we would need an implementation of additional special methods. Not just a simple data representation anymore...

In [None]:
class CatOwner:
    def __init__(self, name, has_license=True, cats=None):
        self.name = name
        self.has_license = has_license
        if cats is None:
            self.cats = ['Meowski', 'Purrski']
        
    def __repr__(self):
        return (self.__class__.__qualname__ +
            f"(name={self.name}, has_license={self.has_license}, cats={self.cats})")
    
    def __eq__(self, other):
        if not isinstance(other, CatOwner):
            return NotImplemented
        return \
            self.name == other.name and \
            self.has_license == other.has_license and \
            self.cats == other.cats
    
    # def __hash__(self):
        # ...
        
    # def __le__(self, other):
        # ...
        
    # def __lt__(self, other):
        # ...
        
    # def __gt__(self, other):
        # ...

### Chapter 4
*The dataclass (finally)*

Now we are ready to look at the dataclass, introduced in Python 3.7.

The dataclass looks similar to the typed named tuple, e.g.:

In [None]:
from dataclasses import dataclass

@dataclass
class CatOwner:
    name: str
    has_license: bool
    cats: List[str]       

vs.

In [None]:
from typing import NamedTuple

class CatOwner(NamedTuple):
    name: str
    has_license: bool
    cats: List[str]

The dataclass can have default values for value types like bool,
but if we try to do the same to a list, we get an error.

In [None]:
from dataclasses import dataclass

@dataclass
class CatOwner:
    name: str
    has_license: bool = True
    cats: list = ["Meowski"]  # Not allowed

Instead we need to use the special `field` function, with a `default_factory` parameter that specifies a function that creates the default object.

In [None]:
from dataclasses import dataclass, field
from typing import List

@dataclass
class CatOwner:
    name: str
    has_license: bool = True
    cats: List[str] = field(default_factory=lambda: ['Meowski'])


In [None]:
a = CatOwner('Anna')
print(a)

a.has_license = False
print(a)

A dataclass is mutable by default. However, by using the `frozen` argument, it becomes immutable. There are also a couple of other arguments that lets us customize the behaviour of the dataclass.

### Generated special functions

```python
@dataclass
```
        __init__, __repr__, __eq__

```python
@dataclass(frozen=True)
```
        __setattr__, __delattr__, (__hash__)


```python
@dataclass(order=True)
```
        __lt__, __le__, __gt__, __ge__
        
```python
@dataclass(unsafe_hash=True)
```
        __hash__

### Customizability


In [None]:
from dataclasses import dataclass, field
from typing import Dict

@dataclass(order=True, unsafe_hash=True)
class Thing:
    # Don't use id in comparison or (string) representation
    id_: str = field(repr=False, compare=False, default="")
        
    # Size wil be used for comparison, since order=True
    size: int = 0
    
    # Don't use content for hash calculation
    content: Dict[str, str] = field(default_factory=dict, hash=False)
    
t1= Thing("xyz", size = 4)
t2= Thing("abc", size = 6)
print(t1)
print(t1 < t2)

Otherwise, this dataclass functions exactly like a class, it is just a more compact way of writing a class.
It is similar to the NamedTuple, but it is mutable, and allows default values for reference types. It is also more customizable.


### Takeaways

* `NamedTuple` is a great start

* Use `@dataclass` instead if you need 
    * mutability
    * customizability
    * default values

* Avoid mutable types as default values

### Thank you!

[`github.com/swiperii/python-dataclasses`](github.com/swiperii/python-dataclasses)

[simon.westerberg@sartorius.com](mailto:simon.westerberg@sartorius.com)