# 01_05: Data classes

In [128]:
import math
import collections
import dataclasses
import datetime

import numpy as np
import pandas as pd
import matplotlib.pyplot as pp

Let us look at Python data structures from the perspective of a data scientist or data analyst. What are the options to store _tabular_ data, such as a table of famous people with their names and birthdays?

<table>
<tr><th>name</th><th>lastname</th><th>birthday</th></tr>
<tr><td>Michele</td><td>Vallisneri</td><td>July 15</td></tr>
<tr><td>Albert</td><td>Einstein</td><td>March 14</td></tr>
<tr><td>John</td><td>Lennon</td><td>October 9</td></tr>
<tr><td>Jocelyn</td><td>Bell Burnell</td><td>July 15</td></tr>
</table>

A list of Python dicts is certainly a possibility: it's easy to access the columns by their key, and to query the rows using comprehensions.

In [3]:
peopledict = [{"name": "Michele", "lastname": "Vallisneri",   "birthday": "July 15"},
              {"name": "Albert",  "lastname": "Einstein",     "birthday": "March 14"},
              {"name": "John",    "lastname": "Lennon",       "birthday": "October 9"},
              {"name": "Jocelyn", "lastname": "Bell Burnell", "birthday": "July 15"}]

In [4]:
[person for person in peopledict if person["birthday"] == "July 15"]

[{'name': 'Michele', 'lastname': 'Vallisneri', 'birthday': 'July 15'},
 {'name': 'Jocelyn', 'lastname': 'Bell Burnell', 'birthday': 'July 15'}]

However dicts are fairly wasteful because the keys need to be repeated for every row.

Another possibilities are `tuple`s, or even better the `namedtuple`s from the `collections` module in the Python standard binary. With these, we create a specialized tuple that associates labels with columns.

In [43]:
Person = collections.namedtuple("Person", ["name", "lastname", "birthday"])

The syntax to create a `person` is intuitive:

In [29]:
Person(name='Michele', lastname='Vallisneri', birthday='July 15')

Person(name='Michele', lastname='Vallisneri', birthday='July 15')

...although we can also omit the labels:

In [31]:
peopletuples = [Person("Michele", "Vallisneri", "July 15"),
                Person("Albert", "Einstein", "March 14"),
                Person("John", "Lennon", "October 9"),
                Person("Jocelyn", "Bell Burnell", "July 15")]

The columns can be accessed with the dot notation of Python object attributes, although regular tuple indices would also work.

In [32]:
[person for person in peopletuples if person.lastname == "Lennon"]

[Person(name='John', lastname='Lennon', birthday='October 9')]

We can convert from and to a dictionary using double-star unpacking and the `namedtuple` method `_asdict`. The underscore is there to avoid confusion in case you _really_ want to use `asdict` as a label.

In [38]:
Person(**peopledict[3])

Person(name='Jocelyn', lastname='Bell Burnell', birthday='July 15')

In [39]:
peopletuples[3]._asdict()

{'name': 'Jocelyn', 'lastname': 'Bell Burnell', 'birthday': 'July 15'}

Python 3.7 introduced an alternative to tuples and dicts for storing data records, _data classes_, in the module `dataclasses`.

This is how we would set up a `person` record with `name`, `lastname`, and `birthday`. We specify the Python type of each field, and we can also set a default value.

In [67]:
@dataclasses.dataclass
class Persondata:
    name: str
    lastname: str
    birthday: str = "unknown"

The syntax is that of a Python `class`, but we have added the _class decorator_ `@dataclass` at the top, and _type annotations_ for each field.

Again, the syntax is intuitive, using or omitting keywords

In [91]:
peopledata = [Persondata(name="Michele", lastname="Vallisneri", birthday="July 15"),
              Persondata("Albert", "Einstein", "March 14"),
              Persondata("John", "Lennon", "October 9"),
              Persondata("Jocelyn", "Bell Burnell", "July 15")]

...and we access fields by name:

In [52]:
[person for person in peopledata if person.birthday != "July 15"]

[Persondata(name='Albert', lastname='Einstein', birthday='March 14'),
 Persondata(name='John', lastname='Lennon', birthday='October 9')]

So far this is very similar to `namedtuple`. However data classes are full Python classes, so we can define methods that operate on the fields: for instance, a method that returns a person's full name, or a prettier printout:

In [72]:
@dataclasses.dataclass
class Persondata:
    name: str
    lastname: str
    birthday: str = "unknown"
    
    # when writing class methods, "self" refers to instances
    def fullname(self):
        return self.name + " " + self.lastname

    # the special method __str__ overrides the standard printout
    def __str__(self):
        return self.lastname + ", " + self.name + ", born " + self.birthday

In [61]:
michele = Persondata('Michele', 'Vallisneri', 'July 15')

In [62]:
michele.fullname()

'Michele Vallisneri'

In [65]:
print(michele)

Vallisneri, Michele born July 15


Data classes have a number of other useful features, such as freezing (columns cannot be changed), sorting (by comparing fields in order, or with a custom "less than" function, computed fields

I encourage you to stop the video here for a moment and experiment with these variants.

In [109]:
@dataclasses.dataclass(frozen = True)
class Persondata_frozen:
    name: str
    lastname: str
    birthday: str = "unknown"


@dataclasses.dataclass(order = True)
class Persondata_ordered:
    name: str
    lastname: str
    birthday: str = "unknown"


@dataclasses.dataclass
class Persondata_customorder:
    name: str
    lastname: str
    birthday: str = "unknown"

    # custom "less than" comparison
    def __lt__(self, other):       
        return (self.lastname, self.name, self.birthday) < (other.lastname, other.name, other.birthday)


@dataclasses.dataclass
class Persondata_computed:
    name: str
    lastname: str
    birthday: str = "unknown"
    fullname: str = dataclasses.field(init=False) # will compute it below

    def __post_init__(self):
        self.fullname = self.name + " " + self.lastname

One thing we haven't seen is how the _type_ of a field (such as `str`) is used with dataclasses. In fact, by default it is _not_. But it is made available to third-party packages to validate data entry. An excellent package for that purpose is `pydantic`.

In [118]:
import pydantic

We replace the standard `dataclasses.dataclass` constructor with its equivalent in `pydantic`. We also write a custom validator for the birthday, which will try to convert it to a Python `datetime` object, and raise an exception if not

In [149]:
@pydantic.dataclasses.dataclass
class Persondata_pydantic:
    name: str
    lastname: str
    birthday: str = "unknown"

    @pydantic.field_validator("birthday")
    def validate_date(cls, value): # a class method, so first argument is the class 
        
        # will fail if date is not "MONTHNAME DAYNUMBER" 
        datetime.datetime.strptime(value, "%B %d")
        
        return value

Now we would get an error if we try to create this dataclass with a name that is not a string, or a date that does not match our template.

In [153]:
Persondata_pydantic("Michele", 15, "July 15")

ValidationError: 1 validation error for Persondata_pydantic
1
  Input should be a valid string [type=string_type, input_value=15, input_type=int]
    For further information visit https://errors.pydantic.dev/2.8/v/string_type

In [154]:
Persondata_pydantic('Michele', "Vallisneri", "7/15")

ValidationError: 1 validation error for Persondata_pydantic
2
  Value error, time data '7/15' does not match format '%B %d' [type=value_error, input_value='7/15', input_type=str]
    For further information visit https://errors.pydantic.dev/2.8/v/value_error

`pydantic` is a very sophisticated and powerful package with many features. It is also compatible with many data analysis and data science packages.

If your project requires substantial data validation, it will pay to dig into `pydantic`.