In [1]:
from pathlib import Path
from pydantic import BaseModel, root_validator
from typing import Optional
import ray
from ray import tune


While creating larger projects, we will typically end up having a lot of parameters. While the fastest way might seem to just hardcode them somewhere, this is not a valid long-term strategy.

Especially when doing experiments with machine learning, we will want to have everything in one place, and ideally we want to have checks in place.

This documents explores how you can make more advanced pydantic settings, even for more complex parameters like ray search spaces.

To start naively, we could just make a config like this

In [2]:
config = {"input_size": 3, "output_size": 20, "data_dir": Path(".")}


In [3]:
config["input_size"]

3

While this will go a long way, there are some horros hidden deep inside python.

In [4]:
from dataclasses import dataclass

@dataclass
class MyClass:
    mutable_attr = []

# Create two instances
instance1 = MyClass()
instance2 = MyClass()

# Append to the list in instance1
instance1.mutable_attr.append('Hello')

print(instance1.mutable_attr)  # prints ['Hello']
print(instance2.mutable_attr)  # also prints ['Hello']. Wait, what?

['Hello']
['Hello']


Every programmer should get nightmares from this, because this is absolutely not what you would expect.
Luckily, pydantic is there to save the day.

![img](python.PNG)

In [10]:
from pydantic import BaseModel
from typing import List

class TrainerSettings(BaseModel):
    mutable_attr: List = []

# Create two settings instances
settings1 = TrainerSettings()
settings2 = TrainerSettings()

# Change 'factor' in settings1
settings1.mutable_attr.append("Hello")

print(settings1.mutable_attr)  # prints ["Hello"] 
print(settings2.mutable_attr) # print [] 

['Hello']
[]



But the protection against modifying features is just one advantage. We can get a config on steroids with pydantic without too much extra effort:

In [11]:
class SearchSpace(BaseModel):
    input_size: int
    output_size: int
    tune_dir: Optional[Path]
    data_dir: Path

config = SearchSpace(input_size=3.0, output_size=20, data_dir=".")  # <- string goes in here
config  # <- and is automatic cast to a Path here

SearchSpace(input_size=3, output_size=20, tune_dir=None, data_dir=PosixPath('.'))

Note how the `"."` data_dir becomes a `PosixPath`, automatically, even if we provide the argument as a string!!

Note how `Optional` allows for leaving the argument out, and the value defaults to `None`.

If possible, it will cast all elements, e.g. even `input_size="3"` becomes an integer

In [19]:
config = SearchSpace(input_size="3", output_size=20, data_dir=".")
config.input_size


3

In [20]:
type(config.input_size) == int

True

And if you try to give `data_dir` something that can't be cast to a `Path`, you will get an error.
The advantage is that you get your errors at the place where you make them, and not 10 steps later when running the trainloop...

In [21]:
try:
    config = SearchSpace(input_size="3", output_size=20, data_dir=3.4)
except ValueError as e:
    print(e)


1 validation error for SearchSpace
data_dir
  value is not a valid path (type=type_error.path)


Let's try to add the ray.tune ranges. We will need these later on when hypertuning. 
You dont have to understand this now, but what it does is it provides us a range of possible parameters, in this case a uniform distribution of numbers between 0.0 and 10.0.

To find out what the type is, we simple call the `type()` method.

In [22]:
type(1.0)

float

In [23]:
type(tune.uniform(0.0, 10.0))


ray.tune.sample.Float

This is a uniform distribution, that Ray will use to search for optimal parameters.

But if we simply add that...

In [25]:
from typing import Union, Optional, Dict

SAMPLE_INT = ray.tune.sample.Integer

try:

    class SearchSpace(BaseModel):
        input_size: int
        hidden_size: Union[int, SAMPLE_INT] = tune.randint(16, 128)
        output_size: int
        tune_dir: Optional[Path]
        data_dir: Path

except RuntimeError as e:
    print(e)


no validator found for <class 'ray.tune.sample.Integer'>, see `arbitrary_types_allowed` in Config


Pydantic complains that it does not know how to validate the type. A simple solution is to add `arbitrary_types_allowed`

In [26]:
class SearchSpace(BaseModel):
    input_size: int
    hidden_size: Union[int, SAMPLE_INT]
    output_size: int = 20
    tune_dir: Path = "."
    data_dir: Path

    class Config:
        arbitrary_types_allowed = True


config = SearchSpace(input_size=3, hidden_size=32, data_dir=".")
config


SearchSpace(input_size=3, hidden_size=32, output_size=20, tune_dir='.', data_dir=PosixPath('.'))

Because of the `Union`, an integer will work

In [27]:
config = SearchSpace(input_size=3, hidden_size=tune.randint(16, 128), data_dir=".")
config


SearchSpace(input_size=3, hidden_size=<ray.tune.sample.Integer object at 0x125a12dc0>, output_size=20, tune_dir='.', data_dir=PosixPath('.'))

And a `tune.randint` will work.

But a `tune.uniform` fails! Exactly what we need!

In [28]:
try:
    config = SearchSpace(input_size=3, hidden_size=tune.uniform(0.0, 0.5), data_dir=".")
except Exception as e:
    print(e)


2 validation errors for SearchSpace
hidden_size
  value is not a valid integer (type=type_error.integer)
hidden_size
  instance of Integer expected (type=type_error.arbitrary_type; expected_arbitrary_type=Integer)


Also, pydantic wont know how to check for `SAMPLE_INT`.
You can write your own validator for a class. Implement a `__get_validators__` function,
which will yield one or more validators. You can find more on that in the [documentation](https://pydantic-docs.helpmanual.io/usage/types/#custom-data-types)


In [29]:
class SampleFloat:
    @classmethod
    def __get_validators__(cls):
        yield cls.validate

    @classmethod
    def validate(cls, v):
        # we check if the value v is actually a ray search.sample.Float type
        if not isinstance(v, ray.tune.sample.Float):
            raise TypeError(f"{ray.tune.sample.Float} required, found {type(v)}")
        return v


We just ran a simple check. But you can imagine more complex checks (e.g. for phone numbers etc)

In [30]:
class SearchSpace(BaseModel):
    dropout: SampleFloat


try:
    config = SearchSpace(dropout=tune.randint(16, 32))
except Exception as e:
    print(e)


1 validation error for SearchSpace
dropout
  <class 'ray.tune.sample.Float'> required, found <class 'ray.tune.sample.Integer'> (type=type_error)


However, in our case, it does not add anything more than we already had with arbitrary types.

In [31]:
SAMPLE_INT = ray.tune.sample.Integer
SAMPLE_FLOAT = ray.tune.sample.Float


class SearchSpace(BaseModel):
    input_size: int
    hidden_size: Union[int, SAMPLE_INT]
    dropout: Union[float, SAMPLE_FLOAT]
    num_layers: Union[int, SAMPLE_INT]
    output_size: int
    tune_dir: Optional[Path]
    data_dir: Path

    class Config:
        arbitrary_types_allowed = True


config = SearchSpace(
    input_size=3,
    hidden_size=tune.randint(16, 128),
    dropout=tune.uniform(0.0, 0.3),
    num_layers=2,
    output_size=20,
    data_dir=".",
)
config


SearchSpace(input_size=3, hidden_size=<ray.tune.sample.Integer object at 0x125a4ecd0>, dropout=<ray.tune.sample.Float object at 0x125a4e280>, num_layers=2, output_size=20, tune_dir=None, data_dir=PosixPath('.'))

But what if we want to protect againts adding non-existing paths?

In [32]:
data_dir = Path("data/a/b").absolute()
data_dir.exists(), data_dir


(False, PosixPath('/Users/rgrouls/code/ML22/notebooks/0-baseline/data/a/b'))

In [33]:
config = SearchSpace(
    input_size=3,
    hidden_size=32,
    dropout=0.1,
    num_layers=2,
    output_size=20,
    data_dir=data_dir,
)
config


SearchSpace(input_size=3, hidden_size=32, dropout=0.1, num_layers=2, output_size=20, tune_dir=None, data_dir=PosixPath('/Users/rgrouls/code/ML22/notebooks/0-baseline/data/a/b'))

We can add a `root_validator` to run an additional check before creation.

In [34]:
class SearchSpace(BaseModel):

    input_size: int
    hidden_size: Union[int, SAMPLE_INT] = tune.randint(16, 128)
    dropout: Union[float, SAMPLE_FLOAT] = tune.uniform(0.0, 0.3)
    num_layers: Union[int, SAMPLE_INT] = tune.randint(2, 5)
    output_size: int
    tune_dir: Optional[Path]
    data_dir: Path

    class Config:
        arbitrary_types_allowed = True

    @root_validator
    def check_path(cls, values: Dict) -> Dict:  # noqa: N805
        datadir = values.get("data_dir")
        if not datadir.exists():
            raise FileNotFoundError(
                f"Make sure the datadir exists.\n Found {datadir} to be non-existing."
            )
        return values


try:
    config = SearchSpace(
        input_size=3,
        hidden_size=32,
        dropout=0.1,
        num_layers=2,
        output_size=20,
        data_dir=data_dir,
    )
except FileNotFoundError as e:
    print(e)


Make sure the datadir exists.
 Found /Users/rgrouls/code/ML22/notebooks/0-baseline/data/a/b to be non-existing.


This can really safe you a lot of headaches!

A last trick is to use inheritance. We can make a baseclass, and inherit all the validators etc, and just add the additional stuff specific to our model.

In [35]:
class BaseSearchSpace(BaseModel):

    input_size: int
    output_size: int
    tune_dir: Optional[Path]
    data_dir: Path

    class Config:
        arbitrary_types_allowed = True

    @root_validator
    def check_path(cls, values: Dict) -> Dict:  # noqa: N805
        datadir = values.get("data_dir")
        if not datadir.exists():
            raise FileNotFoundError(
                f"Make sure the datadir exists.\n Found {datadir} to be non-existing."
            )
        return values


class SearchSpace(BaseSearchSpace):
    hidden_size: Union[int, SAMPLE_INT] = tune.randint(16, 128)
    dropout: Union[float, SAMPLE_FLOAT] = tune.uniform(0.0, 0.3)
    num_layers: Union[int, SAMPLE_INT] = tune.randint(2, 5)


In [36]:
data_dir = Path("../../data/external/gestures-dataset").resolve()
config = SearchSpace(
    input_size=3,
    hidden_size=tune.randint(16, 128),
    dropout=0.1,
    num_layers=2,
    output_size=20,
    data_dir=data_dir,
)
config


SearchSpace(input_size=3, output_size=20, tune_dir=None, data_dir=PosixPath('/Users/rgrouls/code/ML22/data/external/gestures-dataset'), hidden_size=<ray.tune.sample.Integer object at 0x125a3a370>, dropout=0.1, num_layers=2)

We can access items like this:

In [37]:
config.data_dir


PosixPath('/Users/rgrouls/code/ML22/data/external/gestures-dataset')

We also get transformation into a dictionary for free:

In [38]:
config.dict()


{'input_size': 3,
 'output_size': 20,
 'tune_dir': None,
 'data_dir': PosixPath('/Users/rgrouls/code/ML22/data/external/gestures-dataset'),
 'hidden_size': <ray.tune.sample.Integer at 0x125a3a370>,
 'dropout': 0.1,
 'num_layers': 2}