# Some thoughts on data formats

## Scenario file

Consisting of the following 'tables':

* one for each heterogeneous markov chain (size currently around 9k but rises exponentially with number of activities), of which there will be one for each person type
* one for all dwellings with ~100k entries, including all thermal traits, a geolocation, and an ID to UKBuildings
* one for all people with ~250k entries, mainly a link to a dwelling and a link to a markov chain

## Simulation results

Working assumptions:

* all simulation results fit in memory, which increases the number of possible solutions and is a simplifcation
* results aren't compressed in memory

In [None]:
from collections import namedtuple
from enum import Enum

import numpy as np

In [None]:
Variable = namedtuple('Variable', ['name', 'dtype', 'domain'])


class Domain(Enum):
    DWELLING = (101955)
    RESIDENT = (254926)
    
    def __init__(self, number):
        self.number = number
        

def analyse_size_of_result_in_giga_bytes(variables, number_time_steps):
    print("{} steps".format(number_time_steps))
    print("{}".format(", ".join(["{} ({})".format(var.name, var.dtype) for var in variables])))
    print("")
    bytesize_dwellings = sum([var.dtype.itemsize * var.domain.number * number_time_steps 
                             for var in variables if var.domain == Domain.DWELLING])
    bytesize_residents = sum([var.dtype.itemsize * var.domain.number * number_time_steps 
                             for var in variables if var.domain == Domain.RESIDENT])
    print("{:.2f} GB necessary for residents".format(bytes_to_giga_bytes(bytesize_residents)))
    print("{:.2f} GB necessary for dwellings".format(bytes_to_giga_bytes(bytesize_dwellings)))
    print("{:.2f} GB necessary in total".format(bytes_to_giga_bytes(bytesize_dwellings + bytesize_residents)))
    


def bytes_to_giga_bytes(bytesize):
    return bytesize / 1024 / 1024 / 1024

In [None]:
variables = [
    Variable('temperature', np.dtype(np.float32), Domain.DWELLING),
    Variable('thermal_power', np.dtype(np.float32), Domain.DWELLING),
    Variable('activity', np.dtype(np.int8), Domain.RESIDENT),
]

analyse_size_of_result_in_giga_bytes(variables, 8760 * 6)

Storing all data is too much. Not only will it be difficult to keep all results in memory, but also it is most probably more than we need.

Reducing the temporal resolution to 1h might be a solution, even though it might be difficult to downscale people activity. In any case that would lead to:

In [None]:
variables = [
    Variable('temperature', np.dtype(np.float32), Domain.DWELLING),
    Variable('thermal_power', np.dtype(np.float32), Domain.DWELLING),
    Variable('activity', np.dtype(np.int8), Domain.RESIDENT),
]

analyse_size_of_result_in_giga_bytes(variables, 8760)

Which is doable, but still a lot. Would it be possible to use 16bit floats for some of the variables?

In [None]:
print(np.finfo(np.dtype(np.float32)))
print(np.finfo(np.dtype(np.float16)))

The resolution of the 16bit floating point should be enough for temperature and thermal power. Storing thermal power in kW will ensure values will fit into 16bit range.

In [None]:
variables = [
    Variable('temperature', np.dtype(np.float16), Domain.DWELLING),
    Variable('thermal_power', np.dtype(np.float16), Domain.DWELLING),
    Variable('activity', np.dtype(np.int8), Domain.RESIDENT),
]

analyse_size_of_result_in_giga_bytes(variables, 8760)

This looks better, but must ensure its precise enough.