# Torch Implementation of Set function for Time Series


Base Model: Deep-Set Architecture

$$ f(\mathcal{S})=g\left(\frac{1}{|\mathcal{S}|} \sum_{s_{j} \in \mathcal{S}} h\left(s_{j}\right)\right) $$

Modification: Scaled-Dot-Product
Paper:

$$ K_{j, i}=\left[f^{\prime}(\mathcal{S}), s_{j}\right]^{T} W_{i}$$

where $f'$ is another deep-set model. 

$$ e_{j, i}=\frac{K_{j, i} \cdot Q_{i}}{\sqrt{d}} \quad \text { and } \quad a_{j, i}=\frac{\exp \left(e_{j, i}\right)}{\sum_{j} \exp \left(e_{j, i}\right)}$$

> For each head, we multiply
the set element embeddings computed via the function h
with the attentions derived for the individual instances, i.e.

$$ r_{i}=\sum_{j} a_{j, i} h\left(s_{j}\right)$$

The final prediction is made by

$$  \hat{y} = g\Big(\sum_{s∈S} a(S, s) h(s)\Big) $$ 

#### Notes

- $g$ and $h$ are usually just MLPs, $f'$ is a DeepSet
- $m$ is the number of heads
- $W$, $Q$ are learnable. $Q$ is initialized with zeros
- $W_i$ has shape $(\dim(f')+\dim(s), d)$
- $Q$ has shape $(m, d)$
- $K$ has shape $(|S|, d)$
- $E$ has shape $(|S|, m)$
- $e_i$ is a vector of size $|S|$
- $a_i$ is a vector of size $|S|$
- $a(S, s)$ is $(|S|, m)$
- $h(s)$ is $(d,)$
- $r= [r_1, …, r_m] = \sum_{s∈S} a(S, s) h(s)$ is of shape $(m,d)$
- The authors do not seem to include latent dimension?

## Simplified Equations

Rename: $h = ϕ$, $f' = ρ∘∑∘ψ$

$$ a_{j,i} = \operatorname{softmax}(e_i) = \sigma(e_i)$$

$$ e_{j,i} = \frac{1}{\sqrt{d}}K_{j, i}\cdot Q_{i} = \frac{1}{\sqrt{d}}\left[ψ(\mathcal{S}), s_{j}\right]^{T} W_{i}\cdot Q_{i} $$


In [3]:
import torch
from torch import Tensor, jit, nn
from typing import Final, Literal

from einops.layers.torch import Rearrange, Reduce
from torchinfo import summary

In [None]:
class Repeat(nn.Sequential):
    """An copies of a module multiple times."""

    DEFAULT_HP: dict = {
        "__name__": __qualname__,  # type: ignore[name-defined]
        "__module__": __module__,  # type: ignore[name-defined]
        "module": None,
        "copies": 1,
        "independent": True,
    }

    HP: Dict[str, Any]
    """The HP"""

    def __init__(self, **HP: Any) -> None:
        self.HP = self.DEFAULT_HP | HP
        HP = self.HP
        copies: list[nn.Module] = []

        for k in range(HP["copies"]):
            if isinstance(HP["module"], nn.Module):
                module = HP["module"]
            else:
                module = initialize_from_config(HP["module"])

            if HP["independent"]:
                copies.append(module)
            else:
                copies = [module] * HP["copies"]
                break

        HP["module"] = str(HP["module"])
        super().__init__(*copies)

In [51]:
class MLP(nn.Sequential):
    def __init__(self, input_size: int, output_size: int, num_layers: int):

        layers = []
        for k in range(num_layers):
            layer = nn.Linear(input_size, input_size)
            nn.init.kaiming_normal_(layer.weight, nonlinearity="relu")
            nn.init.kaiming_normal_(layer.bias[None], nonlinearity="relu")
            layers.append(layer)
            layers.append(nn.ReLU())
        else:
            layer = nn.Linear(input_size, output_size)
            nn.init.kaiming_normal_(layer.weight, nonlinearity="relu")
            nn.init.kaiming_normal_(layer.bias[None], nonlinearity="relu")
            layers.append(layer)
        super().__init__(*layers)
        
summary(MLP(3,4,2))

In [52]:
class DeepSet(nn.Module):
    """Signature: `[... V, K] -> [... D]`"""
    def __init__(
        self,
        input_size: int,
        latent_size: int,
        output_size: int,
        encoder_layers: int = 2,
        decoder_layers: int = 2,
        aggregation: Literal["min", "max", "sum", "mean", "prod"] = "sum",
    ):
        super().__init__()
        self.encoder = MLP(input_size, latent_size, encoder_layers)
        self.aggregation = Reduce("... j k -> ... k", aggregation)
        self.decoder = MLP(latent_size, output_size, decoder_layers)
        
summary(DeepSet(3,4,5))

In [37]:
class SetAttention

In [None]:
DeepSet().forward

In [23]:
reduce?

In [6]:
from einops import reduce

In [10]:
x = torch.randn(3, 4, 5)
torch.einsum("ijk -> ij", x)

reduce(x, "... (j k) -> ... k", "mean")

In [None]:
scale = torch.tensor(1.23)
num_dim = 10
1.23 ** -torch.arange(0, num_dim + 2, 2) / num_dim

In [None]:
a = [0, 2, 4, 6, 8]
b = [1, 3, 5, 7, 9]

In [None]:
c = sum(zip(a, b), ())

In [None]:
c

In [None]:
sum(c, ())

In [None]:
# insert code here

## A second heading

and some more text

In [None]:
import torch
from torch import Tensor, jit, nn

In [None]:
class SetFuncTS(nn.Module):
    def __init__(self, num_dim: int):
        super().__init__()

        self.encoder
        self.decoder
        self.aggregator

    def forward(self, x: Tensor) -> Tensor:
        """Signature: `[..., <var>, ]`.

        Takes list of triplet-encoded data and applies.
        """
        t = torch.stack(x, dim=-1)
        return torch.sum(t, dim=-1)

In [None]:
jit.script(SetFuncTS())

In [None]:
def call(self, inputs, segment_ids, lengths, training=None):
    if training is None:
        training = tf.keras.backend.learning_phase()

    def dropout_attn(input_tensor):
        if self.attn_dropout > 0:
            mask = tf.random.uniform(tf.shape(input_tensor)[:-1]) < self.attn_dropout
            return input_tensor + tf.expand_dims(tf.cast(mask, tf.float32), -1) * -1e9
        else:
            return tf.identity(input_tensor)

    encoded = self.psi(inputs)
    agg = self.psi_aggregation(encoded, segment_ids)
    agg = self.rho(agg)
    agg_scattered = tf.gather_nd(agg, tf.expand_dims(segment_ids, -1))
    combined = tf.concat([inputs, agg_scattered], axis=-1)
    keys = tf.matmul(combined, self.W_k)
    keys = tf.stack(tf.split(keys, self.n_heads, -1), 1)
    keys = tf.expand_dims(keys, axis=2)
    # should have shape (el, heads, 1, dot_prod_dim)
    queries = tf.expand_dims(tf.expand_dims(self.W_q, -1), 0)
    # should have shape (1, heads, dot_prod_dim, 1)
    preattn = tf.matmul(keys, queries) / tf.sqrt(float(self.dot_prod_dim))
    preattn = tf.squeeze(preattn, -1)
    preattn = smart_cond(
        training, lambda: dropout_attn(preattn), lambda: tf.identity(preattn)
    )

    per_head_preattn = tf.unstack(preattn, axis=1)
    attentions = []
    for pre_attn in per_head_preattn:
        attentions.append(segment_softmax(pre_attn, segment_ids))
    return attentions

```
>>>>>> input_shapes:                    [(16, 8), (16, 15009, 1), (16, 15009, 1), (16, 15009), (16,)]
>>>>>> lengths:                         (16,)
>>>>>> max length |S|:                  15009
>>>>>> sum lengths ∑|S|:                238416
>>>>>> transformed_times:               (16, 15009, 4)
>>>>>> transformed_measurements:        (16, 15009, 24)
>>>>>> combined_values:                 (16, 15009, 29)
>>>>>> demo_encoded:                    (16, 29)
>>>>>> combined_with_demo:              (16, 15010, 29)
>>>>>> mask:                            (16, 15010)
>>>>>> collected_values S:              (238432, 29)
>>>>>> encoded ϕ = h(s):                (238432, 256)
>>>>>> encoded ψ = f'(S):               (238432, 128)
>>>>>> agg ψ:                           (16, 128)
>>>>>> agg ρ:                           (16, 128)
>>>>>> combined [f(S),s]:               (238432, 157)
>>>>>> keys [f(S),s]ᵀW:                 (238432, 4, 1, 128)
>>>>>> preattn eᵢⱼ= KQ/√d:              (238432, 4, 1, 128)
>>>>>> attentions a(S):                 (4, 238432, 1)
>>>>>> weighted_values:                 (4, 238432, 256)
>>>>>> weighted_values a(S,s)h(s):      (238432, 1024)
>>>>>> aggregated_values ∑a(S,s)h(s):   (16, 1024)
>>>>>> output_values g(∑a(S,s)h(s)):    (16, 1)
```