In [1]:
import mlflow
import os
import torch

from torch import nn
from torch import optim
from torch.nn.utils.rnn import pad_sequence

In [2]:
# create an mlflow.db file
! touch mlflow.db

In [3]:
os.environ['MLFLOW_TRACKING_URI'] = 'sqlite:///mlflow.db'

## Thoughts and assumptions
In this task I was to create a network which would compute the $L1$ norm of the a variable-length input sequence of real-valued numbers without using the $L1$ norm explicitly. I could use Dense Layers, Relu activations, the negation operation, and the sum and multiplication operations. No manually initialize a specific weight or set of weights was allowed.

I needed to split this problem into multiple 'sub-problems'
1. Handle the variable-lenth input sequence.
2. Handle values of different signs.
3. Build a trainable model which would predict/compute the $L1$ norm of the input.  

=====================================================================================================================================

1. The first point was relatively easy and hinted in the task description - use an RNN architecture.
2. Point number two needed a bit more thinking and not because it was hard to implement it, but in order to better understand why such operations is required.  
output of the RNN cell can described using the following equation:  
<center>$h' = tanh(W_{ih}*x + b_{ih} + W_{hh}*h + b_{hh})$</center>  
If the input $x$ changes its sign, then it will adjust the value that will be passed further before having it processed through the activation function.  I could use the negation so I will change the signs to positive wherever $x < 0$. The main goal is to ensure that all inputs are treated as non-negative before summing, effectively simulating the absolute value operation.  
The other thing which requires attention is the activation function itself. $tanh$ value range is $(-1, 1)$. It scales our output and this behaviour is not what we need. We will change it then to relu which returns a linear output for a non-negative input $ReLU(x) = (x)^{+}$. This will preserve positive values as they are. Our equation then changes into
<center>$h' = ReLU(W_{ih}*x + b_{ih} + W_{hh}*h + b_{hh})$</center> 
3. Here I needed to establish what I want to achieve. I have an RNN cell which outputs the sum of the scaled previous output added to the current number which is also scaled. I also add bias. That lead me to the conclusion that the feasible solution to that would get an RNN cel which weights would be identity matrices (or just 1 like in our case; $W_{ih} = 1$ and $W_{hh} = 1$) and bias is 0 (both $b_{ih} = 0$ and $b_{hh} = 0$). I cannot set it manually but at least I know what I'm trying to achieve.

In [4]:
# for reproducibility
torch.manual_seed(42)

<torch._C.Generator at 0x13ee5ff70>

In [5]:
k = torch.tensor([1, -1, 2, -2, 3, -3])
print(f"Before: {k}\nAfter: {torch.neg(k)}")
print("Convert only negative values")
print(f"{torch.where(k < 0, torch.neg(k), k)}")
torch.sqrt(k**2)

Before: tensor([ 1, -1,  2, -2,  3, -3])
After: tensor([-1,  1, -2,  2, -3,  3])
Convert only negative values
tensor([1, 1, 2, 2, 3, 3])


tensor([1., 1., 2., 2., 3., 3.])

In [6]:
class CustomRNNCell(nn.Module):
    """Custom RNN cell 
    
    Custom RNN cell which for a given input returns it's positive value summed
    to the information carried along.
    """

    def __init__(self, input_size: int = 1, hidden_size: int = 1):
        """Initialization method

        :param input_size: input size; number od values in a single input
        :type input_size: int
        :param hidden_size: number of features in the hidden layer of our RNN cell
        :type hidden_size: int
        """
        super(CustomRNNCell, self).__init__()

        self._rnn_cell = nn.RNNCell(
            input_size, 
            hidden_size,
            bias=False, # we do not need bias since it's information is irrelevant.
            nonlinearity="relu"
        )

    def forward(self, x: torch.Tensor, hidden: torch.Tensor) -> torch.Tensor:
        """Forward pass

        :param x: input tensor
        :type x: torch.Tensor
        :param hidden: output from the previous iteration
        :type hidden: torch.Tensor
        :return: processed tensor
        :rtype: torch.Tensor
        """

        # transformation to input x
        x = torch.where(x < 0, torch.neg(x), x)
        # another option would be to square it and then calculate the root square
 
        # pass through the cell
        hidden = self._rnn_cell(x, hidden)

        return hidden

In [7]:
class RNN(nn.Module):
    """RNN computing the L1 norm of the input sequence"""

    def __init__(self, input_size: int = 1, hidden_size: int = 1):
        """Initialization

        :param input_size: input size - in our case it will be one
        but other options are also covered
        :type: input_size: int
        :param hidden_size: 
        """

        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        self.rnn_cell = CustomRNNCell(input_size, hidden_size)
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass

        :param x: input tensor
        :type x: torch.Tensor
        :return: computed norm
        :rtype: torch.Tensor
        """

        batch_size = x.size(0)
        hidden = torch.zeros(batch_size, self.hidden_size)

        # Iterate over time steps
        for t in range(x.size(1)):
            current_input = x[:, t, :]
            hidden = self.rnn_cell(current_input, hidden)
        
        return hidden

In [8]:
def generate_data(batch_size: int, max_length: int) -> (torch.Tensor, torch.Tensor):
    """Generate random sequences and their L1 norms

    :param batch_size: batch size
    :type batch_size: int
    :param max_length: max vector length
    :type max_length: int
    :return: generated random sequence
    :rtype: (torch.Tensor, torch.Tensor)
    """

    sequences = []
    targets = []
    for _ in range(batch_size):
        length = torch.randint(1, max_length + 1, (1,)).item() # get random length
        seq = torch.randn(length, 1)  # Random sequence of 'length'
        l1_norm = torch.sum(torch.abs(seq))  # Compute the L1 norm
        sequences.append(seq)
        targets.append(torch.tensor([l1_norm], dtype=torch.float32))
    return sequences, targets

In [10]:
# Create the RNN model
input_size = 1  # Each element is a scalar
hidden_size = 1  # Output is a scalar
model = RNN(input_size=input_size, hidden_size=hidden_size)

In [11]:
print(f"Whh = {model.rnn_cell._rnn_cell.weight_hh}")
print(f"Wih = {model.rnn_cell._rnn_cell.weight_ih}")
print(f"bhh = {model.rnn_cell._rnn_cell.bias_hh}")
print(f"bih = {model.rnn_cell._rnn_cell.bias_ih}")

print(f"sum of coefficients: {sum(param.item() for param in model.parameters())}")

Whh = Parameter containing:
tensor([[0.8300]], requires_grad=True)
Wih = Parameter containing:
tensor([[0.7645]], requires_grad=True)
bhh = None
bih = None
sum of coefficients: 1.5945464372634888


In [12]:
# train till w's are 1

In [13]:
# Training setup
def train_model(model, batch_size, max_length, learning_rate=0.001):
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    criterion = nn.MSELoss()  # Mean Squared Error loss

    # Start an MLflow run
    with mlflow.start_run():
        # Log hyperparameters
        mlflow.log_param("batch_size", batch_size)
        mlflow.log_param("max_length", max_length)
        mlflow.log_param("learning_rate", learning_rate)
        
        epoch = 0
        do_train = True
        while do_train:
            # Generate a batch of random sequences and their L1 norms
            sequences, targets = generate_data(
                batch_size=batch_size,
                max_length=max_length
            )
            
            # Pad sequences to have a consistent batch size
            padded_sequences = pad_sequence(sequences, batch_first=True)
            targets = torch.cat(targets)
            
            optimizer.zero_grad()
            
            # Forward pass
            outputs = model(padded_sequences).squeeze(1)
            
            # Compute loss
            loss = criterion(outputs, targets)
            loss.backward()
            
            # Update weights
            optimizer.step()

            mlflow.log_metric("loss", loss.item(), step=epoch)

            if epoch % 100 == 0:
                print(f'Epoch {epoch}, Loss: {loss.item()}')

            # We know what we want to achieve and so we're stopping the training once our weights are one's
            if model.rnn_cell._rnn_cell.weight_hh == 1 and model.rnn_cell._rnn_cell.weight_ih == 1:
                print(f"Finished training after {epoch} epochs")
                do_train = False
    
            if epoch > 5e4:
                do_train = False
                print("Could not converge")
            epoch += 1

        mlflow.log_param("epochs", epoch)

# Parameters
batch_size = 32
max_length = 10  # Maximum length of any sequence

# Track the experiment using MLflow
mlflow.set_experiment("RNN L1 Norm Calculation")

# Train the model
train_model(model, batch_size, max_length)

# Test the model on a new sequence
seq = torch.randn(1, 10000, 1)  # Random sequence of 'length'
l1_norm = torch.sum(torch.abs(seq))  # Compute the L1 norm

# Get the model's prediction
model.eval()
with torch.no_grad():
    predicted_sum = model(seq).item()

print(f"Predicted sum: {predicted_sum}, Actual sum: {l1_norm}")

2024/08/22 00:20:05 INFO mlflow.store.db.utils: Creating initial MLflow database tables...
2024/08/22 00:20:05 INFO mlflow.store.db.utils: Updating database tables
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
INFO  [alembic.runtime.migration] Running upgrade  -> 451aebb31d03, add metric step
INFO  [alembic.runtime.migration] Running upgrade 451aebb31d03 -> 90e64c465722, migrate user column to tags
INFO  [alembic.runtime.migration] Running upgrade 90e64c465722 -> 181f10493468, allow nulls for metric values
INFO  [alembic.runtime.migration] Running upgrade 181f10493468 -> df50e92ffc5e, Add Experiment Tags Table
INFO  [alembic.runtime.migration] Running upgrade df50e92ffc5e -> 7ac759974ad8, Update run tags with larger limit
INFO  [alembic.runtime.migration] Running upgrade 7ac759974ad8 -> 89d4b8295536, create latest metrics table
INFO  [89d4b8295536_create_latest_metrics_table_py] Migration complete!
INFO  

Epoch 0, Loss: 13.961002349853516
Epoch 100, Loss: 5.079987049102783
Epoch 200, Loss: 0.028426308184862137
Epoch 300, Loss: 0.0037581315264105797
Epoch 400, Loss: 0.003718046937137842
Epoch 500, Loss: 0.005024059675633907
Epoch 600, Loss: 0.0016038696048781276
Epoch 700, Loss: 0.0031264815479516983
Epoch 800, Loss: 0.0065759047865867615
Epoch 900, Loss: 0.0038020440842956305
Epoch 1000, Loss: 0.0029737164732068777
Epoch 1100, Loss: 0.003434132318943739
Epoch 1200, Loss: 0.0027860780246555805
Epoch 1300, Loss: 0.0015386532759293914
Epoch 1400, Loss: 0.0013363079633563757
Epoch 1500, Loss: 0.0021456400863826275
Epoch 1600, Loss: 0.001322242897003889
Epoch 1700, Loss: 0.0013623482082039118
Epoch 1800, Loss: 0.001252010464668274
Epoch 1900, Loss: 0.0009346791193820536
Epoch 2000, Loss: 0.0009686516132205725
Epoch 2100, Loss: 0.0007364969933405519
Epoch 2200, Loss: 0.0004843974602408707
Epoch 2300, Loss: 0.00021630508126690984
Epoch 2400, Loss: 0.00020896112255286425
Epoch 2500, Loss: 0.000

In [14]:
print(f"Whh = {model.rnn_cell._rnn_cell.weight_hh}")
print(f"Wih = {model.rnn_cell._rnn_cell.weight_ih}")
print(f"bhh = {model.rnn_cell._rnn_cell.bias_hh}")
print(f"bih = {model.rnn_cell._rnn_cell.bias_ih}")

Whh = Parameter containing:
tensor([[1.]], requires_grad=True)
Wih = Parameter containing:
tensor([[1.]], requires_grad=True)
bhh = None
bih = None


## Observations

I ran multiple experiments just to check how everything's 

### No bias, just weights
1. both weights are positive, then model converges
2. hh positive, ih negative - not converging
3. hh negative, ih positive - not converging
4. hh negative but very small, ih positive - converging
5. both weights negative - not converging


### Adding bias
1. wh negative, wih positive, bhh positive, bih positive - converges but after a lot of iterations
2. wh negative, wih positive, bhh positive, bih negative - sum of coeffs is negative, converges
3. wh negative, wih positive, bhh negative, bih negative - sum of coeffs is negative, converges

4. wh positive, wih positive (very small), bhh positive, bih negative - converges
5. wh positive, wih positive (very small), bhh negative, bih positive - sum of coeffs is positive, converges very fast
6. wh positive, wih positive (very small), bhh negative, bih negative - sum of coeffs is positive, converges

7. wh positive, wih negative, bhh positive, bih negative - converges
8. wh positive, wih negative, bhh negative, bih positive - not converging (sum of coefficients is negative)
9. wh positive, wih negative, bhh positive, bih positive - sum of coeffs is positive, converges

10. wh negative, wih negative (very small), bhh negative, bih negative - not converging
11. wh negative, wih negative (very small), bhh positive, bih negative - converges
12. wh negative, wih negative, bhh positive, bih positive - sum of coefficients is negative, converges

# Conclusion and final thoughts
The weights should ideally converge to 1, or the identity matrix, since we are essentially adding each input (or its negation) to the accumulated sum.
ReLU outputs zero for any input that is negative or zero. If our weights are initialized with negative values, they might lead to negative outputs when applied to the inputs. If these negative outputs are passed through ReLU, they are clamped to zero. In that case we face information loss ($ReLU$ effectively discards negative signals) and gradient issues (if $ReLU$ outputs 0 in the forward pass, its gradient in the backward pass will also be 0. The way to address it would be to make sure that the initialized weights are positive or use an activation function which allows a small, non-zero output for negative inputs. That ensures proper convergence and prevents issues with zero gradients.