# **Project Overview**

**Objective**

Develop a neural network that:

Learns to translate mathematical expressions from infix notation (e.g., a + b * c) to postfix notation (e.g., a b c * +).
Handles syntactic ambiguity using a data-driven method rather than rule-based parsing.
Operates on symbolic sequences using encoder-decoder or autoregressive modeling.

# **Constraints**
* Input: infix expressions (with parentheses and operators).
* Output: correctly disambiguated postfix expressions.
* Maximum syntactic depth of expressions: 3
* Vocabulary: limited to symbols, operators, parentheses, and variables a–e.
* Model: ≤ 2 million parameters
* No beam search; only greedy autoregressive decoding
* Evaluation: prefix accuracy, not exact match

# **Overall Architecture**

A sequence-to-sequence model is needed:

* Encoder: Encodes infix expression
* Decoder: Generates postfix expression step by step

Use teacher forcing during training, autoregressive decoding during inference

# **Project Structure and Steps**

# Step 1: Dataset Creation


1. Constants & Vocabulary (limited to depth 3)
2. Generate Infix Expression
3. Tokenization
4. Infix to Postfix Conversion
5. Encoding & Decoding
6. Dataset Generator
7. Shifted Decoder Input (for teacher forcing)

**1.1. Constants & Vocabulary**

In [1]:
import numpy as np
import random

OPERATORS = ['+', '-', '*', '/']
IDENTIFIERS = list('abcde')
SPECIAL_TOKENS = ['PAD', 'SOS', 'EOS']
SYMBOLS = ['(', ')', '+', '-', '*', '/']
VOCAB = SPECIAL_TOKENS + SYMBOLS + IDENTIFIERS + ['JUNK']

token_to_id = {tok: i for i, tok in enumerate(VOCAB)}
id_to_token = {i: tok for tok, i in token_to_id.items()}

VOCAB_SIZE = len(VOCAB)
PAD_ID = token_to_id['PAD']
SOS_ID = token_to_id['SOS']
EOS_ID = token_to_id['EOS']

**1.2. Generate Infix Expression**

In [2]:
def generate_infix_expression(max_depth):
    if max_depth == 0:
        return random.choice(IDENTIFIERS)
    elif random.random() < 0.5:
        return generate_infix_expression(max_depth - 1)
    else:
        left = generate_infix_expression(max_depth - 1)
        right = generate_infix_expression(max_depth - 1)
        op = random.choice(OPERATORS)
        return f'({left} {op} {right})'

**1.3. Tokenization**

In [3]:
def tokenize(expr):
    return [c for c in expr if c in token_to_id]

**1.4. Infix to Postfix Conversion**

In [4]:
def infix_to_postfix(tokens):
    precedence = {'+': 1, '-': 1, '*': 2, '/': 2}
    output, stack = [], []
    for token in tokens:
        if token in IDENTIFIERS:
            output.append(token)
        elif token in OPERATORS:
            while stack and stack[-1] in OPERATORS and precedence[stack[-1]] >= precedence[token]:
                output.append(stack.pop())
            stack.append(token)
        elif token == '(':
            stack.append(token)
        elif token == ')':
            while stack and stack[-1] != '(':
                output.append(stack.pop())
            stack.pop()
    while stack:
        output.append(stack.pop())
    return output

**1.5. Encoding & Decoding**

In [5]:
MAX_DEPTH = 3
MAX_LEN = 4 * 2**MAX_DEPTH - 2  # Safe upper bound for postfix len

def encode(tokens, max_len=MAX_LEN):
    ids = [token_to_id[t] for t in tokens] + [EOS_ID]
    return ids + [PAD_ID] * (max_len - len(ids))

def decode_sequence(token_ids, id_to_token, pad_token='PAD', eos_token='EOS'):
    tokens = []
    for token_id in token_ids:
        token = id_to_token.get(token_id, '?')
        if token == eos_token:
            break
        if token != pad_token:
            tokens.append(token)
    return ' '.join(tokens)

**1.6. Dataset Generator**

In [6]:
def generate_dataset(n, max_depth=MAX_DEPTH):
    X, Y = [], []
    for _ in range(n):
        expr = generate_infix_expression(max_depth)
        infix = tokenize(expr)
        postfix = infix_to_postfix(infix)
        X.append(encode(infix))
        Y.append(encode(postfix))
    return np.array(X), np.array(Y)

**1.7. Shifted Decoder Input (for teacher forcing)**

In [7]:
def shift_right(seqs):
    shifted = np.zeros_like(seqs)
    shifted[:, 1:] = seqs[:, :-1]
    shifted[:, 0] = SOS_ID
    return shifted

**1.8. Example Usage**

In [12]:
# Create training and validation data
X_train, Y_train = generate_dataset(10000)
decoder_input_train = shift_right(Y_train)

X_val, Y_val = generate_dataset(1000)
decoder_input_val = shift_right(Y_val)

# Sanity check
i = np.random.randint(10000)
print("Example", i)
print("Infix  :", decode_sequence(X_train[i], id_to_token))
print("Postfix:", decode_sequence(Y_train[i], id_to_token))
print("Shifted:", decode_sequence(decoder_input_train[i], id_to_token))

Example 2825
Infix  : ( ( ( b + d ) + a ) - ( ( c / b ) + d ) )
Postfix: b d + a + c b / d + -
Shifted: SOS b d + a + c b / d + -


# Step 2: LSTM Encoder-Decoder Architecture
We will implement a simple sequence-to-sequence model using:

* An encoder (LSTM) that processes the infix sequence
* A decoder (LSTM) that generates postfix tokens autoregressively
* A shared embedding layer
* No attention for now (optional for later)

This architecture respects the < 2 million parameter constraint.

**2.1. Define Model Inputs**

In [14]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense, Dot, Activation
from tensorflow.keras.layers import Attention, Concatenate
from tensorflow.keras.optimizers import Adam

**2.2. Architecture Definition**

In [19]:
# Dimensions
EMBEDDING_DIM = 128
LATENT_DIM = 256

# ---------------------- Encoder ----------------------
encoder_inputs = Input(shape=(MAX_LEN,), name="encoder_input")
embedding_layer = Embedding(VOCAB_SIZE, EMBEDDING_DIM, mask_zero=True, name="shared_embedding")
encoder_embedding = embedding_layer(encoder_inputs)

encoder_lstm = LSTM(LATENT_DIM, return_sequences=True, return_state=True, name="encoder_lstm")
encoder_outputs, state_h, state_c = encoder_lstm(encoder_embedding)

# ---------------------- Decoder ----------------------
decoder_inputs = Input(shape=(MAX_LEN,), name="decoder_input")
decoder_embedding = embedding_layer(decoder_inputs)

decoder_lstm = LSTM(LATENT_DIM, return_sequences=True, return_state=True, name="decoder_lstm")
decoder_lstm_output, _, _ = decoder_lstm(decoder_embedding, initial_state=[state_h, state_c])

# ---------------------- Attention ----------------------
# Compute attention scores (dot product)
attention_scores = Dot(axes=[2, 2])([decoder_lstm_output, encoder_outputs])
attention_weights = Activation('softmax')(attention_scores)

# Weighted sum of encoder outputs
attention_output = Dot(axes=[2, 1])([attention_weights, encoder_outputs])


# Concatenate context + decoder output
decoder_combined_context = Concatenate(axis=-1)([decoder_lstm_output, attention_output])

# ---------------------- Output Projection ----------------------
output_layer = Dense(VOCAB_SIZE, activation="softmax", name="output_projection")
decoder_outputs = output_layer(decoder_combined_context)

# ---------------------- Full Model ----------------------
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
model.summary()

**2.3. Prepare Targets for Training**

In [17]:
# Sparse categorical crossentropy needs 3D input for Y
Y_train_expanded = np.expand_dims(Y_train, axis=-1)
Y_val_expanded = np.expand_dims(Y_val, axis=-1)

**2.4. Train the Model**

In [27]:
history = model.fit(
    [X_train, decoder_input_train],
    Y_train_expanded,
    validation_data=([X_val, decoder_input_val], Y_val_expanded),
    epochs=30,
    batch_size=64
)

Epoch 1/30
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 14ms/step - accuracy: 0.7775 - loss: 0.7885 - val_accuracy: 0.8456 - val_loss: 0.3518
Epoch 2/30
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 10ms/step - accuracy: 0.8588 - loss: 0.3341 - val_accuracy: 0.8976 - val_loss: 0.2548
Epoch 3/30
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 10ms/step - accuracy: 0.9189 - loss: 0.2101 - val_accuracy: 0.9565 - val_loss: 0.1226
Epoch 4/30
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 11ms/step - accuracy: 0.9682 - loss: 0.0936 - val_accuracy: 0.9943 - val_loss: 0.0323
Epoch 5/30
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 14ms/step - accuracy: 0.9951 - loss: 0.0261 - val_accuracy: 0.9994 - val_loss: 0.0092
Epoch 6/30
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 10ms/step - accuracy: 0.9993 - loss: 0.0075 - val_accuracy: 0.9996 - val_loss: 0.0054
Epoch 7/30
[1m157/157

# Step 3: Autoregressive Inference

**3.1. Define Inference Models**

In [28]:
# Encoder model (used for inference)
encoder_model = Model(encoder_inputs, [encoder_outputs, state_h, state_c])

# Inputs for inference step
decoder_input_single = Input(shape=(1,), name='decoder_input_single')  # single token input
decoder_state_input_h = Input(shape=(LATENT_DIM,), name='decoder_state_input_h')
decoder_state_input_c = Input(shape=(LATENT_DIM,), name='decoder_state_input_c')
encoder_outputs_input = Input(shape=(MAX_LEN, LATENT_DIM), name='encoder_outputs_input')  # full encoder sequence

# Embedding for current input token
decoder_embed_inf = embedding_layer(decoder_input_single)  # shape: (1, 1, EMBEDDING_DIM)

# Run decoder LSTM for one timestep
decoder_lstm_output, state_h_inf, state_c_inf = decoder_lstm(
    decoder_embed_inf, initial_state=[decoder_state_input_h, decoder_state_input_c]
)  # output shape: (1, 1, LATENT_DIM)

# Attention scores = dot product between decoder output and encoder outputs
attention_scores = Dot(axes=[2, 2], name='attention_scores')(
    [decoder_lstm_output, encoder_outputs_input]
)  # shape: (1, 1, T_enc)

# Normalize scores to probabilities
attention_weights = Activation('softmax', name='attention_weights')(attention_scores)  # shape: (1, 1, T_enc)

# Weighted sum of encoder outputs (context vector)
context_vector = Dot(axes=[2, 1], name='context_vector')(
    [attention_weights, encoder_outputs_input]
)  # shape: (1, 1, LATENT_DIM)

# Concatenate decoder output + context
decoder_context_combined = Concatenate(axis=-1, name='decoder_context_concat')(
    [decoder_lstm_output, context_vector]
)  # shape: (1, 1, 2*LATENT_DIM)

# Project to vocabulary
decoder_output_inf = output_layer(decoder_context_combined)  # shape: (1, 1, VOCAB_SIZE)

# Final decoder inference model
decoder_model = Model(
    inputs=[
        decoder_input_single,
        decoder_state_input_h,
        decoder_state_input_c,
        encoder_outputs_input
    ],
    outputs=[
        decoder_output_inf,
        state_h_inf,
        state_c_inf
    ]
)

**3.2. Autoregressive Decode Function**

In [29]:
def autoregressive_decode_with_attention(input_seq):
    # Encode the input
    enc_outputs, h, c = encoder_model.predict(input_seq.reshape(1, -1), verbose=0)

    target_seq = np.zeros((1, 1), dtype=np.int32)
    target_seq[0, 0] = SOS_ID

    decoded_tokens = []

    for _ in range(MAX_LEN):
        output_tokens, h, c = decoder_model.predict(
            [target_seq, h, c, enc_outputs], verbose=0
        )
        sampled_token_index = np.argmax(output_tokens[0, 0])
        sampled_token = id_to_token[sampled_token_index]

        if sampled_token == 'EOS':
            break
        decoded_tokens.append(sampled_token)
        target_seq[0, 0] = sampled_token_index

    # Convert to ID format
    return [token_to_id[t] for t in decoded_tokens if t in token_to_id]


**3.3. Try an Example**

In [37]:
idx = np.random.randint(len(X_val))
x_sample = X_val[idx]
y_true = Y_val[idx]
y_pred = autoregressive_decode_with_attention(x_sample)

print("Infix        :", decode_sequence(x_sample, id_to_token))
print("Target Postfix:", decode_sequence(y_true, id_to_token))
print("Predicted     :", decode_sequence(y_pred, id_to_token))


Infix        : ( ( d * e ) / ( a - ( e / a ) ) )
Target Postfix: d e * a e a / - /
Predicted     : d e * a e a / - /


# Step 4: Prefix Accuracy Evaluation

**4.1. Function: prefix_accuracy_single**

In [36]:
def prefix_accuracy_single(y_true, y_pred, id_to_token, eos_id=EOS_ID, verbose=False):
    t_str = decode_sequence(y_true, id_to_token).split(' EOS')[0]
    p_str = decode_sequence(y_pred, id_to_token).split(' EOS')[0]
    t_tokens = t_str.strip().split()
    p_tokens = p_str.strip().split()

    max_len = max(len(t_tokens), len(p_tokens))
    match_len = sum(x == y for x, y in zip(t_tokens, p_tokens))

    score = match_len / max_len if max_len > 0 else 0

    if verbose:
        print("TARGET  :", ' '.join(t_tokens))
        print("PREDICT :", ' '.join(p_tokens))
        print(f"MATCH   : {match_len}/{max_len} → {score:.2f}")

    return score

**4.2. Function: test()**

In [34]:
def test(n=20, rounds=5):
    results = []
    for r in range(rounds):
        print(f"Round {r+1}")
        X_test, Y_test = generate_dataset(n)
        scores = []
        for i in range(n):
            x = X_test[i]
            y_true = Y_test[i]
            y_pred = autoregressive_decode_with_attention(x)
            score = prefix_accuracy_single(y_true, y_pred, id_to_token)
            scores.append(score)
        avg = np.mean(scores)
        print(f"  Average prefix accuracy: {avg:.3f}")
        results.append(avg)
    return np.mean(results), np.std(results)

**4.3. Run Evaluationt**

In [35]:
mean_score, std_dev = test(n=20, rounds=10)
print(f"\nFinal Prefix Accuracy: {mean_score:.3f} ± {std_dev:.3f}")

Round 1
  Average prefix accuracy: 1.000
Round 2
  Average prefix accuracy: 1.000
Round 3
  Average prefix accuracy: 1.000
Round 4
  Average prefix accuracy: 1.000
Round 5
  Average prefix accuracy: 1.000
Round 6
  Average prefix accuracy: 1.000
Round 7
  Average prefix accuracy: 1.000
Round 8
  Average prefix accuracy: 1.000
Round 9
  Average prefix accuracy: 1.000
Round 10
  Average prefix accuracy: 1.000

Final Prefix Accuracy: 1.000 ± 0.000


In [38]:
# Generate new unseen test dataset
X_test_new, Y_test_new = generate_dataset(n=1000, max_depth=MAX_DEPTH)
def evaluate_on_dataset(X_data, Y_data, sample_count=20, verbose=False):
    scores = []
    for i in range(sample_count):
        x = X_data[i]
        y_true = Y_data[i]
        y_pred = autoregressive_decode_with_attention(x)
        score = prefix_accuracy_single(y_true, y_pred, id_to_token, verbose=verbose)
        scores.append(score)
    return np.mean(scores), np.std(scores)
mean_new, std_new = evaluate_on_dataset(X_test_new, Y_test_new, sample_count=100, verbose=False)
print(f"New Test Set Prefix Accuracy: {mean_new:.3f} ± {std_new:.3f}")
for _ in range(5):
    i = np.random.randint(len(X_test_new))
    print(f"\nExample {i}")
    print("Infix       :", decode_sequence(X_test_new[i], id_to_token))
    print("True Postfix:", decode_sequence(Y_test_new[i], id_to_token))
    print("Predicted   :", decode_sequence(autoregressive_decode_with_attention(X_test_new[i]), id_to_token))
    print('-' * 60)


New Test Set Prefix Accuracy: 0.998 ± 0.020

Example 402
Infix       : ( ( ( a + d ) / ( b / a ) ) / ( ( a - e ) + ( e + d ) ) )
True Postfix: a d + b a / / a e - e d + + /
Predicted   : a d + b a / / a e - e d + + /
------------------------------------------------------------

Example 51
Infix       : ( ( ( d + b ) + ( e + c ) ) + ( d / b ) )
True Postfix: d b + e c + + d b / +
Predicted   : d b + e c + + d b / +
------------------------------------------------------------

Example 646
Infix       : ( ( ( a / c ) + ( c - c ) ) * b )
True Postfix: a c / c c - + b *
Predicted   : a c / c c - + b *
------------------------------------------------------------

Example 820
Infix       : ( a + ( b + ( a + b ) ) )
True Postfix: a b a b + + +
Predicted   : a b a b + + +
------------------------------------------------------------

Example 854
Infix       : ( d / ( d - b ) )
True Postfix: d d b - /
Predicted   : d d b - /
------------------------------------------------------------
