<a href="https://colab.research.google.com/github/tukamilano/combinatory_logic/blob/main/integer_sequence_transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

データを作る部分の軽量化

全てのデータをとってきた上でランダムにデータを抽出する(より良い実験をするために)

お金を使う

## Introduction

In this example, we'll build a sequence-to-sequence Transformer model, which
we'll train on an English-to-Spanish machine translation task.

You'll learn how to:

- Vectorize text using the Keras `TextVectorization` layer.
- Implement a `TransformerEncoder` layer, a `TransformerDecoder` layer,
and a `PositionalEmbedding` layer.
- Prepare data for training a sequence-to-sequence model.
- Use the trained model to generate translations of never-seen-before
input sentences (sequence-to-sequence inference).

The code featured here is adapted from the book
[Deep Learning with Python, Second Edition](https://www.manning.com/books/deep-learning-with-python-second-edition)
(chapter 11: Deep learning for text).
The present example is fairly barebones, so for detailed explanations of
how each building block works, as well as the theory behind Transformers,
I recommend reading the book.

In [1]:
!pip install transformers



## Setup

In [2]:
!pip install --upgrade keras

Collecting keras
  Downloading keras-3.3.3-py3-none-any.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
Collecting namex (from keras)
  Downloading namex-0.0.8-py3-none-any.whl (5.8 kB)
Collecting optree (from keras)
  Downloading optree-0.11.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (311 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: namex, optree, keras
  Attempting uninstall: keras
    Found existing installation: keras 2.15.0
    Uninstalling keras-2.15.0:
      Successfully uninstalled keras-2.15.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.15.0 requires keras<2.16,>=2.15.0, but you have keras 3.3.3 which is incompatible.[0

In [3]:
# We set the backend to TensorFlow. The code works with
# both `tensorflow` and `torch`. It does not work with JAX
# due to the behavior of `jax.numpy.tile` in a jit scope
# (used in `TransformerDecoder.get_causal_attention_mask()`:
# `tile` in JAX does not support a dynamic `reps` argument.
# You can make the code work in JAX by wrapping the
# inside of the `get_causal_attention_mask` method in
# a decorator to prevent jit compilation:
# `with jax.ensure_compile_time_eval():`.
import os

os.environ["KERAS_BACKEND"] = "tensorflow"

import pathlib
import random
import string
import re
import numpy as np

import tensorflow.data as tf_data
import tensorflow.strings as tf_strings

import keras
from keras import layers
from keras import ops
from keras.layers import TextVectorization


In [11]:
import random

random.seed(42)

def generate_unique_tuples(n):
    tuples = set()
    while len(tuples) < n:
        A1 = random.randint(0, 8)
        A2 = random.randint(0, 9)
        A3 = random.randint(0, 16)
        A4 = random.randint(0, 1)
        A5 = random.randint(0, 1)

        B1 = random.randint(0, 8)
        B2 = random.randint(0, 9)
        B3 = random.randint(0, 16)
        B4 = random.randint(0, 1)
        B5 = random.randint(0, 1)

        C = random.randint(0, 1)

        tuples.add(((A1, A2, A3, A4, A5),(B1, B2, B3, B4, B5),C))
    return list(tuples)

In [9]:
import itertools
import json
import math
from operator import add, sub

#generator使ったら早くできるかも？
def generate_dataset(encoding, length=100):
    formula_list = []
    evaluated_result = []

    T5 = ["", "x!"]
    T4 = ["", "x**x"]
    T3 = ["", "9**x", "(-9)**x", "8**x", "(-8)**x", "7**x", "(-7)**x", "6**x", "(-6)**x", "5**x", "(-5)**x", "4**x", "(-4)**x", "3**x", "(-3)**x", "2**x", "(-2)**x"]
    T2 = ["", "x**9", "x**8", "x**7", "x**6", "x**5", "x**4", "x**3", "x**2", "x"]
    T1 = ["", "9", "8", "7", "6", "5", "4", "3", "2"]
    T0 = ["+", "-"]

    for term_pair in encoding:
        A1, A2, A3, A4, A5 = term_pair[0]
        B1, B2, B3, B4, B5 = term_pair[1]
        C = term_pair[2]

        if ((A2, A3, A4, A5) == (0, 0, 0, 0)) or ((B2, B3, B4, B5) == (0, 0, 0, 0)):
            continue

        a1 = T1[A1]
        b1 = T2[A2]
        c1 = T3[A3]
        d1 = T4[A4]
        e1 = T5[A5]

        a2 = T1[B1]
        b2 = T2[B2]
        c2 = T3[B3]
        d2 = T4[B4]
        e2 = T5[B5]

        a0 = T0[C]

        first_sequence = a1 + " " + b1 + " "+ c1 + " "+ d1 + " " + e1
        second_sequence = a2 + " " + b2 + " "+ c2 + " "+ d2 + " " + e2

        new_first_sequence = ' '.join(first_sequence.split())
        first_formula_term = (new_first_sequence.strip()).replace(" ", "*")
        first_integer_sequence_term = [eval((first_formula_term.replace("x!", "math.factorial(x)")).replace("x", str(i))) for i in range(int(length/2))]

        new_second_sequence = ' '.join(second_sequence.split())
        second_formula_term = (new_second_sequence.strip()).replace(" ", "*")
        second_integer_sequence_term = [eval((second_formula_term.replace("x!", "math.factorial(x)")).replace("x", str(i))) for i in range(int(length/2))]

        if a0 == "+":
            evaluated_expr_list = list(map(add, first_integer_sequence_term, second_integer_sequence_term))
        else:
            evaluated_expr_list = list(map(sub, first_integer_sequence_term, second_integer_sequence_term))

        evaluated_expr = (str(evaluated_expr_list)[1:-1].replace(', ', ','))[:length]

        formula_term = first_formula_term + a0 + second_formula_term

        formula_list.append(formula_term)
        evaluated_result.append(evaluated_expr)

    return formula_list, evaluated_result

encoding = generate_unique_tuples(1000)
formula_list, evaluated_result = generate_dataset(encoding)

text_pairs = list(zip(evaluated_result, list(map(lambda formula: 'S' + formula + 'E', formula_list))))

In [10]:
import random

for _ in range(5):
    print(random.choice(text_pairs))

('0,83,-33728,22476042,-18051366912,19375837125000,-26992713122979840,47665763265102630480,-1043634664', 'S5*x**4*4**x*x!-7*x**3*(-9)**x*x**x*x!E')
('0,-60,-4032,-157464,11695104,7695324000,3416187778560,1769737688147520,1135299937278689280,896474187', 'S4*x**2*3**x*x**x*x!-8*x**2*9**x*x!E')
('0,118,1546240,3726464292,6161734565888,11979707031250000,30971118787340009472,9865369481392962152425', 'S8*x**9*8**x*x**x+6*x**7*9**x*x**x*x!E')
('0,-64,266240,-244069200,236370001920,-306302625000000,463348852860026880,-784254623291454510720,1487', 'S2*x**8*(-4)**x*x**x*x!+8*x**8*(-7)**x*x!E')
('0,37,1688,69390,3159168,163551000,9573059520,626331308400,45304318402560,3589195696652160,3090358841', 'S3*x**2*7**x*x!+4*x**2*4**x*x!E')


Now, let's split the sentence pairs into a training set, a validation set,
and a test set.

In [None]:
random.shuffle(text_pairs)
num_val_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples
train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples : num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples :]

print(f"{len(text_pairs)} total pairs")
print(f"{len(train_pairs)} training pairs")
print(f"{len(val_pairs)} validation pairs")
print(f"{len(test_pairs)} test pairs")

1000000 total pairs
700000 training pairs
150000 validation pairs
150000 test pairs


## Vectorizing the text data

We'll use two instances of the `TextVectorization` layer to vectorize the text
data (one for English and one for Spanish),
that is to say, to turn the original strings into integer sequences
where each integer represents the index of a word in a vocabulary.

The English layer will use the default string standardization (strip punctuation characters)
and splitting scheme (split on whitespace), while
the Spanish layer will use a custom standardization, where we add the character
`"¿"` to the set of punctuation characters to be stripped.

Note: in a production-grade machine translation model, I would not recommend
stripping the punctuation characters in either language. Instead, I would recommend turning
each punctuation character into its own token,
which you could achieve by providing a custom `split` function to the `TextVectorization` layer.

In [None]:
vocab_size = 19
sequence_length = 150
batch_size = 64 #適切に変える

eng_vectorization = TextVectorization(
    max_tokens=vocab_size,
    split='character',
    standardize=None,
    output_mode="int",
    output_sequence_length=sequence_length,
)
spa_vectorization = TextVectorization(
    max_tokens=vocab_size,
    split='character',
    standardize=None,
    output_mode="int",
    output_sequence_length=sequence_length + 1,
)
train_eng_texts = [pair[0] for pair in train_pairs]
train_spa_texts = [pair[1] for pair in train_pairs]
eng_vectorization.adapt(train_eng_texts)
spa_vectorization.adapt(train_spa_texts)

Next, we'll format our datasets.

At each training step, the model will seek to predict target words N+1 (and beyond)
using the source sentence and the target words 0 to N.

As such, the training dataset will yield a tuple `(inputs, targets)`, where:

- `inputs` is a dictionary with the keys `encoder_inputs` and `decoder_inputs`.
`encoder_inputs` is the vectorized source sentence and `encoder_inputs` is the target sentence "so far",
that is to say, the words 0 to N used to predict word N+1 (and beyond) in the target sentence.
- `target` is the target sentence offset by one step:
it provides the next words in the target sentence -- what the model will try to predict.

In [None]:
batch_size = 64

def format_dataset(eng, spa):
    eng = eng_vectorization(eng)
    spa = spa_vectorization(spa)
    return (
        {
            "encoder_inputs": eng,
            "decoder_inputs": spa[:, :-1],
        },
        spa[:, 1:],
    )

def make_dataset(pairs):
    eng_texts, spa_texts = zip(*pairs)
    eng_texts = list(eng_texts)
    spa_texts = list(spa_texts)
    dataset = tf_data.Dataset.from_tensor_slices((eng_texts, spa_texts))
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(format_dataset)
    return dataset.cache().shuffle(2048).prefetch(16)

train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)

Let's take a quick look at the sequence shapes
(we have batches of 64 pairs, and all sequences are 20 steps long):

In [None]:
for inputs, targets in train_ds.take(1):
    print(f'inputs["encoder_inputs"].shape: {inputs["encoder_inputs"].shape}')
    print(f'inputs["decoder_inputs"].shape: {inputs["decoder_inputs"].shape}')
    print(f"targets.shape: {targets.shape}")

inputs["encoder_inputs"].shape: (64, 150)
inputs["decoder_inputs"].shape: (64, 150)
targets.shape: (64, 150)


## Building the model

Our sequence-to-sequence Transformer consists of a `TransformerEncoder`
and a `TransformerDecoder` chained together. To make the model aware of word order,
we also use a `PositionalEmbedding` layer.

The source sequence will be pass to the `TransformerEncoder`,
which will produce a new representation of it.
This new representation will then be passed
to the `TransformerDecoder`, together with the target sequence so far (target words 0 to N).
The `TransformerDecoder` will then seek to predict the next words in the target sequence (N+1 and beyond).

A key detail that makes this possible is causal masking
(see method `get_causal_attention_mask()` on the `TransformerDecoder`).
The `TransformerDecoder` sees the entire sequences at once, and thus we must make
sure that it only uses information from target tokens 0 to N when predicting token N+1
(otherwise, it could use information from the future, which would
result in a model that cannot be used at inference time).

In [None]:
import keras.ops as ops


class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.dense_proj = keras.Sequential(
            [
                layers.Dense(dense_dim, activation="relu"),
                layers.Dense(embed_dim),
            ]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.supports_masking = True

    def call(self, inputs, mask=None):
        if mask is not None:
            padding_mask = ops.cast(mask[:, None, :], dtype="int32")
        else:
            padding_mask = None

        attention_output = self.attention(
            query=inputs, value=inputs, key=inputs, attention_mask=padding_mask
        )
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)

    def get_config(self):
        config = super().get_config()
        config.update(
            {
                "embed_dim": self.embed_dim,
                "dense_dim": self.dense_dim,
                "num_heads": self.num_heads,
            }
        )
        return config


class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, vocab_size, embed_dim, **kwargs):
        super().__init__(**kwargs)
        self.token_embeddings = layers.Embedding(
            input_dim=vocab_size, output_dim=embed_dim
        )
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=embed_dim
        )
        self.sequence_length = sequence_length
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim

    def call(self, inputs):
        length = ops.shape(inputs)[-1]
        positions = ops.arange(0, length, 1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        if mask is None:
            return None
        else:
            return ops.not_equal(inputs, 0)

    def get_config(self):
        config = super().get_config()
        config.update(
            {
                "sequence_length": self.sequence_length,
                "vocab_size": self.vocab_size,
                "embed_dim": self.embed_dim,
            }
        )
        return config


class TransformerDecoder(layers.Layer):
    def __init__(self, embed_dim, latent_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.latent_dim = latent_dim
        self.num_heads = num_heads
        self.attention_1 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.attention_2 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.dense_proj = keras.Sequential(
            [
                layers.Dense(latent_dim, activation="relu"),
                layers.Dense(embed_dim),
            ]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()
        self.supports_masking = True

    def call(self, inputs, encoder_outputs, mask=None):
        causal_mask = self.get_causal_attention_mask(inputs)
        if mask is not None:
            padding_mask = ops.cast(mask[:, None, :], dtype="int32")
            padding_mask = ops.minimum(padding_mask, causal_mask)
        else:
            padding_mask = None

        attention_output_1 = self.attention_1(
            query=inputs, value=inputs, key=inputs, attention_mask=causal_mask
        )
        out_1 = self.layernorm_1(inputs + attention_output_1)

        attention_output_2 = self.attention_2(
            query=out_1,
            value=encoder_outputs,
            key=encoder_outputs,
            attention_mask=padding_mask,
        )
        out_2 = self.layernorm_2(out_1 + attention_output_2)

        proj_output = self.dense_proj(out_2)
        return self.layernorm_3(out_2 + proj_output)

    def get_causal_attention_mask(self, inputs):
        input_shape = ops.shape(inputs)
        batch_size, sequence_length = input_shape[0], input_shape[1]
        i = ops.arange(sequence_length)[:, None]
        j = ops.arange(sequence_length)
        mask = ops.cast(i >= j, dtype="int32")
        mask = ops.reshape(mask, (1, input_shape[1], input_shape[1]))
        mult = ops.concatenate(
            [ops.expand_dims(batch_size, -1), ops.convert_to_tensor([1, 1])],
            axis=0,
        )
        return ops.tile(mask, mult)

    def get_config(self):
        config = super().get_config()
        config.update(
            {
                "embed_dim": self.embed_dim,
                "latent_dim": self.latent_dim,
                "num_heads": self.num_heads,
            }
        )
        return config


Next, we assemble the end-to-end model.

In [None]:
embed_dim = 256
latent_dim = 2048
num_heads = 8
num_layers = 1  # レイヤー数を指定

encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="encoder_inputs")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(encoder_inputs)

for _ in range(num_layers):
    x = TransformerEncoder(embed_dim, latent_dim, num_heads)(x)

encoder_outputs = x
encoder = keras.Model(encoder_inputs, encoder_outputs)

decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="decoder_inputs")
encoded_seq_inputs = keras.Input(shape=(None, embed_dim), name="decoder_state_inputs")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(decoder_inputs)

for _ in range(num_layers):
    x = TransformerDecoder(embed_dim, latent_dim, num_heads)(x, encoded_seq_inputs)

x = layers.Dropout(0.5)(x)
decoder_outputs = layers.Dense(vocab_size, activation="softmax")(x)

decoder = keras.Model([decoder_inputs, encoded_seq_inputs], decoder_outputs)

decoder_outputs = decoder([decoder_inputs, encoder_outputs])

transformer = keras.Model(
    [encoder_inputs, decoder_inputs], decoder_outputs, name="transformer"
)

## Training our model

We'll use accuracy as a quick way to monitor training progress on the validation data.
Note that machine translation typically uses BLEU scores as well as other metrics, rather than accuracy.

Here we only train for 1 epoch, but to get the model to actually converge
you should train for at least 30 epochs.

In [None]:
spa_vocab = spa_vectorization.get_vocabulary()
spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab))
max_decoded_sentence_length = 50

def decode_sequence(input_sentence):
    tokenized_input_sentence = eng_vectorization([input_sentence])
    decoded_sentence = "S"
    for i in range(max_decoded_sentence_length):
        tokenized_target_sentence = spa_vectorization([decoded_sentence])[:, :-1]
        predictions = transformer([tokenized_input_sentence, tokenized_target_sentence])

        # ops.argmax(predictions[0, i, :]) is not a concrete value for jax here
        sampled_token_index = ops.convert_to_numpy(
            ops.argmax(predictions[0, i, :])
        ).item(0)
        sampled_token = spa_index_lookup[sampled_token_index]
        decoded_sentence += sampled_token
        if sampled_token == "E":
            break
    return decoded_sentence

In [None]:
print(spa_vocab)

['', '[UNK]', '*', 'x', '9', '!', 'S', 'E', '-', '8', '5', '7', '4', '3', '2', '6', '+', ')', '(']


In [None]:
epochs = 100
spa_vocab = spa_vectorization.get_vocabulary()
spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab))
max_decoded_sentence_length = 50

def decode_sequence(input_sentence):
    tokenized_input_sentence = eng_vectorization([input_sentence])
    decoded_sentence = "S"
    for i in range(max_decoded_sentence_length):
        tokenized_target_sentence = spa_vectorization([decoded_sentence])[:, :-1]
        predictions = transformer([tokenized_input_sentence, tokenized_target_sentence])

        # ops.argmax(predictions[0, i, :]) is not a concrete value for jax here
        sampled_token_index = ops.convert_to_numpy(
            ops.argmax(predictions[0, i, :])
        ).item(0)
        sampled_token = spa_index_lookup[sampled_token_index]
        decoded_sentence += sampled_token
        if sampled_token == "E":
            break
    return decoded_sentence

transformer.summary() #adam
transformer.compile(
    "adamW", loss="sparse_categorical_crossentropy", metrics=["accuracy"]
)
for i in range(epochs):
    transformer.fit(train_ds, epochs=1, validation_data=val_ds)
    for _ in range(5):
        print("=====================")
        input_output_sentence = random.choice(test_pairs)
        translated = decode_sequence(input_output_sentence[0])
        print(input_output_sentence[1])
        print(translated)

[1m10938/10938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m290s[0m 25ms/step - accuracy: 0.9199 - loss: 0.2713 - val_accuracy: 0.8690 - val_loss: 0.4542
S3*x**2*9**x*x**x*x!-4*x**2*(-7)**x*x!E
S**************************************************
Sx**5*9**x*x**x*x!+2*x**3*(-8)**x*x**x*x!E
S**************************************************
S8*x**6*9**x*x**x*x!+9*x**9*6**x*x!E
S**************************************************
S7*x**2*9**x*x**x*x!-x**6*x!E
S**************************************************
S3*x**8*9**x*x**x*x!+6*x**3*5**x*x**xE
S**************************************************
[1m10938/10938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m264s[0m 24ms/step - accuracy: 0.8808 - loss: 0.4010 - val_accuracy: 0.9564 - val_loss: 0.1197
S6*x**2*9**x*x**x*x!-2*x**3*(-4)**xE
S9*x**9*9**x*x**x*x!-9*x**9*(-6)**x*x**x*x!E
S3*x**9*9**x*x**x*x!-x**2*6**xE
S9*x**9*9**x*x**x*x!-9*x**9*(-6)**x*x**x*x!E
S6*x*9**x*x**x*x!-9*x**4*(-2)**x*x!E
S9*x**9*9**x*x**x*x!-9*x**9*(-6)**x*x**