# XRD transformer
Standard seq2seq model architecture based on [language translation model](https://pytorch.org/tutorials/beginner/translation_transformer.html).

### Model setup
Concept: lattice parameters arise solely from peak positions, so learn relationship between peak positions (e.g. (100) and (200) peaks) via self attention. Positional encoding is intrinsic to peak positions already, so does not need to be learned. Output via seq2seq: lattice parameters arise from whole set of peaks.

Use [regression transformer](https://www.nature.com/articles/s42256-023-00639-z) to frame problem as a seq2seq problem. Numbers are tokenized based on digits and positions, and an embedding preserving distances between numbers is applied. (This works surprisingly well in the paper). Numbers can then be generated as a sequence of tokens before being decoded back to numbers.

Example: 12.34 --> ['\_1\_1\_', '\_2\_0\_', '\._', '\_3\_-1\_', '\_4\_-2\_']


Source
* Start with [q, I] data
* Peak positions (q) are binned to integers - initial tests used bin sizes of 0.0005 in q
* Convert bin positions to the regression transformer numerical encoding
* Different input lengths are dynamically padded during batching

Targets
* Start with [a, b, c] lattice data
* Tokenize data using the regression transformer numerical encoding
* Different output lengths are dynamically padded during batching

In [12]:
import pickle
import math
import matplotlib.pyplot as plt
import numpy as np
import re
from typing import Iterable, List
from timeit import default_timer as timer
import tqdm
import numerical_encodings # adapted from regression transformer paper

import torch
from torch import nn, Tensor
from torch.utils.data import DataLoader, Dataset, TensorDataset, random_split
from torch.nn.utils.rnn import pad_sequence
import torch.nn as nn
from torch.nn import Transformer
import torchtext

# Check if GPU is available
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(DEVICE)

cuda


### Load data from files

Load data from pickles

* Lattice parameters are sorted smallest to largest.
* Peaks are unsorted(?)
* Max q is 1.0 (0.99999 or something - maybe it's normalized to 1.0)?

In [2]:
def load_data(file1_path, file2_path):
    with open(file1_path, 'rb') as file1:
        list1 = pickle.load(file1)
    with open(file2_path, 'rb') as file2:
        list2 = pickle.load(file2)
    return list1, list2

xdata_filename = "./data/qI.pickle"
ydata_filename = "./data/lps.pickle"
X, y = load_data(xdata_filename, ydata_filename)

# Some entries are 0 length so limit X to entries with more than 10 elements
X = [x for x in X if len(x) > 10]

lens = [len(thing) for thing in X]

print(min(lens))
print(max(lens))
print(len(lens))

11
162
143804


### Helper functions

Detokenization routines

In [3]:
def get_keys_by_values(my_dict, values_list):
    keys_list = [key for value in values_list for key, val in my_dict.items() if val == value]
    return keys_list

def split_list(input_list, delimiters):
    '''
    Split a list into sublists based on a list of delimiters
    E.g. for the transfomer output which is separated by |
    '''
    result = []
    temp_list = []

    for item in input_list:
        if item not in delimiters:
            temp_list.append(item)
        else:
            if temp_list:  # Check if temp_list is not empty
                result.append(temp_list)
                temp_list = []

    # Add the last temp_list if not empty and not added yet
    if temp_list and temp_list not in result:
        result.append(temp_list)

    return result

def extract_floats(seq):
    out = []
    delimiters = ['|', '<eos>', '<bos>', '<a>', '<b>', '<c>']
    seq_split = split_list(seq, delimiters)
    for i in range(len(seq_split)):
        float_string = "".join([token.split("_")[1] for token in seq_split[i]])
        out.append(float(float_string))
    return out

### Tokenization

Split targets into numerical encoding

PropertyTokenizer from regression transformer paper.

Pass to model: vocab + list of token IDs

In [6]:
# Manually modified
class PropertyTokenizer:
    """Run a property tokenization."""

    def __init__(self) -> None:
        """Constructs a PropertyTokenizer."""
        self.regex = re.compile(r"\s*(\+|-)?(\d+)(\.)?(\d+)?\s*")

    def tokenize(self, text: str) -> List[str]:
        """Tokenization of a property.
        Args:
            text: text to tokenize.
        Returns:
            extracted tokens.
        """
        tokens = []
        matched = self.regex.match(text)
        if matched:
            sign, units, dot, decimals = matched.groups()
            if sign:
                tokens += [f"_{sign}_"]
            tokens += [
                f"_{number}_{position}_" for position, number in enumerate(units[::-1])
            ][::-1]
            if dot:
                tokens += [f"_{dot}_"]
            if decimals:
                tokens += [
                    f"_{number}_-{position}_"
                    for position, number in enumerate(decimals, 1)
                ]
        return tokens

    def tokenize_floats(self, floats: List[float]) -> List[List[str]]:
        """Tokenization of a list of floats.
        Args:
            floats: list of floats to tokenize.
        Returns:
            List of tokenized floats.
        """
        tokens_list = []
        for num in floats:
            tokens = self.tokenize(str(num))
            tokens_list.append("|")
            for i in tokens:
                tokens_list.append(i)
        return tokens_list
    
    from typing import List

    def tokenize_floats_tgt(self, floats: List[float]) -> List[str]:
        """Tokenization of a list of floats.
        Args:
            floats: list of floats to tokenize.
        Returns:
            List of tokenized floats.
        """
        if len(floats) != 3:
            raise ValueError("The input list must have exactly 3 elements")

        tokens_list = []
        prefix_list = ['<a>', '<b>', '<c>']
        
        for idx, num in enumerate(floats):
            tokens = self.tokenize(str(num))
            tokens_list.append(prefix_list[idx])
            for i in tokens:
                tokens_list.append(i)
        
        return tokens_list

max_q = 1

def bin_q(arr, max_q, n=0.0005):
    '''
    Bins the q values in an array of shape (n, 2) into bins of size `n`
    '''
    bin_edges = np.arange(0.0, max_q + n, n)

    # Use `np.digitize` to bin the q values into integers
    binned_q = np.digitize(arr[:,0], bin_edges) - 1 # subtract 1 to make the index 0-based
    
    return binned_q

class RaggedTensorDataset(Dataset):
    def __init__(self, data, labels):
        self.data = data
        self.labels = labels

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]

def generate_vocab(min_range, max_range, decimal_places=0, special_symbols=['<pad>', '<bos>', '<eos>', '|', '<a>', '<b>', '<c>']):
    token_set = set()
    
    max_digits_left = len(str(max_range).split(".")[0])
    max_digits_right = decimal_places

    for position in range(max_digits_left, -max_digits_right - 1, -1):
        for digit in range(10):
            token = f"_{digit}_{position}_"
            token_set.add(token)
    
    if decimal_places > 0:
        token_set.add("_._")

    tokens = special_symbols + sorted(list(token_set))

    # create a dictionary that maps each token to an index
    token_to_index = {}
    for i in range(len(tokens)):
        token_to_index[tokens[i]] = i

    return token_to_index

Full vocab generator

In [7]:
# find maximum number of decimal places in y
max_decimals = 0
for i in range(len(y)):
    for j in range(len(y[i])):
        if len(str(y[i][j]).split('.')[1]) > max_decimals:
            max_decimals = len(str(y[i][j]).split('.')[1])
print(max_decimals)

7


In [8]:
# Generate vocab_X
min_range = 0
max_range = 9999
vocab_X = generate_vocab(min_range, max_range, 0)
print(vocab_X)
print(len(vocab_X))

# Generate vocab_y
min_range = 0
max_range = 999
vocab_y = generate_vocab(min_range, max_range, max_decimals)
# print(vocab_y)
# print(len(vocab_y))

PAD_IDX, BOS_IDX, EOS_IDX, PIPE_IDX = 0, 1, 2, 3
A_IDX, B_IDX, C_IDX = 4, 5, 6

{'<pad>': 0, '<bos>': 1, '<eos>': 2, '|': 3, '<a>': 4, '<b>': 5, '<c>': 6, '_0_0_': 7, '_0_1_': 8, '_0_2_': 9, '_0_3_': 10, '_0_4_': 11, '_1_0_': 12, '_1_1_': 13, '_1_2_': 14, '_1_3_': 15, '_1_4_': 16, '_2_0_': 17, '_2_1_': 18, '_2_2_': 19, '_2_3_': 20, '_2_4_': 21, '_3_0_': 22, '_3_1_': 23, '_3_2_': 24, '_3_3_': 25, '_3_4_': 26, '_4_0_': 27, '_4_1_': 28, '_4_2_': 29, '_4_3_': 30, '_4_4_': 31, '_5_0_': 32, '_5_1_': 33, '_5_2_': 34, '_5_3_': 35, '_5_4_': 36, '_6_0_': 37, '_6_1_': 38, '_6_2_': 39, '_6_3_': 40, '_6_4_': 41, '_7_0_': 42, '_7_1_': 43, '_7_2_': 44, '_7_3_': 45, '_7_4_': 46, '_8_0_': 47, '_8_1_': 48, '_8_2_': 49, '_8_3_': 50, '_8_4_': 51, '_9_0_': 52, '_9_1_': 53, '_9_2_': 54, '_9_3_': 55, '_9_4_': 56}
57


Generate token IDs for each dataset
* The floatencoder converts tensors of IDs to embedding tensors

In [9]:
tokenizer = PropertyTokenizer()

tokenized_y = [tokenizer.tokenize_floats_tgt(y[i]) for i in range(len(y))]
token_ids_y = [[vocab_y[token] for token in tokenized_y[i]] for i in range(len(tokenized_y))]

binned_X = [bin_q(X[i], max_q) for i in range(len(X))]
tokenized_X = [tokenizer.tokenize_floats(binned_X[i]) for i in range(len(binned_X))]
token_ids_X = [[vocab_X[token] for token in tokenized_X[i]] for i in range(len(tokenized_X))]

Make ragged tensor dataset
* Padding and special token appending happens in the collate function

In [10]:
ragged_dataset = RaggedTensorDataset(token_ids_X, token_ids_y)

# Random seed for reproducibility
torch.manual_seed(42)

# Set the proportions for train, validation, and test splits
train_ratio = 0.90
val_ratio = 0.05
test_ratio = 0.05

# Calculate the sizes for each split
train_size = int(train_ratio * len(ragged_dataset))
val_size = int(val_ratio * len(ragged_dataset))
test_size = len(ragged_dataset) - train_size - val_size

# Perform the train-validation-test split
train_dataset, val_dataset, test_dataset = random_split(ragged_dataset, [train_size, val_size, test_size])

# print sizes
print(f"Train size: {len(train_dataset)}")
print(f"Validation size: {len(val_dataset)}")
print(f"Test size: {len(test_dataset)}")


Train size: 129423
Validation size: 7190
Test size: 7191


### Building a model
Using this as basis https://pytorch.org/tutorials/beginner/translation_transformer.html


In [13]:
# Seq2Seq Network
class Seq2SeqTransformer(nn.Module):
    def __init__(self,
                 num_encoder_layers: int,
                 num_decoder_layers: int,
                 emb_size: int,
                 nhead: int,
                 src_vocab_size: int,
                 tgt_vocab_size: int,
                 src_vocab: dict,
                 tgt_vocab: dict,
                 dim_feedforward: int = 512,
                 dropout: float = 0.1):
        super(Seq2SeqTransformer, self).__init__()
        self.transformer = Transformer(d_model=emb_size,
                                       nhead=nhead,
                                       num_encoder_layers=num_encoder_layers,
                                       num_decoder_layers=num_decoder_layers,
                                       dim_feedforward=dim_feedforward,
                                       dropout=dropout)
        self.generator = nn.Linear(emb_size, tgt_vocab_size) # outputs probability distribution over target vocabulary

        # Note - bug when "numerical_encodings." was omitted, causing model to try to call forward()
        self.src_numerical_encoding = numerical_encodings.FloatEncoding(
            num_embeddings = src_vocab_size,
            embedding_dim = emb_size,
            vocab = src_vocab)
        self.tgt_numerical_encoding = numerical_encodings.FloatEncoding(
            num_embeddings = tgt_vocab_size,
            embedding_dim = emb_size,
            vocab = tgt_vocab)

    def forward(self,
                src: Tensor,
                tgt: Tensor,
                src_mask: Tensor,
                tgt_mask: Tensor,
                src_padding_mask: Tensor,
                tgt_padding_mask: Tensor,
                memory_key_padding_mask: Tensor):
        src_emb = self.src_numerical_encoding(src) # src_tok_emb acts on the input tensor. Instead of positional, use numerical
        tgt_emb = self.tgt_numerical_encoding(tgt) # tgt_tok_emb acts on the output tensor
        outs = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None,
                                src_padding_mask, tgt_padding_mask, memory_key_padding_mask) # this is calling self.forward
        return self.generator(outs)

    def encode(self, src: Tensor, src_mask: Tensor):
        return self.transformer.encoder(self.src_numerical_encoding(src), src_mask)

    def decode(self, tgt: Tensor, memory: Tensor, tgt_mask: Tensor):
        return self.transformer.decoder(self.tgt_numerical_encoding(tgt), memory, tgt_mask)

Define padding and causal masks

In [14]:
def generate_square_subsequent_mask(sz):
    mask = (torch.triu(torch.ones((sz, sz), device=DEVICE)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask

def create_mask(src, tgt):
    src_seq_len = src.shape[0]
    tgt_seq_len = tgt.shape[0]

    tgt_mask = generate_square_subsequent_mask(tgt_seq_len)
    src_mask = torch.zeros((src_seq_len, src_seq_len),device=DEVICE).type(torch.bool)

    src_padding_mask = (src == PAD_IDX).transpose(0, 1)
    tgt_padding_mask = (tgt == PAD_IDX).transpose(0, 1)
    return src_mask, tgt_mask, src_padding_mask, tgt_padding_mask

Define collation function

In [15]:
def collate_fn(batch):
    """
    This function takes in a list of samples and returns separate source and target batches.
    """
    # Get the inputs and labels separately
    # Inputs are lists and converted to tensors here
    src_batch = [torch.tensor(sample[0]) for sample in batch]
    tgt_batch = [torch.tensor(sample[1]) for sample in batch]
    
    # add BOS_IDX to end and EOS_IDX to start of tgt_batch
    tgt_batch = [torch.cat((torch.tensor([BOS_IDX]), tgt_batch[i], torch.tensor([EOS_IDX]))) for i in range(len(tgt_batch))]

    # # Reserve the right to pad later
    src_batch = pad_sequence(src_batch, batch_first=False)
    tgt_batch = pad_sequence(tgt_batch, batch_first=False)

    return src_batch, tgt_batch

Define training and evaluation loop

In [16]:
def train_epoch(model, optimizer):
    model.train()
    losses = 0
    train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, collate_fn=collate_fn)

    for src, tgt in train_dataloader:
        src = src.to(DEVICE)
        tgt = tgt.to(DEVICE)

        tgt_input = tgt[:-1, :]

        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        logits = model(src, tgt_input, src_mask, tgt_mask,src_padding_mask, tgt_padding_mask, src_padding_mask)

        optimizer.zero_grad()

        tgt_out = tgt[1:, :]
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        loss.backward()

        optimizer.step()
        losses += loss.item()

    return losses / len(list(train_dataloader))

def evaluate(model):
    model.eval()
    losses = 0

    val_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE, collate_fn=collate_fn)

    for src, tgt in val_dataloader:
        src = src.to(DEVICE)
        tgt = tgt.to(DEVICE)

        tgt_input = tgt[:-1, :]

        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        logits = model(src, tgt_input, src_mask, tgt_mask,src_padding_mask, tgt_padding_mask, src_padding_mask)

        tgt_out = tgt[1:, :]
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        losses += loss.item()

    return losses / len(list(val_dataloader))

Set model parameters

In [17]:
torch.manual_seed(42)

SRC_VOCAB_SIZE = len(vocab_X)
TGT_VOCAB_SIZE = len(vocab_y)
SRC_VOCAB = vocab_X
TGT_VOCAB = vocab_y
EMB_SIZE = 16
NHEAD = 8
FFN_HID_DIM = 64
BATCH_SIZE = 64
NUM_ENCODER_LAYERS = 3
NUM_DECODER_LAYERS = 3

transformer = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE,
                                 NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, SRC_VOCAB, TGT_VOCAB, FFN_HID_DIM)

for p in transformer.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)

transformer = transformer.to(DEVICE)

loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

optimizer = torch.optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)



Train the model

In [17]:
NUM_EPOCHS = 50

for epoch in range(1, NUM_EPOCHS+1):
    start_time = timer()
    train_loss = train_epoch(transformer, optimizer)
    end_time = timer()
    val_loss = evaluate(transformer)
    print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}, "f"Epoch time = {(end_time - start_time):.3f}s"))

    # Save model every 10 epochs
    if epoch % 10 == 0:
        torch.save(transformer.state_dict(), f"transformer_{epoch}.pt")



Epoch: 1, Train loss: 3.253, Val loss: 2.106, Epoch time = 175.985s
Epoch: 2, Train loss: 1.830, Val loss: 1.629, Epoch time = 167.281s
Epoch: 3, Train loss: 1.628, Val loss: 1.570, Epoch time = 167.637s
Epoch: 4, Train loss: 1.586, Val loss: 1.551, Epoch time = 167.469s
Epoch: 5, Train loss: 1.569, Val loss: 1.542, Epoch time = 167.768s
Epoch: 6, Train loss: 1.559, Val loss: 1.536, Epoch time = 170.318s
Epoch: 7, Train loss: 1.551, Val loss: 1.532, Epoch time = 167.076s
Epoch: 8, Train loss: 1.546, Val loss: 1.528, Epoch time = 167.536s
Epoch: 9, Train loss: 1.542, Val loss: 1.525, Epoch time = 167.810s
Epoch: 10, Train loss: 1.539, Val loss: 1.523, Epoch time = 168.182s
Epoch: 11, Train loss: 1.536, Val loss: 1.521, Epoch time = 168.280s
Epoch: 12, Train loss: 1.533, Val loss: 1.520, Epoch time = 167.012s
Epoch: 13, Train loss: 1.531, Val loss: 1.518, Epoch time = 166.768s
Epoch: 14, Train loss: 1.529, Val loss: 1.517, Epoch time = 166.881s
Epoch: 15, Train loss: 1.528, Val loss: 1.5

KeyboardInterrupt: 

### Testing model

Load saved model and test on examples


In [18]:
# Load a torch model from a .pth file
model_weights_path = "transformer_40.pt" 
transformer.load_state_dict(torch.load(model_weights_path))

<All keys matched successfully>

Generation routines

In [19]:
# function to generate output sequence using greedy algorithm
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    src = src.to(DEVICE)
    src_mask = src_mask.to(DEVICE)

    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(DEVICE)
    for i in range(max_len-1):
        memory = memory.to(DEVICE)
        tgt_mask = (generate_square_subsequent_mask(ys.size(0))
                    .type(torch.bool)).to(DEVICE)
        out = model.decode(ys, memory, tgt_mask)
        out = out.transpose(0, 1)
        prob = model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.item()

        ys = torch.cat([ys,
                        torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0)
        if next_word == EOS_IDX:
            break
    return ys

def greedy_decode_verbose(model, src, src_mask, max_len, start_symbol,top_k=3):
    def get_keys_by_values(vocab, token):
        return [k for k, v in vocab.items() if v == token]
    
    src = src.to(DEVICE)
    src_mask = src_mask.to(DEVICE)

    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(DEVICE)
    for i in range(max_len-1):
        memory = memory.to(DEVICE)
        tgt_mask = (generate_square_subsequent_mask(ys.size(0))
                    .type(torch.bool)).to(DEVICE)
        out = model.decode(ys, memory, tgt_mask)
        out = out.transpose(0, 1)
        prob = model.generator(out[:, -1])

        # Get the top k tokens and their probabilities
        top_probs, top_tokens = torch.topk(prob, top_k, dim=1)
        top_probs = torch.softmax(top_probs, dim=-1)
        
        # Print the top k tokens and their probabilities
        print(f"Step {i}: Top {top_k} tokens and probabilities:")
        for j in range(top_k):
            token = top_tokens[0, j].item()
            token_str = get_keys_by_values(vocab_y, token)[0]
            print(f"Token '{token_str}': Probability {top_probs[0, j].item()}")

        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.item()

        ys = torch.cat([ys,
                        torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0)
        if next_word == EOS_IDX:
            break
    return ys

# actual function to translate input sentence into target language
def translate(model: torch.nn.Module, src, verbose=False):
    model.eval()
    num_tokens = src.shape[0]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
    if verbose:
        tgt_tokens = greedy_decode_verbose(
            model,  src, src_mask, max_len=30, start_symbol=BOS_IDX).flatten()
    else:
        tgt_tokens = greedy_decode(
            model,  src, src_mask, max_len=30, start_symbol=BOS_IDX).flatten()
    return tgt_tokens

Check which tokens the model is predicting
* the structure of the numbers is being captured (P near 1 for tags and ".")

In [20]:
src = test_dataset[0][0]
src = torch.tensor(src).unsqueeze(1)
out = translate(transformer, src, verbose=True)

Step 0: Top 3 tokens and probabilities:
Token '<a>': Probability 0.9999988079071045
Token '_5_0_': Probability 6.563274723703216e-07
Token '_4_0_': Probability 4.707397351921827e-07
Step 1: Top 3 tokens and probabilities:
Token '_1_1_': Probability 0.6457002758979797
Token '_9_0_': Probability 0.1870238035917282
Token '_8_0_': Probability 0.16727592051029205
Step 2: Top 3 tokens and probabilities:
Token '_0_0_': Probability 0.4196949005126953
Token '_1_0_': Probability 0.3270508050918579
Token '_2_0_': Probability 0.25325435400009155
Step 3: Top 3 tokens and probabilities:
Token '_._': Probability 1.0
Token '<eos>': Probability 6.363818361165841e-10
Token '_1_1_': Probability 2.546460819985441e-10
Step 4: Top 3 tokens and probabilities:
Token '_0_-1_': Probability 0.3384811580181122
Token '_1_-1_': Probability 0.3347078561782837
Token '_2_-1_': Probability 0.32681092619895935
Step 5: Top 3 tokens and probabilities:
Token '_3_-2_': Probability 0.33856016397476196
Token '_8_-2_': Probabi

In [22]:
def compare_results(n, verbose=False):
    src = test_dataset[n][0]
    src = torch.tensor(src).unsqueeze(1)
    src = translate(transformer, src, verbose=False)
    src = get_keys_by_values(vocab_y, src.tolist())
    src = extract_floats(src)
    tgt = get_keys_by_values(vocab_y, test_dataset[n][1])
    tgt = extract_floats(tgt)
    # convert to numpy arrays
    src = np.array(src)
    tgt = np.array(tgt)
    # calculate the absolute error
    error = np.abs(src - tgt)
    # calculate the mean absolute error
    mae = np.mean(error)
    # calculate the mean squared error
    mse = np.mean(error**2)
    # calculate the root mean squared error
    rmse = np.sqrt(mse)
    # calculate the mean absolute percentage error
    mape = np.mean(np.abs(error / tgt)) * 100
    # calculate the coefficient of determination
    ss_res = np.sum((tgt - src)**2)
    ss_tot = np.sum((tgt - np.mean(tgt))**2)
    r2 = 1 - (ss_res / ss_tot)
    if verbose:
        # summarize
        print('MAE: %.3f' % mae)
        print('MSE: %.3f' % mse)
        print('RMSE: %.3f' % rmse)
        print('MAPE: %.3f' % mape)
        print('R2: %.3f' % r2)

    return src, tgt, mape

Compare the predicted and actual lattice parameters

In [25]:
for i in range(20):
    a, b, mape = compare_results(i)
    print(f'dataset length {len(test_dataset[i][0])}, predicted:{a}, actual:{b}, mape:{mape}')



dataset length 202, predicted:[10.034 12.634 14.634], actual:[ 6.5144 10.05   17.824 ], mape:32.54555317667957
dataset length 318, predicted:[10.034 12.634 14.634], actual:[ 5.6059  9.3515 19.8996], mape:46.85071557120319
dataset length 165, predicted:[10.034 12.634 14.634], actual:[ 8.202 14.803 19.783], mape:21.005282727796395
dataset length 167, predicted:[10.034 12.634 14.634], actual:[14.75  15.3   15.543], mape:18.415336597597452
dataset length 153, predicted:[10.034 12.634 14.634], actual:[ 7.458 22.887 24.652], mape:39.99204159316957
dataset length 91, predicted:[10.034 12.634 14.634], actual:[ 7.146 19.443 19.762], mape:33.7944413824394
dataset length 124, predicted:[10.034 12.434 14.434], actual:[ 8.1959 10.6113 12.9909], mape:16.904194707792872
dataset length 174, predicted:[10.034 12.634 14.634], actual:[ 9.2492 10.2439 13.9779], mape:12.170276808011689
dataset length 184, predicted:[10.034 12.634 14.634], actual:[ 8.229 14.784 28.713], mape:28.503636397869194
dataset lengt