<a href="https://colab.research.google.com/github/rprimi/colB5BERT/blob/main/python/Miscelanea.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Saving embeddings to disk is a common task when working with models like BERT, because it allows you to avoid recomputing the embeddings for the same input multiple times. There are several ways to store embeddings, but two popular methods are:

1. Saving as numpy arrays using np.save or np.savez (for multiple arrays at once).
2. Saving as a pickle file, which is a Python-specific binary format.

In your case, since you have a dictionary where each entry is a list of PyTorch tensors, you can first convert your tensors to numpy arrays, and then use one of these methods to store them. Here is an example of how you can do this with numpy:

```python
import numpy as np

def save_embeddings_to_disk(embeddings, filename):
    # Convert tensors to numpy arrays and store them in the same structure
    numpy_embeddings = {layer: [t.numpy() for t in tensors] for layer, tensors in embeddings.items()}

    # Use numpy's savez function to store the dictionary
    # We use ** to unpack the dictionary into keyword arguments
    np.savez(filename, **numpy_embeddings)
```

Then, you can load your embeddings back with:

```python
def load_embeddings_from_disk(filename):
    with np.load(filename) as data:
        embeddings = {layer: data[layer] for layer in data.files}
    return embeddings
```

This will give you a dictionary where the keys are the layers, and the values are lists of numpy arrays.

Please note that you may have to handle large files depending on the size of your embeddings and the number of examples. This can be managed by saving in chunks or using compressed formats if necessary.

### Python equivalent of R str

Yes, there is a similar command in Python called `dir()` which returns a list of all the attributes and methods of any object passed to it¹. Another similar command is `vars()` which returns the __dict__ attribute of an object¹. There is also a function called `inspect.getmembers()` which returns all the members of an object in a list of (name, value) pairs sorted by name¹. I hope this helps!

Origem: conversa com o Bing, 05/06/2023
(1) Is there a Python equivalent of R's str (), returning only the .... https://stackoverflow.com/questions/27749573/is-there-a-python-equivalent-of-rs-str-returning-only-the-structure-of-an-ob.
(2) What are Python pandas equivalents for R functions like str(), summary .... https://stackoverflow.com/questions/27637281/what-are-python-pandas-equivalents-for-r-functions-like-str-summary-and-he.
(3) Qual é a diferença entre 'string' e r'string' em Python?. https://pt.stackoverflow.com/questions/80545/qual-%c3%a9-a-diferen%c3%a7a-entre-string-e-rstring-em-python.

Understanding complex data structures in Python code often requires carefully examining the code and using built-in Python functions that give insights about these structures. Here are a few steps that might help:

1. **Print Statements**: Use `print()` statements liberally to output variables and their types. This can give you an idea of what data structures are being used at various points in the code.

2. **Type Checking**: Use the `type()` function to check the type of data structures. For instance, `type(my_var)` would return the type of `my_var`.

3. **Introspection**: Use dir() to view the attributes and methods of an object. For example, `dir(my_var)` would list all the methods that can be used with `my_var`.

4. **Length and Structure**: Use `len()` to find the length of a data structure. For dictionaries, lists, tuples, etc., you can also print individual elements.

5. **Variable Explorer**: If you're using an Integrated Development Environment (IDE) like PyCharm or Jupyter notebook, you can make use of the variable explorer to inspect your variables and data structures.

6. **Debugger**: A debugger can help you step through the code one line at a time and examine the changes in your data structures as the code executes. Python's built-in debugger is pdb.

7. **Visualization Tools**: For complex data structures like nested dictionaries or dataframes, consider using data visualization tools or libraries like pandas, matplotlib, or seaborn to visualize the data.

Remember, understanding complex data structures can be challenging, but it is often a matter of breaking down the structure into smaller, more manageable parts and understanding those individually.

In Python, lists and dictionaries don't have dimensions in the way that arrays in NumPy or dataframes in pandas do. Instead, they have lengths, and those lengths can be nested. You can use the built-in `len()` function to find out the number of elements in a list or dictionary.

For a list:

```python
my_list = [1, 2, 3, 4, 5]
print(len(my_list))  # Output: 5
```

For a dictionary:

```python
my_dict = {'one': 1, 'two': 2, 'three': 3}
print(len(my_dict))  # Output: 3
```

For nested structures, you'd need to use additional `len()` calls or use a loop or comprehension to iterate over the elements.

For example, for a list of lists (a 2D list), you could use a list comprehension:

```python
my_list_of_lists = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
dimensions = [len(inner_list) for inner_list in my_list_of_lists]
print(dimensions)  # Output: [3, 3, 3]
```

This tells you that you have a "3x3" list. Note that this only works for regularly-shaped data; if your lists have differing lengths, you'll get a variety of numbers.

For a nested dictionary, things can get more complex, and you may need a recursive function to fully explore the structure if the nesting can be more than one level deep.

Old functions

In [None]:

def calculate_cosine_similarity(list_A, list_B):
    result = []
    for tensor_A in list_A:
        for tensor_B in list_B:
            similarity = cosine_similarity(tensor_A.unsqueeze(0), tensor_B.unsqueeze(0), dim=-1)
            result.append(similarity)
    return result


def calculate_cosine_similarity(list_A, list_B):
    result = []
    for tensor_A in list_A:
        for tensor_B in list_B:
            # Expand dimensions so that shapes are [1, n_tokens_A, 768] and [n_tokens_B, 1, 768]
            tensor_A_exp = tensor_A.unsqueeze(0).unsqueeze(1)
            tensor_B_exp = tensor_B.unsqueeze(0).unsqueeze(2)

            # Repeat tensors so that shapes are [n_tokens_B, n_tokens_A, 768] and [n_tokens_B, n_tokens_A, 768]
            tensor_A_rep = tensor_A_exp.repeat(tensor_B.shape[0], 1, 1)
            tensor_B_rep = tensor_B_exp.repeat(1, tensor_A.shape[0], 1)
            similarity = cosine_similarity(tensor_A_rep, tensor_B_rep, dim=-1)
            result.append(similarity)
    return result


### Chat GPT question

I have a list of A of i elements. Each elment is a torch tensor of n_tokens1 X 768 (dim of embedings) . Then I have a list of B of p elements. Each elment is a torch tensor of n_tokens2 X 768 (dim of embedings). I want to claculate the cosine similarity between embedding vectors n_tokens1 with n_tokens2. I want the result to be a list of i elements by p elments of matrices n_tokens1 X n_tokens2 containing the cossine similarities. Please creeate a code in python using efficient vectorized operations

First, to compute cosine similarity between tensors, you can use the `torch.nn.functional.cosine_similarity` function provided by PyTorch.

Here's a function that computes cosine similarity between every pair of tokens in `A` and `B`, assuming that `A` and `B` are PyTorch tensors.

```python
import torch
import torch.nn.functional as F

def compute_cosine_similarity(A, B):
    batch_size1, num_of_tokens1, embedding_dim1 = A.shape
    batch_size2, num_of_tokens2, embedding_dim2 = B.shape

    if embedding_dim1 != embedding_dim2:
        raise ValueError("Embedding dimensions must match!")

    if batch_size1 != batch_size2:
        raise ValueError("Batch sizes must match!")

    # We will compute cosine similarity for each example in the batch separately
    cosine_similarities = []
    for i in range(batch_size1):
        a = A[i]
        b = B[i]

        # Compute cosine similarity between all pairs of tokens.
        # The resulting matrix will have shape (num_of_tokens1, num_of_tokens2)
        similarity_matrix = torch.zeros((num_of_tokens1, num_of_tokens2))

        for j in range(num_of_tokens1):
            for k in range(num_of_tokens2):
                similarity_matrix[j, k] = F.cosine_similarity(a[j], b[k], dim=0)

        cosine_similarities.append(similarity_matrix)

    return cosine_similarities
```

This function will return a list of 2D tensors, each with shape `(num_of_tokens1, num_of_tokens2)`. Each tensor in the list corresponds to an example in the batch. The values in the 2D tensor represent the cosine similarity between the corresponding tokens in `A` and `B`. Note that this is quite a computationally expensive way to compute these similarities due to the nested for loops, and it might be worth looking into more efficient methods if this becomes a bottleneck in your code.

Can you do this using vector operations ?

```python
import torch
import torch.nn.functional as F

def compute_cosine_similarity_vectorized(A, B):
    # Check for matching dimensions
    batch_size1, num_of_tokens1, embedding_dim1 = A.shape
    batch_size2, num_of_tokens2, embedding_dim2 = B.shape

    if embedding_dim1 != embedding_dim2:
        raise ValueError("Embedding dimensions must match!")

    if batch_size1 != batch_size2:
        raise ValueError("Batch sizes must match!")

    A_unsqueezed = A.unsqueeze(2)  # Shape becomes [batch_size, num_of_tokens1, 1, embedding_dim]
    B_unsqueezed = B.unsqueeze(1)  # Shape becomes [batch_size, 1, num_of_tokens2, embedding_dim]

    # Calculate cosine similarity. The result has shape [batch_size, num_of_tokens1, num_of_tokens2]
    similarity_matrix = F.cosine_similarity(A_unsqueezed, B_unsqueezed, dim=-1)

    return similarity_matrix

def get_bert_embeddings(bert_model, examples, layers):
    # Move model to GPU if available
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    bert_model = bert_model.to(device)

    embeddings = {layer: [] for layer in layers}
    for ex_ids in examples:
        # Convert data to tensor and move to GPU
        ex_ids_tensor = torch.tensor([ex_ids]).to(device)
        with torch.no_grad():
            # Output includes 'last_hidden_state', 'pooler_output', 'hidden_states'
            output = bert_model(ex_ids_tensor, output_hidden_states=True)
            hidden_states = output.hidden_states
            for layer in layers:
                # Verify layer index is valid
                if layer < 0 or layer >= len(hidden_states):
                    print(f"Invalid layer {layer}")
                else:
                    # Hidden states is a tuple. Indexing into it gives a tensor of shape
                    # (batch_size, sequence_length, hidden_size). Since batch_size is 1,
                    # we remove the batch dimension.
                    layer_output = hidden_states[layer].squeeze(0)
                    # Convert back to CPU for further processing or storage
                    embeddings[layer].append(layer_output.to('cpu'))
    return embeddings
```


In [None]:

def calculate_cosine_similarity(list_A, list_B, topk=5):
    result = []
    for tensor_A in list_A:
        similarities = []
        for tensor_B in list_B:
            # Expand dimensions so that shapes are [n_tokens_A, 1, 768] and [1, n_tokens_B, 768]
            tensor_A_exp = tensor_A.unsqueeze(1)
            tensor_B_exp = tensor_B.unsqueeze(0)
            # Compute cosine similarity
            similarity = torch.nn.functional.cosine_similarity(tensor_A_exp, tensor_B_exp, dim=-1)
            similarities.append(similarity)
        # Concatenate all similarity scores and find the topk
        similarities = torch.cat(similarities, dim=-1)
        topk_values, topk_indices = torch.topk(similarities, topk, dim=-1)
        result.append((topk_values, topk_indices))
    return result


def calculate_cosine_similarity(list_A, list_B, topk=5, batch_size=10):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    result = []

    # Move tensors to GPU
    list_A = [tensor.to(device) for tensor in list_A]
    list_B = [tensor.to(device) for tensor in list_B]

    # Process batch_size tensor_A at a time
    for i in range(0, len(list_A), batch_size):
        batch_A = list_A[i:i+batch_size]
        batch_A = torch.stack(batch_A).unsqueeze(2)  # [batch_size, n_tokens_A, 1, 768]
        similarities = []

        for tensor_B in list_B:
            tensor_B = tensor_B.unsqueeze(0).unsqueeze(0)  # [1, 1, n_tokens_B, 768]
            similarity = torch.nn.functional.cosine_similarity(batch_A, tensor_B, dim=-1)  # [batch_size, n_tokens_A, n_tokens_B]
            similarities.append(similarity)

        similarities = torch.cat(similarities, dim=-1)  # [batch_size, n_tokens_A, n_tokens_B*len(list_B)]
        topk_values, topk_indices = torch.topk(similarities, topk, dim=-1)  # [batch_size, n_tokens_A, topk]
        result.extend(zip(topk_values, topk_indices))

    # Move tensors back to CPU
    result = [(values.to('cpu'), indices.to('cpu')) for values, indices in result]

    return result


def calculate_cosine_similarity(list_A, list_B, topk=5, batch_size=10):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    result = []

    # Move tensors to GPU
    list_A = [tensor.to(device) for tensor in list_A]
    list_B = [tensor.to(device) for tensor in list_B]

    # Process batch_size tensor_A at a time
    for i in tqdm(range(0, len(list_A), batch_size), desc='Processing', dynamic_ncols=True):
        batch_A = list_A[i:i+batch_size]
        batch_A = torch.nn.utils.rnn.pad_sequence(batch_A, batch_first=True).unsqueeze(2)  # [batch_size, max_n_tokens_A, 1, 768]
        similarities = []

        for tensor_B in list_B:
            tensor_B = tensor_B.unsqueeze(0).unsqueeze(0)  # [1, 1, n_tokens_B, 768]
            similarity = torch.nn.functional.cosine_similarity(batch_A, tensor_B, dim=-1)  # [batch_size, max_n_tokens_A, n_tokens_B]
            similarities.append(similarity)

        similarities = torch.cat(similarities, dim=-1)  # [batch_size, max_n_tokens_A, n_tokens_B*len(list_B)]
        topk_values, topk_indices = torch.topk(similarities, topk, dim=-1)  # [batch_size, max_n_tokens_A, topk]
        result.extend(zip(topk_values, topk_indices))

    # Move tensors back to CPU
    result = [(values.to('cpu'), indices.to('cpu')) for values, indices in result]

    return result



In [None]:
!rm /content/drive/MyDrive/*.npz


In [None]:
!pip install mysql-connector-python
import mysql.connector



# Set up logging
logging.basicConfig(level=logging.INFO)  # you can change the level to DEBUG if you want more detailed logs
logger = logging.getLogger(__name__)

def calculate_cosine_similarity1_db(list_A, list_B, db_config, table_name, batch_size=1):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Move list_A tensors to GPU
    list_A = [tensor.to(device) for tensor in list_A]

    # Connect to the MySQL server
    cnx = mysql.connector.connect(**db_config)
    cursor = cnx.cursor()

    # Create a new table
    cursor.execute(f"CREATE TABLE IF NOT EXISTS {table_name} (n_batch INT, num_element_listA INT, num_token_A INT, num_element_list_B INT, num_token_B INT, cos_sim FLOAT)")

    # Process batch_size tensor_A at a time
    for n_batch in tqdm(range(0, len(list_A), batch_size), desc='Processing', dynamic_ncols=True):
        batch_A = list_A[n_batch:n_batch+batch_size]
        batch_A = torch.nn.utils.rnn.pad_sequence(batch_A, batch_first=True).unsqueeze(2)  # [batch_size, max_n_tokens_A, 1, 768]
        batch_data = []

        for num_element_list_B, tensor_B in enumerate(list_B):
            tensor_B = tensor_B.to(device).unsqueeze(0).unsqueeze(0)  # [1, 1, n_tokens_B, 768]
            similarity = torch.nn.functional.cosine_similarity(batch_A, tensor_B, dim=-1)  # [batch_size, max_n_tokens_A, n_tokens_B]
            similarity = similarity.half()  # Reduce precision to float16
            similarity_np = similarity.cpu().numpy()  # Move similarity to CPU and convert to NumPy array

            # Iterate over the elements in the similarity matrix
            for num_element_listA in range(similarity_np.shape[0]):
                for num_token_A in range(similarity_np.shape[1]):
                    for num_token_B in range(similarity_np.shape[2]):
                        # Extract the corresponding cosine similarity
                        cos_sim = similarity_np[num_element_listA, num_token_A, num_token_B]

                        # Add the row to the batch_data list
                        batch_data.append((n_batch, num_element_listA, num_token_A, num_element_list_B, num_token_B, float(cos_sim)))

            del tensor_B  # Delete tensor_B from GPU memory

        # Insert all rows in batch_data into the database
        query = f"INSERT INTO {table_name} (n_batch, num_element_listA, num_token_A, num_element_list_B, num_token_B, cos_sim) VALUES (%s, %s, %s, %s, %s, %s)"
        cursor.executemany(query, batch_data)

        logger.info(f"Inserted {len(batch_data)} rows into the database")

    # Make sure data is committed to the database
    cnx.commit()
    cursor.close()
    cnx.close()

# Example usage
db_config = {'host': 'localhost', 'database': 'test_db', 'user': 'test_user', 'password': 'test_password'}
calculate_cosine_similarity1_db(list_A, list_B, db_config=db_config, table



In [None]:

# print(batch)
# print(element_A)
# print(batch_A.shape[1])
# print(element_B)
# print(tensor_B.shape[2])
# print(similarity[element_A])

calculate_cosine_similarity1_db(embeddings_itens[6], embeddings_posts[6], db_config=db_config,table_name="cos_sim_L6",  batch_size=1)


In [None]:
logging.basicConfig(level=logging.INFO)  # you can change the level to DEBUG if you want more detailed logs
logger = logging.getLogger(__name__)


def calculate_cosine_similarity1_hdf5b(list_A, list_B, filename, dataset_name, max_tokens_A, max_tokens_B, num_batches, num_list_A, num_list_B, batch_size=1):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Move list_A tensors to GPU
    list_A = [tensor.to(device) for tensor in list_A]

    # Create a new HDF5 file
    with h5py.File(filename, "w") as f:
        # Create the dataset
        dset = f.create_dataset(dataset_name, (num_batches, num_list_A, num_list_B, max_tokens_A, max_tokens_B), dtype='float16')

        # Process batch_size tensor_A at a time
        for n_batch in tqdm(range(0, len(list_A), batch_size), desc='Processing', dynamic_ncols=True):
            batch_A = list_A[n_batch:n_batch+batch_size]
            batch_A = torch.nn.utils.rnn.pad_sequence(batch_A, batch_first=True).unsqueeze(2)  # [batch_size, max_n_tokens_A, 1, 768]

            for n_listA_elem, tensor_A in enumerate(batch_A):
                for n_listB_elem, tensor_B in enumerate(list_B):
                    tensor_B = tensor_B.to(device).unsqueeze(0).unsqueeze(0)  # [1, 1, n_tokens_B, 768]
                    similarity = torch.nn.functional.cosine_similarity(tensor_A.unsqueeze(0), tensor_B, dim=-1)  # [1, n_tokens_A, n_tokens_B]
                    similarity = similarity.half()  # Reduce precision to float16
                    similarity_np = similarity.cpu().numpy()  # Move similarity to CPU and convert to NumPy array

                    # Write the similarities to the HDF5 dataset
                    dset[n_batch, n_listA_elem, n_listB_elem, :similarity_np.shape[1], :similarity_np.shape[2]] = similarity_np

                    del tensor_B  # Delete tensor_B from GPU memory
                    logger.info(f"batch: {n_batch}, A elem: {n_listA_elem}, B elem: {n_listB_elem} , similarity shape: {similarity_np.shape[1]} X {similarity_np.shape[2]}")


In [None]:
#cosim_top5_L6 = calculate_cosine_similarity2(embeddings_itens[6][0:3], embeddings_posts[6][0:10], topk = 5, batch_size=2)
filname="cosim_L11"
dataset_name = "db_cosim_L1"
max_tokens_A = 512
max_tokens_B = 23
num_batches = len(embeddings_posts[11])
num_list_A = len(embeddings_posts[11])
num_list_B = len(embeddings_itens[11])
batch_size=1

calculate_cosine_similarity1_hdf5b(embeddings_posts[11], embeddings_itens[11], filname, dataset_name, max_tokens_A, max_tokens_B, num_batches, num_list_A, num_list_B, batch_size)

In [None]:
import h5py
import torch
from tqdm import tqdm

def calculate_cosine_similarity1_hdf5(list_A, list_B, filename, dataset_name, batch_size=1):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Move list_A tensors to GPU
    list_A = [tensor.to(device) for tensor in list_A]

    # Create a new HDF5 file
    with h5py.File(filename, "w") as f:
        # Process batch_size tensor_A at a time
        for n_batch in tqdm(range(0, len(list_A), batch_size), desc='Processing', dynamic_ncols=True):
            batch_A = list_A[n_batch:n_batch+batch_size]
            batch_A = torch.nn.utils.rnn.pad_sequence(batch_A, batch_first=True).unsqueeze(2)  # [batch_size, max_n_tokens_A, 1, 768]

            for num_element_list_B, tensor_B in enumerate(list_B):
                tensor_B = tensor_B.to(device).unsqueeze(0).unsqueeze(0)  # [1, 1, n_tokens_B, 768]
                similarity = torch.nn.functional.cosine_similarity(batch_A, tensor_B, dim=-1)  # [batch_size, max_n_tokens_A, n_tokens_B]
                similarity = similarity.half()  # Reduce precision to float16
                similarity_np = similarity.cpu().numpy()  # Move similarity to CPU and convert to NumPy array

                # Create a new dataset for each batch, element of list_A, and element of list_B
                dset = f.create_dataset(f"{dataset_name}/batch_{n_batch}_A_{num_element_list_B}_B", data=similarity_np, dtype='float16')

                del tensor_B  # Delete tensor_B from GPU memory
