<a href="https://colab.research.google.com/github/rosafilgueira/PyCodeSearch/blob/main/Registry_search_multimodel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Learning Transformer models for code


## UnixCoder RepoSIM

We are going to use two models:
-  Lazyhope/unixcoder-nine-advtest;
- Lazyhope/unixcoder-clone-detection

Both have been trained following: https://github.com/microsoft/CodeBERT/blob/master/UniXcoder/downstream-tasks/code-search/README.md

We are going to work with these two models (which uses a bi-enconder approach) for code similarity and text similarity.

In [44]:
!pip install tensorflow
!pip install -U accelerate
!pip install docarray
!pip install pandas
!pip install torch
!pip install transformers
!pip install tqdm



## Add database example

In [90]:
import pandas as pd

# Define the codes and their corresponding docstrings
codes = [
    "def add(a, b):\n    return a + b",
    "def subtract(a, b):\n    return a - b",
    "def multiply(a, b):\n    return a * b",
    "def divide(a, b):\n    return a / b",
    "def power(a, b):\n    return a ** b",
    "def modulus(a, b):\n    return a % b"
]

docs = [
    "This function adds two numbers.",
    "This function subtracts the second number from the first.",
    "This function multiplies two numbers.",
    "This function divides the first number by the second.",
    "This function raises the first number to the power of the second.",
    "This function returns the remainder when the first number is divided by the second."
]

# Create the dataframe
registry = pd.DataFrame({
    'code': codes,
    'doc': docs
})

### dataframe
registry

Unnamed: 0,code,doc
0,"def add(a, b):\n return a + b",This function adds two numbers.
1,"def subtract(a, b):\n return a - b",This function subtracts the second number from...
2,"def multiply(a, b):\n return a * b",This function multiplies two numbers.
3,"def divide(a, b):\n return a / b",This function divides the first number by the ...
4,"def power(a, b):\n return a ** b",This function raises the first number to the p...
5,"def modulus(a, b):\n return a % b",This function returns the remainder when the f...


## Load the models:

- model_code_to_code -- for code-to-code search
- model_text_to_code -- for text-to-code search


In [91]:
from transformers import pipeline

model_code_to_code = pipeline(
    model="Lazyhope/unixcoder-clone-detection",
    trust_remote_code=True,
    device_map="auto")

model_text_to_code = pipeline(
    model="Lazyhope/RepoSim",
    trust_remote_code=True,
    device_map="auto")



[*] Consider setting GitHub token to avoid hitting rate limits. 
For more info, see: https://docs.github.com/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token


## Turn code and docstring into torch embeddings

In [92]:
import torch

def encode(string, model_type):
    if model_type == 1:
        with torch.no_grad():
            embedding = model_text_to_code.encode(string, 512)
        final_t=embedding.squeeze()
    else:
        with torch.no_grad():
            embedding = model_code_to_code(string, truncation=True, max_length=512)
        if isinstance(embedding, list):
            embedding = torch.tensor(embedding)
        kk=embedding.squeeze()
        final_t=kk[0]

    return final_t

## model_type =1 -- for text-to_code
registry["doc_embeddings"] = registry["doc"].apply(encode, model_type=1)
## model_type=2 -- for code-to-code
registry["code_embeddings"] = registry["code"].apply(encode, model_type=2)


registry

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unnamed: 0,code,doc,doc_embeddings,code_embeddings
0,"def add(a, b):\n return a + b",This function adds two numbers.,"[tensor(-1.5903), tensor(-0.3973), tensor(3.99...","[tensor(2.1682), tensor(-0.6963), tensor(1.542..."
1,"def subtract(a, b):\n return a - b",This function subtracts the second number from...,"[tensor(-2.4678), tensor(-2.2801), tensor(0.66...","[tensor(0.2957), tensor(0.9715), tensor(0.7843..."
2,"def multiply(a, b):\n return a * b",This function multiplies two numbers.,"[tensor(-2.3068), tensor(1.3498), tensor(3.078...","[tensor(3.8874), tensor(-2.6447), tensor(-0.01..."
3,"def divide(a, b):\n return a / b",This function divides the first number by the ...,"[tensor(-3.0391), tensor(-0.0748), tensor(1.79...","[tensor(2.7210), tensor(0.7029), tensor(1.5855..."
4,"def power(a, b):\n return a ** b",This function raises the first number to the p...,"[tensor(-2.9461), tensor(-0.5175), tensor(2.37...","[tensor(3.9258), tensor(-2.2622), tensor(1.259..."
5,"def modulus(a, b):\n return a % b",This function returns the remainder when the f...,"[tensor(-2.4903), tensor(-1.3584), tensor(1.60...","[tensor(1.8459), tensor(1.1119), tensor(1.2777..."


### Text-to-code Similarity

In [93]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [94]:
# Compute user query embeddings
user_query_docs = "Function that adds two numbers"
#only change -- indicate the model_type=1 ... for tex-to-code
user_query_docs_emb = encode(user_query_docs, model_type=1)


# Convert document embeddings to numpy arrays
registry["doc_embeddings"] = registry["doc_embeddings"].apply(lambda x: np.array(x))

# Compute cosine similarity
user_query_emb = np.array(user_query_docs_emb)
cos_similarities = cosine_similarity(user_query_emb.reshape(1, -1), np.vstack(registry["doc_embeddings"]))

# Add cosine similarity scores as a new column
registry_doc=registry.copy()
registry_doc["cosine_similarity_doc"] = cos_similarities[0]

# Sort the dataframe based on cosine similarity
sorted_df = registry_doc.sort_values(by="cosine_similarity_doc", ascending=False)

# Retrieve the top 5 most similar documents
top_5_similar_docs = sorted_df.head(5)

In [95]:
top_5_similar_docs

Unnamed: 0,code,doc,doc_embeddings,code_embeddings,cosine_similarity_doc
0,"def add(a, b):\n return a + b",This function adds two numbers.,"[-1.590325, -0.39731324, 3.997744, 2.6590736, ...","[tensor(2.1682), tensor(-0.6963), tensor(1.542...",0.976803
2,"def multiply(a, b):\n return a * b",This function multiplies two numbers.,"[-2.3068378, 1.3498034, 3.0785217, 1.3164998, ...","[tensor(3.8874), tensor(-2.6447), tensor(-0.01...",0.704091
1,"def subtract(a, b):\n return a - b",This function subtracts the second number from...,"[-2.4677699, -2.280132, 0.6681174, 1.4311584, ...","[tensor(0.2957), tensor(0.9715), tensor(0.7843...",0.700667
3,"def divide(a, b):\n return a / b",This function divides the first number by the ...,"[-3.0390859, -0.07484988, 1.7925602, 0.2812554...","[tensor(2.7210), tensor(0.7029), tensor(1.5855...",0.57978
5,"def modulus(a, b):\n return a % b",This function returns the remainder when the f...,"[-2.490343, -1.3584495, 1.6064605, -0.03004119...","[tensor(1.8459), tensor(1.1119), tensor(1.2777...",0.545701


### Code-to-Text

In [96]:
# Compute user query embeddings
user_query_code = "def add_numbers(a, b):\n return a +"
#only change -- indicate the model_type=2 ... for code-to-code
user_query_code_emb = encode(user_query_code, model_type=2)
# Convert document embeddings to numpy arrays
registry["code_embeddings"] = registry["code_embeddings"].apply(lambda x: np.array(x))

# Compute cosine similarity
user_query_emb_c = np.array(user_query_code_emb)
cos_similarities = cosine_similarity(user_query_emb_c.reshape(1, -1), np.vstack(registry["code_embeddings"]))

# Add cosine similarity scores as a new column
registry_code=registry.copy()
registry_code["cosine_similarity_code"] = cos_similarities[0]

# Sort the dataframe based on cosine similarity
sorted_df_code = registry_code.sort_values(by="cosine_similarity_code", ascending=False)

# Retrieve the top 5 most similar documents
top_5_similar_code = sorted_df_code.head(5)




In [97]:
top_5_similar_code

Unnamed: 0,code,doc,doc_embeddings,code_embeddings,cosine_similarity_code
0,"def add(a, b):\n return a + b",This function adds two numbers.,"[-1.590325, -0.39731324, 3.997744, 2.6590736, ...","[2.1682317, -0.6963131, 1.542846, 3.1269562, 1...",0.797922
2,"def multiply(a, b):\n return a * b",This function multiplies two numbers.,"[-2.3068378, 1.3498034, 3.0785217, 1.3164998, ...","[3.8874378, -2.6447327, -0.012378594, 2.499000...",0.74938
4,"def power(a, b):\n return a ** b",This function raises the first number to the p...,"[-2.946074, -0.51749504, 2.3797052, 2.3707116,...","[3.9257975, -2.2621982, 1.2599363, 4.382963, 0...",0.688891
3,"def divide(a, b):\n return a / b",This function divides the first number by the ...,"[-3.0390859, -0.07484988, 1.7925602, 0.2812554...","[2.7209783, 0.7028546, 1.585468, 1.0704353, 2....",0.556907
5,"def modulus(a, b):\n return a % b",This function returns the remainder when the f...,"[-2.490343, -1.3584495, 1.6064605, -0.03004119...","[1.845908, 1.1119063, 1.2776934, 0.43036178, 2...",0.406185


# Code Sumarization

In [87]:
from transformers import RobertaTokenizer, T5ForConditionalGeneration

tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base-multi-sum')
model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base-multi-sum')


Downloading (…)olve/main/vocab.json:   0%|          | 0.00/703k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/294k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/12.5k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/902 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

In [88]:
def generate_summary(text):
    input_ids = tokenizer.encode(text, return_tensors="pt")
    generated_ids = model.generate(input_ids, max_length=20)
    summary = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    return summary

In [89]:
registry_summary=registry.copy()
# Assuming the registry dataframe is already defined
registry_summary["summarization"] = registry["code"].apply(generate_summary)
registry_summary


Unnamed: 0,code,doc,doc_embeddings,code_embeddings,summarization
0,"def add(a, b):\n return a + b",This function adds two numbers.,"[-1.590325, -0.39731324, 3.997744, 2.6590736, ...","[2.1682317, -0.6963131, 1.542846, 3.1269562, 1...",Add two vectors.
1,"def subtract(a, b):\n return a - b",This function subtracts the second number from...,"[-2.4677699, -2.280132, 0.6681174, 1.4311584, ...","[0.2956554, 0.97152746, 0.78425086, 3.7197409,...",Subtract two vectors.
2,"def multiply(a, b):\n return a * b",This function multiplies two numbers.,"[-2.3068378, 1.3498034, 3.0785217, 1.3164998, ...","[3.8874378, -2.6447327, -0.012378594, 2.499000...",Multiply two vectors.
3,"def divide(a, b):\n return a / b",This function divides the first number by the ...,"[-3.0390859, -0.07484988, 1.7925602, 0.2812554...","[2.7209783, 0.7028546, 1.585468, 1.0704353, 2....",Divide two numbers.
4,"def power(a, b):\n return a ** b",This function raises the first number to the p...,"[-2.946074, -0.51749504, 2.3797052, 2.3707116,...","[3.9257975, -2.2621982, 1.2599363, 4.382963, 0...",Returns a power of b.
5,"def modulus(a, b):\n return a % b",This function returns the remainder when the f...,"[-2.490343, -1.3584495, 1.6064605, -0.03004119...","[1.845908, 1.1119063, 1.2776934, 0.43036178, 2...",Returns a % b.
