# Embedding Bug Reports with OpenAI <code>text-embedding-3-small</code>

<u><b>Full Procedure<u><b>

1. Load Dependencies and Dataset
- Necessary libraries are imported for data handling, pre-processing the text, and interaction with the OpenAI API. A custom text pre-processor and OpenAI client are initialized
- The dataset of resolved Mozilla Firefox bug reports is loaded into a pandas DataFrame

2. Preprocess the Text
- Concatenate the Summary and Description fields, as they will be used as input for the models in the next steps
- The bug report Concat field is pre-processed using the custom <code>PreProcessor</code> class to clean and prepare text for embedding

3. Tokenization and Embed the Text
- Each text entry is tokenized using the <code>tiktoken</code> library to ensure compatibility with OpenAI models and calculate token usage
- Each preprocessed comment is embedded using the OpenAI embedding API. Embeddings are stored along with their corresponding metadata

4. Save the Embeddings
- The computed embeddings and metadata are saved to a file for future use in tasks like classification, fine-tuning, clustering and semantic search

### Imports and Read Data

In [1]:
from typing import List
import pandas as pd
from text_preprocessing import PreProcessor

import tiktoken
from openai import OpenAI

client = OpenAI()
# Custom pre-processor class
pp = PreProcessor()

In [2]:
# Retrieve the bug reports from the request_data notebook
data = pd.read_csv("bug_reports_mozilla_firefox_resolved_fixed_comments.csv")

In [3]:
data.head()

Unnamed: 0,Bug ID,Type,Summary,Product,Component,Status,Resolution,Priority,Severity,Description
0,1955715,enhancement,Update addonsInfo asrouter targeting to allow ...,Firefox,Messaging System,RESOLVED,FIXED,P1,--,"Currently, the addonsInfo targeting returns an..."
1,1953155,task,Enable expand on hover and remove coming soon ...,Firefox,Sidebar,RESOLVED,FIXED,P1,--,"When expand on hover is enabled, the message s..."
2,1953857,enhancement,Add support for picker style tiles in the Abou...,Firefox,Messaging System,RESOLVED,FIXED,P1,--,In bug 1910633 we added support for a single s...
3,1945526,task,[SPIKE] What’s New Notification: Windows Toast...,Firefox,Messaging System,RESOLVED,FIXED,P1,--,Spike to understand how the Windows Toast Noti...
4,1945564,enhancement,Add new callout for Create Tab Group action &&...,Firefox,Messaging System,RESOLVED,FIXED,P1,--,Scope is to update && add to the onboarding ca...


In [4]:
data.Summary[0]

'Update addonsInfo asrouter targeting to allow targeting on user-installed addons'

In [8]:
data.Description[0]

"Currently, the addonsInfo targeting returns an object of objects, each representing an addon with said addon's ID/Name being the key of the object. This makes it difficult to use in JEXL targeting expressions unless you are already aware of the ID/Name of the addon you with to gather information for. \nUpdating `addonsInfo`'s `addons` property to be an array of objects, each with a property containing the id/name (as was previously the key) will support the initial use case of getting a particular object by ID via JEXL, but also allow for further use cases such as evaluating if a user has any existing non-system addons active."

In [5]:
# Concatenate the Summary and Description fields in another column named Concat
data["Concat"] = ("Summary: " + data.Summary.str.strip() + "; Description: " + data.Description.str.strip())

In [7]:
data.head(2)

Unnamed: 0,Bug ID,Type,Summary,Product,Component,Status,Resolution,Priority,Severity,Description,Concat
0,1955715,enhancement,Update addonsInfo asrouter targeting to allow ...,Firefox,Messaging System,RESOLVED,FIXED,P1,--,"Currently, the addonsInfo targeting returns an...",Summary: Update addonsInfo asrouter targeting ...
1,1953155,task,Enable expand on hover and remove coming soon ...,Firefox,Sidebar,RESOLVED,FIXED,P1,--,"When expand on hover is enabled, the message s...",Summary: Enable expand on hover and remove com...


In [9]:
data.Concat[0]

"Summary: Update addonsInfo asrouter targeting to allow targeting on user-installed addons; Description: Currently, the addonsInfo targeting returns an object of objects, each representing an addon with said addon's ID/Name being the key of the object. This makes it difficult to use in JEXL targeting expressions unless you are already aware of the ID/Name of the addon you with to gather information for. \nUpdating `addonsInfo`'s `addons` property to be an array of objects, each with a property containing the id/name (as was previously the key) will support the initial use case of getting a particular object by ID via JEXL, but also allow for further use cases such as evaluating if a user has any existing non-system addons active."

### Pre-Process Text

In [10]:
# Pre-process the text using the custom class
data["Concat"] = data["Concat"].apply(pp.clean_text)

In [11]:
# Example of the cleaned text
data.Concat[0]

'summary update addonsinfo asrouter target allow target on user install addon description current addonsinfo target return object object represent addon said addon id name key object this make difficult use jexl target expression unless already aware id name addon gather information update addonsinfo addon property array object property contain id name support initial use case get particular object id via jexl also allow use case evaluate user existing non system addon active'

### Tokenization and Embed the Text

In [12]:
# Function to transform the text into embeddings using the model available in the OpenAI API
def get_embedding(text: str, model = "text-embedding-3-small", **kwargs) -> List[float]:
    # Replace newlines, which can negatively affect performance.
    text = text.replace("\n", " ")

    response = client.embeddings.create(input=[text], model = model, **kwargs)

    return response.data[0].embedding

In [13]:
embedding_model = "text-embedding-3-small"
embedding_encoding = "cl100k_base"
max_tokens = 8000  # The maximum for text-embedding-3-small is 8191

In [14]:
# Encode the text to check the lenght of each bug report
encoding = tiktoken.get_encoding(embedding_encoding)
data["N_tokens"] = data.Concat.apply(lambda x: len(encoding.encode(x)))

In [15]:
# There is one example with 16.877 words, so we will remove it from the dataset before creating the embeddings
max(data["N_tokens"])

16877

In [24]:
data.drop(data[data["N_tokens"] == 16877].index, inplace = True)

In [28]:
data.reset_index(drop = True, inplace = True)

In [30]:
# Now, the example with the max lenght has 3.268 words, which is fine for the embedding model 
max(data["N_tokens"])

3268

In [34]:
# Create a new column name Embeddings and apply the vectorization for each bug report text
# This may take a few minutes
data["Embeddings"] = data.Concat.apply(lambda x: get_embedding(x, model = embedding_model))

### Save Embeddings

In [35]:
# Save data to CSV
data.to_csv("bug_reports_mozilla_firefox_resolved_fixed_comments_embeddings.csv", index = None)