The code in this notebook corresponds to [this blog post](https://shahadmahmud.com/en/ml/llm/retriever-augmented-generation-rag-with-llms). You can go through the blog post to understand the code in detail.

In [1]:
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.document_loaders import UnstructuredHTMLLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from IPython.display import Markdown, display

load_dotenv()

True

In [2]:
class Configs:
    GPT_MODEL = "gpt-4o"

    SOURCE_FILE = "data/example/BERT.html"

settings = Configs()

## Prepare the knowledge base (KB)

We would use the [BERT documentation](https://huggingface.co/docs/transformers/en/model_doc/bert) from Hugging Face as the KB for this task. We have pre-downloaded the webpage and is available in the `data/example/BERT.html` file. We would use the `unstructured` library to parse the HTML file and extract the text content from the webpage.

If you want to load the webpage from the internet, look at [this documentation](https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/html/) from Langchain.

### Loading the webpage

We would load the webpage and extract the text content from the webpage using the `UnstructuredHTMLLoader`.

In [3]:
loader = UnstructuredHTMLLoader(settings.SOURCE_FILE)
data = loader.load()

### Splitting the text content

Loading the entire text content from the webpage would be too much for the model to handle. We can even hit the context length limitation. Also, the LLM may not get the right piece information from the whole text. 

Thats why we would split the text content into smaller and manageable chunks.

In [4]:
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=32)
chunks = splitter.split_documents(data)

### Creating Vector Database

We would create a vector database from the text chunks. We would use the `OpenAIEmbeddings` to encode the text chunks into vectors and the `Faiss` library to create the vector database.

In [5]:
embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(chunks, embeddings)

## Developing the Retrieval Pipeline

We would develop a retrieval pipeline that would take a query and return the most relevant text chunks from the knowledge base. We will combine the texts and pass it to the LLM model along the original Query to generate the answer.

For the pipeline, we will use a prompt template and chain consisting of retriever, LLM and output parser.

In [6]:
from langchain_core.prompts import PromptTemplate

template = """You are a helpful AI assistant that can answer questions from given contexts.

The given contexts are:
{contexts}

The question is:
{question}
"""

prompt = PromptTemplate.from_template(template)

In [7]:
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

llm = ChatOpenAI(model=settings.GPT_MODEL, temperature=0.3)
retriever = db.as_retriever(k=10)

chain = (
    {
        "contexts": itemgetter("question")
        | retriever
        | (lambda x: [d.page_content for d in x]),
        "question": itemgetter("question"),
    }
    | prompt
    | llm
    | StrOutputParser()
)

In [8]:
response = chain.invoke({
    "question": "How can I use SDPA with BERT in transformers?"
})

display(Markdown(response))

To use Scaled Dot Product Attention (SDPA) with BERT in the transformers library, you can follow these steps:

1. **Load the BERT Model with SDPA**: You need to load the BERT model with the `attn_implementation` parameter set to `"sdpa"`. This can be done using the `BertModel.from_pretrained` method.

2. **Use Half-Precision for Speedups**: For optimal performance, it is recommended to load the model in half-precision (e.g., `torch.float16` or `torch.bfloat16`).

Here is an example code snippet to illustrate this:

```python
from transformers import BertModel

# Load the BERT model with SDPA and in half-precision
model = BertModel.from_pretrained(
    "bert-base-uncased", 
    torch_dtype=torch.float16, 
    attn_implementation="sdpa"
)

# Now the model is ready to be used for training or inference
```

### Additional Tips:
- **Padding**: Since BERT uses absolute position embeddings, it is usually advised to pad the inputs on the right rather than the left.
- **Training and Inference**: You can use this model for tasks like masked language modeling (MLM) and next sentence prediction (NSP), which are the objectives BERT was trained on.

### Example Usage for Inference:
Here’s an example of how you might use the model for a question-answering task:

```python
from transformers import BertTokenizer

# Initialize the tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Define the question and text
question = "Who was Jim Henson?"
text = "Jim Henson was a nice puppet"

# Tokenize the inputs
inputs = tokenizer(question, text, return_tensors="pt")

# Perform inference
outputs = model(**inputs)

# Extract start and end scores for the answer span
start_scores = outputs.start_logits
end_scores = outputs.end_logits

# Process the scores to get the final answer (not shown here)
```

By following these steps, you can effectively use SDPA with BERT in the transformers library.