# Retreival - Core of RAG

<img src='./images/IMG_0359.jpg' width="800">

## Context Tokenizer
A context tokenizer is responsible for preprocessing raw text data by converting it into a sequence of tokens that a model can understand. Tokens are smaller units of text, such as words, subwords, or characters, depending on the tokenization strategy used.

In [None]:
import torch
from transformers import DPRContextEncoderTokenizer

In [None]:
model_name = "facebook/dpr-ctx_encoder-single-nq-base"
context_tokenizer = DPRContextEncoderTokenizer.from_pretrained(model_name)

In [None]:
text = [("How are you?","I am Good."),("Hey Buddy. Whats Up?","Nothing Man."),("What Is Love","It is a song recorded by the artist Haddaway") ]

In [None]:
tokens_info = context_tokenizer(text , return_tensors='pt', padding=True, truncation=True, max_length=256)

In [None]:
tokens_info

## Context Encoder
A context encoder is a component within a language model that processes tokenized input to generate meaningful representations (embeddings) of the text. These embeddings capture the semantic and syntactic relationships between tokens.

In [None]:
from transformers import DPRContextEncoder

In [None]:
model_name = "facebook/dpr-ctx_encoder-single-nq-base"
context_encoder = DPRContextEncoder.from_pretrained(model_name)

In [None]:
outputs = context_encoder(**tokens_info)

In [None]:
outputs

In [None]:
outputs.pooler_output.shape

### In summary, while a context tokenizer handles text preprocessing by breaking it into tokens, a context encoder processes these tokens to produce meaningful representations that capture their relationships and context within the input text. Both are essential but serve distinctly different roles in NLP pipelines.