# Retreival - Core of RAG

<img src='./images/IMG_0359.jpg' width="800">

## Context Tokenizer
A context tokenizer is responsible for preprocessing raw text data by converting it into a sequence of tokens that a model can understand. Tokens are smaller units of text, such as words, subwords, or characters, depending on the tokenization strategy used.

In [6]:
import torch
from transformers import DPRContextEncoderTokenizer

In [27]:
model_name = "facebook/dpr-ctx_encoder-single-nq-base"
context_tokenizer = DPRContextEncoderTokenizer.from_pretrained(model_name)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizer'.


In [10]:
text = [("How are you?","I am Good."),("Hey Buddy. Whats Up?","Nothing Man."),("What Is Love","It is a song recorded by the artist Haddaway") ]

In [11]:
tokens_info = context_tokenizer(text , return_tensors='pt', padding=True, truncation=True, max_length=256)

In [12]:
tokens_info

{'input_ids': tensor([[ 101, 2129, 2024, 2017, 1029,  102, 1045, 2572, 2204, 1012,  102,    0,
            0,    0,    0,    0,    0],
        [ 101, 4931, 8937, 1012, 2054, 2015, 2039, 1029,  102, 2498, 2158, 1012,
          102,    0,    0,    0,    0],
        [ 101, 2054, 2003, 2293,  102, 2009, 2003, 1037, 2299, 2680, 2011, 1996,
         3063, 2018, 2850, 4576,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

## Context Encoder
A context encoder is a component within a language model that processes tokenized input to generate meaningful representations (embeddings) of the text. These embeddings capture the semantic and syntactic relationships between tokens.

In [13]:
from transformers import DPRContextEncoder

In [15]:
model_name = "facebook/dpr-ctx_encoder-single-nq-base"
context_encoder = DPRContextEncoder.from_pretrained(model_name)

Some weights of the model checkpoint at facebook/dpr-ctx_encoder-single-nq-base were not used when initializing DPRContextEncoder: ['ctx_encoder.bert_model.pooler.dense.bias', 'ctx_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRContextEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRContextEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [16]:
outputs = context_encoder(**tokens_info)

In [17]:
outputs

DPRContextEncoderOutput(pooler_output=tensor([[ 0.2181,  0.5647, -0.2786,  ..., -0.3639,  0.7693, -0.0286],
        [ 0.4470,  0.3929,  0.2845,  ..., -0.1393,  0.6474,  0.2293],
        [ 0.1112,  0.0598,  0.1836,  ..., -0.4205, -0.0009,  0.3082]],
       grad_fn=<SliceBackward0>), hidden_states=None, attentions=None)

In [18]:
outputs.pooler_output.shape

torch.Size([3, 768])

### In summary, while a context tokenizer handles text preprocessing by breaking it into tokens, a context encoder processes these tokens to produce meaningful representations that capture their relationships and context within the input text. Both are essential but serve distinctly different roles in NLP pipelines.