Contextual compression works well with LLaMA2 quantized model

Provides summarized version of page content instead of full page content which gets a bit difficult to comprehend.

In [1]:
import torch 
import time
import transformers # HF import
from langchain import HuggingFacePipeline # To build the HF pipeline using Llama-2
from langchain import PromptTemplate,  LLMChain # To create PromptTemplate and LLMChain
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM , AutoModel  # For creating the model and tokenizer


In [2]:
from transformers import GPTQConfig

#mname = 'TheBloke/Llama-2-7B-Chat-GGUF'
mname = "TheBloke/Mistral-7B-OpenOrca-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(mname)
tokenizer.pad_token = tokenizer.eos_token

quantization_config_loading = GPTQConfig(bits=4, 
                                         disable_exllama=True, 
                                         use_cuda_fp16=True,
                                         tokenizer=tokenizer)

model = AutoModelForCausalLM.from_pretrained(mname,
                                             quantization_config=quantization_config_loading,
                                             device_map="auto")

model.eval()

pipe = pipeline("text-generation",
                model=model,
                tokenizer= tokenizer,
                torch_dtype=torch.bfloat16,
                device_map="auto",
                max_new_tokens = 128,
                do_sample=True,
                top_k=1,
                num_return_sequences=1,
                eos_token_id=tokenizer.eos_token_id,
                repetition_penalty=1.2
                )

llm=HuggingFacePipeline(pipeline=pipe, model_kwargs={'temperature':0})


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Using `disable_exllama` is deprecated and will be removed in version 4.37. Use `use_exllama` instead and specify the version with `exllama_config`.The value of `use_exllama` will be overwritten by `disable_exllama` passed in `GPTQConfig` or stored in your config file.
You passed `quantization_config` to `from_pretrained` but the model you're loading already has a `quantization_config` attribute and has already quantized weights. However, loading attributes (e.g. ['use_cuda_fp16', 'use_exllama', 'max_input_length', 'exllama_config', 'disable_exllama']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.
CUDA extension not installed.
CUDA extension not installed.


#### Embedding Model

In [3]:
from langchain.schema import Document
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma


In [4]:
embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')


#### Document Preparation

Source : https://www.kdnuggets.com/2017/04/top-20-papers-machine-learning.html

In [5]:
docs = [
    Document(
        page_content="The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. This significantly reduces overfitting and gives major improvements over other regularization methods",
        metadata={"name":"Dropout: a simple way to prevent neural networks from overfitting", "year": 2014, 
                  "Authors": "Hinton, G.E., Krizhevsky, A., Srivastava, N., Sutskever, I., & Salakhutdinov, R.", "cited":2084 , 
                  "Field" : "'Neural Network','Regularization'"},
    ),
    Document(
        page_content="We present a residual learning framework to ease the training of deep neural networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.",
        metadata={"name":"Deep Residual Learning for Image Recognition", "year": 2016, 
                  "Authors": "He, K., Ren, S., Sun, J., & Zhang, X. (2016). CoRR", "cited":1436 , 
                  "Field" : "'Image Recognition','Computer Vision'"},
    ),
    Document(
        page_content="Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change.  We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs.  Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.",
        metadata={"name":"Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift", "year": 2015, 
                  "Authors": "Sergey Ioffe, Christian Szegedy", "cited":946 , 
                  "Field" : "'Neural Network','Deep Learning','Speed up Training Process'"},
    ),
    Document(
        page_content="Convolutional Neural Networks (CNNs) have been established as a powerful class of models for image recognition problems. Encouraged by these results, we provide an extensive empirical evaluation of CNNs on large-scale video classification using a new dataset of 1 million YouTube videos belonging to 487 classes",
        metadata={"name":"Large-Scale Video Classification with Convolutional Neural Networks ", "year": 2014, 
                  "Authors": "Fei-Fei, L., Karpathy, A., Leung, T., Shetty, S., Sukthankar, R., & Toderici, G.", "cited":865 , 
                  "Field" : "'Convolutional Neural Network','Deep Learning','Video Classfication'"},
    ),
    Document(
       page_content="We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.",
       metadata={"name":"Microsoft COCO: Common Objects in Context", "year": 2014, 
                  "Authors": "Belongie, S.J., Dollár, P., Hays, J., Lin, T., Maire, M., Perona, P., Ramanan, D., & Zitnick, C.L", "cited":830 , 
                 "Field" : "'Convolutional Neural Network','Object Detection','Dataset'"},
    ),
]


#### Vector Store

In [6]:
db = Chroma.from_documents(docs, embeddings) ### Vector Store Creation


#### Creating and ContextualCompressionRetriever

In [7]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor


In [8]:
compressor = LLMChainExtractor.from_llm(llm)


In [9]:
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, 
                                                       base_retriever=db.as_retriever())


#### Doing simple Similarity Search based on a query

In [10]:
docs = db.similarity_search('Please let me know the details about batch normalization ?')


In [12]:
len(docs), docs[0]


(4,
 Document(page_content="Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change.  We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs.  Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.", metadata={'Authors': 'Sergey Ioffe, Christian Szegedy', 'Field': "'Neural Network','Deep Learning','Speed up Training Process'", 'cited': 946, 'name': 'Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift', 'year': 2015}))

#### Using compression_retriever to get results

In [None]:
results = compression_retriever.get_relevant_documents("Please let me know the details about Dropout?")


In [14]:
results


[Document(page_content='"randomly drop units", "preventing units from co-adapting too much", "reduces overfitting"', metadata={'Authors': 'Hinton, G.E., Krizhevsky, A., Srivastava, N., Sutskever, I., & Salakhutdinov, R.', 'Field': "'Neural Network','Regularization'", 'cited': 2084, 'name': 'Dropout: a simple way to prevent neural networks from overfitting', 'year': 2014}),
 Document(page_content='Residual Learning Framework, Easier Training, Deep Neural Networks', metadata={'Authors': 'He, K., Ren, S., Sun, J., & Zhang, X. (2016). CoRR', 'Field': "'Image Recognition','Computer Vision'", 'cited': 1436, 'name': 'Deep Residual Learning for Image Recognition', 'year': 2016}),
 Document(page_content='None\n```', metadata={'Authors': 'Fei-Fei, L., Karpathy, A., Leung, T., Shetty, S., Sukthankar, R., & Toderici, G.', 'Field': "'Convolutional Neural Network','Deep Learning','Video Classfication'", 'cited': 865, 'name': 'Large-Scale Video Classification with Convolutional Neural Networks ', 'ye

In [15]:
results[0].page_content


'"randomly drop units", "preventing units from co-adapting too much", "reduces overfitting"'

In [16]:
results[1].page_content


'Residual Learning Framework, Easier Training, Deep Neural Networks'

#### Checking Contextual Compression on Larger Document

In [17]:
# import the LangChain pdf document loader
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter


In [20]:
survey_of_llms = 'survey_of_large_lang_models.pdf'


In [21]:
# Load and create pages
loader = PyPDFLoader(file_path=survey_of_llms)
documents = loader.load()

text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=100)
documents = text_splitter.split_documents(documents)


#### Adding the embeddings to the existing Chroma DB

In [22]:
_ = db.add_documents(documents)


In [None]:
docs = db.similarity_search('Please let me know how the Large Language Models are trained?')


In [None]:
docs ### Huge content the whole page content is retrieved


#### Trying the Contextual Compression

In [23]:
results = compression_retriever.get_relevant_documents("Please let me know how the Large Language Models are trained?")




In [24]:
for i in results:
    print(i.page_content)
    print('---------------')


[121] A. Radford, R. J ´ozefowicz, and I. Sutskever, “Learning to generate reviews and discovering sentiment,” CoRR, vol. abs/1704.01444, 2017.
[122] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” 2018.
---------------
- Large Language Models (LLM): Pre-trained language models (PLMs) have been proposed by pre-training Transformer models over large-scale corpora, showing strong capabilities in solving various natural language processing (NLP) tasks.
- Model Scaling: Researchers have found that model scaling can lead to an improved model capacity. When the parameter scale exceeds a certain level, these enlarged language models not only achieve a significant performance improvement, but also exhibit some special abilities (e.g., in-context learning) that are not present in small-scale language models (e.g., B
---------------
[30] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown,
B. Chess, R. Child, S. Gray, A. 