In [11]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0


In [12]:
device = "cuda"

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [6]:
# !pip install torch --index-url https://download.pytorch.org/whl/cu118
# !pip install -qU langchain cohere tiktoken pypdf sentence_transformers faiss-gpu peft bitsandbytes trl
# !pip install -q accelerate==0.21.0

In [54]:
import os
from getpass import getpass
os.environ["COHERE_API_KEY"] = getpass("Cohere API Key:")

Cohere API Key: ········································


In [7]:
from langchain.document_loaders.pdf import PyPDFDirectoryLoader
pdf_folder_path = "/kaggle/input/nlp-book-set"
loader = PyPDFDirectoryLoader(pdf_folder_path)
docs = loader.load()

### SPLIT THE DOCUMENTS

In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
texts = text_splitter.split_documents(docs)

In [10]:
print(len(texts))

5054


### DEFINE EMBEDDING MODEL

In [18]:
from langchain.embeddings import HuggingFaceBgeEmbeddings
model_name = "BAAI/bge-large-en-v1.5"
model_kwargs = {'device': 'cuda'}
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

### BASE VECTOR STORE (FAISS) RETRIEVER

In [19]:
from langchain.vectorstores import FAISS
vectorstore = FAISS.from_documents(texts, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 20})

Batches:   0%|          | 0/158 [00:00<?, ?it/s]

### RERANKING WITH COHERE RERANK

In [55]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank

In [56]:
compressor = CohereRerank()
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, 
    base_retriever=retriever
)

### OPENAI MODEL

In [49]:
#!pip install pydantic==1.10.9 openai==0.27.8

In [57]:
import openai
from langchain.chat_models import ChatOpenAI

In [58]:
os.environ['OPENAI_API_KEY'] = "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"
model_name = "gpt-3.5-turbo"
llm = ChatOpenAI(model_name=model_name)

### RAG PIPELINE WITH COMPRESSOR RETRIEVER

In [59]:
from langchain.chains import RetrievalQA

In [60]:
rag_pipeline = RetrievalQA.from_chain_type(llm=llm,
                                 chain_type="stuff",
                                 retriever=compression_retriever)

In [61]:
query = "Tell me about various advanced Retrieval Systems"
print(rag_pipeline.run(query=query))

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

There are various advanced retrieval systems used in information retrieval and natural language processing. Some of these systems include:

1. Ranking-based Retrieval Systems: These systems rank the retrieved documents based on their relevance to the query. They use different ranking algorithms and techniques to determine the importance of each document.

2. Neural Network-based Retrieval Systems: These systems use artificial neural networks to learn the semantic relationships between queries and documents. They can capture complex patterns and representations in the data, leading to more accurate retrieval results.

3. Knowledge Graph-based Retrieval Systems: These systems utilize knowledge graphs, which are structured representations of knowledge, to enhance retrieval. They leverage the relationships and connections within the knowledge graph to improve the relevance of the retrieved documents.

4. Personalized Retrieval Systems: These systems take into account the user's preferences

In [62]:
query_set = [
    "How does LLMs learn",
    "Write a short history of NLP from 2000 to 2023",
    "Write a code that explain a symple Speech recognition",
    "What is coreference solution?"
]

In [63]:
for query in query_set:
    print(rag_pipeline.run(query))

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

LLMs (Language Models) learn by analyzing large amounts of text data and predicting the likelihood of a word or sequence of words occurring in a given context. They are trained using a process called Masked Language Modeling (MLM), where words in a sentence are randomly masked, and the model is tasked with predicting the masked words based on the surrounding context.

During training, the LLM is presented with a series of sentences where some words are masked. It then learns to predict the masked words by considering the context provided by the surrounding words. The model is trained to assign higher probabilities to words that are more likely to appear in a given context.

The training process involves adjusting the parameters of the LLM based on the error between the predicted probabilities and the actual words. This is done through backpropagation and gradient descent, where the model iteratively updates its parameters to improve its predictions.

By analyzing a large corpus of text

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

From 2000 to 2023, Natural Language Processing (NLP) has made significant advancements and witnessed remarkable growth. Various developments and breakthroughs have shaped the field during this period.

In the early 2000s, researchers focused on core areas of NLP, such as syntactic parsing, semantic analysis, and named entity recognition. They explored rule-based approaches and statistical models to improve the accuracy of these tasks. Researchers also worked on building large annotated datasets to train machine learning models.

Around 2006, there was a shift towards utilizing more data-driven approaches in NLP. This led to the rise of machine learning techniques, including supervised, unsupervised, and semi-supervised learning. Researchers started using large-scale datasets and more complex models, such as neural networks, to achieve better performance in various NLP tasks.

In the following years, there was a surge in research related to specific NLP tasks, such as sentiment analysis

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Certainly! Here's a simple example code that demonstrates speech recognition using the SpeechRecognition library in Python:

```python
import speech_recognition as sr

# Create a recognizer object
recognizer = sr.Recognizer()

# Define the audio source (e.g., microphone)
with sr.Microphone() as source:
    print("Listening...")

    # Adjust for ambient noise
    recognizer.adjust_for_ambient_noise(source)

    # Capture the audio input
    audio = recognizer.listen(source)

    try:
        # Use the recognizer to convert speech to text
        text = recognizer.recognize_google(audio)
        print("You said:", text)

    except sr.UnknownValueError:
        print("Oops! Unable to understand audio.")

    except sr.RequestError as e:
        print("Oops! An error occurred.", e)
```

In this code, we first import the `speech_recognition` library. We create a recognizer object and define the audio source as the microphone. We then use the `adjust_for_ambient_noise` method to account fo

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Coreference resolution is the task of linking together mentions in a text that refer to the same entity. It involves identifying and connecting pronouns, definite and indefinite noun phrases, and names that refer to the same entity in a discourse model. The goal is to create a set of coreference chains or clusters. Various approaches, such as machine learning, have been used to classify pairs of noun phrases as coreferring or not, based on annotated data. A separate clustering mechanism is then used to reconcile any conflicting classifications and create a partition of the noun phrases.
