<a href="https://colab.research.google.com/github/riphunter7001x/MultiModal_RAG/blob/main/Hybrid_Search_in_RAG_with_opensource_huggingface.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [2]:
# Sample documents
documents = [
    "This is a list which containig sample documents.",
    "Keywords are important for keyword-based search.",
    "Document analysis involves extracting keywords.",
    "Keyword-based search relies on sparse embeddings."
]

In [3]:
query="keyword-based search"

In [4]:
import re
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    return text


In [5]:
preprocess_documents=[preprocess_text(doc) for doc in documents]

In [6]:
preprocess_documents

['this is a list which containig sample documents',
 'keywords are important for keywordbased search',
 'document analysis involves extracting keywords',
 'keywordbased search relies on sparse embeddings']

In [7]:
print("Preprocessed Documents:")
for doc in preprocess_documents:
    print(doc)

Preprocessed Documents:
this is a list which containig sample documents
keywords are important for keywordbased search
document analysis involves extracting keywords
keywordbased search relies on sparse embeddings


In [8]:
print("Preprocessed Query:")
print(query)

Preprocessed Query:
keyword-based search


In [13]:
preprocessed_query = preprocess_text(query)

In [14]:
preprocessed_query

'keywordbased search'

In [15]:
vector=TfidfVectorizer()

In [16]:
X=vector.fit_transform(preprocess_documents)

In [17]:
X.toarray()

array([[0.        , 0.        , 0.37796447, 0.        , 0.37796447,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.37796447, 0.        , 0.        , 0.37796447, 0.        ,
        0.        , 0.37796447, 0.        , 0.        , 0.37796447,
        0.37796447],
       [0.        , 0.4533864 , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.4533864 , 0.4533864 , 0.        ,
        0.        , 0.35745504, 0.35745504, 0.        , 0.        ,
        0.        , 0.        , 0.35745504, 0.        , 0.        ,
        0.        ],
       [0.46516193, 0.        , 0.        , 0.46516193, 0.        ,
        0.        , 0.46516193, 0.        , 0.        , 0.46516193,
        0.        , 0.        , 0.36673901, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.43671931, 0.        , 0.        , 0.       

In [18]:
X.toarray()[0]

array([0.        , 0.        , 0.37796447, 0.        , 0.37796447,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.37796447, 0.        , 0.        , 0.37796447, 0.        ,
       0.        , 0.37796447, 0.        , 0.        , 0.37796447,
       0.37796447])

In [19]:
query_embedding=vector.transform([preprocessed_query])

In [20]:
query_embedding.toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.70710678, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.70710678, 0.        , 0.        ,
        0.        ]])

In [21]:
similarities = cosine_similarity(X, query_embedding)

In [22]:
similarities

array([[0.        ],
       [0.50551777],
       [0.        ],
       [0.48693426]])

In [23]:
np.argsort(similarities,axis=0)

array([[0],
       [2],
       [3],
       [1]])

In [25]:
#Ranking
ranked_indices=np.argsort(similarities,axis=0)[::-1].flatten()

In [26]:
ranked_documents = [documents[i] for i in ranked_indices]

In [27]:
ranked_indices


array([1, 3, 2, 0])

In [28]:
# Output the ranked documents
for i, doc in enumerate(ranked_documents):
    print(f"Rank {i+1}: {doc}")

Rank 1: Keywords are important for keyword-based search.
Rank 2: Keyword-based search relies on sparse embeddings.
Rank 3: Document analysis involves extracting keywords.
Rank 4: This is a list which containig sample documents.


In [29]:
query

'keyword-based search'

In [30]:
documents = [
    "This is a list which containig sample documents.",
    "Keywords are important for keyword-based search.",
    "Document analysis involves extracting keywords.",
    "Keyword-based search relies on sparse embeddings."
]

In [None]:
#https://huggingface.co/sentence-transformers

In [31]:
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-3.0.0-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.7/224.7 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.11.0->sentence-transform

In [39]:
from sentence_transformers import SentenceTransformer, util
# sentences = ["I'm happy", "I'm full of happiness"]

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')



modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [40]:
document_embeddings = model.encode(documents)

In [41]:
document_embeddings

array([[-0.02178365,  0.05011964, -0.01590293, ...,  0.10129204,
         0.05670731,  0.04024563],
       [-0.00097261, -0.00410358,  0.01657118, ...,  0.02935942,
        -0.01249212,  0.06550964],
       [-0.02925701,  0.07428253, -0.01379713, ...,  0.04377958,
         0.06241399,  0.01498034],
       [-0.00896699, -0.07016608,  0.03400351, ..., -0.01310814,
         0.01019189,  0.06507178]], dtype=float32)

In [42]:
# Sample search query (represented as a dense vector)
query_embedding = model.encode(" keyword based search")

In [43]:
query_embedding

array([-2.78899074e-02,  7.60938879e-03, -3.34221087e-02, -1.35217011e-02,
       -2.85079423e-02,  4.12325412e-02,  1.29690930e-01,  1.11014908e-02,
       -3.57324220e-02,  1.29979169e-02,  4.28221934e-02,  3.97290327e-02,
        1.16726734e-01, -1.65527221e-02,  4.27451218e-03,  2.10218318e-03,
        4.43923883e-02,  6.35507777e-02, -5.19163720e-02, -5.69379218e-02,
        7.19835535e-02,  1.07145216e-02, -2.51452718e-02, -5.25703877e-02,
       -4.18839864e-02,  3.53007093e-02, -1.82282068e-02, -3.70108709e-02,
        2.36918032e-02,  1.98688172e-02, -3.63125764e-02,  1.62455495e-02,
        9.71726999e-02,  1.24289088e-01, -6.22852035e-02, -3.99765410e-02,
       -9.41596255e-02,  3.57973240e-02, -1.57044665e-03,  1.59695391e-02,
       -1.11757398e-01, -7.39829801e-03, -2.11200416e-02, -6.61235210e-03,
        9.32554007e-02,  7.93264806e-03, -9.71482843e-02,  4.77806702e-02,
        2.13952153e-03, -1.30355381e-03, -1.76959857e-01, -4.33786921e-02,
       -5.55116311e-02, -

In [44]:
# Calculate cosine similarity between query and documents

similarities = util.pytorch_cos_sim(document_embeddings, query_embedding)

ValueError: Expected 2D array, got 1D array instead:
array=[-4.33558911e-01 -6.90989271e-02  1.23601807e-02 -3.43092531e-01
 -1.48445502e-01  3.21845591e-01  5.74033022e-01 -3.55119616e-01
 -1.14691453e-02 -2.84739643e-01  3.84089500e-01  6.09324992e-01
  4.98532861e-01  2.68042952e-01 -5.86683571e-04 -3.46985795e-02
  3.00739706e-01  2.32090786e-01  3.90492938e-02 -2.90325314e-01
  7.80914068e-01  1.53757215e-01 -6.43252954e-02 -2.39077970e-01
  1.24622583e-01 -1.81798697e-01 -5.62797070e-01 -5.10610402e-01
  3.74789476e-01  3.57742310e-01  5.13043106e-01  5.94603680e-02
  1.56966373e-01  2.21914742e-02 -5.92278652e-02  1.21190391e-01
 -2.46373057e-01  2.18135834e-01  3.64166200e-02  1.59871042e-01
 -4.72838610e-01 -3.27039540e-01 -3.07489812e-01  3.45153093e-01
  4.17532064e-02 -2.60652393e-01 -4.82894570e-01  4.73541647e-01
 -4.10372764e-03  1.87252000e-01 -8.95379961e-01 -3.91747564e-01
 -7.52554655e-01 -9.80314687e-02  5.95890820e-01  2.05484912e-01
 -1.13308348e-01 -3.13356668e-01 -2.75235713e-01 -4.90673304e-01
  4.79961783e-01  1.44121483e-01 -2.00381875e-01  4.24846172e-01
  1.25580847e-01 -1.61226969e-02 -2.82689661e-01 -3.53870630e-01
 -5.46757467e-02  6.91252947e-02  1.76448107e-01  1.37075230e-01
 -2.65076280e-01  5.13671935e-01 -5.95174193e-01  7.58630753e-01
 -2.76672989e-01 -5.53356349e-01  6.46081567e-02 -1.73464417e-01
 -1.74187794e-01 -3.26729834e-01 -1.67249516e-01  1.82692826e-01
  1.16372608e-01 -5.58967441e-02  1.06417738e-01 -2.97685534e-01
  7.00057089e-01 -8.07351470e-02 -4.55772638e-01 -8.78616333e-01
  4.63901907e-01 -3.97475123e-01  1.54709309e-01  5.69180166e-03
 -4.14728411e-02 -1.98891595e-01  9.66743305e-02  5.35015225e-01
 -3.20427120e-01  3.81272030e-03 -5.05842716e-02 -1.00061595e+00
  1.96460232e-01  8.56752619e-02 -1.27057418e-01  7.31272399e-02
  8.19047391e-01 -2.04175711e-01 -1.53521746e-01  8.58663023e-02
 -2.28450373e-01 -4.79804389e-02 -5.38339436e-01 -1.60674036e-01
  5.42230904e-01 -2.01242313e-01  1.96052626e-01  2.59935975e-01
  2.83488512e-01 -1.58191025e-01  9.84059870e-02  2.55165458e-01
 -1.48093164e-01  3.25759202e-01 -1.38067767e-01  2.29788378e-01
  2.03162551e-01  1.23335682e-01 -1.09558098e-01 -3.68083954e-01
 -2.94311672e-01  1.72750279e-01 -2.56300092e-01  2.23197505e-01
 -6.49550796e-01 -8.81395936e-02 -2.54177034e-01  2.85025746e-01
 -8.21395397e-01  1.02549367e-01  6.30345881e-01 -9.05303955e-02
 -6.96733892e-02  2.88207442e-01 -3.18248987e-01 -2.74555981e-01
 -1.77595735e-01 -1.13845415e-01 -1.00076430e-01 -8.34838822e-02
  2.09854305e-01 -8.79636034e-02  1.38978974e-03 -8.78090784e-02
  2.11936906e-01  2.12150410e-01 -2.14278921e-01 -4.27505046e-01
 -2.02396408e-01 -1.76654443e-01 -1.36259988e-01  3.76199812e-01
  4.39674169e-01 -2.62075663e-01 -2.12331057e-01  9.68241394e-02
 -8.41803569e-03 -8.44962522e-02  3.32603574e-01 -4.32197958e-01
 -3.81005406e-01 -1.12387337e-01 -6.50942147e-01  2.46205330e-01
 -2.25070909e-01  1.45653844e-01  1.54106632e-01  3.29803795e-01
 -4.92939949e-01 -6.45661578e-02 -2.48309318e-02  5.04173160e-01
 -2.00685561e-01 -1.92429963e-02  2.35896364e-01  5.81409752e-01
 -1.72185555e-01 -4.74449366e-01  7.00374126e-01  4.53280210e-01
  2.48730838e-01  1.18627846e-01  1.36589063e-02 -4.88530636e-01
  3.29598844e-01  1.33194983e-01 -2.35465452e-01  1.95371732e-01
 -9.66523960e-02  7.34551763e-03 -5.37169039e-01 -1.33656099e-01
 -1.50199115e-01  4.72197205e-01 -6.23869449e-02 -7.41832480e-02
  2.16723774e-02 -1.20681627e-02  6.72157034e-02  4.29405957e-01
 -6.72633469e-01  2.10746109e-01  4.47874635e-01 -1.40515423e+00
 -1.44267574e-01 -4.71800178e-01  1.31969690e-01  6.55736089e-01
 -3.65514874e-01 -2.27392912e-01 -1.11890905e-01  1.03277015e-02
 -7.28045642e-01 -5.42707860e-01 -4.76267524e-02  2.60146230e-01
 -2.77142376e-01 -4.91120704e-02 -2.03846052e-01 -6.64066315e-01
  1.91447124e-01  3.45525384e-01 -2.96313524e-01  6.22967631e-02
  3.10594708e-01  2.48246148e-01  3.63817066e-01  4.63978238e-02
  9.38036665e-02  1.59423396e-01 -3.61332864e-01  7.78372467e-01
 -1.04722083e+00 -2.71148920e-01 -6.41963303e-01 -1.72933638e-02
  5.12935638e-01 -1.30533352e-01  5.87422438e-02 -3.62332791e-01
 -1.44982934e-01  4.57855821e-01 -7.01589808e-02 -1.77848816e-01
 -1.62058175e-01 -1.19706213e-01 -4.56942320e-01  2.01663360e-01
  2.55334049e-01 -4.93881613e-01  5.34943283e-01  5.80518544e-01
  4.79539223e-02 -4.23969746e-01  2.93376055e-02 -3.65325183e-01
 -2.01774552e-01  1.62769139e-01 -3.62018079e-01  3.60753775e-01
  1.75314710e-01  3.45650949e-02  3.11422348e-01  4.51040477e-01
 -2.72217602e-01 -5.49397051e-01  1.40248954e-01  1.80626050e-01
 -7.51793563e-01  2.45946925e-02 -5.74796975e-01  1.93975121e-02
 -5.02385259e-01  1.64802685e-01 -1.18895352e-01  1.02949929e+00
 -1.78252459e-01 -3.49612951e-01  2.02321932e-01  1.59962907e-01
 -2.91368276e-01 -5.15332460e-01  2.08177403e-01  1.99150648e-02
  1.00090897e+00  1.22266114e-01 -1.00548565e-03  1.76191688e-01
  5.60256302e-01  5.87382376e-01  1.91734910e-01  5.68576753e-02
  5.94094582e-02  5.11811435e-01  8.75300989e-02  6.44842982e-02
 -6.71776593e-01  4.67552155e-01  1.61360309e-01  5.81584096e-01
 -4.32234816e-02 -1.32923290e-01 -4.41939443e-01 -1.59867629e-01
 -3.17623436e-01  3.52581412e-01 -5.04976064e-02  2.14824364e-01
  1.01639092e-01  4.25880164e-01  2.56777912e-01  4.48827475e-01
  7.26452842e-02  2.99975704e-02  3.11055154e-01  4.24858481e-01
  1.57977328e-01 -4.77583796e-01 -1.88838899e-01  2.04706073e-01
 -3.98609251e-01 -5.05467020e-02 -6.53734982e-01  3.67701761e-02
 -1.91995248e-01 -1.85358703e-01  7.27856532e-02  4.73436952e-01
  6.57315791e-01  3.22556049e-01  8.91946852e-02  1.34055331e-01
 -5.76361001e-01  1.88355789e-01 -2.88593203e-01  1.81605220e-01
  1.96463093e-01  1.92910433e-01  1.48544922e-01  7.09057629e-01
 -3.49930197e-01 -1.96374476e-01  6.40798390e-01  6.49225831e-01
  3.06335509e-01  3.08545250e-02 -3.47203344e-01 -1.85910344e-01
  5.24000168e-01 -8.44055489e-02  4.59141470e-02  2.32933596e-01
 -2.71039754e-01  3.83520484e-01  1.62480459e-01 -5.53910196e-01
  5.45489252e-01 -1.65699393e-01 -5.80755115e-01 -5.07490337e-01
  2.23030448e-01 -2.41501078e-01 -3.90089564e-02  1.62448004e-01
  1.35700449e-01 -2.33723298e-01  1.17019348e-01  1.68747887e-01
  3.27951342e-01 -5.31870186e-01 -6.12210250e-03  1.14713502e+00].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [45]:
similarities

tensor([[0.2485],
        [0.8069],
        [0.5307],
        [0.7478]])

In [50]:
ranked_indices = np.argsort(similarities, axis=0).flatten()

In [51]:
ranked_indices

tensor([0, 2, 3, 1])

In [52]:
# Output the ranked documents
for i, idx in enumerate(ranked_indices):
    print(f"Rank {i+1}: Document {idx+1}")

Rank 1: Document 1
Rank 2: Document 3
Rank 3: Document 4
Rank 4: Document 2


In [1]:
doc_path="/content/HADOOP ADMINISTRATION AND BDA NOTES.pdf"

In [2]:
!pip install pypdf



In [3]:
!pip install langchain_community



In [4]:
from langchain_community.document_loaders import PyPDFLoader

In [5]:
loader=PyPDFLoader(doc_path)

In [6]:
docs=loader.load()

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [8]:
splitter = RecursiveCharacterTextSplitter(chunk_size=200,chunk_overlap=30)

In [9]:
chunks = splitter.split_documents(docs)

In [10]:
chunks

[Document(page_content='What is a Hadoop Cluster?  \nA cluster is basically a collection. A computer cluster is a collection of computers', metadata={'source': '/content/HADOOP ADMINISTRATION AND BDA NOTES.pdf', 'page': 0}),
 Document(page_content='interconnected to each other over a network. Similarly, a  Hadoop Cluster  is a collection of \nextraordinary computational systems designed and deployed to store, optimise, and analyse', metadata={'source': '/content/HADOOP ADMINISTRATION AND BDA NOTES.pdf', 'page': 0}),
 Document(page_content='petabytes of Big Data with astonishing agility.   \nHere this Big Data Co urse will explain to you more about Hadoop Cluster with real -time', metadata={'source': '/content/HADOOP ADMINISTRATION AND BDA NOTES.pdf', 'page': 0}),
 Document(page_content='project experience, which was well designed by Top Industry working Experts.  \nFactors deciding the Hadoop Cluster Capacity', metadata={'source': '/content/HADOOP ADMINISTRATION AND BDA NOTES.pdf', 'pa

In [11]:
from langchain.embeddings import HuggingFaceInferenceAPIEmbeddings

In [12]:
HF_TOKEN="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"

In [13]:
embeddings = HuggingFaceInferenceAPIEmbeddings(api_key=HF_TOKEN, model_name="BAAI/bge-base-en-v1.5")

In [14]:
!pip install chromadb



In [15]:
from langchain.vectorstores import Chroma

In [16]:
vectorstore=Chroma.from_documents(chunks,embeddings)

In [17]:
vectorstore_retreiver = vectorstore.as_retriever(search_kwargs={"k": 3})

In [18]:
vectorstore_retreiver

VectorStoreRetriever(tags=['Chroma', 'HuggingFaceInferenceAPIEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x7cd1ce5a72b0>, search_kwargs={'k': 3})

In [19]:
!pip install rank_bm25



In [20]:
from langchain.retrievers import BM25Retriever, EnsembleRetriever

In [21]:
keyword_retriever = BM25Retriever.from_documents(chunks)

In [22]:
keyword_retriever.k =  3

In [23]:
ensemble_retriever = EnsembleRetriever(retrievers=[vectorstore_retreiver,keyword_retriever],weights=[0.3, 0.7])

# Mixing vector search and keyword search for Hybrid search

## hybrid_score = (1 — alpha) * sparse_score + alpha * dense_score

In [24]:
model_name = "HuggingFaceH4/zephyr-7b-beta"

In [25]:
!pip install bitsandbytes



In [26]:
!pip install accelerate



In [27]:
import torch
from transformers import ( AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline, )
from langchain import HuggingFacePipeline

In [28]:
# function for loading 4-bit quantized model
def load_quantized_model(model_name: str):
    """
    model_name: Name or path of the model to be loaded.
    return: Loaded quantized model.
    """
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.bfloat16,
        quantization_config=bnb_config,
    )
    return model

In [29]:
# initializing tokenizer
def initialize_tokenizer(model_name: str):
    """
    model_name: Name or path of the model for tokenizer initialization.
    return: Initialized tokenizer.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name, return_token_type_ids=False)
    tokenizer.bos_token_id = 1  # Set beginning of sentence token id
    return tokenizer

In [30]:
tokenizer = initialize_tokenizer(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [31]:
model = load_quantized_model(model_name)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

In [32]:
pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    use_cache=True,
    device_map="auto",
    max_length=2048,
    do_sample=True,
    top_k=5,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
)

In [33]:
llm = HuggingFacePipeline(pipeline=pipeline)

  warn_deprecated(


In [34]:
from langchain.chains import RetrievalQA

In [35]:
normal_chain = RetrievalQA.from_chain_type(
    llm=llm, chain_type="stuff", retriever=vectorstore_retreiver
)

In [36]:
hybrid_chain = RetrievalQA.from_chain_type(
    llm=llm, chain_type="stuff", retriever=ensemble_retriever
)

In [40]:
response1 = normal_chain.invoke("what is hadoop?")

In [41]:
response1

{'query': 'what is hadoop?',
 'result': "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\nlearntek  \n \nApac he Hadoop  is an excellent software framework that allows the processing of big \ndata elements. It can use the power of commodity hardware by employing a modular\n\ninterconnected to each other over a network. Similarly, a  Hadoop Cluster  is a collection of \nextraordinary computational systems designed and deployed to store, optimise, and analyse\n\nan efficient Hadoop Cluster with optimum performance  \n\uf0b7 Volume of Data  \n \nIf you ever wonder how Hadoop even came into existence, it is because of  the huge volume\n\nQuestion: what is hadoop?\nHelpful Answer: Hadoop is a software framework designed to store and process large datasets in a distributed computing environment. It's built to handle the three V's of big data: volume, variety, and velocity. Had

In [42]:
print(response1.get("result"))

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

learntek  
 
Apac he Hadoop  is an excellent software framework that allows the processing of big 
data elements. It can use the power of commodity hardware by employing a modular

interconnected to each other over a network. Similarly, a  Hadoop Cluster  is a collection of 
extraordinary computational systems designed and deployed to store, optimise, and analyse

an efficient Hadoop Cluster with optimum performance  
 Volume of Data  
 
If you ever wonder how Hadoop even came into existence, it is because of  the huge volume

Question: what is hadoop?
Helpful Answer: Hadoop is a software framework designed to store and process large datasets in a distributed computing environment. It's built to handle the three V's of big data: volume, variety, and velocity. Hadoop is an open-source technology initially developed by Apache

In [43]:
response2 = hybrid_chain.invoke("what is hadoop?")

In [44]:
response2

{'query': 'what is hadoop?',
 'result': "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\nNow that we know what exactly a Hadoop Cluster is, let us now learn  why exactly we need \nto plan a Hadoop Cluster and what are various factors we need to look into, in order to plan\n\n\uf0b7  \uf0b7  \nBefore we start learning about the Hadoop cluster first thing we need to know is what \nactually  cluster  means. Cluster is a collection of something, a simple computer cluster\n\nwhat actually this scalable property means. Suppose an organization wants to analyze \nor maintain around 5PB of data for the upcoming 2 months so he used 10\n\nlearntek  \n \nApac he Hadoop  is an excellent software framework that allows the processing of big \ndata elements. It can use the power of commodity hardware by employing a modular\n\ninterconnected to each other over a network. Similarly, a  H

In [45]:
print(response2.get("result"))

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Now that we know what exactly a Hadoop Cluster is, let us now learn  why exactly we need 
to plan a Hadoop Cluster and what are various factors we need to look into, in order to plan

    
Before we start learning about the Hadoop cluster first thing we need to know is what 
actually  cluster  means. Cluster is a collection of something, a simple computer cluster

what actually this scalable property means. Suppose an organization wants to analyze 
or maintain around 5PB of data for the upcoming 2 months so he used 10

learntek  
 
Apac he Hadoop  is an excellent software framework that allows the processing of big 
data elements. It can use the power of commodity hardware by employing a modular

interconnected to each other over a network. Similarly, a  Hadoop Cluster  is a collection of 
extraordinary computational syste