## Importing Dependencies

In [1]:
import pandas as pd
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import EmbeddingRetriever, DensePassageRetriever
from haystack.pipelines import DocumentSearchPipeline
from haystack import Document
from haystack.utils import print_documents

  from .autonotebook import tqdm as notebook_tqdm


## Loading Datasets

In [2]:
website_df = pd.read_csv('../data/plaksha website - Sheet2m.csv')
website_df.head()

Unnamed: 0,Crisp,Detailed
0,"Plaksha University, founded in 2019, emerged a...",Plaksha University is the culmination of a vis...
1,Plaksha University's framework rests upon thre...,Plaksha University's mission is underpinned by...
2,Plaksha University's founders represent a dive...,The driving force behind Plaksha University co...
3,"Back in 2017, Plaksha University formed an Aca...","In 2017, Plaksha University took a significant..."
4,Plaksha University has forged partnerships wit...,Plaksha University's commitment to fostering t...


## Creating a InMemory Data Store

In [3]:
document_store_inmemory = InMemoryDocumentStore(use_bm25=False, use_gpu=True, similarity="dot_product")

### Casting data into Document object

In [4]:
document_list = []

for i in website_df["Crisp"]:
    document = Document(content=i, content_type='text')
    document_list.append(document)

In [5]:
document_store_inmemory.write_documents(document_list)

## Initializing the Retriever (Embedding)



In [6]:
retriever_embedded = EmbeddingRetriever(
    document_store=document_store_inmemory,
   embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1",
   model_format="sentence_transformers"
)

  return self.fget.__get__(instance, owner)()


In [7]:
document_store_inmemory.update_embeddings(retriever_embedded)

Batches: 100%|██████████| 3/3 [00:02<00:00,  1.14it/s]docs/s]
Documents Processed: 10000 docs [00:03, 3288.14 docs/s]       


## Creating the Pipeline

In [8]:
search_pipeline = DocumentSearchPipeline(retriever_embedded)

In [14]:
result = search_pipeline.run(
    query = "btech program",
    params={"Retriever": {"top_k":3}}
)

print_documents(result)

Batches: 100%|██████████| 1/1 [00:00<00:00, 32.27it/s]


Query: btech program

{   'content': 'Plaksha University invites outstanding and inquisitive '
               'individuals to become part of its BTech program, designed to '
               'cultivate future technology leaders. The admission process for '
               'a limited cohort of 200 students encompasses four key steps: '
               'an online application, a virtual interaction phase, the option '
               'to apply for need-based scholarships, and the issuance of '
               'final decisions on a rolling basis. This meticulous process '
               'ensures that the university identifies and selects a talented '
               'and diverse group of students to embark on their educational '
               'journey in technology and innovation.',
    'name': None}

{   'content': "Plaksha University's BTech program fees are divided into two "
               'installments each year, with the semester fees detailed in a '
               'table format. Addition




## Initializing the Retriever (DPR)

Dense Passage Retrieval is a retrieval method that calculates relevance using dense representations. Key features:

- One BERT base model to encode documents
- One BERT base model to encode queries
- Ranking of Documents done by dot product similarity between query and document embeddings


Indexing using DPR is comparatively expensive in terms of required computation since all documents in the database need to be processed through the transformer. In order to keep query times low, 

In [10]:
retriever_dpr = DensePassageRetriever(
    document_store=document_store_inmemory,
    query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
    passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base"
)

## Creating the Pipeline

In [11]:
search_pipeline = DocumentSearchPipeline(retriever_embedded)

In [13]:
result = search_pipeline.run(
    query = "btech degrees",
    params={"Retriever": {"top_k":3}}
)

print_documents(result)

Batches: 100%|██████████| 1/1 [00:00<00:00, 30.14it/s]


Query: btech degrees

{   'content': 'Plaksha University invites outstanding and inquisitive '
               'individuals to become part of its BTech program, designed to '
               'cultivate future technology leaders. The admission process for '
               'a limited cohort of 200 students encompasses four key steps: '
               'an online application, a virtual interaction phase, the option '
               'to apply for need-based scholarships, and the issuance of '
               'final decisions on a rolling basis. This meticulous process '
               'ensures that the university identifies and selects a talented '
               'and diverse group of students to embark on their educational '
               'journey in technology and innovation.',
    'name': None}

{   'content': 'Plaksha University offers four unique and profoundly '
               'interdisciplinary undergraduate B.Tech degrees, representing '
               'the vanguard of 21st-century e


