## Importing Dependencies

In [1]:
import pandas as pd
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import TfidfRetriever, BM25Retriever
from haystack.pipelines import DocumentSearchPipeline
from haystack import Document
from haystack.utils import print_documents

  from .autonotebook import tqdm as notebook_tqdm


## Loading Dataset

In [2]:
website_df = pd.read_csv('../data/plaksha website - Sheet2m.csv')
website_df.head()

Unnamed: 0,Crisp,Detailed
0,"Plaksha University, founded in 2019, emerged a...",Plaksha University is the culmination of a vis...
1,Plaksha University's framework rests upon thre...,Plaksha University's mission is underpinned by...
2,Plaksha University's founders represent a dive...,The driving force behind Plaksha University co...
3,"Back in 2017, Plaksha University formed an Aca...","In 2017, Plaksha University took a significant..."
4,Plaksha University has forged partnerships wit...,Plaksha University's commitment to fostering t...


## Creating a Instore Data Store

In [3]:
document_store_instore = InMemoryDocumentStore(use_bm25=False, use_gpu=True)

### Casting data into Document object

The structure of Document Class is

```python
class Document:
    content: Union[str, pd.DataFrame]
    content_type: Literal["text", "table", "image"]
    id: str
    meta: Dict[str, Any]
    score: Optional[float] = None
    embedding: Optional[np.ndarray] = None
    id_hash_keys: Optional[List[str]] = None
```

In [4]:
document_list = []

for i in website_df["Crisp"]:
    document = Document(content=i, content_type='text')
    document_list.append(document)

In [5]:
document_store_instore.write_documents(document_list)

In [6]:
document_list[0].content

"Plaksha University, founded in 2019, emerged as an innovative institution initiated by visionary entrepreneurs and industry leaders. The journey began in 2017 with the inception of the Reimagining Higher Education Foundation, dedicated to transforming technology education, both in India and worldwide. In February 2019, the Mohali campus was officially inaugurated, setting the stage for its inaugural student intake in August 2021. Plaksha University's mission is to redefine engineering and technology education, offering a cutting-edge approach to nurture the next generation of tech leaders, making a significant impact on the educational landscape and the future of technology-driven innovation."

In [7]:
document_list[0].embedding

## Initializing the Retriever (TF-IDF)

TF-IDF is a commonly used baseline for information retrieval that exploits two key intuitions:

- Documents that have more lexical overlap with the query are more likely to be relevant.
- Words that occur in fewer documents are more significant than words that occur in many documents.


In [8]:
retriever_tfidf = TfidfRetriever(document_store_instore, top_k=3)

## Creating the Pipeline

In [9]:
search_pipeline = DocumentSearchPipeline(retriever_tfidf)

In [10]:
result = search_pipeline.run(
    query = "btech degrees fee",
    params={"Retriever": {"top_k":3}}
)

In [11]:
print_documents(result)


Query: btech degrees fee

{   'content': 'The fee structure for our Ph.D. program at Plaksha University '
               'is as follows:\r\n'
               '\r\n'
               'Admission Fee (One Time): ₹25,000\r\n'
               'Annual Registration Fee (Non-Refundable): ₹5,000\r\n'
               'Annual Tuition Fee: ₹7,00,000\r\n'
               'This provides a clear overview of the financial obligations '
               'for students enrolling in the program.',
    'name': None}

{   'content': "Plaksha University's BTech program fees are divided into two "
               'installments each year, with the semester fees detailed in a '
               'table format. Additionally, students should be aware of '
               'one-time payments, including a Security Deposit of Rs 50,000 '
               "and an Admission Fee of Rs 50,000. It's important to note that "
               'the deadline for the payment of Semester 1 fees varies '
               'depending on the specifi

## Creating a Instore Data Store

In [12]:
document_store_inmemory = InMemoryDocumentStore(use_bm25=True, use_gpu=True)

### Casting data into Document object

In [13]:
document_list = []

for i in website_df["Crisp"]:
    document = Document(content=i, content_type='text')
    document_list.append(document)

In [14]:
document_store_inmemory.write_documents(document_list)

Updating BM25 representation...: 100%|██████████| 78/78 [00:00<00:00, 20998.44 docs/s]


In [15]:
document_list[0].content

"Plaksha University, founded in 2019, emerged as an innovative institution initiated by visionary entrepreneurs and industry leaders. The journey began in 2017 with the inception of the Reimagining Higher Education Foundation, dedicated to transforming technology education, both in India and worldwide. In February 2019, the Mohali campus was officially inaugurated, setting the stage for its inaugural student intake in August 2021. Plaksha University's mission is to redefine engineering and technology education, offering a cutting-edge approach to nurture the next generation of tech leaders, making a significant impact on the educational landscape and the future of technology-driven innovation."

## Initializing the Retriever (BM25)

BM25 is a variant of TF-IDF. It improves in two main aspects:

- It saturates tf after a set number of occurrences of the given term in the document
- It normalises by document length so that short documents are favoured over long documents if they have the same amount of word overlap with the query


In [16]:
retriever_bm25 = BM25Retriever(document_store_inmemory, top_k=3)

## Creating the Pipeline

In [17]:
search_pipeline = DocumentSearchPipeline(retriever_bm25)

In [18]:
result = search_pipeline.run(
    query = "btech degrees fee",
    params={"Retriever": {"top_k":3}}
)

print_documents(result)


Query: btech degrees fee

{   'content': "Plaksha University's BTech program fees are divided into two "
               'installments each year, with the semester fees detailed in a '
               'table format. Additionally, students should be aware of '
               'one-time payments, including a Security Deposit of Rs 50,000 '
               "and an Admission Fee of Rs 50,000. It's important to note that "
               'the deadline for the payment of Semester 1 fees varies '
               'depending on the specific admission round. Furthermore, '
               'candidates should be aware that fee revisions are possible in '
               'subsequent years, typically ranging from 5% to 8% annually. '
               'This information provides a transparent overview of the '
               'financial aspects associated with the BTech program at Plaksha '
               'University.',
    'name': None}

{   'content': 'The fee structure for our Ph.D. program at Plaksha Unive