<a href="https://colab.research.google.com/github/viswanathanar/NLP/blob/master/SmartBot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Installing all the prerequisites**


In [1]:
%%bash
pip install --upgrade pip
pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab,ocr,pdf]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pip
  Downloading pip-23.0.1-py3-none-any.whl (2.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 23.1 MB/s eta 0:00:00
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.0.4
    Uninstalling pip-22.0.4:
      Successfully uninstalled pip-22.0.4
Successfully installed pip-23.0.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting farm-haystack[colab,ocr,pdf]
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-install-j3vswjfv/farm-haystack_612e74d4bb4e4ae09aaba0328ccbf1e4
  Resolved https://github.com/deepset-ai/haystack.git to commit 95a48c6c9d31826243385c9b0df1ca781ba51d45
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements

DEPRECATION: git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab,ocr,pdf] contains an egg fragment with a non-PEP 508 name pip 25.0 will enforce this behaviour change. A possible replacement is to use the req @ url syntax, and remove the egg fragment. Discussion can be found at https://github.com/pypa/pip/issues/11617
  Running command git clone --filter=blob:none --quiet https://github.com/deepset-ai/haystack.git /tmp/pip-install-j3vswjfv/farm-haystack_612e74d4bb4e4ae09aaba0328ccbf1e4


In [2]:
import logging
logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

**Upload Input Data**

Uploading the POV data (7-InfySmartbotWhitePaper-ContentIntelligenceusingICTP.docx)  

In [3]:
from google.colab import files
upload = files.upload()

Saving P7-InfySmartbotWhitePaper-ContentIntelligenceusingICTP.docx to P7-InfySmartbotWhitePaper-ContentIntelligenceusingICTP.docx
Saving P7-InfySmartbotWhitePaper-QnA.docx to P7-InfySmartbotWhitePaper-QnA.docx


**Preprocessing the Data using Haystack**

Haystack includes a suite of tools to extract text from different file types, normalize white space and split text into smaller pieces to optimize retrieval. These data preprocessing steps can have a big impact on the systems performance and effective handling of data is key to getting the most out of Haystack.

In [5]:
from haystack.nodes import PreProcessor
from haystack.utils import convert_files_to_docs
doc_dir = ''
all_docs = convert_files_to_docs(dir_path=doc_dir)
preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=False,
    split_by="word",
    split_length=150,
    split_respect_sentence_boundary=True,
)
docs = preprocessor.process(all_docs)

print(f"n_files_input: {len(all_docs)}\nn_docs_output: {len(docs)}")

INFO:haystack.utils.preprocessing:Converting P7-InfySmartbotWhitePaper-ContentIntelligenceusingICTP.docx
INFO:haystack.utils.preprocessing:Converting P7-InfySmartbotWhitePaper-QnA.docx
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Preprocessing:   0%|          | 0/2 [00:00<?, ?docs/s]

n_files_input: 2
n_docs_output: 19


**Initializing the DocumentStore**

We will start creating our Q&A system by initializing a DocumentStore. A DocumentStore stores the Documents that our Q&A system uses to find answers to our questions

In [6]:
from haystack.document_stores import InMemoryDocumentStore
document_store = InMemoryDocumentStore(use_bm25=True)
document_store.write_documents(docs)

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


Updating BM25 representation...:   0%|          | 0/19 [00:00<?, ? docs/s]

**Buidling a indexing pipeline**

Indexing our input documents for faster retrieval

In [13]:
import os
from haystack.pipelines.standard_pipelines import TextIndexingPipeline
files_to_index = [f for f in os.listdir("./")]
files_to_index.remove('.config')
files_to_index.remove('sample_data')
print(files_to_index)
indexing_pipeline = TextIndexingPipeline(document_store)
indexing_pipeline.run_batch(file_paths=files_to_index)

INFO:haystack.pipelines.base:It seems that an indexing Pipeline is run, so using the nodes' run method instead of run_batch.


['P7-InfySmartbotWhitePaper-ContentIntelligenceusingICTP.docx', 'P7-InfySmartbotWhitePaper-QnA.docx']


Converting files:   0%|          | 0/2 [00:00<?, ?it/s]

Preprocessing:   0%|          | 0/2 [00:00<?, ?docs/s]

Updating BM25 representation...:   0%|          | 0/323 [00:00<?, ? docs/s]

{'documents': [<Document: {'content': 'PK\x03\x04\x14\x00\x06\x00\x08\x00\x00\x00!\x00VuLǎ\x01\x00\x00\\\x07\x00\x00\x13\x00\x08\x02[Content_Types].xml \x04\x02(\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0

**Building up a basic Q&A System**

Depending on the outcome/result, will update the Q&A embedings further

In [14]:
from haystack.nodes import BM25Retriever
retriever = BM25Retriever(document_store=document_store)

In [15]:
from haystack.nodes import FARMReader
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model: * LOADING MODEL: 'deepset/roberta-base-squad2' (Roberta)


Downloading pytorch_model.bin:   0%|          | 0.00/496M [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model:Auto-detected model language: english
INFO:haystack.modeling.model.language_model:Loaded 'deepset/roberta-base-squad2' (Roberta model) from model hub.


Downloading (…)okenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


In [16]:
from haystack.pipelines import ExtractiveQAPipeline
pipe = ExtractiveQAPipeline(reader, retriever)

In [20]:
## Retrieving the top 3 Answers
prediction = pipe.run(
    query="What is ICTP?",
    params={
        "Retriever": {"top_k": 10},
        "Reader": {"top_k": 3}
    }
)

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]

In [21]:
from pprint import pprint
pprint(prediction)

{'answers': [<Answer {'answer': 'eases the annotation process', 'type': 'extractive', 'score': 0.6453227400779724, 'context': 'How does ICTP reduce the extraction training time?\nAns: ICTP eases the annotation process, which is used for  training information extraction.  In ICT', 'offsets_in_document': [{'start': 546, 'end': 574}], 'offsets_in_context': [{'start': 61, 'end': 89}], 'document_ids': ['4c65decf6721d58a03f2cf05e0f7a1fa'], 'meta': {'name': 'P7-InfySmartbotWhitePaper-QnA.docx', '_split_id': 0}}>,
             <Answer {'answer': 'a Natural Language Processing (NLP) based solution', 'type': 'extractive', 'score': 0.6205132007598877, 'context': '“Infosys Cognitive Tagging Platform” or ICTP, a Natural Language Processing (NLP) based solution, can derive content intelligence from the documents o', 'offsets_in_document': [{'start': 46, 'end': 96}], 'offsets_in_context': [{'start': 46, 'end': 96}], 'document_ids': ['4dd8645e3f9bcc5898f7e525a1ce989a'], 'meta': {'name': 'P7-InfySmartb

In [22]:
prediction['answers'][0]

<Answer {'answer': 'eases the annotation process', 'type': 'extractive', 'score': 0.6453227400779724, 'context': 'How does ICTP reduce the extraction training time?\nAns: ICTP eases the annotation process, which is used for  training information extraction.  In ICT', 'offsets_in_document': [{'start': 546, 'end': 574}], 'offsets_in_context': [{'start': 61, 'end': 89}], 'document_ids': ['4c65decf6721d58a03f2cf05e0f7a1fa'], 'meta': {'name': 'P7-InfySmartbotWhitePaper-QnA.docx', '_split_id': 0}}>