# Installing Haystack


This is latest version of haystack installed using pip

---



In [83]:
%%bash
!pip install anvil-uplink
pip install --upgrade pip
pip install farm-haystack[colab]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


bash: line 1: !pip: command not found


Set the logging level to INFO

In [84]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger().setLevel(logging.DEBUG)
logging.getLogger("haystack").setLevel(logging.INFO)

# Initialising the document store


In [85]:
from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore(use_bm25=True)

INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0
DEBUG:urllib3.connectionpool:Resetting dropped connection: tm.hs.deepset.ai
DEBUG:urllib3.connectionpool:https://tm.hs.deepset.ai:443 "POST /batch/ HTTP/1.1" 200 13


The DocumentStore is now ready. Now it's time to fill it with some Documents.

# Preparing Documents

1.Download harry potter book from the internet. You can find them in data/harrypotter as of .txt file.

In [86]:
from haystack.utils import fetch_archive_from_http

doc_dir = "/content/data/harrypotter"

fetch_archive_from_http(
    url="/content/data/harrypotter/Harry Potter and The Sorcerers Stone.txt",
    output_dir=doc_dir
)

INFO:haystack.utils.import_utils:Found data stored in '/content/data/harrypotter'. Delete this first if you really want to fetch new data.


False

2.Use TextIndexingPipeline to convert the files you just downloaded into Haystack Document objects and write them into the DocumentStore:

In [87]:
import os
from haystack.pipelines.standard_pipelines import TextIndexingPipeline

#files_to_index = [doc_dir + "/" + f for f in os.listdir(doc_dir)]
files_to_index = [doc_dir + "/" + "Harry Potter and The Sorcerers Stone.txt"]
indexing_pipeline = TextIndexingPipeline(document_store)
indexing_pipeline.run_batch(file_paths=files_to_index)

INFO:haystack.pipelines.base:It seems that an indexing Pipeline is run, so using the nodes' run method instead of run_batch.


Converting files:   0%|          | 0/1 [00:00<?, ?it/s]

Preprocessing:   0%|          | 0/1 [00:00<?, ?docs/s]

DEBUG:urllib3.connectionpool:https://tm.hs.deepset.ai:443 "POST /batch/ HTTP/1.1" 200 13


Updating BM25 representation...:   0%|          | 0/435 [00:00<?, ? docs/s]

{'documents': [<Document: {'content': 'Harry Potter and the Sorcerer’s Stone\nBy J.K. Rowling\n\nCHAPTER ONE\n\nThe Boy Who Lived\nMr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly\nnormal, thank you very much. They were the last people you’d expect to be involved in anything\nstrange or mysterious, because they just didn’t hold with such nonsense.\nMr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy\nman with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin\nand blonde and had nearly twice the usual amount of neck, which came in very useful as she\nspent so much of her time craning over garden fences, spying on the neighbors. The Dursleys\nhad a small son called Dudley and in their opinion there was no finer boy anywhere.\nThe Dursleys had everything they wanted, but they also had a secret, and their greatest fear was\nthat somebody would discover it. ', 'content_

# Initializing the Retriever

Our search system will use a Retriever, so we need to initialize it. A Retriever sifts through all the Documents and returns only the ones relevant to the question



In [88]:
from haystack.nodes import BM25Retriever

retriever = BM25Retriever(document_store=document_store)

# Initializing the Reader

In [89]:
from haystack.nodes import FARMReader

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0
INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /deepset/roberta-base-squad2/resolve/main/config.json HTTP/1.1" 200 0
INFO:haystack.modeling.model.language_model: * LOADING MODEL: 'deepset/roberta-base-squad2' (Roberta)
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /deepset/roberta-base-squad2/resolve/main/config.json HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://tm.hs.deepset.ai:443 "POST /batch/ HTTP/1.1" 200 13
INFO:haystack.modeling.model.language_model:Auto-detected model language: english
INFO:haystack.modeling.model.language_model:Loaded 'deepset/roberta-base-squad2' (Roberta model) from model hub.
DEBUG:urllib3.connectionpool:Starting new HTTPS 

We've initalized all the components for our pipeline. We're now ready to create the pipeline.

# Creating the Retriever-Reader Pipeline

It connects the Reader and the Retriever. The combination of the two speeds up processing because the Reader only processes the Documents that the Retriever has passed on.

In [90]:
from haystack.pipelines import ExtractiveQAPipeline

pipe = ExtractiveQAPipeline(reader, retriever)

DEBUG:urllib3.connectionpool:https://tm.hs.deepset.ai:443 "POST /batch/ HTTP/1.1" 200 13


# Asking a Question

1.Use the pipeline run() method to ask a question. The query argument is where you type your question. Additionally, you can set the number of documents you want the Reader and Retriever to return using the top-k parameter. 

In [91]:
prediction = pipe.run(
    query = "Who is Ginny?",
    params={
        "Retriever": {"top_k": 3},
        "Reader": {"top_k": 3}
    }
)



Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]

2.Print out the answers the pipeline returned:


3.Simplify the printed answers:

In [92]:
from haystack.utils import print_answers

print_answers(
    prediction,
    details="minimum" 
)


Query: Who is Ginny?
Answers:
[   {   'answer': 'Ron’s younger sister',
        'context': 'er. “There he is, Mom, there he is,\n'
                   'look!”\n'
                   'It was Ginny Weasley, Ron’s younger sister, but she wasn’t '
                   'pointing at Ron.\n'
                   '“Harry Potter!” she squealed. “'},
    {   'answer': 'You’re not old enough',
        'context': ', also red-headed, who was holding her hand,\n'
                   '“Mom, can’t I go…”\n'
                   '“You’re not old enough, Ginny, now be quiet. All right, '
                   'Percy, you go first.”\n'
                   'What loo'},
    {   'answer': 'You-Know-Who',
        'context': 'onto the platform.”\n'
                   '“Never mind that, do you think he remembers what '
                   'You-Know-Who looks like?”\n'
                   'Their mother suddenly became very stern.\n'
                   '“I forbid you '}]


In [93]:
from haystack.nodes import TransformersSummarizer
from haystack import Document
from haystack.pipelines import SearchSummarizationPipeline

summarizer = TransformersSummarizer(model_name_or_path="google/pegasus-xsum")
pipeline = SearchSummarizationPipeline(summarizer=summarizer, retriever=retriever, generate_single_summary=True)



INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /google/pegasus-xsum/resolve/main/config.json HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://tm.hs.deepset.ai:443 "POST /batch/ HTTP/1.1" 200 13
DEBUG:urllib3.connectionpool:https://tm.hs.deepset.ai:443 "POST /batch/ HTTP/1.1" 200 13


# QUERY FOR SUMMARISATION


In [96]:
result = pipeline.run(query="Describe Dumbledore ", params={"Retriever": {"top_k": 2}})



In [97]:
for doc in result['documents']:
    doc.content = doc.content.replace('\n', ' ')
#print(doc.content)
output = " ".join([doc.content for doc in result['documents']])
output = output.replace("Get free e-books and video tutorials at www.passuneb.com", "")
# Print the output as a single paragraph
line = ""
for i, char in enumerate(output):
    line += char
    if (i + 1) % 100 == 0:
        print(line)
        line = ""
if line:
    print(line)

“Don’t tell me you’d never heard of Dumbledore!” said Ron. “Can I have a frog? I might get Agrippa —
 thanks —” Harry turned over his card and read: ALBUS DUMBLEDORE CURRENTLY HEADMASTER OF HOGWARTS Pa
ge 73 of 226  Considered by many the greatest wizard of modern times, Dumbledore is particularly fam
ous for his defeat of the dark wizard Grindelwald in 1945, for the discovery of the twelve uses of d
ragon’s blood, and his work on alchemy with his partner, Nicolas Flamel. Professor Dumbledore enjoys
 chamber music and tenpin bowling. Harry turned the card back over and saw, to his astonishment, tha
t Dumbledore’s face had disappeared. “He’s gone!” “Well, you can’t expect him to hang around all day
,” said Ron. “He’ll be back. No, I’ve got Morgana again and I’ve got about six of her… do you want i
t? You can start collecting.” Ron’s eyes strayed to the pile of Chocolate Frogs waiting to be unwrap
ped. “Help yourself,” said Harry. “But in, you know, the Muggle world, people just stay put