<a href="https://colab.research.google.com/github/xhiroga/til/blob/main/software-engineering/haystack/quickstart/haystack_01_basic_qa_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [Build Your First Question Answering System | Haystack](https://haystack.deepset.ai/tutorials/01_basic_qa_pipeline)

In [15]:
%%bash

nvidia-smi

Sun Sep 24 03:15:45 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P0    48W / 400W |   2367MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [16]:
%%bash

pip install --upgrade pip
pip install farm-haystack[colab,inference]





In [17]:
from haystack.telemetry import tutorial_running

tutorial_running(1)

In [18]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

In [19]:
from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore(use_bm25=True)

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


In [20]:
from haystack.utils import fetch_archive_from_http

doc_dir = "data/build_your_first_question_answering_system"

fetch_archive_from_http(
    url="https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt1.zip",
    output_dir=doc_dir,
)

INFO:haystack.utils.import_utils:Found data stored in 'data/build_your_first_question_answering_system'. Delete this first if you really want to fetch new data.


False

In [None]:
import os
from haystack.pipelines.standard_pipelines import TextIndexingPipeline

files_to_index = [doc_dir + "/" + f for f in os.listdir(doc_dir)]
indexing_pipeline = TextIndexingPipeline(document_store)
indexing_pipeline.run_batch(file_paths=files_to_index)

In [22]:
!ls data/build_your_first_question_answering_system

0_Game_of_Thrones__season_8_.txt
101_Titties_and_Dragons.txt
102_The_Princess_and_the_Queen.txt
10_Beyond_the_Wall__Game_of_Thrones_.txt
118_Dark_Wings__Dark_Words.txt
119_Walk_of_Punishment.txt
11_The_Dragon_and_the_Wolf.txt
120_And_Now_His_Watch_Is_Ended.txt
121_The_Bear_and_the_Maiden_Fair.txt
126_Kissed_by_Fire.txt
127_The_Climb__Game_of_Thrones_.txt
129_Second_Sons.txt
12_Fire.txt
130_Game_of_Thrones_title_sequence.txt
131_Mhysa.txt
133_Game_of_Thrones__Season_5__soundtrack_.txt
134_Game_of_Thrones__Season_6__soundtrack_.txt
135_Game_of_Thrones__Season_7__soundtrack_.txt
136_Game_of_Thrones__Season_8__soundtrack_.txt
145_Elio_M._García_Jr._and_Linda_Antonsson.txt
146_The_Sons_of_the_Dragon.txt
148_Game_of_Thrones__Winter_Is_Coming.txt
151_Ellaria_Sand.txt
154_Margaery_Tyrell.txt
160_Viserys_Targaryen.txt
191_Gendry.txt
193_Lord_Snow.txt
194_A_Song_of_Ice_and_Fire.txt
195_World_of_A_Song_of_Ice_and_Fire.txt
197_A_Game_of_Thrones.txt
198_A_Clash_of_Kings.txt
199_A_Storm_of_Swords.tx

## Initializing the Retriever

In [23]:
from haystack.nodes import BM25Retriever  # 文書のランク付けアルゴリズム。TF-IDF（Term Frequency-Inverse Document Frequency）の拡張と見なすことができるが、単語の出現率の増加が飽和するとマッチ度の増加率が緩やかになる改良がされている。

retriever = BM25Retriever(document_store=document_store)


## Initializing the Reader

In [24]:
from haystack.nodes import FARMReader

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)


INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.modeling.model.language_model: * LOADING MODEL: 'deepset/roberta-base-squad2' (Roberta)
INFO:haystack.modeling.model.language_model:Auto-detected model language: english
INFO:haystack.modeling.model.language_model:Loaded 'deepset/roberta-base-squad2' (Roberta model) from model hub.
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


## Creating the Retriever-Reader Pipeline

In [None]:
from haystack.pipelines import ExtractiveQAPipeline

pipe = ExtractiveQAPipeline(reader, retriever)


## Asking a Question

In [None]:
prediction = pipe.run(
    query="Who is the father of Arya Stark?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
)


Inferencing Samples: 100%|██████████| 1/1 [00:03<00:00,  3.30s/ Batches]


In [None]:
from pprint import pprint

pprint(prediction)


{'answers': [<Answer {'answer': 'Eddard', 'type': 'extractive', 'score': 0.993372917175293, 'context': "s Nymeria after a legendary warrior queen. She travels with her father, Eddard, to King's Landing when he is made Hand of the King. Before she leaves,", 'offsets_in_document': [{'start': 207, 'end': 213}], 'offsets_in_context': [{'start': 72, 'end': 78}], 'document_ids': ['9e3c863097d66aeed9992e0b6bf1f2f4'], 'meta': {'_split_id': 3}}>,
             <Answer {'answer': 'Ned', 'type': 'extractive', 'score': 0.9753613471984863, 'context': "k in the television series.\n\n====Season 1====\nArya accompanies her father Ned and her sister Sansa to King's Landing. Before their departure, Arya's h", 'offsets_in_document': [{'start': 630, 'end': 633}], 'offsets_in_context': [{'start': 74, 'end': 77}], 'document_ids': ['7d3360fa29130e69ea6b2ba5c5a8f9c8'], 'meta': {'_split_id': 10}}>,
             <Answer {'answer': 'Lord Eddard Stark', 'type': 'extractive', 'score': 0.9177325367927551, 'context':