# HCL Question and Answering Model

## Overview

The aim of this code is to build a question answering system using Haystack's DocumentStore, Retriever, and Reader.

The whole code is divided into two major section, First part uses the dataset provided by the client (HCL Dataset), which is then fed to the model and result is obtained. The second part uses the OWASP data as the input for the model and the result is obtained for the dame.

Our system will use this data and will be able to answer questions to the prompt that is given to it. Prompts like: "This is an instance of 'Injection.SQL', why would this be vulnerable and how would I fix it? (code context)".

The code below is used to access files from Google Drive within a Google Colab environment.

In [None]:
import os
import json

# loading a datafile from your Google Drive directory
from google.colab import drive
my_path = '/content/drive'
drive.mount(my_path)

# Replace 'your_directory_path' with the path to your directory containing JSON files
# directory = '/content/drive/MyDrive/Colab Notebooks/HCL 2 - Capstone Project'
directory = '/content/drive/Shared drives/HCLCapstone/Dataset'


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Data collection

On Exploring the dataset provided by the client we found that most of the information was in the 'API' folder in the dataset.

This section of code traverses a directory structure, identify JSON files within directories named 'API', and load their contents into a Python dictionary.

In [None]:
# Dictionary to hold all JSON data
all_data = {}

for root, dirs, files in os.walk(directory):
    if os.path.basename(root) == 'api':  # Check if the parent directory is 'api'
        for filename in files:
            if filename.endswith('.json'):
                filepath = os.path.join(root, filename)
                with open(filepath, 'r') as file:
                    # Load JSON file and add its contents to the all_data dictionary
                    all_data[filename] = json.load(file)

# Now 'all_data' contains the data from all JSON files in 'api' folders

In [None]:
all_data

{'AccessControl.Bypass_javascript_beb9bd61.article.json': {'version': 1,
  'id': 'AccessControl.Bypass_javascript_beb9bd61',
  'apiName': 'MooTools Cookie Path Too General',
  'cause': [{'sortId': 0,
    'text': "The path for the cookie has been set in the `Cookie.write()` method allowing this cookie to be acccessed from any page. The path attribute determines the pages within that particular application path (and its sub-directories) to which the cookie is accessible. A value of '/' or false means that the cookie can be accessed within all pages in all paths throughout the application."},
   {'sortId': 1,
    'text': '{@code_bad:\n<script language="javascript">\n...\n   var c = Cookie.write(\'user\', value, {path: false});\n   // or\n   var c = Cookie.write(\'user\', value);   // Defaults to {path: false}\n...\n</script>\n:code}'},
   {'sortId': 2,
    'text': "Setting this attribute to its default value of '/' could allow cookies given to an authenticated user to leak out into areas 

## Data processing

The data, being in JSON format, requires conversion to text format for integration with our model. The following code achieves this conversion.

We extract only the 'causes', 'risks', and 'recommendations' sections from the data, as these contain the substantive information that the model can learn from.

In [None]:
# Assuming 'all_data' is your dictionary containing JSON objects
formatted_text_data = {}

def extract_text(items):
    texts = []
    for item in items:
        if isinstance(item, str):
            texts.append(item)
        elif isinstance(item, dict) and 'text' in item:
            texts.append(item['text'])
    return ', '.join(texts)

for file_name, json_content in all_data.items():
    formatted_text = ""

    # Extract 'cause' or 'causes' content
    for key in ['cause', 'causes']:
        if key in json_content:
            formatted_text += "The causes are: " + extract_text(json_content[key]) + "\n"

    # Extract 'risks' content
    if 'risks' in json_content:
        formatted_text += "The risks are: " + extract_text(json_content['risks']) + "\n"

    # Extract 'recommendations' content
    if 'recommendations' in json_content:
        formatted_text += "The recommendations are: " + extract_text(json_content['recommendations']) + "\n"

    formatted_text_data[file_name] = formatted_text

# 'formatted_text_data' now contains the formatted text representation of specific JSON content


In [None]:
formatted_text_data

{'AccessControl.Bypass_javascript_beb9bd61.article.json': 'The causes are: The path for the cookie has been set in the `Cookie.write()` method allowing this cookie to be acccessed from any page. The path attribute determines the pages within that particular application path (and its sub-directories) to which the cookie is accessible. A value of \'/\' or false means that the cookie can be accessed within all pages in all paths throughout the application., {@code_bad:\n<script language="javascript">\n...\n   var c = Cookie.write(\'user\', value, {path: false});\n   // or\n   var c = Cookie.write(\'user\', value);   // Defaults to {path: false}\n...\n</script>\n:code}, Setting this attribute to its default value of \'/\' could allow cookies given to an authenticated user to leak out into areas of the application where authentication is not necessary. For example, if a user browses the public (unauthenticated) part of a website, it can usually be done without having to provide the user wit

## Installing Haystack

Now we’ll install the latest release of Haystack with `pip`:

In [None]:
!pip install --upgrade pip
!pip install farm-haystack[colab,inference]

[0m

### Enabling Telemetry
This code is used to notify the haystack library that a particular tutorial or example is currently being run, to track usage or to tailor the behavior of the library for educational purposes. See [Telemetry](https://docs.haystack.deepset.ai/docs/telemetry) for more details.

In [None]:
from haystack.telemetry import tutorial_running

tutorial_running(1)

Set the logging level to INFO: This script sets up a logging configuration where only warnings and more severe messages are logged globally, but for the haystack module, it logs info-level messages, allowing for more detailed logging from the haystack library specifically.

In [None]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

## Initializing the DocumentStore

We'll start creating our question answering system by initializing a DocumentStore. A DocumentStore stores the Documents that the question answering system uses to find answers to our questions/promts. Here we're using the `InMemoryDocumentStore`, which is the simplest DocumentStore to get started with. It requires no external dependencies and it's a good option for smaller datasets and debugging. But it doesn't scale up so well to larger Document collections, so it's not a good choice for production systems.

To learn more about the DocumentStore and the different types of external databases that haystack support, see [DocumentStore](https://docs.haystack.deepset.ai/docs/document_store).

In [None]:
from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore(use_bm25=True)

INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0


In [None]:
!pip install farm-haystack[elasticsearch]

[0m

The DocumentStore is now ready. Now let's fill it with some Documents.

This code creates a specified directory (if it doesn't already exist) and iterates through a dictionary of text data, saving each item in the dictionary to a separate text file in the specified directory.

In [None]:
import os

# Directory where text files will be saved
doc_dir = "data/build_your_first_question_answering_system"
os.makedirs(doc_dir, exist_ok=True)

# Since 'formatted_text_data' contains our data
for file_name, text in formatted_text_data.items():
    file_path = os.path.join(doc_dir, file_name + ".txt")
    with open(file_path, "w") as file:
        file.write(text)


Here we use `TextIndexingPipeline` to convert the files we just downloaded into Haystack document objects and write them into the DocumentStore.

In [None]:
from haystack.pipelines import TextIndexingPipeline

indexing_pipeline = TextIndexingPipeline(document_store)

files_to_index = [os.path.join(doc_dir, f) for f in os.listdir(doc_dir)]
indexing_pipeline.run_batch(file_paths=files_to_index)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
INFO:haystack.pipelines.base:It seems that an indexing Pipeline is run, so using the nodes' run method instead of run_batch.
Converting files: 100%|██████████| 143/143 [00:00<00:00, 349.48it/s]
Preprocessing: 100%|██████████| 143/143 [00:00<00:00, 172.13docs/s]
INFO:haystack.document_stores.base:Duplicate Documents: Document with id '2b64fabaf9ccf6f9bad970a702045008' already exists in index 'document'
INFO:haystack.document_stores.base:Duplicate Documents: Document with id 'c440d91f9836d142f11e551c4ef28ca8' already exists in index 'document'
INFO:haystack.document_stores.base:Duplicate Documents: Document with id '75e33d913a08a07f9b244f8242422f29' already exists in index 'document'
INFO:haystack.document_stores.base:Duplicate Documents: Document with id '2b64fabaf9ccf6f9bad970a702045008' already exists in index 'document'
INFO:haystack.document_stores.base:Duplicate Documents: Docu

{'documents': [<Document: {'content': "The causes are: The detected NodeJS code contains a setTimeout function containing one or more arguments with a variable and/or string concatenation. This could be user input that may contain malicious code from an attacker., {@code_bad:\nstring = 'boo';\n\nsetTimeout(function (str) { console.log(str) }, 100, string);\n:code}\nThe recommendations are: Validate variables and avoid string concatenation., {@code_bad:\nsetTimeout(function (str) { console.log(str) }, 100, 'boo')\n:code}", 'content_type': 'text', 'score': None, 'meta': {'_split_id': 0}, 'id_hash_keys': ['content'], 'embedding': None, 'id': '84c18bb98800749d40b1a670ac97c16b'}>,
  <Document: {'content': 'The causes are: Content being passed into the `setTimeout()` method should be checked for containing tainted data. Since this data is further used to execute JavaScript inside the HyperText Markup Language (HTML) of a page after a set delay, steps should be taken towards validating it., T

## Initializing the Retriever

Our search system will use a Retriever, so we need to initialize it. A Retriever sifts through all the Documents and returns only the ones relevant to the question. We have used the BM25 algorithm.

First we'll initialize a BM25Retriever and make it use the InMemoryDocumentStore we initialized earlier.

In [None]:
from haystack.nodes import BM25Retriever

retriever = BM25Retriever(document_store=document_store)

## Initializing the Reader

A Reader scans the texts it received from the Retriever and extracts the top answer candidates. Readers are based on powerful deep learning models but are much slower than Retrievers at processing the same amount of text. In this tutorial, we're using a FARMReader with a base-sized RoBERTa question answering model called [`deepset/roberta-base-squad2`](https://huggingface.co/deepset/roberta-base-squad2).

Let's initialize the Reader.

In [None]:
from haystack.nodes import FARMReader

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0
INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model: * LOADING MODEL: 'deepset/roberta-base-squad2' (Roberta)


model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model:Auto-detected model language: english
INFO:haystack.modeling.model.language_model:Loaded 'deepset/roberta-base-squad2' (Roberta model) from model hub.


tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0


## Creating the Retriever-Reader Pipeline

Here we're using a ready-made pipeline called `ExtractiveQAPipeline`. It connects the Reader and the Retriever. The combination of the two speeds up processing because the Reader only processes the Documents that the Retriever has passed on. To learn more about pipelines, see [Pipelines](https://docs.haystack.deepset.ai/docs/pipelines).

Let's create the Pipeline.

In [None]:
from haystack.pipelines import ExtractiveQAPipeline

pipe = ExtractiveQAPipeline(reader, retriever)

## Asking a Question / Giving a Prompt

Here we use the pipeline `run()` method to ask a question. The query argument is where we type our question. Additionally, we can set the number of documents we want the Reader and Retriever to return using the `top-k` parameter.

In [None]:
prediction = pipe.run(
    # 1.   This is an instance of 'Injection.SQL', why would this be vulnerable and how would I fix it?
    #"SELECT (SELECT COUNT(*) FROM moz_places), " + "(SELECT SUBSTR(stat,1,LENGTH(stat)-2) FROM sqlite_stat1 " + "WHERE idx = 'moz_places_url_uniqueindex')
    # 2.   What happens after setting the `validateArguments` property to `false`?
    # 3.   What is cross site scripting?
    # 4.   How do I solve cross site scripting vulnerability?
    query="""

    What is cross site scripting?

    """, params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
)

Inferencing Samples: 100%|██████████| 1/1 [00:14<00:00, 14.87s/ Batches]


Print out the answers that the pipeline returned.

In [None]:
from pprint import pprint

pprint(prediction)

{'answers': [<Answer {'answer': 'XSS) attacks', 'type': 'extractive', 'score': 0.7377877235412598, 'context': 'bitrary HTML on a website can lead to dangerous Cross-Site Scripting(XSS) attacks. Attribute domPropsInnerHTML is used to render raw HTML contents dir', 'offsets_in_document': [{'start': 109, 'end': 121}], 'offsets_in_context': [{'start': 69, 'end': 81}], 'document_ids': ['c629399a80c2b692cf3cdfc8ce239770'], 'meta': {'_split_id': 0}}>,
             <Answer {'answer': 'XSS) attacks', 'type': 'extractive', 'score': 0.7116725444793701, 'context': 'using non-trusted content can lead to dangerous Cross-Site Scripting(XSS) attacks. Using a non-trusted template is equivalent to allowing arbitrary Ja', 'offsets_in_document': [{'start': 108, 'end': 120}], 'offsets_in_context': [{'start': 69, 'end': 81}], 'document_ids': ['e1c8d2b1dd61157f5c1b856980f33999'], 'meta': {'_split_id': 0}}>,
             <Answer {'answer': 'XSS) attacks', 'type': 'extractive', 'score': 0.7076241970062256, 'co

This simplifies the answers that the pipeline returned.

In [None]:
from haystack.utils import print_answers

print_answers(prediction, details="medium")  ## Choose from `minimum`, `medium`, and `all`

'Query: \n\n    What is cross site scripting?\n\n    '
'Answers:'
[   {   'answer': 'XSS) attacks',
        'context': 'bitrary HTML on a website can lead to dangerous Cross-Site '
                   'Scripting(XSS) attacks. Attribute domPropsInnerHTML is '
                   'used to render raw HTML contents dir',
        'score': 0.7377877235412598},
    {   'answer': 'XSS) attacks',
        'context': 'using non-trusted content can lead to dangerous Cross-Site '
                   'Scripting(XSS) attacks. Using a non-trusted template is '
                   'equivalent to allowing arbitrary Ja',
        'score': 0.7116725444793701},
    {   'answer': 'XSS) attacks',
        'context': 'bitrary HTML on a website can lead to dangerous Cross-Site '
                   'Scripting(XSS) attacks. Attribute InnerHTML is used to '
                   'render raw HTML contents directly in',
        'score': 0.7076241970062256},
    {   'answer': 'XSS) attack',
        'context': 'o be executed.

# **Getting data from the OWASP website and training the model.**

## Data collection

For the second part now we will use the data from the OWASP website that has the Cheat Sheet Series. We have cloned the repository and used that content.

In [None]:
# Clone the OWASP CheatSheetSeries repository

!git clone https://github.com/OWASP/CheatSheetSeries.git

Cloning into 'CheatSheetSeries'...
remote: Enumerating objects: 45345, done.[K
remote: Counting objects: 100% (211/211), done.[K
remote: Compressing objects: 100% (81/81), done.[K
remote: Total 45345 (delta 128), reused 180 (delta 106), pack-reused 45134[K
Receiving objects: 100% (45345/45345), 1.45 GiB | 34.21 MiB/s, done.
Resolving deltas: 100% (39486/39486), done.


In [None]:
!pip install markdown html2text

Collecting html2text
  Downloading html2text-2024.2.26.tar.gz (56 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.5/56.5 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: html2text
  Building wheel for html2text (setup.py) ... [?25l[?25hdone
  Created wheel for html2text: filename=html2text-2024.2.26-py3-none-any.whl size=33111 sha256=d2bb81c5a670a19599ba8bf0174f0911c8df9d30ede3c7a54983958d9a076fb2
  Stored in directory: /root/.cache/pip/wheels/f3/96/6d/a7eba8f80d31cbd188a2787b81514d82fc5ae6943c44777659
Successfully built html2text
Installing collected packages: html2text
Successfully installed html2text-2024.2.26
[0m

In [None]:
#Importing necessary files
import os
import markdown
import html2text

## Data processing

This code processes all Markdown files in the 'cheatsheets' directory, converts them to plain text, and combines them into a single text file.

In [None]:
# Set the path to the cheatsheets directory within the cloned repository
cheatsheets_dir = 'CheatSheetSeries/cheatsheets'

# Initialize a variable to hold the combined text
combined_text = ''

# Walk through the directory, and read each Markdown file
for root, dirs, files in os.walk(cheatsheets_dir):
    for file in files:
        # Check if the file has a Markdown extension
        if file.endswith('.md'):
            # Construct the full file path
            file_path = os.path.join(root, file)
            # Open and read the Markdown file
            with open(file_path, 'r', encoding='utf-8') as md_file:
                md_content = md_file.read()
                # Convert Markdown to HTML using the markdown library
                html_content = markdown.markdown(md_content)
                # Initialize html2text converter
                text_maker = html2text.HTML2Text()
                # Set to ignore links in the conversion process
                text_maker.ignore_links = True
                # Convert HTML to plain text
                plain_text = text_maker.handle(html_content)
                # Add the plain text to the combined text, separating files with newlines
                combined_text += plain_text + '\n\n'

# Save the combined plain text to a single file
with open('combined_cheatsheets.txt', 'w', encoding='utf-8') as output_file:
    output_file.write(combined_text)

In [None]:
with open('combined_cheatsheets.txt', 'r', encoding='utf-8') as output_file:
  #print(output_file.readlines())
  combined_text = output_file.readlines()

In [None]:
combined_text

['# Docker Security Cheat Sheet\n',
 '\n',
 '## Introduction\n',
 '\n',
 'Docker is the most popular containerization technology. When used correctly,\n',
 'it can enhance security compared to running applications directly on the host\n',
 'system. However, certain misconfigurations can reduce security levels or\n',
 'introduce new vulnerabilities.\n',
 '\n',
 'The aim of this cheat sheet is to provide a straightforward list of common\n',
 'security errors and best practices to assist in securing your Docker\n',
 'containers.\n',
 '\n',
 '## Rules\n',
 '\n',
 '### RULE #0 - Keep Host and Docker up to date\n',
 '\n',
 'To protect against known container escape vulnerabilities like Leaky Vessels,\n',
 "which typically result in the attacker gaining root access to the host, it's\n",
 'vital to keep both the host and Docker up to date. This includes regularly\n',
 'updating the host kernel as well as the Docker Engine.\n',
 '\n',
 "This is due to the fact that containers share the host's k

## Initializing the DocumentStore

We'll start creating our question answering system by initializing a DocumentStore. A DocumentStore stores the Documents that the question answering system uses to find answers to our questions/promts. Here we're using the `InMemoryDocumentStore`, which is the simplest DocumentStore to get started with. It requires no external dependencies and it's a good option for smaller datasets and debugging. But it doesn't scale up so well to larger Document collections, so it's not a good choice for production systems.

To learn more about the DocumentStore and the different types of external databases that haystack support, see [DocumentStore](https://docs.haystack.deepset.ai/docs/document_store).

In [None]:
from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore(use_bm25=True)

INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0


In [None]:
!pip install farm-haystack[elasticsearch]

[0m

In our initial code snippet, combined_text is filled with plain text data from the Markdown files, making it a string or a list of lines (depending on how it's processed at the end). We need to adjust our approach depending on what exactly we want to achieve with the loop.

If our intent is to save the combined text to individual files, we need to define how to split this text into separate documents or what the file_name should be for each part. Since combined_text is a very large string containing all the combined texts, it might not make sense to iterate over it like a dictionary.

The below code will save the entire combined content into one file. If we need to split the content into separate files, we'll need to define the logic on how to split combined_text and assign appropriate file names for each part.

In [None]:
import os

# Directory where text files will be saved
doc_dir_owasp = "data/build_your_question_answering_system_using_owasp_data"
os.makedirs(doc_dir_owasp, exist_ok=True)

# Define the file name for the combined text
file_name = "combined_owasp_cheatsheets.txt"
file_path = os.path.join(doc_dir_owasp, file_name)

# Write the combined text to a single file
with open(file_path, "w", encoding='utf-8') as file:
    if isinstance(combined_text, list):
        # If combined_text is a list, join its elements into a single string
        file.write(''.join(combined_text))
    else:
        # If combined_text is already a string, write it directly
        file.write(combined_text)

In [None]:
# from transformers import AutoModelForQuestionAnswering, AutoTokenizer, Trainer, TrainingArguments
# from datasets import load_dataset

# # Load your dataset
# datasets = load_dataset("path/to/your/dataset")

# # Load the pre-trained model and tokenizer
# model_name = "deepset/roberta-base-squad2"
# model = AutoModelForQuestionAnswering.from_pretrained(model_name)
# tokenizer = AutoTokenizer.from_pretrained(model_name)

# # Define training arguments
# training_args = TrainingArguments(
#     output_dir="./models/roberta-finetuned",
#     num_train_epochs=3,
#     per_device_train_batch_size=16,
#     warmup_steps=500,
#     weight_decay=0.01,
#     logging_dir='./logs',
#     evaluation_strategy='epoch'
# )

# # Initialize the Trainer
# trainer = Trainer(
#     model=model,
#     args=training_args,
#     train_dataset=datasets['train'],
#     eval_dataset=datasets['validation']
# )

# # Start training
# trainer.train()

# # Save the fine-tuned model
# model.save_pretrained("./models/roberta-finetuned")
# tokenizer.save_pretrained("./models/roberta-finetuned")

Here we use `TextIndexingPipeline` to convert the files we just downloaded into Haystack document objects and write them into the DocumentStore.

In [None]:
from haystack.pipelines import TextIndexingPipeline

# This part assumes you have a document_store set up.
# Since we are not setting up Elasticsearch or another document store here,
# you might need to modify this to suit your setup or use case.
indexing_pipeline = TextIndexingPipeline(document_store)

files_to_index = [os.path.join(doc_dir_owasp, f) for f in os.listdir(doc_dir_owasp)]

indexing_pipeline.run_batch(file_paths=files_to_index)

INFO:haystack.pipelines.base:It seems that an indexing Pipeline is run, so using the nodes' run method instead of run_batch.
Converting files: 100%|██████████| 1/1 [00:00<00:00,  1.24it/s]
Preprocessing: 100%|██████████| 1/1 [00:00<00:00,  1.02docs/s]
Updating BM25 representation...: 100%|██████████| 1292/1292 [00:00<00:00, 12056.27 docs/s]


{'documents': [<Document: {'content': "# Docker Security Cheat Sheet\n\n## Introduction\n\nDocker is the most popular containerization technology. When used correctly,\nit can enhance security compared to running applications directly on the host\nsystem. However, certain misconfigurations can reduce security levels or\nintroduce new vulnerabilities.\n\nThe aim of this cheat sheet is to provide a straightforward list of common\nsecurity errors and best practices to assist in securing your Docker\ncontainers.\n\n## Rules\n\n### RULE #0 - Keep Host and Docker up to date\n\nTo protect against known container escape vulnerabilities like Leaky Vessels,\nwhich typically result in the attacker gaining root access to the host, it's\nvital to keep both the host and Docker up to date. This includes regularly\nupdating the host kernel as well as the Docker Engine.\n\nThis is due to the fact that containers share the host's kernel. If the host's\nkernel is vulnerable, the containers are also vulne

In [None]:
files_to_index

['data/build_your_question_answering_system_using_owasp_data/combined_owasp_cheatsheets.txt']

## Initializing the Retriever

Our search system will use a Retriever, so we need to initialize it. A Retriever sifts through all the Documents and returns only the ones relevant to the question. We have used the BM25 algorithm.

First we'll initialize a BM25Retriever and make it use the InMemoryDocumentStore we initialized earlier.

In [None]:
from haystack.nodes import BM25Retriever

retriever = BM25Retriever(document_store=document_store)

## Initializing the Reader

A Reader scans the texts it received from the Retriever and extracts the top answer candidates. Readers are based on powerful deep learning models but are much slower than Retrievers at processing the same amount of text. In this tutorial, we're using a FARMReader with a base-sized RoBERTa question answering model called [`deepset/roberta-base-squad2`](https://huggingface.co/deepset/roberta-base-squad2).

Let's initialize the Reader.

In [None]:
from haystack.nodes import FARMReader

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0
INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0
INFO:haystack.modeling.model.language_model: * LOADING MODEL: 'deepset/roberta-base-squad2' (Roberta)
INFO:haystack.modeling.model.language_model:Auto-detected model language: english
INFO:haystack.modeling.model.language_model:Loaded 'deepset/roberta-base-squad2' (Roberta model) from model hub.
INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0


## Creating the Retriever-Reader Pipeline

Here we're using a ready-made pipeline called `ExtractiveQAPipeline`. It connects the Reader and the Retriever. The combination of the two speeds up processing because the Reader only processes the Documents that the Retriever has passed on. To learn more about pipelines, see [Pipelines](https://docs.haystack.deepset.ai/docs/pipelines).

Let's create the Pipeline.

In [None]:
from haystack.pipelines import ExtractiveQAPipeline

pipe = ExtractiveQAPipeline(reader, retriever)

## Asking a Question / Giving a Prompt

Here we use the pipeline `run()` method to ask a question. The query argument is where we type our question. Additionally, we can set the number of documents we want the Reader and Retriever to return using the `top-k` parameter.

In [None]:
prediction = pipe.run(

    #What happens after setting the `validateArguments` property to `false`
    query="""

    what is cross site scripting?

    """, params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
)

Inferencing Samples: 100%|██████████| 1/1 [00:22<00:00, 23.00s/ Batches]


Print out the answers that the pipeline returned.

In [None]:
from pprint import pprint

pprint(prediction)

{'answers': [<Answer {'answer': 'Prevention', 'type': 'extractive', 'score': 0.7371946573257446, 'context': " the document. The very first OWASP Cheat Sheet, Cross Site\nScripting Prevention, was inspired by RSnake's work and we thank RSnake for\nthe inspiratio", 'offsets_in_document': [{'start': 762, 'end': 772}], 'offsets_in_context': [{'start': 70, 'end': 80}], 'document_ids': ['bd0e209d857f3b859a2e9b1bf025dbe3'], 'meta': {'_split_id': 445}}>,
             <Answer {'answer': 'a type of attack where malicious JavaScript code\nis injected into a displayed variable', 'type': 'extractive', 'score': 0.5444809198379517, 'context': '\n\nCross-Site Scripting (XSS) is a type of attack where malicious JavaScript code\nis injected into a displayed variable. For example, if the value of t', 'offsets_in_document': [{'start': 400, 'end': 486}], 'offsets_in_context': [{'start': 32, 'end': 118}], 'document_ids': ['715a538ed407bbdf6070cb4028719be8'], 'meta': {'_split_id': 1277}}>,
             <Answ

This simplifies the answers that the pipeline returned.

In [None]:
from haystack.utils import print_answers

print_answers(prediction, details="medium")  ## Choose from `minimum`, `medium`, and `all`

'Query: \n\n    what is cross site scripting?\n\n    '
'Answers:'
[   {   'answer': 'Prevention',
        'context': ' the document. The very first OWASP Cheat Sheet, Cross '
                   'Site\n'
                   "Scripting Prevention, was inspired by RSnake's work and we "
                   'thank RSnake for\n'
                   'the inspiratio',
        'score': 0.7371946573257446},
    {   'answer': 'a type of attack where malicious JavaScript code\n'
                  'is injected into a displayed variable',
        'context': '\n'
                   '\n'
                   'Cross-Site Scripting (XSS) is a type of attack where '
                   'malicious JavaScript code\n'
                   'is injected into a displayed variable. For example, if the '
                   'value of t',
        'score': 0.5444809198379517},
    {   'answer': 'testing for application\nsecurity professionals',
        'context': 'is article is a guide to Cross Site Scripting (XSS) '
    