# Web Crawl Q&A

This app will allow you to ask questions based on a website's content and get a clear, summarized answer as long as the information exists. It is quite similar in structure to the File Q&A example, but scrapes webpages for data rather than files. For this example, we'll be scraping the [Writer docs](https://dev.writer.com/docs) so we can ask questions about using Writer models and APIs.

At a high level, it will:
* Use the Scrapy framework to gather text content from a specific domain
* Split the content into relatively small pieces
* Use a sentence similarity-capable language model to separate the most relevant pieces to the query
* Use a generative language model to sumarize the useful information in those pieces

### Dependencies

Make sure you have a virtual environment selected if you don't want to install dependencies globally.

In [1]:
%pip install scrapy tiktoken sentence_transformers writerai python-dotenv

Collecting scrapy
  Using cached Scrapy-2.8.0-py2.py3-none-any.whl (272 kB)
Collecting tiktoken
  Using cached tiktoken-0.3.3-cp310-cp310-macosx_11_0_arm64.whl (706 kB)
Collecting sentence_transformers
  Using cached sentence-transformers-2.2.2.tar.gz (85 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting writerai
  Using cached writerai-0.4.0-py3-none-any.whl (85 kB)
Collecting python-dotenv
  Using cached python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Collecting service-identity>=18.1.0
  Using cached service_identity-21.1.0-py2.py3-none-any.whl (12 kB)
Collecting zope.interface>=5.1.0
  Using cached zope.interface-6.0-cp310-cp310-macosx_11_0_arm64.whl (202 kB)
Collecting w3lib>=1.17.0
  Using cached w3lib-2.1.1-py3-none-any.whl (21 kB)
Collecting itemadapter>=0.1.0
  Using cached itemadapter-0.8.0-py3-none-any.whl (11 kB)
Collecting PyDispatcher>=2.0.5
  Using cached PyDispatcher-2.0.7-py3-none-any.whl (12 kB)
Collecting tldextract
  Using cached tldextract-3.4.1-py3-

### Scraping a website with Scrapy

The Scrapy framework is a bit complicated at first, but it provides a lot of useful features for robust and customizable web scraping. It needs a specific project organization, and we can easily set that up with `scrapy startproject [name]`:

In [2]:
!scrapy startproject crawler

New Scrapy project 'crawler', using template directory '/Users/heathexer/writer/writer-cookbook/.venv/lib/python3.10/site-packages/scrapy/templates/project', created in:
    /Users/heathexer/writer/writer-cookbook/web_crawl_qa/crawler

You can start your first spider with:
    cd crawler
    scrapy genspider example example.com


This creates a few files and directories, but for this example all we need to do is create a basic spider:

In [3]:
%%writefile "./crawler/crawler/spiders/writer_spider.py"

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class WriterSpider(CrawlSpider):
    name = "dev.writer.com"
    allowed_domains = ["dev.writer.com"]
    start_urls = ["https://dev.writer.com/docs/"]

    rules = [
        Rule(LinkExtractor(), callback='parse', follow=False)
    ]

    def parse(self, response):
        # Select all text that is contained in a <p> block
        text = "\n".join(response.selector.xpath("//p//text()").extract()).strip()
        return {
            'url': response.url,
            'text': text
        }

Writing ./crawler/crawler/spiders/writer_spider.py


This code is highly dependent on your target site. For example, all of the pages we want to search happen to be accessible in one 'click' from the `start_urls`, so we can set `follow=False`. If this was not the case, we would need to set `follow=True` and probably add some additional rules to filter out pages we don't want. For a starter guide on Scrapy you can look [here](https://www.scrapingbee.com/blog/web-scraping-with-scrapy/), or for more detailed information and examples look at the [Scrapy docs](https://docs.scrapy.org/en/latest/index.html).

Now that we've created out spider, we can run it with the aptly named `scrapy runspider [name]`. The -O flag tells it what file to put the gathered data into.

In [4]:
!scrapy runspider crawler/crawler/spiders/writer_spider.py -O site_text.json

2023-05-03 15:30:31 [scrapy.utils.log] INFO: Scrapy 2.8.0 started (bot: scrapybot)
2023-05-03 15:30:31 [scrapy.utils.log] INFO: Versions: lxml 4.9.2.0, libxml2 2.9.13, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.1, Twisted 22.10.0, Python 3.10.10 (main, Apr  5 2023, 14:58:08) [Clang 11.1.0 ], pyOpenSSL 23.1.1 (OpenSSL 3.1.0 14 Mar 2023), cryptography 40.0.2, Platform macOS-13.1-arm64-arm-64bit
2023-05-03 15:30:31 [scrapy.crawler] INFO: Overridden settings:
{'SPIDER_LOADER_WARN_ONLY': True}


See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

2023-05-03 15:30:31 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2023-05-03 15:30:31 [scrapy.extensions.telnet] INFO: Telnet Password: 63286887eb7e9401
2023-05-03 15:30:31 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.ext

Now you can see exactly what data got gathered in `site_text.json`. If you look closely, you'll notice a few empty and duplicate pages. These could be filtered out with a couple of Scrapy rules, but they shouldn't affect query results so there's no harm in leaving them in.

Now that the scraping part is done, we can move on to using AI to query the information. The following code block is just to set up the Writer security object. Make sure you have a `.env` file in the parent directory with the following lines:
```
WRITER_ORG_ID=<your org ID>
WRITER_API_KEY=<your API key>
```

or just directly set the corresponding variables in the code block.

In [5]:
from writer import Writer
from writer.models import shared
from dotenv import load_dotenv
import os

load_dotenv("..")
org_id = os.environ.get("WRITER_ORG_ID")
api_key = os.environ.get("WRITER_API_KEY")

writer = Writer(
    security=shared.Security(
        api_key=api_key
    ),
    organization_id=org_id
)

First, we need a function to split extracted text into chunks that can be compared to the input query:

In [6]:
import tiktoken

CHUNK_TOKENS = 50

def chunk_text(text: str):
    tokenizer = tiktoken.get_encoding("cl100k_base")
    # We can tokenize the text to count the number of tokens for chunking
    tokens = tokenizer.encode(text)

    chunks = [
        tokenizer.decode(tokens[i : i + CHUNK_TOKENS])
        for i in range(0, len(tokens), CHUNK_TOKENS)
    ]
    
    return chunks

Next up, we need a function to actually compare those chunks. We use the `all-MiniLM-L6-v2` model from the `sentence_transformers` library for this. It checks the similarity in meaning between the input question and every chunk, and assigns a score to each one. Then we can rank them by this score to determine the most relevant pieces of text.

In [7]:
from sentence_transformers import SentenceTransformer, util

TOP_N = 10

def get_relevant_text(query: str, chunks: list[str]):
    model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
    text_embeddings = model.encode(chunks)
    query_embeddings = model.encode(query)

    cos_sims = util.cos_sim(query_embeddings, text_embeddings)

    sorted_chunk_nums = sorted(zip(cos_sims.tolist()[0], range(len(chunks))))
    # We also take the chunk before and after each relevant chunk. 
    # Since the text is split arbitrarily, this is just in case some important context got cut off.
    relevant_text = "\n".join(
        chunks[i - 1] or "" + chunks[i] + chunks[i + 1] or ""
        for _, i in sorted_chunk_nums[-TOP_N:]
    )

    return relevant_text

  from .autonotebook import tqdm as notebook_tqdm


Now that we've sorted out a small, important subset of our contextual data, we can feed that along with our question to a generative language model to form into an actual answer. For this purpose, we are using Writer's `palmyra-instruct` model.

In [8]:
from writer.models import operations, shared

def get_answer(query: str, text: str):
    prompt = (
        f"Given the following context information, try to answer the following question with as much detail as possible."
        f"Use only the given context. Do not use outside information.\n"
        f"If no answer can be found in the given context, output \"Could not find an answer in your data.\"\n"
        f"Question: {query} \n\nContext:\n{text}\nAnswer: "
    )

    req = operations.CreateCompletionRequest(
        completion_request=shared.CompletionRequest(
            prompt=prompt, max_tokens=2000, temperature=1.1
        ),
        model_id="palmyra-instruct",
    )

    res = writer.completions.create(req)
    if res.completion_response is not None:
        return res.completion_response.choices[0].text
    else:
        print(res.fail_response)

Finally, we can combine everything we've written into a simple function that takes in a question and returns an answer.

In [9]:
import json

def run_query(query: str):
    site_data_file = open("site_text.json", "r")
    site_data = json.load(site_data_file)
    chunks = []

    for page in site_data:
        chunks += chunk_text(page['text'])

    relevant_text = get_relevant_text(query, chunks)

    answer = get_answer(query, relevant_text)
    return answer

And ask it a question:

In [10]:
run_query("How do I use an API key?")

' To use an API key, you will need to log into your Writer account dashboard and obtain the API keys. Once you have the keys, you should include them in your API request. If the correct keys are not used or if the keys have become outdated, Writer will return an error.'