This is a Web Crawler for Vectara
The web crawler currently has 4 modes of operation:
- Single URL
- Sitemap
- RSS
- Recursive
For the former, provide the crawler with a URL and it will ingest it into Vectara. For the latter, provide the crawler with a root page, and it will retrieve the sitemap(s) and index all links from the sitemap.
This crawler has a minimum set of python dependencies, as outlined in requirements.txt.
Install these requirements by running:
pip3 install -r requirements.txt
The crawler generates PDFs for each page to upload to Vectara's file upload API. The crawler relies on headless browsers to both extract links and to generate these PDFs. This allows for realistic text rendering, even of javascript-heavy websites. Chrome/Chromium is required for link extraction, and there are currently 2 supported headless browsers for PDF generation, each with their own tradeoffs:
pyhtml2pdf
which in turn uses headless Chrome for rendering. You will either need to install Chrome locally or keep a copy of chromedriver in yourPATH
.wkhtmltopdf
which uses Qt WebKit for rendering. It's highly recommended that you download a precompiled wkhtmltopdf binary and add it to yourPATH
(as opposed to trying to install wkhtmltopdf via a package manager)
Unfortunately no website PDF rendering system is perfect, though for the purposes of neural search, it generally doesn't need to be: you just need to make sure the right text is rendered in roughly the right order.
wkhtmltopdf
tends to do a pretty good job of this task but doesn't handle URL
fragments (things after #
in the URL), so crawls using wkhtmltopdf
will
remove any URL fragment from the document ID when submitted to Vectara.
wkhtmltopdf
also can be insecure, so either keep the process sandboxed or
only run it on sites that you trust.
pyhtml2pdf
(and Chrome) generally produce more accurate colors and
positioning of rendering than wkhtmltopdf
though for the purposes of neural
text search these generally do not matter. Unfortunately, the visual
accuracy can sometimes yield programmatic inaccuracies where certain elements
of the PDF blocks are located in the wrong place.
In general, if you have full access to the content and/or have the ability to do more bespoke content extraction, it will yield better results than a generic web crawler, and Vectara maintains a full text/metadata indexing API as well for those users.
python3 crawler.py [parameters]
Parameters are:
Parameter | Required? | Description | Default |
---|---|---|---|
url | Yes | The starting URL, domain, or homepage | N/A |
crawl-type | No | single-page, rss, sitemap, or recursive | single-page |
pdf-driver | No | What to convert pages to PDFs. chrome or wkhtmltopdf | chrome |
(no-)install-chrome-driver | No | Whether or not to install the Chrome driver for extracting links | --install-chrome-driver |
depth | No | Maximum depth to discover and crawl links | 3 |
crawl-pattern | No | Optional regular expression to stick the crawl to | .* (all URLs) |
customer-id | Yes | Your Vectara customer ID | N/A |
corpus-id | Yes | Your Vectara corpus ID | N/A |
appclient-id | Yes | OAuth 2.0 client ID to index content | N/A |
appclient-secret | Yes | OAuth 2.0 client ID to secret content | N/A |
customer-id | Yes | Your Vectara customer ID | N/A |
auth-url | No | OAuth2 authentication URL | Defined by your account |
indexing-endpoint | No | OAuth2 authentication URL | api.vectara.com |
This code is licensed Apache 2.0. For more details, see the license file