Skip to content

Commit

Permalink
indexing data
Browse files Browse the repository at this point in the history
  • Loading branch information
jeremyorr-hm committed Jul 25, 2023
1 parent 2653a86 commit 10fe4db
Show file tree
Hide file tree
Showing 5 changed files with 13 additions and 3 deletions.
Binary file added docs/guide/img/index-job-complete.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/guide/img/index-job-progress.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/guide/img/start-url.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
8 changes: 5 additions & 3 deletions docs/guide/user-guide/03-configure-project.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ With a website search project the start url provides the root of the indexing jo
To configure a sitemap as a starting url use the full path to the sitemap e.g. www.example.com/sitemap.xml
:::

![Start URL](../img/start-url.png)

#### Allowed Path Patterns
The allowed path pattern is a way of refining the content the indexer chooses to index. When this is set, the indexer will evaluate the url path of the document being processed. If the url contains a match to the allowed path pattern the content will be extracted and indexed, if the url does not contain a pattern match the content will be ignored. If the allowed path is left blank, all content is extracted and indexed (unless it matches any blocked path pattern). Multiple allowed path ptterns can be set and content that matches any one of these will be extracted.
::: tip
Expand All @@ -31,13 +33,13 @@ The blocked path pattern can be used to explicitly ignore content matching a cer
![Blocked Path](../img/blocked-path.png)

#### XPath
The indexer uses Machine Learning to infer what the most appropriate / primary content of a page or paragraph is, but sometimes this doesn't identify the content correctly. The XPath config parameter can be uesd to specifically instruct the indexer to extract content that matches the XPath.
The indexer uses Machine Learning to infer what the most appropriate / primary content of a page or paragraph is, but sometimes this doesn't identify the content correctly. The XPath config parameter can be used to specifically instruct the indexer to extract content that matches the XPath.

#### Wait XPath
Sometimes the content to be indexed is dynamically rendered at the point a page is loaded and so can be missed by the speed the indexer usually extracts content. Setting a waif XPath causes the indexer to wait until the content at a particular XPath has fully rendered before extraction.
Sometimes the content to be indexed is dynamically rendered at the point a page is loaded and so can be missed by the speed the indexer usually extracts content. Setting a wait XPath causes the indexer to wait until the content at a particular XPath has fully rendered before extraction.

#### Follow Links
The follow links option tells the indexer to crawl all links on a page being indexed to pull in any additional content. This is typically set to True for a simple indexing configuration that might use a home page as the index starting url. If the index is based on one or more sitemaps, it is possible to index only content that appears explicitly in the sitemap by setting the follow links parameter to False
The follow links option tells the indexer to crawl all links on a page being indexed to pull in any additional content. This is typically turned on for a simple indexing configuration that might use a home page as the index starting url. If the index is based on one or more sitemaps, it is possible to index only content that appears explicitly in the sitemap by turning the follow links parameter off

### Automatic Document Classification
The Find service includes the ability to automatically give indexed content a classification based on a list of provided labels. This uses a process called Zero Shot classification and is unsupervised, as documents are indexed a classifier determines which of the provided classifications best fits the content.
Expand Down
8 changes: 8 additions & 0 deletions docs/guide/user-guide/06-Index-and-deploy.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,14 @@
## Start an Indexing Job

### Monitor Job Progress
Clicking on the Index Version Id takes you through to the details page for the Index job. While the job is running the job status is updated as urls are processed by the indexer. The
status shows the number of urls processed in total and a count of those that have been successfully processed and a count of any that have failed to process.

[Index Job Status](../img/index-job-progress.png)

Once the Index job is complete the status is updated to reflect this with the final counts of urls indexed. There are also links that become active to allow the download of the list of the urls indexed for reference, and to download the actual extracted content.

[Indexing Complete](../img/index-job-complete.png)

### Review Index Data

Expand Down

0 comments on commit 10fe4db

Please sign in to comment.