Skip to content

Commit

Permalink
More documentation work
Browse files Browse the repository at this point in the history
  • Loading branch information
zgornel committed Feb 10, 2020
1 parent 6adc4ca commit dc06b31
Show file tree
Hide file tree
Showing 6 changed files with 46 additions and 30 deletions.
8 changes: 4 additions & 4 deletions README.md
@@ -1,12 +1,12 @@
![Alt text](https://github.com/zgornel/Garamond.jl/blob/master/docs/src/assets/logo.png)

A small, flexible neural and data search engine, written in Julia. Batteries not included.

[![License](http://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat)](LICENSE.md)
<p align="center">
[![](https://img.shields.io/badge/docs-dev-blue.svg)](https://zgornel.github.io/Garamond.jl/dev)
[![Build Status (master)](https://travis-ci.com/zgornel/Garamond.jl.svg?token=8HcgFtAjpxwpdXiu8Fon&branch=master)](https://travis-ci.com/zgornel/Garamond.jl)
[![Coverage Status](https://coveralls.io/repos/github/zgornel/Garamond.jl/badge.svg?branch=master)](https://coveralls.io/github/zgornel/Garamond.jl?branch=master)
[![](https://img.shields.io/badge/docs-dev-blue.svg)](https://zgornel.github.io/Garamond.jl/dev)
</p>

### Garamond is small, flexible neural and data search engine, written in Julia. Batteries not included.

## Installation

Expand Down
2 changes: 1 addition & 1 deletion docs/make.jl
Expand Up @@ -19,7 +19,7 @@ makedocs(
"Configuration" => "configuration.md",
"Client/Server" => "clientserver.md",
"Building" => "build.md",
"Notes" => "notes.md",
"Remarks" => "remarks.md",
"API Reference" => "api.md",
]
)
Expand Down
36 changes: 23 additions & 13 deletions docs/src/clientserver.md
@@ -1,20 +1,20 @@
# Search server, clients and REST APIs

Garamond is designed as a [client-server architecture](http://catb.org/~esr/writings/taoup/html/ch11s06.html#id2958899) in which the server receives requests, performs the search, recommendation or ranking operations and returns the response i.e. results back to the client.
Garamond is designed as a [client-server architecture](http://catb.org/~esr/writings/taoup/html/ch11s06.html#id2958899) in which the server receives requests, performs the search, recommendation or ranking operations and returns a response containing the search results back to the client.

!!! note

- The clients do not depend on the Garamond package and are very lightweight.
- The preferred way of communicating with the server is through the [REST API](@ref rest-api-specification) using HTTP clients such as [curl](https://curl.haxx.se/), etc.

In the root directory of the package the search server utility and two thin clients can be found:
- **gars** - starts the search server. The operations performed by the search engine server at this point are indexing data according to a given configuration and serving requests coming from connections to sockets or HTTP ports.
- **garc** - command line client supporting Unix socket communication. Through it, a single search can be performed and many of the search request parameters can be specified. It supports printing search results in a human-readable way.
- **garw** - web client supporting Web socket communication (experimental and feature limited). The basic principle is that it starts a HTTP server which serves a page at a given HTTP port. If the web page is not specified, a default one is generated internally and served. The user connects with a web browser of choice at the local address and port (i.e. `127.0.0.1`) and performs the search queries from the page. It naturally supports multiple queries however, the parameters of the search cannot be changed.
- `gars` - starts the search server. The operations performed by the search engine server at this point are indexing data according to a given configuration and serving requests coming from connections to sockets or HTTP ports.
- `garc` - command line client supporting Unix socket communication. Through it, a single search can be performed and many of the search request parameters can be specified. It supports printing search results in a human-readable way.
- `garw` - web client supporting Web socket communication (experimental and feature limited). The basic principle is that it starts a HTTP server which serves a page at a given HTTP port. If the web page is not specified, a default one is generated internally and served. The user connects with a web browser of choice at the local address and port (i.e. `127.0.0.1`) and performs the search queries from the page. It naturally supports multiple queries however, the parameters of the search cannot be changed.


## Server
The search server listens on an ip and socket for incoming requests. Once one is received, it is processed and the response sent back to same socket. Looking at the `gars` command line help
The search server listens on an ip and/or socket for incoming requests. Once one is received, it is processed and the response sent back to same socket. Looking at the `gars` command line help
```
$ ./gars --help
Activating environment at `~/projects/Garamond.jl/Project.toml`
Expand Down Expand Up @@ -47,6 +47,7 @@ optional arguments:
-h, --help show this help message and exit
```
starting the server becomes quite straightforward.

For example, to start the server listening to a web socket at port 9100 and to a UNIX socket at `/tmp/some/socket`:
```
$ ./gars -d ./search_data_config.json -u /tmp/some/socket -w 9100 --log-level info
Expand All @@ -71,10 +72,12 @@ $ ./garc --help
usage: garc [--log-level LOG-LEVEL] [-u UNIX-SOCKET]
[--return-fields [RETURN-FIELDS...]] [--pretty]
[--max-matches MAX-MATCHES]
[--response-size RESPONSE-SIZE]
[--search-method SEARCH-METHOD]
[--max-suggestions MAX-SUGGESTIONS] [--id-key ID-KEY] [-k]
[--update-searcher UPDATE-SEARCHER] [--update-all]
[--rank] [-h] [query]
[--env-operation ENV-OPERATION ENV-OPERATION]
[--ranker RANKER] [--input-parser INPUT-PARSER] [-h]
[query]
positional arguments:
query the search query (default: "")
Expand All @@ -89,8 +92,11 @@ optional arguments:
List of fields to return (ignores wrong names)
--pretty output is a pretty print of the results
--max-matches MAX-MATCHES
maximum results to return (type: Int64,
default: 10)
maximum number of results for internal
neighbor searches (type: Int64, default: 10)
--response-size RESPONSE-SIZE
maximum number of results to return (type:
Int64, default: 10)
--search-method SEARCH-METHOD
type of match done during search (type:
Symbol, default: :exact)
Expand All @@ -101,10 +107,14 @@ optional arguments:
--id-key ID-KEY The linear ID key (default:
"garamond_linear_id")
-k, --kill Kill the search engine server
--update-searcher UPDATE-SEARCHER
Update a searcher (default: "")
--update-all Update all searchers
--rank Use ranker (if any)
--env-operation ENV-OPERATION ENV-OPERATION
Environment operation
--ranker RANKER The ranker to use; avalilable: noop_ranker
(default: "noop_ranker")
--input-parser INPUT-PARSER
The input parser to use; available:
noop_input_parser, base_input_parser (default:
"noop_input_parser")
-h, --help show this help message and exit
```

Expand Down
21 changes: 13 additions & 8 deletions docs/src/getting_started.md
Expand Up @@ -10,14 +10,19 @@ The engine uses a pluggable approach in which data loaders, parsers, recommender
!!! tip "Glossary"

Throughout the documentation, certain terms will appear when refering to the internals of the engine. Some of the most frequent ones are:
* **config** may refer to several configuration files or objects that the engine uses.
* **configuration** may refer to:
- searcher configuration, a `SearcherConfig` object which holds the configuration options for individual searchers.
- environment configuration, a `NamedTuple` that contains searcher configurations as well as other parameters.
- data configuration file, a JSON file which is parsed to generate an environment configuration.
* **search environment** a `SearchEnv` object that holds the data and searchers among other. It fully describes the state of the engine.
* **searcher** - object that is used to perform the actual search. It holds the indexed documents in some vectorial representation.
* **searcher**, a `Searcher` object that is used to perform the actual search. It holds the indexed documents in some vectorial representation.
* **index** - the data structure holding the vector representation of the documents.
* **request** - may refer to either a request form an outside system to the engine i.e. HTTP request or its internal representation in the API, of type `InternalRequest`.
* **request** - may refer to:
- a request form an outside system to the engine i.e. HTTP request.
- the internal representation of a request, of type `InternalRequest`.

### Engine configuration
The main configuration of the engine pertains to data loading, parsing and indexing. Its role is to provide all necessary details as well as the internal architecture of the engine. The recommended way for configuring the engine is to create a JSON file with all necessary options. Alternatively, the result of parsing the configuration file i.e. the configuration object can be created manually or programatically however it is - at least at this point - a cumbersome operation.
## Engine configuration
The main configuration of the engine pertains to data loading, parsing and indexing. Its role is to provide all necessary details as well as the internal architecture of the engine. The recommended way for configuring the engine is to create a JSON file with all necessary options. Alternatively, the result of parsing the configuration file i.e. the configuration object can be created explicitly however it is, at least at this point, a cumbersome operation.

```@repl_index
using Logging, JSON, JuliaDB, Garamond
Expand All @@ -33,14 +38,14 @@ for field in fieldnames(typeof(cfg))
end
```

### The search environment
## The search environment
Building the search environment out of the configuration is straightforward. The environment holds the in-memory data in the form of an `IndexedTable` or `NDSparse` object, the searchers as well as other information such as primary db key and configuration paths.

```@repl_index
env = build_search_env(cfg)
```

### Engine operations
## Engine operations
The internal API is designed to be straightforward and uniform in the way it is called. First, one has to build a request which fully describes the operation to be performed and subsequently, call the operation desired. For example, to perform a search, one request would be:
```@repl_index
request = Garamond.InternalRequest(operation=:search,
Expand All @@ -62,7 +67,7 @@ Ranking the results using the ranker specified in the request is done with:
ranked = rank(env, request, search_results)
```

### Results and responses
## Results and responses

Once results are available, these can be printed
```@repl_index
Expand Down
7 changes: 4 additions & 3 deletions docs/src/index.md
Expand Up @@ -8,7 +8,7 @@ CurrentModule=Garamond

# Introduction

Garamond is a small, flexible neural and data search engine. It can be used both as a Julia package i.e. search functionality available through API method calls or as a standalone search server i.e. search functionality accessible through clients that communicate with the server.
Garamond is a small, flexible neural and data search engine. It can be used both as a Julia package, with search functionality available through API method calls or as a standalone search server, with search functionality accessible through clients that communicate with the server.

Internally, the engine's architecture is that of an ensemble of searchers, with an analytical database as data backend. Each searcher has its own characteristics i.e. ways of embedding documents, searching through the vectors and the search results from all searchers can be combined in a variety of ways. The engine supports runtime loading and use of custom data loaders, recommendation engines and result rankers.

Expand All @@ -34,17 +34,18 @@ downloads the `master` branch of the repository and adds `Garamond` to the curre
- Run-time batch re-indexing
- HTTP(REST)/Web-socket and UNIX socket connectivity
- Wordvectors support: [Word2Vec](https://en.wikipedia.org/wiki/Word2vec), [ConceptnetNumberbatch](https://github.com/commonsense/conceptnet-numberbatch), [GloVe](https://nlp.stanford.edu/projects/glove/)
- Classic search based on [term frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2), [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency%E2%80%93Inverse_document_frequency), [bm25](https://en.wikipedia.org/wiki/Okapi_BM25)
- Compressed vector support for low-memory footprint using [array quantization](https://github.com/zgornel/QuantizedArrays.jl)
- Classic search based on [term frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2), [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency%E2%80%93Inverse_document_frequency), [bm25](https://en.wikipedia.org/wiki/Okapi_BM25)
- Suggestion support using [BK Trees](https://en.wikipedia.org/wiki/BK-tree)
- Many state-of-the-art neural document and sentence embedding methods
- Multi-threading [supported](https://github.com/zgornel/Garamond.jl/tree/cc-multithreading)
- Caching mechanisms for fast resume
- Portable (and statically compilable) to many architectures
- Portable and statically compilable to many architectures

## Coming Soon
- Billion-scale search through [IVFADC](https://github.com/JuliaNeighbors/IVFADC.jl)
- Run-time indexing
- Architectural improvements i.e. pool of embedders

## Longer term plans
- Image/Video/Audio i.e. generic search
Expand Down
2 changes: 1 addition & 1 deletion docs/src/notes.md → docs/src/remarks.md
@@ -1,4 +1,4 @@
# Notes
# Remarks

## Multi-threading
If one chooses to use multi-threading i.e. through the `Threads.@threads`, `Threads.@spawn` macros for example, export the following: `OPENBLAS_NUM_THREADS=1` and `JULIA_NUM_THREADS=<n>` where `n` is the number of threads desired.
Expand Down

0 comments on commit dc06b31

Please sign in to comment.