# indexd Demo

## What is indexd?

The name "indexd" signifies (in the typical convention) "index daemon". While the name might not be accurate in the technical sense of a daemon, this summarizes its basic purpose. Indexd, in a nutshell, is a microservice which maintains URLs as pointers to stored data files. Indexd adds a layer of abstraction over stored data files: the data can move between or live in multiple locations, while the unique identifier for each file, kept in indexd, allows us to obtain the URLs (and some miscellaneous metadata) for the same stored data. Additionally, indexd tracks revisions of the same data file.

## Python Demo

Throughout this demo we're going to use direct API calls to indexd, just to get a sense for the API and what's going on "under the hood". For actually interfacing with indexd in our code we use another library called "indexclient" (can you guess what this does?). As we work through the demo we'll show the code both for making calls directly to indexd and for using indexclient.

### Setup

For this demo make sure the `indexclient` package is installed such that it can be used here in jupyter. I used this to install it in this notebook:
```
import sys
!cd ~/cdis/indexclient; {sys.executable} setup.py develop --user
```

In [430]:
import json
from urllib.parse import urljoin

from indexclient.client import IndexClient
import requests

To start, we'll run indexd on `localhost:8080`. Probably the easiest way is with a docker container:
```bash
# Start from indexd directory
# Build the docker image if you don't have it yet
docker build -t indexd .
# Now run the image, and set it to forward to port 8080.
docker run -d --name indexd -p 8080:80 indexd
```

In order to use endpoints requiring admin authorization, set up a username and password in the indexd docker image:
```bash
docker exec indexd python /indexd/bin/index_admin.py create --username test --password test
```

(Here we set up a bit of code just to make printing out the API calls more concise and readable.)

In [431]:
base = 'http://localhost:8080'

# NOTE
# Fill in the auth with whatever username/password you set before.
request_auth = requests.auth.HTTPBasicAuth('test', 'test')

indexd = lambda path: urljoin(base, path)

def print_response(response):
    print(response)
    try:
        print(json.dumps(response.json(), indent=4))
    except ValueError:
        print(response.text)

Just for the purposes of re-using this demo with the same indexd instance, we'll clear out all the records from indexd. (For the sake of the tutorial, this shouldn't make sense yet—so ignore this, and move along!)

In [432]:
def wipe_indexd():
    """
    Delete all records from indexd.
    """
    records = requests.get(indexd('/index/')).json()['records']
    for record in records:
        path = indexd('/index/{}'.format(record['did']))
        params = {'rev': record['rev']}
        response = requests.delete(path, auth=request_auth, params=params)

In [433]:
wipe_indexd()

We'll set up an `IndexClient` as well, which is what our other code actually uses to interface with indexd.

In [434]:
client = IndexClient(baseurl=base, auth=request_auth)

Let's check that indexd is alive, using the status endpoint.

In [435]:
print('GET {}'.format(indexd('/_status')))
print()
print_response(requests.get(indexd('/_status')))

GET http://localhost:8080/_status

<Response [200]>
Healthy


We can also check the status through the client.

In [436]:
client.check_status()

(It doesn't return anything if indexd is working.)

So far so good. Let's get the list of records stored in indexd right now, by sending a `GET` to `/index/`.

In [437]:
print('GET {}'.format(indexd('/index/')))
print()
print_response(requests.get(indexd('/index/')))

GET http://localhost:8080/index/

<Response [200]>
{
    "version": null,
    "size": null,
    "file_name": null,
    "acl": [],
    "ids": null,
    "start": null,
    "metadata": {},
    "limit": 100,
    "hashes": null,
    "urls": [],
    "records": []
}


Listing records with the client (the return value will have just the records, and not the extra information returned from the endpoint):

In [438]:
list(client.list())

[]

There's no records registered yet...let's create one.

### Creating a Record

Just below is some example data for a record. We `POST` this to the `/index/` endpoint on indexd to register the record.

The minimum information necessary to supply to indexd is the file size, the hash (in any of several common formats), a list of URLs pointing to where the data file is stored (which can be left empty),
and the form TODO. For this example we'll also give our imaginary file a name, and add `'*'` in the ACL list.

In [439]:
hashes = {'md5': 'e561f9248d7563d15dd93457b02ebbb6'}
size = 8
data_v_0 = {
    'hashes': hashes,
    'size': 8,
    'urls': ["storage://file/path/example_file"],
    'form': 'object',
    'file_name': 'example_file',
    'acl': ['*'],
}

print('POST {}'.format(indexd('/index/')))
print()
response = requests.post(indexd('/index/'), json=data_v_0, auth=request_auth)
print_response(response)

POST http://localhost:8080/index/

<Response [200]>
{
    "baseid": "7b044aa0-1c65-4874-831b-9d69a602d6f4",
    "rev": "55633f08",
    "did": "testprefix:8511e34c-655c-4025-8d21-a4bf3bf2e5d3"
}


Success! We see in the response we have these three fields, `rev`, `did`, and `baseid`. These uniquely identify certain things about this record.

- `did` is the ID for this record specifically.
- `baseid` is a common identifier for all versions of the same record; we'll come back to versioning later.
- `rev` is the identifier for this version.

Let's repeat that, this time using the client. The `IndexClient`, for returning index records, returns a `Document` object containing all the information for an index record.

In [440]:
wipe_indexd()
client_create_kwargs = dict(data_v_0)
client_create_kwargs.pop('form')

# Use the IndexClient to create a new record.
doc = client.create(**client_create_kwargs)

print('Document attributes and methods:')
print(json.dumps(
    list(attr for attr in dir(doc) if not attr.startswith('_')),
    indent=4,
))

# Save this stuff, we'll need to use it later.
v_0 = doc.to_json()

Document attributes and methods:
[
    "acl",
    "baseid",
    "client",
    "created_date",
    "delete",
    "did",
    "file_name",
    "form",
    "hashes",
    "metadata",
    "patch",
    "rev",
    "size",
    "to_json",
    "updated_date",
    "urls",
    "urls_metadata",
    "version"
]


We can convert the document into JSON, to get all the properties in the same format as they would be returned from the API.

In [441]:
print(json.dumps(doc.to_json(), indent=4))

{
    "updated_date": "2018-08-09T00:22:20.039646",
    "urls_metadata": {
        "storage://file/path/example_file": {}
    },
    "baseid": "91a1a213-b630-4ec8-88d4-e38fc9fae968",
    "hashes": {
        "md5": "e561f9248d7563d15dd93457b02ebbb6"
    },
    "urls": [
        "storage://file/path/example_file"
    ],
    "form": "object",
    "size": 8,
    "file_name": "example_file",
    "did": "testprefix:a62d7817-f43a-4281-ac5d-98a8d7e5af1c",
    "acl": [
        "*"
    ],
    "metadata": {},
    "created_date": "2018-08-09T00:22:20.039637",
    "rev": "861b0dab",
    "version": null
}


Great, so we made a new record with some basic information. Now, let's take a closer look at the fields the go into a record.

### About Records in indexd

A single record in indexd contains several fields; let's go through each field and explain what these are for.

#### `did` ("digital identifier")

A unique identifier (UUID4) for the file; indexd will make these for new records automatically. Notice that the one that indexd generated for us looks like this:
```
<prefix>:<UUID>
```
We're going to discuss these prefixes in more detail in a later section.

#### `baseid`

The `baseid` is a common identifier for all versions of one file, across revisions.

#### `rev`

The `rev` field identifies a particular version of a file with multiple versions.

#### `size`

This is just the filesize that we gave indexd originally for this file.

#### `file_name`

Optional field recording the filename of the indexed file.

#### `created_date`

The time that this record was created.

#### `urls`

Like we mentioned above, this is the list of URLs which point to the real location of the stored data.

#### `acl`

"Access control list". Fence uses this list to control authorization when generating pre-signed URLs.

#### `hashes`

`hashes` is an object storing one or more hashes for the file itself. These can be any of:
- MD5
- SHA
- SHA256
- SHA512
- CRC
- ETag

For this demo we'll skip over a few record fields: `form`, `metadata`, `urls_metadata`, and `version`, all of which are not used extensively or specific to use in the GDC.

Now that we've seen some examples and know what the fields mean, we're going to trim the fields in the next examples to keep things concise.

In [442]:
def print_record(record):
    """
    Utility function to print subset of record fields.
    """
    print(record['file_name'])
    print('urls: {}'.format(record['urls']))
    print('size: {}'.format(record['size']))
    print('baseid: {}'.format(record['baseid']))
    print('rev: {}'.format(record['rev']))
    print('did: {}'.format(record['did']))


### Retrieving Records

Now the list of records returned from indexd should have our new entry—let's check, again using a `GET` to the `/index/` endpoint.

In [443]:
print('GET {}'.format(indexd('/index/')))
print()
response = requests.get(indexd('/index/'))
print_record(response.json()['records'][0])

GET http://localhost:8080/index/

example_file
urls: ['storage://file/path/example_file']
size: 8
baseid: 91a1a213-b630-4ec8-88d4-e38fc9fae968
rev: 861b0dab
did: testprefix:a62d7817-f43a-4281-ac5d-98a8d7e5af1c


We can look up this specific record using `GET` `/index/{UUID}`, where the UUID is the DID that indexd returned before when we created this record.

In [444]:
path = indexd('/index/{}'.format(v_0['did']))

print('GET {}'.format(path))
print()
print_record(requests.get(path).json())

GET http://localhost:8080/index/testprefix:a62d7817-f43a-4281-ac5d-98a8d7e5af1c

example_file
urls: ['storage://file/path/example_file']
size: 8
baseid: 91a1a213-b630-4ec8-88d4-e38fc9fae968
rev: 861b0dab
did: testprefix:a62d7817-f43a-4281-ac5d-98a8d7e5af1c


Let's also search for this record through the client.

In [445]:
doc = client.get(v_0['did'])
print_record(doc.to_json())

example_file
urls: ['storage://file/path/example_file']
size: 8
baseid: 91a1a213-b630-4ec8-88d4-e38fc9fae968
rev: 861b0dab
did: testprefix:a62d7817-f43a-4281-ac5d-98a8d7e5af1c


We can also search through all the records, but apply an argument to filter by hash, size, and/or URL.

Let's apply the `hash` argument in the query string, and give it the md5 hash for our file.

In [446]:
path = indexd('/index?hash=md5:{}'.format(v_0['hashes']['md5']))
records = requests.get(path).json()['records']

print('GET {}'.format(path))
print()
print('Returned {} records'.format(len(records)))
for record in records:
    print()
    print_record(record)

GET http://localhost:8080/index?hash=md5:e561f9248d7563d15dd93457b02ebbb6

Returned 1 records

example_file
urls: ['storage://file/path/example_file']
size: 8
baseid: 91a1a213-b630-4ec8-88d4-e38fc9fae968
rev: 861b0dab
did: testprefix:a62d7817-f43a-4281-ac5d-98a8d7e5af1c


And of course, we can accomplish the same thing using the `IndexClient`.

In [447]:
doc = client.get_with_params(params={'hashes': v_0['hashes']})
print_record(doc.to_json())

example_file
urls: ['storage://file/path/example_file']
size: 8
baseid: 91a1a213-b630-4ec8-88d4-e38fc9fae968
rev: 861b0dab
did: testprefix:a62d7817-f43a-4281-ac5d-98a8d7e5af1c


### Record Versions

Now that we've created a record, let's look at the process of updating this record with a new version. We're going to change the contents—and thus the size and the hash—of our imaginary file. Let's update indexd with the new information. To add a new version, we `POST` to `/index/{UUID}`, where the UUID is an identifier for the existing file.

In [448]:
# Here's the new data for the "file".
data_v_1 = dict(data_v_0)
data_v_1['size'] = 10
data_v_1['hashes'] = {'md5': 'f7952a9483fae0af6d41370d9333020b'}

# We saved the DID for this file before.
path = indexd('/index/{}'.format(v_0['did']))
print('POST {}'.format(path))
print()
response = requests.post(path, json=data_v_1, auth=request_auth)
# Also stash the return values from this response.
v_1 = response.json()
print_response(response)

POST http://localhost:8080/index/testprefix:a62d7817-f43a-4281-ac5d-98a8d7e5af1c

<Response [200]>
{
    "baseid": "91a1a213-b630-4ec8-88d4-e38fc9fae968",
    "rev": "f792a6b2",
    "did": "d7b0ad4e-8afe-4480-ae68-9ac6ea60a082"
}


Now, if we compare this `baseid` to the `baseid` that indexd returned when we created the record for the original file, we see that this `baseid` remains the same.

In [449]:
print('Same `baseid`? {}'.format(v_0['baseid'] == v_1['baseid']))

Same `baseid`? True


All revisions of the same file will share this `baseid`.

However, this record has a different `rev` and a different `did` than the original.

In [450]:
print('Same `did`? {}'.format(v_0['did'] == v_1['did']))
print('Same `rev`? {}'.format(v_0['rev'] == v_1['rev']))

Same `did`? False
Same `rev`? False


Having created the new version for this file, let's again make a request `GET` `/index/{UUID}`, using the shared `baseid`.

In [451]:
path = indexd('/index/{}'.format(v_0['baseid']))
print('GET {}'.format(path))
print()
response = requests.get(path)
print_record(response.json())

GET http://localhost:8080/index/91a1a213-b630-4ec8-88d4-e38fc9fae968

example_file
urls: ['storage://file/path/example_file']
size: 10
baseid: 91a1a213-b630-4ec8-88d4-e38fc9fae968
rev: f792a6b2
did: d7b0ad4e-8afe-4480-ae68-9ac6ea60a082


The information for this record reflects the new changes to the file: the size and `did` have changed. The `baseid` is the same.

In [452]:
print(response.json()['did'] == v_1['did'])

True


However, the original information still exists. We can make a request again using the DID of the original file, and see that this revision hasn't changed.

In [453]:
path = indexd('/index/{}'.format(v_0['did']))
print('GET {}'.format(path))
print()
print_record(requests.get(path).json())

GET http://localhost:8080/index/testprefix:a62d7817-f43a-4281-ac5d-98a8d7e5af1c

example_file
urls: ['storage://file/path/example_file']
size: 8
baseid: 91a1a213-b630-4ec8-88d4-e38fc9fae968
rev: 861b0dab
did: testprefix:a62d7817-f43a-4281-ac5d-98a8d7e5af1c


Finally, we can look at the whole list of versions for a single file, with `GET` `/index/{UUID}/versions`. The object in the response will contain the records for every version of this file as key-value pairs, where the keys are just numeric indexes (in string form) and the values are the records.

In [454]:
path = indexd('/index/{}/versions'.format(v_0['baseid']))
print('GET {}'.format(path))
print()
print_response(requests.get(path))

GET http://localhost:8080/index/91a1a213-b630-4ec8-88d4-e38fc9fae968/versions

<Response [200]>
{
    "1": {
        "updated_date": "2018-08-09T00:22:20.246230",
        "urls_metadata": {
            "storage://file/path/example_file": {}
        },
        "baseid": "91a1a213-b630-4ec8-88d4-e38fc9fae968",
        "hashes": {
            "md5": "f7952a9483fae0af6d41370d9333020b"
        },
        "urls": [
            "storage://file/path/example_file"
        ],
        "form": "object",
        "size": 10,
        "file_name": "example_file",
        "version": null,
        "acl": [
            "*"
        ],
        "metadata": {},
        "created_date": "2018-08-09T00:22:20.246219",
        "rev": "f792a6b2",
        "did": "d7b0ad4e-8afe-4480-ae68-9ac6ea60a082"
    },
    "0": {
        "updated_date": "2018-08-09T00:22:20.039646",
        "urls_metadata": {
            "storage://file/path/example_file": {}
        },
        "baseid": "91a1a213-b630-4ec8-88d4-e38fc9fae968",

As a final point on the versioning capabilities in indexd, note what happens when we try to update the "version 0" of this file. We can do this using a `PUT` to `/index/{did}?{rev}` using the `did` and `rev` values for the first version we created. Let's suppose we're going to try to move this file to a different storage location.

In [460]:
data_v_1_1 = {
    'urls': ['storage://different/file/path']
}

path = indexd('/index/{}?{}'.format(v_0['did'], v_0['rev']))

print('PUT {}'.format(path))
print()
response = requests.put(path, json=data_v_1_1, auth=request_auth)
print_response(response)

PUT http://localhost:8080/index/testprefix:a62d7817-f43a-4281-ac5d-98a8d7e5af1c?861b0dab

<Response [409]>
{
    "error": "revision mismatch"
}


This operation is not allowed because we tried to modify an older version of this record. This disallows applying conflicting updates to the same record, since they must always operate on the latest version.

### Record Aliases

Keeping track of records with UUID4s works well for Python code but less so for humans (or even not for human-readability but just semantic significance). To help the humans keep things straight, indexd supports aliases for its records. The endpoints for listing, creating, updating, and removing aliases are at the `/alias` endpoints in indexd.

To start, let's list the existing aliases.

In [456]:
print_response(requests.get(indexd('/alias/')))

<Response [200]>
{
    "size": null,
    "limit": 100,
    "aliases": [
        "foo"
    ],
    "hashes": null,
    "start": null
}


The response is, unsurpringly, empty, since we haven't made any yet. Let's do that. To make an alias we're going to send a `PUT` to `/alias/{ALIAS_STRING}`, where `ALIAS_STRING` is the more human-readable name that we want to attach to a record. In the body, we send the information about the record we want to use.

In [457]:
data = {
    'release': 'public',
    'size': data_v_1['size'],
    'hashes': data_v_1['hashes'],
}
alias = 'foo'
path = indexd('/alias/{}'.format(alias))
response = requests.put(path, json=data, auth=request_auth)
print_response(response)

<Response [200]>
{
    "name": "foo",
    "rev": "8ff8d788"
}


Now we have an alias for this record.

In [458]:
print_response(requests.get(indexd('/alias/')))

<Response [200]>
{
    "size": null,
    "limit": 100,
    "aliases": [
        "foo"
    ],
    "hashes": null,
    "start": null
}


We can get the information for this alias now using `GET` `/alias/{ALIAS_NAME}`.

In [459]:
path = indexd('/alias/foo')
print('GET {}'.format(path))
print()
print_response(requests.get(path))

GET http://localhost:8080/alias/foo

<Response [200]>
{
    "host_authorities": [],
    "name": "foo",
    "size": 10,
    "start": 0,
    "keeper_authority": null,
    "release": "public",
    "metadata": null,
    "limit": 100,
    "rev": "8ff8d788",
    "hashes": {
        "md5": "f7952a9483fae0af6d41370d9333020b"
    },
    "urls": [
        {
            "metadata": {},
            "url": "storage://file/path/example_file"
        }
    ]
}


## About Prefixes and Data GUIDs

## Indexd in other Gen3 Services

In sheepdog, creating metadata automatically registers an index with indexd; see [`FileUploadEntity._register_index`](https://github.com/uc-cdis/sheepdog/blob/0c2e9eec3d6c79d46cbf35d687958cfbadcb1ce1/sheepdog/transactions/upload/sub_entities.py#L273-L312).