# SequenceCollectionClient tutorial

## Introduction 

The `refget` Python package contains an class called `SequenceCollectionClient` that provides a simple Python API for interacing with a remote refget sequence collections server.



In [1]:
import refget
from refget import SequenceCollectionClient

First, start a local demo service by running 

```console
bash deployment/demo_up.sh
```

Then, you can create a `SequenceCollectionClient` to interact with the service from within Python:

In [2]:
seqcol_client = SequenceCollectionClient(urls=["http://127.0.0.1:8100"])

In [3]:
seqcol_client

<SequenceCollectionClient>
  Service ID: org.databio.seqcolapi
  Service Name: Sequence collections
  API URLs:    http://127.0.0.1:8100

Now we have a client connected to our server. Now, you can interact with this object to run any of the API functions Check what's available: 

In [4]:
seqcol_client.list_collections()

{'pagination': {'page': 0, 'page_size': 100, 'total': 6},
 'results': ['XZlrcEGi6mlopZ2uD8ObHkQB1d0oDwKk',
  'QvT5tAQ0B8Vkxd-qFftlzEk2QyfPtgOv',
  'Tpdsg75D4GKCGEHtIiDSL9Zx-DSuX5V8',
  'UNGAdNDmBbQbHihecPPFxwTydTcdFKxL',
  'sv7GIP1K0qcskIKF3iaBmQpaum21vH74',
  'aVzHaGFlUDUNF2IEmNdzS_A8lCY0stQH']}

Retrieve one of these collections:

In [5]:
seqcol_client.get_collection("XZlrcEGi6mlopZ2uD8ObHkQB1d0oDwKk")

{'lengths': [8, 4, 4],
 'names': ['chrX', 'chr1', 'chr2'],
 'sequences': ['SQ.iYtREV555dUFKg2_agSJW6suquUyPpMw',
  'SQ.YBbVX0dLKG1ieEDCiMmkrTZFt_Z5Vdaj',
  'SQ.AcLxtBuKEPk_7PGE_H4dGElwZHCujwH6'],
 'sorted_sequences': ['SQ.AcLxtBuKEPk_7PGE_H4dGElwZHCujwH6',
  'SQ.YBbVX0dLKG1ieEDCiMmkrTZFt_Z5Vdaj',
  'SQ.iYtREV555dUFKg2_agSJW6suquUyPpMw'],
 'name_length_pairs': [{'length': 8, 'name': 'chrX'},
  {'length': 4, 'name': 'chr1'},
  {'length': 4, 'name': 'chr2'}]}

This gives you the **level 2** representation of the sequence collection, which is the canonical, expanded representation. You can also request the more compact **level 1** representation, which gives you digests for each of the attributes:

In [6]:
seqcol_client.get_collection("XZlrcEGi6mlopZ2uD8ObHkQB1d0oDwKk", level=1)

{'lengths': 'cGRMZIb3AVgkcAfNv39RN7hnT5Chk7RX',
 'names': 'Fw1r9eRxfOZD98KKrhlYQNEdSRHoVxAG',
 'sequences': '0uDQVLuHaOZi1u76LjV__yrVUIz9Bwhr',
 'sorted_sequences': 'KgWo6TT1Lqw6vgkXU9sYtCU9xwXoDt6M',
 'name_length_pairs': 'B9MESWM8k-hK_OeQK8bZNAG74pLY0Ujq',
 'sorted_name_length_pairs': 'wwE4PUok50YyEF2Ne8BBA5__zk92CZH8'}

These attributes are useful because you can use them in the same way you us a top-level sequence digest to look up values of a specific attribute using the `get_attribute` function.
For example, here we will use the lengths digest to retrieve just the value of this attribute.
You can see it matches the expanded version retrieved above:

In [7]:
seqcol_client.get_attribute("lengths", "cGRMZIb3AVgkcAfNv39RN7hnT5Chk7RX")

[8, 4, 4]

We can also discover possible attributes with the `list_attributes` functio, which will list all available values of a specific attribute:

In [8]:
seqcol_client.list_attributes("lengths", page_size=3)

{'pagination': {'page': 0, 'page_size': 3, 'total': 3},
 'results': ['cGRMZIb3AVgkcAfNv39RN7hnT5Chk7RX',
  'x5qpE4FtMkvlwpKIzvHs3a02Nex5tthp',
  '7-_HdxYiRf-AJLBKOTaJUdxXrUkIXs6T']}

One of the useful applications of attribute digests is that we can use them to discover other sequence collections that have the same values.
Here's how to get a list of collections that have a certain digest for an attribute:

In [9]:
seqcol_client.list_collections(page=1, 
                               page_size=2, 
                               attribute="lengths", 
                               attribute_digest="cGRMZIb3AVgkcAfNv39RN7hnT5Chk7RX")

{'pagination': {'page': 4, 'page_size': 2, 'total': 4},
 'results': ['UNGAdNDmBbQbHihecPPFxwTydTcdFKxL',
  'aVzHaGFlUDUNF2IEmNdzS_A8lCY0stQH']}

Compare two sequence collections

In [10]:
seqcol_client.compare(
    "UNGAdNDmBbQbHihecPPFxwTydTcdFKxL",
    "aVzHaGFlUDUNF2IEmNdzS_A8lCY0stQH")

{'digests': {'a': 'UNGAdNDmBbQbHihecPPFxwTydTcdFKxL',
  'b': 'aVzHaGFlUDUNF2IEmNdzS_A8lCY0stQH'},
 'attributes': {'a_only': [],
  'b_only': [],
  'a_and_b': ['lengths',
   'name_length_pairs',
   'names',
   'sequences',
   'sorted_sequences']},
 'array_elements': {'a': {'lengths': 3,
   'name_length_pairs': 3,
   'names': 3,
   'sequences': 3,
   'sorted_sequences': 3},
  'b': {'lengths': 3,
   'name_length_pairs': 3,
   'names': 3,
   'sequences': 3,
   'sorted_sequences': 3},
  'a_and_b': {'lengths': 3,
   'name_length_pairs': 1,
   'names': 3,
   'sequences': 3,
   'sorted_sequences': 3},
  'a_and_b_same_order': {'lengths': True,
   'name_length_pairs': True,
   'names': False,
   'sequences': True,
   'sorted_sequences': True}}}

## Using pydantic models

One of the really cool things you can do with the `refget` package is use the pydantic models. We provide a `SequenceCollection` object that gives you some nice ways to interact with these objects in python. From a dictionary representation you retrieve from an API, you can construct a Pydantic object like this:

In [11]:
seqcol_dict = seqcol_client.get_collection("XZlrcEGi6mlopZ2uD8ObHkQB1d0oDwKk")
seqcol = refget.SequenceCollection.from_dict(seqcol_dict)
seqcol

SequenceCollection(digest='XZlrcEGi6mlopZ2uD8ObHkQB1d0oDwKk', sorted_name_length_pairs_digest='wwE4PUok50YyEF2Ne8BBA5__zk92CZH8')

This object is very useful. You can use it to get this sequence collection in a variety of different formats:

In [12]:
seqcol.level2()

{'lengths': [8, 4, 4],
 'names': ['chrX', 'chr1', 'chr2'],
 'sequences': ['SQ.iYtREV555dUFKg2_agSJW6suquUyPpMw',
  'SQ.YBbVX0dLKG1ieEDCiMmkrTZFt_Z5Vdaj',
  'SQ.AcLxtBuKEPk_7PGE_H4dGElwZHCujwH6'],
 'sorted_sequences': ['SQ.AcLxtBuKEPk_7PGE_H4dGElwZHCujwH6',
  'SQ.YBbVX0dLKG1ieEDCiMmkrTZFt_Z5Vdaj',
  'SQ.iYtREV555dUFKg2_agSJW6suquUyPpMw'],
 'name_length_pairs': [{'length': 8, 'name': 'chrX'},
  {'length': 4, 'name': 'chr1'},
  {'length': 4, 'name': 'chr2'}]}

In [13]:
seqcol.level1()

{'lengths': 'cGRMZIb3AVgkcAfNv39RN7hnT5Chk7RX',
 'names': 'Fw1r9eRxfOZD98KKrhlYQNEdSRHoVxAG',
 'sequences': '0uDQVLuHaOZi1u76LjV__yrVUIz9Bwhr',
 'sorted_sequences': 'KgWo6TT1Lqw6vgkXU9sYtCU9xwXoDt6M',
 'name_length_pairs': 'B9MESWM8k-hK_OeQK8bZNAG74pLY0Ujq',
 'sorted_name_length_pairs': 'wwE4PUok50YyEF2Ne8BBA5__zk92CZH8'}

In [14]:
seqcol.lengths.digest

'cGRMZIb3AVgkcAfNv39RN7hnT5Chk7RX'

In [15]:
seqcol.itemwise()

[{'name': 'chrX',
  'length': 8,
  'sequence': 'SQ.iYtREV555dUFKg2_agSJW6suquUyPpMw'},
 {'name': 'chr1',
  'length': 4,
  'sequence': 'SQ.YBbVX0dLKG1ieEDCiMmkrTZFt_Z5Vdaj'},
 {'name': 'chr2',
  'length': 4,
  'sequence': 'SQ.AcLxtBuKEPk_7PGE_H4dGElwZHCujwH6'}]

You can access individual attributes like this:

In [16]:
seqcol.name_length_pairs.digest

'B9MESWM8k-hK_OeQK8bZNAG74pLY0Ujq'

In [17]:
seqcol.name_length_pairs.value

[{'length': 8, 'name': 'chrX'},
 {'length': 4, 'name': 'chr1'},
 {'length': 4, 'name': 'chr2'}]

Because this is a `SQLModel` object, you could also use this to create and interact with a database easily. You can find reference documentation in the [models](../../models) section.