# Using Ceph from Jupyter notebooks

This notebook will guide you on how to interact with Ceph that is provided by DataHub team directly from Jupyter notebooks.

In order to use Ceph, there needs to be installed `thoth-storages` package that provides an adapter for interacting with Ceph. There are implemented also other adapters that will help you interact with other persistent parts, but we will focus strictly on Ceph in this notebook.

In [1]:
import os

from thoth.storages import CephStore

**Warning:** If you want to use Thoth directly, please use adapters that encapsulate Ceph handling and ensure data consistency, such as `SolverResultsStore`, `BuildLogsStore` or `AnalysisResultsStore`. This notebook presents low level adapter API.

To check what methods the Ceph adapter provides, we can simply check Python documentation.

In [2]:
help(CephStore)

Help on class CephStore in module thoth.storages.ceph:

class CephStore(thoth.storages.base.StorageBase)
 |  Adapter for storing and retrieving data from Ceph - low level API.
 |  
 |  Method resolution order:
 |      CephStore
 |      thoth.storages.base.StorageBase
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, prefix, *, host:str=None, key_id:str=None, secret_key:str=None, bucket:str=None, region:str=None)
 |      Initialize adapter to Ceph.
 |      
 |      Parameters not explicitly provided will be picked from env variables.
 |  
 |  check_connection(self) -> None
 |      Ceph Connection Check.
 |      
 |      Check whether the given connection to the Ceph is alive and healthy,
 |      raise an exception if not.
 |  
 |  connect(self) -> None
 |      Create a connection to the remote Ceph.
 |  
 |  document_exists(self, document_id:str) -> bool
 |      Check if the there is an object with the given key in bucket.
 |      
 |      This check does on

The constructor accepts all the parameters that can be supplied eigher explicitly on adapter instantiation or there can be used environment variables (preferred). The ones supplied to constructor have higher priority. Let's check the code of constructor to see which environment variables are applicable:

In [3]:
import inspect

lines = inspect.getsourcelines(CephStore.__init__)
print("".join(lines[0]))

    def __init__(self, prefix, *,
                 host: str = None, key_id: str = None, secret_key: str = None,
                 bucket: str = None, region: str = None):
        """Initialize adapter to Ceph.

        Parameters not explicitly provided will be picked from env variables.
        """
        super().__init__()
        self.host = host or os.environ['THOTH_S3_ENDPOINT_URL']
        self.key_id = key_id or os.environ['THOTH_CEPH_KEY_ID']
        self.secret_key = secret_key or os.environ['THOTH_CEPH_SECRET_KEY']
        self.bucket = bucket or os.environ['THOTH_CEPH_BUCKET']
        self.region = region or os.getenv('THOTH_CEPH_REGION', None)
        self._s3 = None
        self.prefix = prefix

        if not self.prefix.endswith('/'):
            self.prefix += '/'



As we don't want to expose credentials in this notebook that is availble publicly, we assume that environment variables are present inside running Jupyter notebook and we can easily instantiate adapter instance and make a connection to Ceph:

In [4]:
adapter = CephStore(
    prefix=os.environ['THOTH_CEPH_BUCKET_PREFIX']
)  # prefix should either be provided or picked from environment
adapter.connect()

Let's check the connection status:

In [5]:
adapter.is_connected()

True

Let's check whether our document `foo` exists on Ceph:

In [6]:
adapter.document_exists('foo')

True

As it is not already present, let's create one with some content:

In [7]:
adapter.store_document({'some': 'document'}, 'foo')

{'ResponseMetadata': {'RequestId': 'tx0000000000000000bc10d-005bec4b9b-136a6fe8-default',
  'HostId': '',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-length': '0',
   'etag': '"b7d144531216255307a634d8fe75361e"',
   'accept-ranges': 'bytes',
   'x-amz-request-id': 'tx0000000000000000bc10d-005bec4b9b-136a6fe8-default',
   'date': 'Wed, 14 Nov 2018 16:21:47 GMT'},
  'RetryAttempts': 0},
 'ETag': '"b7d144531216255307a634d8fe75361e"'}

In [8]:
adapter.document_exists('foo')

True

Explore the documents stored in Ceph

In [9]:
help(adapter.get_document_listing)

Help on method get_document_listing in module thoth.storages.ceph:

get_document_listing() -> Generator[str, NoneType, NoneType] method of thoth.storages.ceph.CephStore instance
    Get listing of documents stored on the Ceph.



In [10]:
document_count = 0

for _ in adapter.get_document_listing():
    document_count += 1

print(f"Number of documents in stored on the Ceph: {document_count}")

Number of documents in stored on the Ceph: 26550


In [11]:
it = adapter.get_document_listing()  # The generator returns document IDs 
document_id = next(it)

document_id  # last document we've inserted

'bar'

In [12]:
next(it)

'foo'

In [13]:
from json import JSONDecodeError
from pprint import pprint

try:
    pprint(adapter.retrieve_document(document_id))
except JSONDecodeError:  # the last document might have been blob (ie. bytes)
    pprint(adapter.retrieve_blob(document_id))

b'This is some text'


In [14]:
pprint(adapter.retrieve_document(next(it)))  # try next one, feel free to experiment

{'analysis_id': 'package-extract-w7hz7'}


As Ceph is an object store, Ceph adapter also provides low-level operations that work directly on bytes so you can easily store documents that are not dictionaries, such as text files, images or anything alse:

In [15]:
adapter.store_blob('This is some text'.encode(), 'bar')

{'ResponseMetadata': {'RequestId': 'tx0000000000000000bc130-005bec4ba2-136a6fe8-default',
  'HostId': '',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-length': '0',
   'etag': '"97214f63224bc1e9cc4da377aadce7c7"',
   'accept-ranges': 'bytes',
   'x-amz-request-id': 'tx0000000000000000bc130-005bec4ba2-136a6fe8-default',
   'date': 'Wed, 14 Nov 2018 16:21:54 GMT'},
  'RetryAttempts': 0},
 'ETag': '"97214f63224bc1e9cc4da377aadce7c7"'}

In [16]:
adapter.retrieve_blob('bar').decode()

'This is some text'