Skip to content

A data processing engine that is designed to help with data onboarding within the Filecoin network.

Notifications You must be signed in to change notification settings

schreck23/ptolemy

Repository files navigation

Ptolemy (Beta release v0):

A data processing engine that is designed to help with data onboarding within the Filecoin network.

Ptolemy will consist of a series of component designed to help process entire existing file structures and allow users to efficiently package data for delivery to Filecoin Storage Providers (from here on in referred to as SPs).
This allows to track and associate legacy filenames with CIDs, build car files of a configurable size and even invoke deals with deal engines on the network.

To get started with Ptolemy in its current iteration you can use the http interface to configure and run jobs in a mode known as blitz which will attempt to build all car files in parallel. A future version of ptolemy (coming soon) will have a serial mode to support processing environments where there is limited staging capacity.

To better understand how Ptolemy works there are a handful components involved in its successful operation. In order to support Ptolemy a couple of existing ecosystem products are used in tandem with Ptolemy. There are the following:

https://github.com/ipld/go-car - Used to package up the prepared sectors (in this case directories) and create the car file. Once this is installed the path to the go-car binary should be configured in worker.ini.

https://github.com/filecoin-project/go-fil-commp-hashhash - Used to calculate the commp for the newly generated car file. This must be installed prior to use and the path to the stream-commp binary must be configured in the worker.ini file.

INSTALL (for Ubuntu 22.04)

# Setup base environment
sudo apt-get update
sudo apt-get install git
sudo apt-get install python3
sudo apt-get install python3-pip

# add python specific libraries
sudo pip3 install fastapi
sudo apt-get -y install python3-psycopg2
sudo pip3 install uvicorn
    
cd /usr/local
sudo git clone https://github.com/schreck23/ptolemy

# database prep
sudo apt -y install postgresql postgresql-contrib
sudo systemctl start postgresql.service
sudo systemctl enable postgresql.service

# configure database for Ptolemy user
# (note) - you can create a db with any name and user account with any
# credentials you wish, however the database.ini will need to be updated
# based on those parameters
sudo -u postgres createuser -d -s ptolemy
sudo -u postgres psql -c "ALTER USER repository PASSWORD 'ptolemy'";
sudo -u postgres createdb ptolemy

CONFIGURE 

To get started open the ptolemy.ini file found in /usr/local/ptolemy and 
set each flag properly as defined below:

ip_addr -> The IP address we wish the orchestrator to listen on, if you
wish to just universally handle connections then 0.0.0.0 can be used.  If
you wish to use the external machine IP that will work as well.

port -> the port you wish to have the Ptolemy orchestrator listen on.

api_threads -> the number of threads you wish to allocate to listen for 
any and all api requests

threads -> number of threads allocated to the scan process for the 
ptolemy orchestrator

Upon completing the configuration for the orchestrator the worker must now
be configured.  There are two sections in this config 
file(/usr/local/ptolemy/worker.ini), the first being 
the orchestrator section and the second being the worker section.

orchestrator section:
ip_addr -> the ip address of the Ptolemy orchestrator node.
port -> the listener port for the Ptolemy orchestrator

worker section:
ip_addr -> the ip address for the host node, due to the heartbeat mechanism
the 0.0.0.0 address cannot be used here.  Only 1 worker per IP is allowed.

port -> port we wish to have the worker listen on.

threads -> number of threads we wish to use for creating our car files in
parallel

api_threads -> number of threads you wish to allocate to listen for http
requests

car_gen -> path to go-car binary to use for making carfiles

commp -> path to the commp binary to use for calculating commp on car
files

We must also setup our database credentials in the
/usr/local/ptolemy/database.ini file.

host -> the host where postgres is currently installed

db_name -> the name of the db where we want to place the tables for 
ptolemy (the default is ptolemy but if a different one was used during
install place it here)

db_user -> the user for all ptolemy interactions, if you used a different
user name above (repository in this case) please use it here

pass -> the password for the user, by default this would be ptolemy but if
a different password was configured above use it here.

The last config file in /usr/local/ptolemy is the logging.ini file.  This
will configure the python logger for ptolemy and the worker which will 
share a common logfile.

log_level -> The desired logging level we wish to use, options include 
INFO, DEBUG, WARNING, ERROR & CRITICAL

format -> used to configure the format of the log message

datefmt -> can be used to customize the format of the time stamp 
associated with the log message

logfile -> path and filename for logging messages, the default is 
/tmp/ptolemy.log

It should be noted that all entries in this configuration file should
follow traditional python logging standards.  This also includes the need
for redundant pound signs to provide a necessary escape character.  This
will allow users to configure the logging service to produce logs that 
may be shared with other applications like Splunk for monitoring.

USE PTOLEMY

Now that we have Ptolemy configured it is time to create a project and
generate the respective car files.  For this example will use a host of 
192.168.1.100 which will be our orchestrator IP.  It should be noted that
if the worker is properly configured to communicate with the orchestrator
it will auto-register to receive instructions.

The Ptolemy startup will be placed in a system startup script in the future
but for now the operating instructions are quite simple.

cd /usr/local/ptolemy
sudo python3 ptolemy.py

If we are running the worker on the same node then also do the following:
cd /usr/local/ptolemy
sudo python3 worker.py

Now that Ptolemy is running we will configure and project and launch it, 
to do this we will use the API to configure a project, scan the file
system, align the files in containers for car files and then execute the 
build.  It should be noted that the commands issued to the API will
return a positive or negative response immediately to indicate whether the
step failed or a background process has been launched.

At any given point a project's status can be obtained by using the 
following RESTful call:

curl -X GET http://<Ptolemy ip>:<Ptolemy port>/v0/projects/

This will return a list of all the projects maintained by Ptolemy and 
their respective status.

To understand how to configure a project we need to understand the meta
associated with the configuration:

shard_size - this will indicate a threshold for the largest file size in 
GiBs and will support an integer between 1 and 31.  This setting will break
a file into chunks of this size if it is larger than the shard_size.  For 
example if the shard_size is 1 a 10 GiB file will result in 10 chunks.

staging_dir - a directory set aside for local cache where files can be 
chopped and car files can be built and placed for extraction.

target_dir - directory where our dataset lives (in this case an 
unstructured dataset in a filesystem typically) that we wish to convert to 
car files.

car_size - an integer between 1 and 31 that will allow a user to determine 
how large the generated cars should be in GiBs.

encryption - currently not implemented but in a future ptolemy release 
users will be able to specify a key alias representing a specific key the 
user wishes to encrypt data with if data privacy is desired.

load_type - blitz vs serial, currently only blitz is supported but serial 
mode will be released soon.

Now that we understand our metadata requirement lets create a ficticious
project named delta with some set of metadata.

shard_size = 1 GiB
staging_dir = /srv/delta-staging/shrek-staging/ptolemy-test
target_dir = /srv/delta-staging/shrek-staging/radiant-data-to-delta
car_size = 17 GiBs
encryption = None
load_type = blitz

To create and prepare this project issue the following command
(without carriage returns of course):

curl -X POST http://<ptolemy ip address>:<ptolemy port>/v0/create/delta 
-H 'Content-Type: application/json' 
-d '{"shard_size":1,
"staging_dir":"/srv/delta-staging/shrek-staging/ptolemy-test",
"target_dir":"/srv/delta-staging/shrek-staging/radiant-data-to-delta",
"car_size":17,"encryption":"None","load_type":"blitz"}'

Now that we have defined our project we are now ready to scan the 
filesystem and capture the relevant metadata:

curl -X POST http://<ptolemy ip address>:<ptolemy port>/v0/scan/delta

When this background task is done the job state will be updated to 
"scanning complete", using the status command above we can check to see
when it's done or we can look directory in the postgres DB if we choose,
data is preserved in the ptolemy_projects table.

Once scanning is done we containerize and create our sector mapping by
using the following command:

curl -X POST http://<ptolemy ip address>:<ptolemy port>/v0/containerize/delta

When the background task is complete the job status will show 
"containerization complete" we can then begin making car files with our
new containers.

In the beta version only blitz mode is active so once the containerization
is complete build your carfiles:

curl -X POST http://<ptolemy ip address>:<ptolemy port>/v0/blitz/delta

This will place your carfiles in your staging area, calculate the root
CID and commp and place that metadata in the database.  To obtain the
carfile metadata use the following command:

curl -X GET http://<ptolemy ip address>:<ptolemy port>/v0/carfile_meta/delta

ROADMAP

Ptolemy beta is currently in version v0 and all API calls are categorized 
using this nomenclature.  Any new major version will use a new set of 
URLs with a v<major version> tag as new features are added.  Deprecation
periods will be provided as new versions replace old versions so as not 
to cause significant API impacts.

Ptolemy V1 plans:

- Direct from S3 egress and car build.  Already in development and will 
designed to support S3 open data with no credentials and tradtional AWS
v2/4 credentials as well.

- Preliminary UI release is scheduled for V1 (buiding off delta and
delta-dm UI frameworks)

- File system listener that will capture changes to a scanned filesystem 
and track them for reconciliation.

- Storage protocol support for NAS like NFS/CIFS mounting

- Updates for logging and better state management and error pathing 
(still a WIP)

Ptolemy V2 plans:

- Other cloud targets (potentially Azure and GCP and others)

- Car v2 support

- Direct to delta integration for end to end automated pipeline flow

About

A data processing engine that is designed to help with data onboarding within the Filecoin network.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages