# CDCS Data Management

This Notebook details the functions and interactions for managing data in a CDCS instance.  Primarily, this means having an account on the database allowing you to log in to create and modify records. 

In [10]:
from pathlib import Path

from cdcs import CDCS

## 1. Class initialization

A CDCS client manager can be initialized by passing it the host url and authenication information.

Parameters

- __username__: (*str, optional*) Username to log in as. If not given, then a promp will ask for it. An empty string '' username will access the client as an anonymous guest.
- __password__: (*str, optional*) The password associated with the username. If not given and username is not '', then a promp will ask for the password.
- __certification__: (*str, optional*) The path to a certification file, if needed.

In [2]:
curator = CDCS('https://potentials.nist.gov/', username='lmh1')

Enter password for lmh1 @ https://potentials.nist.gov:········


## 2. Query data

The query() method works the same here as for an anonymous user. Note that the query will return *all* records that you have access to: public records as well as ones assigned to you.

Parameters
- __template__: (*list, str, pandas.Series or pandas.DataFrame, optional*) One or more templates or template titles to limit the search by.
- __title__: (*str, optional*) Record title to limit the search by.
- __keyword__: (*str or list, optional*) Keyword(s) to use for a string-based search of record content.  Only records containing all keywords will be returned. 
- __mongoquery__: (*str or dict, optional*) Mongodb find query to use in limiting searches by record element fields.  Note: only record parsing is supported, not field projection.

Returns
- (*pandas.DataFrame*) All records matching the search request

Specify a template in the database to interact with

In [6]:
# Note: template should the name of a template in the database you are accessing!
template = 'FAQ'

Use query to fetch records

In [7]:
records = curator.query(template=template)
records

Unnamed: 0,id,template,workspace,user_id,title,xml_content,last_modification_date,template_title
0,5df2ad2290985100269d7bf5,5df2ad2290985100269d7bf1,5de9613a75d7d40014ffb6fc,2,lammps,"<faq xmlns:xsi=""http://www.w3.org/2001/XMLSch...",2019-12-17T19:58:26.591000Z,FAQ
1,5df2ad2290985100349d7c6f,5df2ad2290985100269d7bf1,5de9613a75d7d40014ffb6fc,2,faq,"<faq xmlns:xsi=""http://www.w3.org/2001/XMLSch...",2019-12-17T19:58:26.850000Z,FAQ
2,5df2ad2390985100269d7bf9,5df2ad2290985100269d7bf1,5de9613a75d7d40014ffb6fc,2,submit,"<faq xmlns:xsi=""http://www.w3.org/2001/XMLSch...",2019-12-17T19:58:27.099000Z,FAQ
3,5df2ad2390985100349d7c73,5df2ad2290985100269d7bf1,5de9613a75d7d40014ffb6fc,2,ref,"<faq xmlns:xsi=""http://www.w3.org/2001/XMLSch...",2019-12-17T19:58:27.329000Z,FAQ
4,5df2ad2490985100349d7c77,5df2ad2290985100269d7bf1,5de9613a75d7d40014ffb6fc,2,manuscript,"<faq xmlns:xsi=""http://www.w3.org/2001/XMLSch...",2019-12-17T19:58:27.558000Z,FAQ
5,5df2ad2490985100269d7bfd,5df2ad2290985100269d7bf1,5de9613a75d7d40014ffb6fc,2,formats,"<faq xmlns:xsi=""http://www.w3.org/2001/XMLSch...",2019-12-17T19:58:27.787000Z,FAQ
6,5df2ad2490985100269d7c01,5df2ad2290985100269d7bf1,5de9613a75d7d40014ffb6fc,2,graphs,"<faq xmlns:xsi=""http://www.w3.org/2001/XMLSch...",2019-12-17T19:58:28.042000Z,FAQ


Pick the first record, and see its xml contents

In [8]:
record = records.iloc[0]
content = record.xml_content
print(content)

<faq  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ><question>Where can I download the LAMMPS molecular dynamics software package?</question><answer><![CDATA[<a href="http://lammps.sandia.gov" class="external">LAMMPS</a> (Large-scale Atomic/Molecular Massively Parallel Simulator) is developed and maintained at  <a href="http://www.sandia.gov" class="external">Sandia National Laboratories</a>.]]></answer></faq>


## 3. Manage data

A number of class methods have been defined to support retrieving/adding/modifying/deleting records that you own.

### 3.1 Upload a new record

New records can be uploaded to the database using the upload_record() method.

Parameters

- **template** (*str or pandas.Series*) The template or template title to associate with the record.
- **filename** (*str, optional*) Name of an XML file whose contents are to be uploaded.  Either filename or content required.
- **content** (*str or bytes, optional*) String content to upload. Either filename or content required.
- **title** (*str, optional*) Title to save the record as.  Optional if filename is given (title will be taken as filename without ext).
- **duplicatecheck** (*bool, optional*) If True (default), then a ValueError will be raised if a record already exists in the database with the same template and title.  If False, no check is performed possibly allowing for multiple records with the same title to exist in the database.

Use the content from the record queried above to upload a "test" record.

In [9]:
title = 'testrecord1'
curator.upload_record(template=template, title=title, content=content)

record testrecord1 (5e1785f1ab2f7c00263cf484) successfully uploaded.


Alternatively, the names of any local XML files can be specified and the method will automatically read the contents.  Note that if title is not specified, the file name without path and extension will be used.

In [11]:
# Save content to local file
filename = 'testrecord2.xml'
with open(filename, 'w') as f:
    f.write(content)
    
# Upload from file    
curator.upload_record(template=template, filename=filename)

# Delete local file (keep working directory clean)
Path(filename).unlink()

record testrecord2 (5e17880cab2f7c002c3cf413) successfully uploaded.


### 3.2 Access data records assigned to you

The records that you own can be accessed using the get_records and get_record methods. get_records() will fetch all records with matching template and/or title. get_record() will fetch a single record if exactly one record matching the given title + template is found, otherwise will throw an error.

In [12]:
curator.get_records(template=template)

Unnamed: 0,id,template,workspace,user_id,title,xml_content,last_modification_date
0,5df2ac9990985100269d7ad9,5df2ac9890985100349d7ad7,5de9613a75d7d40014ffb6fc,2,potential.1985--Foiles-S-M--Ni-Cu,"<?xml version=""1.0"" encoding=""utf-8""?>\n<inter...",2019-12-17T18:36:01.151000Z
1,5df2ac9990985100339d7ad9,5df2ac9890985100349d7ad7,5de9613a75d7d40014ffb6fc,2,potential.1987--Ackland-G-J-Thetford-R--Nb,"<?xml version=""1.0"" encoding=""utf-8""?>\n<inter...",2019-12-17T18:36:01.458000Z
2,5df2ac9a90985100329d7ad9,5df2ac9890985100349d7ad7,5de9613a75d7d40014ffb6fc,2,potential.1987--Ackland-G-J-Thetford-R--Ta,"<?xml version=""1.0"" encoding=""utf-8""?>\n<inter...",2019-12-17T18:36:01.732000Z
3,5df2ac9a90985100269d7add,5df2ac9890985100349d7ad7,5de9613a75d7d40014ffb6fc,2,potential.1987--Ackland-G-J-Thetford-R--V,"<?xml version=""1.0"" encoding=""utf-8""?>\n<inter...",2019-12-17T18:36:01.990000Z
4,5df2ac9b90985100269d7ae1,5df2ac9890985100349d7ad7,5de9613a75d7d40014ffb6fc,2,potential.1987--Ackland-G-J-Thetford-R--W,"<?xml version=""1.0"" encoding=""utf-8""?>\n<inter...",2019-12-17T18:36:02.327000Z
...,...,...,...,...,...,...,...
1624,5e0a4682ab2f7c00263cf436,5e0a34edab2f7c00263cf2c6,5de9613a75d7d40014ffb6fc,2,2018--S-A-Etesami-M-I-Baskes-M-Laradji-et-al--...,"<?xml version=""1.0"" encoding=""utf-8""?>\n<citat...",2019-12-30T18:48:34.421000Z
1625,5e0a4685ab2f7c002e3cf3a8,5e0a34edab2f7c00263cf2c6,5de9613a75d7d40014ffb6fc,2,2019--M-I-Mendelev--Cu-Zr,"<?xml version=""1.0"" encoding=""utf-8""?>\n<citat...",2019-12-30T18:48:36.987000Z
1626,5e0a4687ab2f7c00263cf43a,5e0a34edab2f7c00263cf2c6,5de9613a75d7d40014ffb6fc,2,2019--M-I-Mendelev--Fe-Ni-Cr,"<?xml version=""1.0"" encoding=""utf-8""?>\n<citat...",2019-12-30T18:48:39.535000Z
1627,5e1785f1ab2f7c00263cf484,5df2ad2290985100269d7bf1,,2,testrecord1,"<faq xmlns:xsi=""http://www.w3.org/2001/XMLSch...",2020-01-09T19:58:41.299000Z


In [13]:
record = curator.get_record(template=template, title='testrecord2')
print(record)

id                                                 5e17880cab2f7c002c3cf413
template                                           5df2ad2290985100269d7bf1
workspace                                                              None
user_id                                                                   2
title                                                           testrecord2
xml_content               <faq  xmlns:xsi="http://www.w3.org/2001/XMLSch...
last_modification_date                          2020-01-09T20:07:40.769000Z
Name: 0, dtype: object


In [14]:
print(record.xml_content)

<faq  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ><question>Where can I download the LAMMPS molecular dynamics software package?</question><answer><![CDATA[<a href="http://lammps.sandia.gov" class="external">LAMMPS</a> (Large-scale Atomic/Molecular Massively Parallel Simulator) is developed and maintained at  <a href="http://www.sandia.gov" class="external">Sandia National Laboratories</a>.]]></answer></faq>


### Assign the workspace of record(s)

Initially, all records are not part of a workspace. In order for others to query the records, they must be assigned to a workspace that they have access to.

**Note:** requires that user has access to workspaces

#### Selecting a workspace

- __get_workspaces__ retrieves the workspaces
- __get_workspace__ retrieves a single workspace based on title

In [15]:
curator.get_workspaces()

In [8]:
curator.get_workspace(title='Global Public Workspace')

id       5dcd581e98caa900137e14ee
title     Global Public Workspace
owner                        None
Name: 0, dtype: object

The attribute __global_workspace__ also retrieves the Global Public Workspace 

In [16]:
workspace = curator.global_workspace
print(workspace)

id       5de6e6165176c3001f4677f0
title     Global Public Workspace
owner                        None
Name: 0, dtype: object


#### Assign workspace

Workspace can be assigned to one or more records using __assign_record_workspace__.

In [17]:
record = curator.get_record(title=title)
curator.assign_record_workspace(record, workspace)

{'title': 'testing1'}
record testing1 (5de7cd925176c30030330863) assigned to workspace Global Public Workspace (5de6e6165176c3001f4677f0)


In [18]:
alluserrecords = curator.get_records()
curator.assign_record_workspace(alluserrecords, workspace)

{}
record testing1 (5de7cd925176c30030330863) assigned to workspace Global Public Workspace (5de6e6165176c3001f4677f0)
record testing2 (5de7ce7e5176c30037963b69) assigned to workspace Global Public Workspace (5de6e6165176c3001f4677f0)


### Delete a record

__delete_record__ will delete a single user record. The record must be uniquely identified by either

- passing a record Series
- uniquely identifying a single record by title+template


In [16]:
# Delete by identifying with title + template
curator.delete_record(title='testrecord1')

record testrecord1 (5e1785f1ab2f7c00263cf484) has been deleted.


In [17]:
# Get record Series first
record = curator.get_record(title='testrecord2', template='Request')

# Delete by passing Series
curator.delete_record(record)

record testrecord2 (5e17880cab2f7c002c3cf413) has been deleted.


## Manage blobs (raw files)

Create blob file for testing

In [22]:
filename = 'test_blob.txt'

with open(filename, 'w') as f:
    f.write('This is my blob for testing')

### Upload blob

Upload a blob file with __upload_blob__

In [23]:
handle = curator.upload_blob(filename=filename)
print(handle)

File "test_blob.txt" uploaded as blob "test_blob.txt" (5de7d1ad5176c300335b7830)
https://test-potentials.nist.gov/rest/blob/download/5de7d1ad5176c300335b7830/


### Access blobs

See metadata for blobs you own with __get_blobs__ and __get_blob__

__NOTE__ blobs can be explored by either filename or id, but behavior is different

- Identifying by filename only works for user blobs
- Identifying by id works for all blobs available to you (user + allowed workspaces)

In [24]:
curator.get_blobs()

Unnamed: 0,id,user_id,filename,handle,upload_date
0,5de7d1ad5176c300335b7830,5,test_blob.txt,https://test-potentials.nist.gov/rest/blob/dow...,2019-12-04 15:33:01+00:00


In [26]:
blobdata = curator.get_blob(filename=filename)
print(blobdata)

id                                      5de7d1ad5176c300335b7830
user_id                                                        5
filename                                           test_blob.txt
handle         https://test-potentials.nist.gov/rest/blob/dow...
upload_date                            2019-12-04 15:33:01+00:00
Name: 0, dtype: object


In [27]:
curator.get_blob(id=blobdata.id)

id                                      5de7d1ad5176c300335b7830
user_id                                                        5
filename                                           test_blob.txt
handle         https://test-potentials.nist.gov/rest/blob/dow...
upload_date                            2019-12-04 15:33:01+00:00
dtype: object

### Access blob contents

- __get_blob_contents__ returns the blob contents as bytes
- __download_blob__ saves the contents locally using the blob's filename (and optional directory)

In [28]:
print(curator.get_blob_contents(filename=filename))

b'This is my blob for testing'


In [29]:
curator.download_blob(filename=filename)

### Assign blob to workspace

In [30]:
blob = curator.get_blob(filename=filename)
workspace = curator.global_workspace

In [31]:
curator.assign_blob_workspace(blob, workspace)

blob test_blob.txt (5de7d1ad5176c300335b7830) assigned to workspace Global Public Workspace (5de6e6165176c3001f4677f0)


### Delete blob

In [29]:
curator.delete_blob(filename=filename)

Successfully deleted blob "blob" (5de5477398caa900282b8226)
