# Adding data

This notebook shows how to add a document to database with the datalayer.

First, we will create a synthetic dataframe that we want to save as a database record

In [1]:
import pandas

df = pandas.DataFrame(dict(date = ["2016-11-10", "2016-11-10", "2016-11-11", "2016-11-11","2016-11-11","2016-11-11","2016-11-11", "2016-11-11" ],
                           time = ["22:00:00", "23:00:00", "00:00:00", "01:00:00", "02:00:00", "03:00:00", "04:00:00", "05:00:00"],
                           value = [90, 91, 80, 87, 84,94, 91, 94]))
df['date_time'] = pandas.to_datetime(df['date'] + ' ' + df['time'])
df=df.set_index('date_time')
print (df)

                           date      time  value
date_time                                       
2016-11-10 22:00:00  2016-11-10  22:00:00     90
2016-11-10 23:00:00  2016-11-10  23:00:00     91
2016-11-11 00:00:00  2016-11-11  00:00:00     80
2016-11-11 01:00:00  2016-11-11  01:00:00     87
2016-11-11 02:00:00  2016-11-11  02:00:00     84
2016-11-11 03:00:00  2016-11-11  03:00:00     94
2016-11-11 04:00:00  2016-11-11  04:00:00     91
2016-11-11 05:00:00  2016-11-11  05:00:00     94


Several parameters must be given for any document, and must be defined in order to add new data.
These parameters are the ones given in the next example.
In addition, one may add any other parameters to the document.

The data is added using the next method:

In [2]:
from hera import datalayer

projectName = "addDataExample" # must be a string
documentType = "ExampleData" # must be a string
desc = {"description_A": "A", "description_B": "B"} # must be a dictionary. Contains descriptors of the data.
dataFormat = datalayer.datatypes.JSON_PANDAS # other types are givnen the documentation. 
resource = df.to_json() # A dynamic field, can points to a specific file in a folder (path) or contain the data itself.

new_doc=datalayer.Measurements.addDocument(projectName=projectName, desc=desc, type=documentType, dataFormat=dataFormat, resource=resource)
print(new_doc)

Measurements object


Notice that the desc dictionary may not contain a key named "type".
The allowed data formats are detailed in the hera.datalayer.datatypes:

-    STRING : Any string. 
-    TIME   : any date/time object
-    HDF    : a dask or pandas in hdf file format. 
-    NETCDF_XARRAY : an xarray netcdf. 
-    JSON_DICT  :  JSON as python dict 
-    JSON_PANDAS :  JSON as pandas.DataFrame 
-    GEOPANDAS   : a GIS-file format. returns as geopandas.GISDataFrame 
-    PARQUET    : dask or pandas in parquet format. 
-    IMAGE      : any Image data format. Preferably PNG. 

They indicate how to read the data, and therefore must correspond to the type of data located in the resource.

The added document can be loaded as presented in the "Getting data" notebook.

# Getting data
This notebook shows how to get the data with the datalayer.

Let's read the synthetic database record we saved in the "Adding Data" example

After importing the datalayer, you can get the data that fits your requirments. 
Below we see an example of getting the document of the experimental data between 2 dates of Haifa campaign in station Check_Post, instrument Sonic, height 9(m).

In [3]:
# projectName = 'Haifa'
# station = 'Check_Post'
# instrument = 'Sonic'
# height = 9

# doc = datalayer.Measurements.getDocuments(projectName=projectName,
#                                       station=station,
#                                       instrument=instrument,
#                                       height=height)

projectName = "addDataExample"
desc=dict(description_A = "A")

docList = datalayer.Measurements.getDocuments(projectName=projectName,**desc)

The result obtained from the query is:

In [4]:
print(docList)

[<Measurements: Measurements object>]


You can now read the data from the 'doc' and perform another query (for example, on a date range):

In [5]:
start = pandas.Timestamp('2016-11-10 23:00:00')
end = pandas.Timestamp('2016-11-11 02:00:00')
data=(docList[docList.count() - 1].getData())
data=data[start:end]

print(data)

                          date      time  value
2016-11-10 23:00:00 2016-11-10  23:00:00     91
2016-11-11 00:00:00 2016-11-11  00:00:00     80
2016-11-11 01:00:00 2016-11-11  01:00:00     87
2016-11-11 02:00:00 2016-11-11  02:00:00     84


If we have got our data as dask dataframe, we can convert it to pandas dataframe with the '.compute()' function, like this: data = data.compute()

Alternatively you can use the argument 'usePandas' with value True to get data directly as pandas and not dask.
(**Should be used only when the data is small**)

# Update data description



Before:

In [6]:
print('The resource is: %s' %docList[0].resource)
print('The description is: %s' %docList[0].desc)

The resource is: {"date":{"1478815200000":"2016-11-10","1478818800000":"2016-11-10","1478822400000":"2016-11-11","1478826000000":"2016-11-11","1478829600000":"2016-11-11","1478833200000":"2016-11-11","1478836800000":"2016-11-11","1478840400000":"2016-11-11"},"time":{"1478815200000":"22:00:00","1478818800000":"23:00:00","1478822400000":"00:00:00","1478826000000":"01:00:00","1478829600000":"02:00:00","1478833200000":"03:00:00","1478836800000":"04:00:00","1478840400000":"05:00:00"},"value":{"1478815200000":90,"1478818800000":91,"1478822400000":80,"1478826000000":87,"1478829600000":84,"1478833200000":94,"1478836800000":91,"1478840400000":94}}
The description is: {'description_A': 'A', 'description_B': 'B'}


In [None]:
docobj = docList[0]
newdata1 = dict(docobj.desc)
newdata1['description_C'] = "C1"
resource1 = "resource1"


newdata2 = dict(docobj.desc)
newdata2['description_C'] = "C1"
resource2 = "resource2"

Method 1: set the new attributes in the object and save. 

In [7]:
docobj.resource = resource1
docobj.desc = newdata
docobj.save()

Now we check that the database was updated. 

In [7]:
after_update_docList = datalayer.Measurements.getDocuments(projectName=projectName,**desc)
after_update_docobj = docList[0]
print('The resource is: %s' %after_update_docobj.resource)
print('The description is: %s' %after_update_docobj.desc)

The resource is: resource1
The description is: {'description_A': 'A', 'description_B': 'B', 'description_C': 'C1'}


Method 2:

In [8]:
docobj = docList[0]

docobj.update(resource="resource2",desc=docdesc)

Now we update the object and fetch the current values from the database: 

In [8]:
docobj.reload()
print('The resource is: %s' %docobj.resource)
print('The description is: %s' %docobj.desc)

The resource is: resource2
The description is: {'description_A': 'A', 'description_B': 'B', 'description_C': 'C2'}


# Using Project 

Using the Project class simplifies the access to the different documents of the project. 

Define the project with 

In [9]:
from hera.datalayer import Project 

p = Project(projectName="testProject")

p.simulations

<hera.datalayer.collection.Simulations_Collection at 0x7f45c0375c18>