# dbCamHD

This notebook is an attempt to create a metadata database using the pycamhd library.

#### Setup environment

In [None]:
%matplotlib inline
import pycamhd as camhd
import numpy as np
import pandas as pd

#### Get a list of files from the server
This operation can take many minutes (10-15), and sometimes fails when the server has some random issue. I don't recommend doing this very often, if at all. I have changed these cells over to raw now that the dbcamhd.json file exists so the cells will not run automatically. It should not be necessary to rerun these next few cells until CamHD comes back online and begins to generate new video files.

#### Create a Pandas dataframe from these Lists

#### Get some additional information about the files
This cell takes a couple of hours to run in a single thread, which is why it is commented out. How much faster would it go using Dask Delayed and a bunch of Dask workers?

#### Add these to the dbcamhd dataframe

#### Save dataframe to JSON file

#### Load dataframe from JSON file (do this instead of camhd.get_file_list())
Here we load the dbcamhd JSON file that contains the results from the above cells.

In [None]:
dbcamhd = pd.read_json('dbcamhd.json', orient="records", lines=True)

In [None]:
dbcamhd.tail()

#### Plot histogram of MOV sizes

In [None]:
import holoviews as hv
hv.extension('bokeh')
from bokeh.plotting import figure, show

frequencies, edges = np.histogram(dbcamhd['filesize']/1024/1024/1024, bins=np.linspace(0,20,100))

p = figure(title="MOV Size Distribution")
p.quad(top=frequencies, bottom=0, left=edges[:-1], right=edges[1:], fill_color="blue", line_color="black")
p.xaxis.axis_label = 'Filesize (GB)'
p.yaxis.axis_label = 'N'
show(p)

In [None]:
frequencies, edges = np.histogram(dbcamhd['frame_count'], bins=np.linspace(0,50000,100))

p = figure(title="MOV Size Distribution")
p.quad(top=frequencies, bottom=0, left=edges[:-1], right=edges[1:], fill_color="blue", line_color="black")
p.xaxis.axis_label = 'Filesize (GB)'
p.yaxis.axis_label = 'N'
show(p)

#### Plot histogram of timestamp deltas

In [None]:

deltas = dbcamhd.timestamp.loc[dbcamhd.timestamp>140000000].diff()
frequencies, edges = np.histogram(deltas, range=(0, 100000), bins=range(10000))

p = figure(title="MOV Size Distribution", x_range=(0, 1000))
p.quad(top=frequencies, bottom=0, left=edges[:-1], right=edges[1:], fill_color="blue", line_color="black")
p.xaxis.axis_label = 'Delta (s)'
p.yaxis.axis_label = 'N'
show(p)

#### Plot subset of frame_counts

In [None]:
p = figure(x_axis_type='datetime', y_range=(0, 30000))
min_t = 1400000000

dates = pd.to_datetime(dbcamhd['timestamp'][dbcamhd.timestamp>min_t],unit='s')
frame_count = dbcamhd['frame_count'][dbcamhd.timestamp>min_t]

p.circle(dates, frame_count, size=1)
p.xaxis.axis_label = 'Date'
p.yaxis.axis_label = 'Frame Count'
show(p)

#### Add deployment numbers to database

See the [asset management](https://github.com/ooi-integration/asset-management/blob/master/deployment/RS03ASHS_Deploy.csv) page for deployment information.

In [None]:
import pandas as pd

In [None]:
dbcamhd = pd.read_json('dbcamhd.json', orient="records", lines=True)
dbcamhd.filename[0]

In [None]:
dbcamhd.tail()

In [None]:
dt = pd.to_datetime(dbcamhd.timestamp, unit='s')
dbcamhd['deployment'] = dbcamhd.timestamp*0
dbcamhd.loc[dt < '2016-07-26 21:18:00', 'deployment'] = 2
dbcamhd.loc[dt >= '2016-07-26 21:18:00', 'deployment'] = 3
dbcamhd.loc[dt >= '2017-08-14 06:00:00', 'deployment'] = 4

In [None]:
dbcamhd.tail()

In [None]:
dbcamhd.to_json('dbcamhd.json', orient="records", lines=True)

### References

https://github.com/tjcrone/pycamhd<br>
https://rawdata.oceanobservatories.org/files/RS03ASHS/PN03B/06-CAMHDA301/<br>
https://pandas.pydata.org/