# Data entry and management

In the database section I described the direct database upload process. This is intended only for the chosen few who are allowed to directly access the database. SQL is a cruel beast. It is very easily to blithely overwrite and destroy data if you have write permissions.  

As we were developing trials protocols, data were entered in what can be best described as a semi-structured form. Re-working this into a clean form for database upload was a very time consuming for someone (me, primarily). As the trials program matured and protocols solidified the data entry structure has become more regimented. With the expansion of trials we will also be using a range of different Fisheries Observers. Making the data entry process as streamlined as possible assumes greater importance.  

In this section I describe the current approach which uses a combined Spreadsheet/Web upload interface. Fine details may change with each project, but the general approach will still apply. Field observers are still advised to do some basic old-school field data management:

- Keep a field notebook
- In your notebook keep an overview of samples collected (Date, which tow, treatment, net side etc plus comments)
- Manually enter data in the notebook from notes, recordings etc (don't just rely on the computer entry)
- Enter data as soon as possible
- Physically check data entry
- Check the data in the Shiny data upload interface


## The Shiny data upload interface
During my research career I have counted lots of fish. Lots and lots and lots of fish! There is no escaping the need for some sort of manual data entry and QA/QC or data checking procedure. The challenge is to make a data entry interface that is intuitive, avoids repetition as much as possible, yet has as seamless-as-possible upload to some sort of data storage system such as a database. At this stage there also have to be safeguards, so that data cannot be inadvertently over-written due to data entry errors.  

I thought long and hard about how to do this. Conventional online data entry interfaces such as PHP pages are fine for the easy stuff (Vessel, Observer, Date), but not so crash hot for entering 500 fish lengths for 10 species from one sample. Why? Because they take too long to enter in the interface, then upload. The trade-off solution is to use a validated/restricted spreadsheet approach, which is then uploaded to a R Shiny page with some additional verification and checks, and then uploaded into a sandbox database that will hold the data for a trip until it is manually checked again, and uploaded to the Master database.  


````{margin}
```{warning}
I hate spreadsheets! There, I've said it!  
Well, dumb spreadsheets are kind of a necessary evil. They are good for typing lots of numbers in fast.  
However, as soon as you need to make many text entries, and - heaven forbid - use the FILL DOWN function, you need to be disciplined in your data entry and meticulous in your manual checking.  
Spreasheets are also used by some as analytical tools. I avoid this like the plague. It is too easy for errors to creep in, in completely untraceable ways.
```
````
Notwithstanding my personal distrust of spreadsheets, they are useful in two ways. First, if you have to enter lots of numbers they are quick. Second, anybody with basic ecology/biologyfisheries training is familiar with them. When you are on a boat and have had nowhere near enough sleep to function properly, data entry needs to be as familiar as possible. So... we consort with the devil and use a spreadsheet data entry interface.  

Let's have a look at a working system. The pretty UX component is still a work in progress, but the engineering of it works pretty well.

## Spreadsheet entry interface
The pointy end of the data entry is a spreadsheet, which comes in either Libre/OpenOffice (odt) or Excel (xlsx) versions. For a planned trial, we would pre-populate the key values such as Vessel, Treatment and so on as much as possible. Each value has a particular cell or columns for values such as catch weight and fish lengths. These are fixed in position on the sheet, and validated as much as possible to reduce errors.  

```{figure} SpreadsheetEntry.png
:height: 400px
:name: SpreadsheetEntry

Data entry sheet. Identifier variables have a specific location on the sheet, with validated dropdown values. The Species lengths have no limit to the number of individuals that can be entered, and additional species columns are added to the end of the sheet as required.

```
There are a few things to note about the example sheet above. First, identifier entries such as Vessel, Observer, Date live in particular cells of the data sheet. *Don't move stuff around!* The program that reads the data in looks for cells by their row-column position, so if you want to change the aesthetics, you will break it. The positions should be locked so the data enterer can't insert/delete cells anyway.  

The next thing to note is that where possibly the possible range of values are in dropdown lists. With the exception of the **Notes** column, which is free form text, you should not need to manually enter a text value. The reason for this is that all y'all aren't disciplined enough to maintain case sensitivity, and sometimes trailing or leading spaces get inserted, or maybe 0 (zero) and O (capital letter 'O') look close enough to be interchangeable enough for some folks. 
````{margin}
```{warning}
Don't laugh!  
I have seen data files where letter O has been used for zero before...
```
````
The other exceptions are the Date and time cells. Date has to be entered in free form as dd mmm yyyy (eg. 26 Feb 2022). Excel/LibreOffice will convert this to the correct date format. The reason for this can be traced to 1776. Americans, bless their souls, tend to enter dates as Month/Day/Year. This is a potential source of error because there is ambiguity for any day/month less than or equal to 12. So, I have to add this annoying element. Time is a bit more straightforward: HMS in 24 hour time.


The file itself has multiple sheets - one for each Haul/net combination. For specific trips, we will pre-populate these according to the planned sample design. Near the end of the file there will also be a template sheet if extra sheets need to be added, and a sheet with the dropdown options. This is where we can modify the sheet for use in different regions or applications. We simply change the species names, and the sheets will read from this range  

```{figure} DropdownValues.png
:height: 400px
:name: DropdownValues

Dropdown values. The values in each of the dropdown lists and their subsequent validation process are read from a range of cells in the Dropdown value sheet. This is password protected, but can be added to if required.

```

Numeric fields such as the Sample Weight and the fish lengths are validation-locked to numeric and integer values respectively. The species lengths columns are important to understand. The header of each column is a dropdown value from the species list. You do not need to scroll through pre-entered columns to find the species - just use enough columns for the species that were in the sample. This streamlines the data entry process no end.  

The **Notes** column is freeform text. This is where you would put in any important information about weirdness such as torn cod ends, paint cans in the net (both of which have happened!) and so on.


## Data upload interface
Uploading data can only be done when there is internet access. The database lives on the Ionos server. The upload interface was designed in R ShinyDashboard, and the aim is to try and keep each section as uncluttered as possible, yet keep similar elements together. Unfortunately these are sometimes conflicting objectives. The interface is also constantly evolving as I add styling elements, so watch this space.  

We first start with a blank upload page:  

```{figure} BlankUploadPage.png
:height: 400px
:name: BlankUploadPage

The upload interface is navigated by a set of menu items in the sidebar. 

```

The first stage is to select your data entry worksheet, and the Sheet number. In a structured trial each sheet in the file will be pre-populated with its *intended* sample identifiers, but Sod's Law being what it is I force the data enterer to manually choose the sheet so that they have to effectively double check they are uploading the correct one. When the file and sheet are selected, the identifier and data fields are read and displayed: 

```{figure} UploadValidationPage.png
:height: 400px
:name: UploadValidationPage

The validation display shows the identifiers and the data fields so that the data enterer can scan and conduct a final check. 

```

I can't stress how important it is to scan this page for any hiccups. If there are any errors, go back to the Spreadsheet file and correct it at that stage. When you are satisfied the data are indeed correct, then click the *Submit to database* button. Give it a few seconds, and a popup window will confirm the data have been submitted.  

```{note}
This page will be restructured a bit in the near future. There is a bit too much information to scan, so I'll split the catch weight and fish length data into different tabs and bump up the font size a bit.

``` 
Well, the popup window said the data were submitted. How do we know it wasn't lying? To check we navigate to  the Data Uploaded window:

```{figure} UploadedData.png
:height: 400px
:name: UploadedData

The Uploaded Data page reads directly from the sandbox database to show what has really been uploaded. Refresh the page after every uploading to check it was successful.

```

```{note}
**But wait!** I'm really tired, I made a mistake in the identifiers and unwittingly overwrote a data entry with gobbledegook! What do I do? 

``` 
One safeguard I have built in is that each upload has date-time stamp identifier built into the database fields. If there are multiple uploads of the same sheet, or the wrong identifiers were entered in the sheet which could possibly overwrite the 'real' entries, we can identify the entry after the cruise and screen it. You can add a note in the 'real' corrected data to flag it. This is also why the old-school field notebook is important. If there's a screw up, write it down there and then. After 40 tows they all blur into one at the end of a cruise.  


## Photo upload interface
During the 2021 trials one of the biggest time-sucks from my point of view was photo curation. Typically photos of each haul may be taken at one of three stages: **Net**; **Hopper**; and **Sample**. These need to be uploaded to the server, and named with a set of identifiers so that they can be queried and retrieved for each of their respective hauls. This interface enables the user to upload photos from their camera, phone, iPad or whatever device they used and specify the identifiers for each photo. Once they are happy with it, they can upload the photo which will get written to the server, with a corresponding database field entry to allow retrieval.  

```{figure} ImageUpload.png
:height: 400px
:name: ImageUpload

The Image Upload page reads the photo from your device. The user manually selects the identifiers and checks all is correct, then submits the image for upload.

```
As with the data upload, there is an image upload validation page that reads which files are actually on the server. After uploading the image, refresh this page to verify that the image actually made it. If you make a mistake, it can be fixed at the end of the cruise. A date-time stamp is added to each file at this stage, so that if there is an error or duplication, we can locate and correct the problem.  


## Data exploration and checking
The primary validation of the data prior to uploading currently exists in the spreadsheet file. However, graphical feedback is good both for the onboard observer, and as a quick product for the skipper to show that having the observer on board is not a complete waste of time.

```{note}
**TODO...** I will add some more validations prior to clicking the upload button. Feedback from the first deployment of the interface suggested that when the enterer is tired, it becomes a "blah, blah, whatever - Upload" process. I will add some interrogations back to the database to check for potential duplication, and screen for outlier values. 20m long Cod, for example! 

``` 
There are two key data products generated by the observer: The fish lengths per Species and light treatment; and the catch weight per category for each light treatment. Each of these plots will read from the submitted data in the sandbox database, and serve as a progressive report. 


```{figure} SizeFrequencyGraphs.png
:height: 400px
:name: SizeFrequencyGraphs

The size-frequency plots provide early feedback on the data collection process. They may also help identify any data errors. You won't be able to correct them on the cruise, but they can be noted in the field book. The graphs will facet on the page according to whatever treatments have been sampled during the cruise. So early on they will look a bit scraggly!

```

```{figure} CatchWeight.png
:height: 400px
:name: CatchWeight

Ditto for the catch weights...

```


## Final comments
That completes the introduction to the Trials data entry interface. The next steps will be:  

- CSS styling to make some of the text larger and more readable
- More automated data checks prior to uploading
- Making the spreadsheet more bomb-proof and difficult for all you click and point folks to screw up
- One data element that is *not* currently incorporated is the vessel tracking information from Followmee. This requires quite a bit of user interaction, so will require a bit of work to get it foolproof

The uploading process saves an inordinate amount of time from my side. I would have implemented it earlier, but we were still finalizing protocols. This engineering can be relatively easy and modified for any trials we will conduct in future.
