The Ushichka dataset has multiple types of files spread across multiple hard-disks, and of different types. This notebook will try to compare, and check how many unique files are there.

The broad reason Ushichka was split over multiple hard-disks was for 1) back-up and 2) due to space limitations on the first hard disk. The dataset has two 'fieldseason' phases: 1 & 2. Phase 1 is pre-Lund visit, while Phase 2 is all that happened post-Lund.

The four disks are named:

1. Ushichka dataset harddisk #1
1. Ushichka dataset harddisk #2
1. Ushichka dataset harddisk #3
1. Ushichka dataset harddisk #4

...and shortly referred to as #1-4

Here I will be discussing the data collected only in the 'main' recording volume for recordings done on (See Beleyur&Goerlitz 2021 (in prep) or Beleyur PhD Thesis, Uni. Konstanz):

* 2018-06-19
* 2018-06-21
* 2018-06-22
* 2018-07-14
* 2018-07-21
* 2018-07-25
* 2018-07-28
* 2018-08-02
* 2018-08-14
* 2018-08-17
* 2018-08-18
* 2018-08-19

Aside from all of this data - there is actually a *lot* of data collected using the audio-video array system of free-flying bats recorded in various field sites in and outside of caves. 




The typical organisation of the data in the drives followed this pattern showing the organisation of folders containing raw data. 

### Typical Audio and Video raw data organisation

```
MAIN DISK:
    * fieldwork_2018_001
        * actrackdata
            * wav
                # one folder per starting date and session number. 
                # the 'starting date' remains the same past midnight even though the actual date
                # has changed.
                * YYYY-MM-DD_sss
                    * FILE1.WAV
                    * FILE2.WAV
                    * ..........
            * weather
        * video
            * YYYY-MM-DD
                # one folder per camera
                * K1
                    # first folder made on switching is always P0000000
                    # all following folders are +1
                    * P0000000
                        # same folder numbering system . First recording triggered is always
                        # stored as 00000000.TMC and +1 for every new file triggered.
                        * 00000000.TMC
                        * 00000001.TMC
                        * .........
                    * P0000001
                    * ......
                    
                * K2
                * K3
        * notebook_scans
            * YYYY-MM-DD
                * notebook photos, scans, additional notes/observations
```
In addition to the *raw* data, there are also some folders in ```actrackdata/video/yyyy-mm-dd/``` which have the DLTdv wand calibrations, the .avi exported videos, and even a few bat trajectories. 


### LiDAR data

In addition to the audio and video data, the LiDAR scans are stored in disk #2. The whole folder is 6.72GB

```
MAIN DISK:
    * Orlova_Chuka_LiDAR
        * Exports
            * various formats of the LiDAR scan - .ply, .dxf and a compressed folder 'OrlovaChukaTotalPointCloud.rar'
        # zipped folder containing a Word file explaining how the raw data was processed. 
        * Re _Orlova_Chuka_data.zip
```
Asparuh Kamburov also has the original data with him - and can be contacted for a copy if necessary. 


### Dates for ``` fieldwork_2018_001\```

```\actrackdata\wav```

|Disk #|Disk #1   | Disk #2|    Disk #3   |   Disk #4    |
|------|-----------|-----|--------|-----|
|Start|2018-06-19 |2018-06-19 | non-Ushichka |2018-06-19 |
|  End| 2018-07-25|2018-07-28 | non-Ushichka | 2018-07-28   | 

```\video```

|Disk #|  Disk #1  |  Disk #2   | Disk #3   |   Disk #4    |
|------|-----------|------------|-----------|--------------|
|Start |2018-06-19 | 2018-04-09 |2018-04-09 |non-Ushichka     |
|  End | 2018-07-25| 2018-07-25 |2018-05-01 | non-Ushichka    | 



### Dates for ``` fieldwork_2018_002\```

```\actrackdata\wav```

|Disk #|  Disk #1  |  Disk #2   | Disk #3    |   Disk #4    |
|------|-----------|------------|------------|--------------|
|Start | No data   | 2018-07-28 | 2018-07-28 |  2018-07-28  |
|  End | No data   | 2018-08-17 | 2018-08-19 |  2018-08-19  | 

```\video```

|Disk #|   Disk #1 |  Disk #2  |    Disk #3   |   Disk #4    |
|------|-----------|-----------|--------------|--------------|
| Start|  No data  |2018-07-28 |  2018-07-28  |  2018-07-28  |
|  End |  No data  |2018-08-19 |  2018-08-19  |  2018-08-19  | 


In [1]:
import os
import glob
import sys 

import numpy as np

In [81]:
drives = ['D:/','E:/','F:/','G:/']

In [63]:
# split all paths into their session folder and the filename
# check the difference in the files
def only_sessionfolder_filename(full_path):
    restpath, filename = os.path.split(full_path)
    restpathm1, session = os.path.split(restpath)
    return os.path.join(session,filename)

## What is where ? ```fieldwork_001``` audio and video data 

In [64]:
# fieldseason 01 all wav files: 
fieldseason1_path = 'fieldwork_2018_001/actrackdata/wav/*/*.wav'
all_wav_files = [glob.glob(drive+fieldseason1_path) for drive in drives[:2]]
all_wav_files
print([len(each) for i,each in enumerate(all_wav_files)])

[5276, 5437]


In [65]:
# session anf file paths for #1 and #2
session_and_files = [ list(map(only_sessionfolder_filename, each))  for each in all_wav_files]

In [66]:
# are all wav files in #2 fieldwork_001 there in #1 
file_diffs_season1 = set(session_and_files[1]).difference(set(session_and_files[0]))

In [69]:
print('Number of unique files in Drive#2',len(file_diffs_season1))

Number of unique files in Drive#2 161


### ```fieldwork_001/actrackdata/wav``` summary:
Drive #2 has all fieldwork_001 WAV files in #1 and some more - especially the later sessions  post 2018-07-25. I also checked that # 4 has the same total file size as #1.  So essentially #1 and #4 have the same data for fieldwork_001/actrackdata/wav


## ```fieldwork_001/video``` file status

In [59]:
drives = ['D:/fieldwork_2018_001/video/','E:/fieldwork_2018_001/video/']
fieldseason1_path = '**/**/**/*.TMC'
all_tmc_files = [ glob.glob(each+fieldseason1_path, recursive=True) for each in drives]
all_tmc_files
print([len(each) for each in all_tmc_files])

[154054, 154054]


### ```fieldwork_001/actrackdata/video``` summary:
Drives #1 and #2 have **the same** video data. I also checked the total file space taken up by either, and they are also the same. 


## What is where ? ```fieldwork_002``` audio and video data 

### ```fieldwork_002\actrackdata\wav``` 

In [84]:
# fieldseason 01 all wav files: 
fieldseason2_path = ['fieldwork_2018_002/actrackdata/wav/*/*.wav','fieldword_2018_002/actrackdata/wav/*/*.wav',
                     'fieldwork_2018_002/actrackdata/wav/*/*.wav']
# For drives 2-4
all_wav_files = [glob.glob(drive+folderp) for drive, folderp in zip(drives[1:],fieldseason2_path)]
all_wav_files
print([len(each) for i,each in enumerate(all_wav_files)])

[528, 1109, 1109]


Already without looking into much detail we can see drives #3 and #4 have the same number of wav files, and #2 has about half as many. Let's check that the files are all the same once again. 