The Ushichka dataset has multiple types of files spread across multiple hard-disks, and of different types. This notebook will try to compare, and check how many unique files are there.

The broad reason Ushichka was split over multiple hard-disks was for 1) back-up and 2) due to space limitations on the first hard disk. The dataset has two 'fieldseason' phases: 1 & 2. Phase 1 is pre-Lund visit, while Phase 2 is all that happened post-Lund.

The four disks are named:

1. Ushichka dataset harddisk #1
1. Ushichka dataset harddisk #2
1. Ushichka dataset harddisk #3
1. Ushichka dataset harddisk #4

...and shortly referred to as #1-4

Here I will be discussing the data collected only in the 'main' recording volume for recordings done on (See Beleyur&Goerlitz 2021 (in prep) or Beleyur PhD Thesis, Uni. Konstanz):

* 2018-06-19
* 2018-06-21
* 2018-06-22
* 2018-07-14
* 2018-07-21
* 2018-07-25
* 2018-07-28
* 2018-08-02
* 2018-08-14
* 2018-08-17
* 2018-08-18
* 2018-08-19

Aside from all of this data - there is actually a *lot* of data collected using the audio-video array system of free-flying bats recorded in various field sites in and outside of caves. 




The typical organisation of the data in the drives followed this pattern showing the organisation of folders containing raw data. 

### Typical Audio and Video raw data organisation

```
MAIN DISK:
    * fieldwork_2018_001
        * actrackdata
            * wav
                # one folder per starting date and session number. 
                # the 'starting date' remains the same past midnight even though the actual date
                # has changed.
                * YYYY-MM-DD_sss
                    * FILE1.WAV
                    * FILE2.WAV
                    * ..........
            * weather
        * video
            * YYYY-MM-DD
                # one folder per camera
                * K1
                    # first folder made on switching is always P0000000
                    # all following folders are +1
                    * P0000000
                        # same folder numbering system . First recording triggered is always
                        # stored as 00000000.TMC and +1 for every new file triggered.
                        * 00000000.TMC
                        * 00000001.TMC
                        * .........
                    * P0000001
                    * ......
                    
                * K2
                * K3
        * notebook_scans
            * YYYY-MM-DD
                * notebook photos, scans, additional notes/observations
```
In addition to the *raw* data, there are also some folders in ```actrackdata/video/yyyy-mm-dd/``` which have the DLTdv wand calibrations, the .avi exported videos, and even a few bat trajectories. 


### LiDAR data

In addition to the audio and video data, the LiDAR scans are stored in disk #2. The whole folder is 6.72GB

```
MAIN DISK:
    * Orlova_Chuka_LiDAR
        * Exports
            * various formats of the LiDAR scan - .ply, .dxf and a compressed folder 'OrlovaChukaTotalPointCloud.rar'
        # zipped folder containing a Word file explaining how the raw data was processed. 
        * Re _Orlova_Chuka_data.zip
```
Asparuh Kamburov also has the original data with him - and can be contacted for a copy if necessary. 


### Dates for ``` fieldwork_2018_001\```

```\actrackdata\wav```

|Disk #|Disk #1   | Disk #2|    Disk #3   |   Disk #4    |
|------|-----------|-----|--------|-----|
|Start|2018-06-19 |2018-06-19 | non-Ushichka |2018-06-19 |
|  End| 2018-07-25|2018-07-28 | non-Ushichka | 2018-07-28   | 

```\video```

|Disk #|  Disk #1  |  Disk #2   | Disk #3   |   Disk #4    |
|------|-----------|------------|-----------|--------------|
|Start |2018-06-19 | 2018-04-09 |2018-04-09 |non-Ushichka     |
|  End | 2018-07-25| 2018-07-25 |2018-05-01 | non-Ushichka    | 



### Dates for ``` fieldwork_2018_002\```

```\actrackdata\wav```

|Disk #|  Disk #1  |  Disk #2   | Disk #3    |   Disk #4    |
|------|-----------|------------|------------|--------------|
|Start | No data   | 2018-07-28 | 2018-07-28 |  2018-07-28  |
|  End | No data   | 2018-08-17 | 2018-08-19 |  2018-08-19  | 

```\video```

|Disk #|   Disk #1 |  Disk #2  |    Disk #3   |   Disk #4    |
|------|-----------|-----------|--------------|--------------|
| Start|  No data  |2018-07-28 |  2018-07-28  |  2018-07-28  |
|  End |  No data  |2018-08-19 |  2018-08-19  |  2018-08-19  | 


In [1]:
import os
import hashlib
import glob
import sys 
import tqdm
import joblib
from joblib import Parallel, delayed
import numpy as np
import pandas as pd
from pandas.core.common import flatten

In [2]:
drives = ['D:/','E:/','F:/','G:/']

In [3]:
# split all paths into their session folder and the filename
# check the difference in the files
def only_sessionfolder_filename(full_path):
    restpath, filename = os.path.split(full_path)
    restpathm1, session = os.path.split(restpath)
    return os.path.join(session,filename)

def only_camera_sessionfolder_filename(full_path):
    restpath, filename = os.path.split(full_path)
    restpathm1, folder_num = os.path.split(restpath)
    restpathm2 , kamera = os.path.split(restpathm1)
    restpathm3 , session_folder = os.path.split(restpathm2)
    return os.path.join(session_folder,kamera,folder_num,filename)

## What is where ? ```fieldwork_001``` audio and video data 

In [4]:
# fieldseason 01 all wav files: 
fieldseason1_path = 'fieldwork_2018_001/actrackdata/wav/*/*.wav'
all_wav_files = [glob.glob(drive+fieldseason1_path) for drive in drives[:2]]
all_wav_files
print([len(each) for i,each in enumerate(all_wav_files)])

[5276, 5437]


In [5]:
# session anf file paths for #1 and #2
session_and_files = [ list(map(only_sessionfolder_filename, each))  for each in all_wav_files]

In [6]:
# are all wav files in #2 fieldwork_001 there in #1 
file_diffs_season1 = set(session_and_files[1]).difference(set(session_and_files[0]))

In [7]:
print('Number of unique files in Drive#2',len(file_diffs_season1))

Number of unique files in Drive#2 161


### ```fieldwork_001/actrackdata/wav``` summary:
Drive #2 has all fieldwork_001 WAV files in #1 and some more - especially the later sessions  post 2018-07-25. I also checked that # 4 has the same total file size as #1.  So essentially #1 and #4 have the same data for fieldwork_001/actrackdata/wav


## ```fieldwork_001/video``` file status

In [8]:
drives_folders = ['D:/fieldwork_2018_001/','E:/fieldwork_2018_001/']
fieldseason1_path = 'video/**/**/**/*.TMC'
all_tmc_files = [ glob.glob(each+fieldseason1_path) for each in drives_folders]
all_tmc_files
print([len(each) for each in all_tmc_files])

[11317, 11317]


In [9]:
f01_sessionfile =  [ list(map(only_camera_sessionfolder_filename, each)) for each in all_tmc_files]
f01_sessionfiles_sets = [set(each) for each in f01_sessionfile]

### ```fieldwork_001/actrackdata/video``` summary:
Drives #1 and #2 have **the same** video data. I also checked the total file space taken up by either, and they are also the same. 


## What is where ? ```fieldwork_002``` audio and video data 

### ```fieldwork_002\actrackdata\wav``` 

In [10]:
drives_wavfolders = ['E:/','F:/','G:/']

# fieldseason 02 all wav files: 
fieldseason2_path = ['fieldwork_2018_002/actrackdata/wav/*/*.wav','fieldword_2018_002/actrackdata/wav/*/*.wav',
                     'fieldwork_2018_002/actrackdata/wav/*/*.wav']
# For drives 2-4
all_wav_files2 = [glob.glob(drive+folderp) for drive, folderp in zip(drives_wavfolders,fieldseason2_path)]
all_wav_files2
print([len(each) for i,each in enumerate(all_wav_files2)])

[528, 1109, 1109]


Already without looking into much detail we can see:
* drives #3 and #4 have the same number of wav files
* #2 has about half as many. 

Even though the number of files are the same, let's check that the files are all the same between #3 and #4

In [11]:
# select only the session folder+files in them 
f02_sessionfile2 =  [ list(map(only_sessionfolder_filename, each)) for each in all_wav_files2]
f02_sessionfiles_sets = [set(each) for each in f02_sessionfile2]

In [12]:
# common files between the three drives
all_common = f02_sessionfiles_sets[0].intersection(f02_sessionfiles_sets[1]).intersection(f02_sessionfiles_sets[2])
print(len(all_common))

528


In [13]:
# are files in #2 a subset of the files in #3 and #4 ?
all_common == f02_sessionfiles_sets[0]

True

So the files in #2 seem to be there in drives #3 and #4 too. 

Are the files in #3 and #4 the same though? 

In [14]:
drives34_same = f02_sessionfiles_sets[1] == f02_sessionfiles_sets[2]
print(drives34_same)

True


The files in #3 and #4 are indeed the same - which means there's a complete redundancy in the 

### Summary ```fieldwork_002/actrackdata/wav``` :
* 1109 wav files in #3 and #4 are the same 
* 528 wav files in #2 are also in #3 and #4

## ```fieldwork_002\video``` 

In [15]:

# fieldseason 02 all wav files: 
fieldseason2_path = ['fieldwork_2018_002/','fieldword_2018_002/',
                     'fieldwork_2018_002/']

sub2_path = 'video/**/**/**/*.TMC'
all_tmc_files2 = []
for each,drivep in zip(drives_wavfolders, fieldseason2_path):
    all_tmc_files2.append(glob.glob(each+drivep+sub2_path))

all_tmc_files2
print([len(each) for each in all_tmc_files2])

[1873, 4881, 4881]


In [16]:
len(set(all_tmc_files2[0]))

1873

In [17]:
drives

['D:/', 'E:/', 'F:/', 'G:/']

In [18]:
# now only extract the sessionpath downwards
f02_tmc_sessionfiles = [ list(map(only_camera_sessionfolder_filename, each)) for each in all_tmc_files2]
f02_tmc_files_set = [set(each) for each in f02_tmc_sessionfiles]

In [19]:
# get all files that are common to all 3 drives:
tmc_allcommon = f02_tmc_files_set[0].intersection(f02_tmc_files_set[1]).intersection(f02_tmc_files_set[2])
print(len(tmc_allcommon))

1873


In [20]:
# Do drives #3 and #4 have the same TMC?
tmc_34_same = f02_tmc_files_set[1]==f02_tmc_files_set[2]
print(tmc_34_same)

True


In [21]:
f02_tmc_files_set[2].issuperset(f02_tmc_files_set[0])

True

###  summary video files for drives # 2-4

* Video TMC files in #2 are a subset of #3 and #4
* Video TMC files in #3 and #4 are the same

## What are all the unique audio and video files in the Ushichka dataset?

## All the unique WAV files in the dataset

In [22]:
# all the unique WAV files 
wav_combined = f02_sessionfiles_sets + session_and_files
wav_unique = set.union(*wav_combined)
len(wav_unique)
    

6390

## All the unique TMC files in the dataset


In [23]:
tmc_combined = f02_tmc_files_set + f01_sessionfiles_sets
tmc_unique = set.union(*tmc_combined)
print(len(tmc_unique))


16198


In [24]:
print(len(tmc_unique)/3)

5399.333333333333


## All WAV files in the Ushichka period


In [25]:
ushichka_dates=['2018-06-19', '2018-06-21','2018-06-22','2018-07-14','2018-07-21',
    '2018-07-25', '2018-07-28','2018-08-02','2018-08-14','2018-08-17','2018-08-18',
    '2018-08-19']
get_outer_folder = lambda path_X: os.path.split(path_X[::-1])[-1][::-1] 
def file_is_in_datelist(filepath, date_list):
    '''
    Parameters
    ----------
    filepath : str
    date_list : list
        With strings containing dates. 
    Returns
    -------
    boolean 
        True is the outer folder of the filepath matches part of the date_list. 
    
    Example
    -------
    >>> dates = ['2018-06-19','2018-06-21']
    >>> file_path = '2018-07-25_001\\MULTIWAV_2018-07-25_23-07-47_1532549267.WAV'
    >>> file_is_in_datelist(file_path, dates) # outputs False
    
    # instead with
    >>> file_path = '2018-06-19_001\\MULTIWAV_2018-07-25_23-07-47_1532549267.WAV'
    >>> file_is_in_datelist(file_path, dates) # outputs True
    '''
    datefolder = get_outer_folder(filepath)
    in_datelist = sum([ each in datefolder  for each in date_list])
    if in_datelist>0:
        return True
    else:
        return False
def file_is_in_uschichka_dates(filepath):
    return file_is_in_datelist(filepath, ushichka_dates)

In [26]:
ushichka_wavfiles = sorted(list(filter(file_is_in_uschichka_dates, list(wav_unique))))
ushichka_tmcfiles = sorted(list(filter(file_is_in_uschichka_dates, list(tmc_unique))))

In [27]:
ushichka_wavfiles

['2018-06-19_001\\MULTIWAV_2018-06-19_20-56-21_1529430981.WAV',
 '2018-06-19_001\\MULTIWAV_2018-06-19_20-57-55_1529431075.WAV',
 '2018-06-19_001\\MULTIWAV_2018-06-19_21-00-04_1529431204.WAV',
 '2018-06-19_001\\MULTIWAV_2018-06-19_21-15-16_1529432116.WAV',
 '2018-06-19_001\\MULTIWAV_2018-06-19_21-15-23_1529432123.WAV',
 '2018-06-19_001\\MULTIWAV_2018-06-19_21-16-54_1529432214.WAV',
 '2018-06-19_001\\MULTIWAV_2018-06-19_21-26-24_1529432784.WAV',
 '2018-06-19_001\\MULTIWAV_2018-06-19_21-26-48_1529432808.WAV',
 '2018-06-19_001\\MULTIWAV_2018-06-19_21-27-10_1529432830.WAV',
 '2018-06-19_001\\MULTIWAV_2018-06-19_21-29-38_1529432978.WAV',
 '2018-06-19_001\\MULTIWAV_2018-06-19_21-31-02_1529433062.WAV',
 '2018-06-19_001\\MULTIWAV_2018-06-19_21-31-21_1529433081.WAV',
 '2018-06-19_001\\MULTIWAV_2018-06-19_21-32-12_1529433132.WAV',
 '2018-06-19_001\\MULTIWAV_2018-06-19_21-32-20_1529433140.WAV',
 '2018-06-19_001\\MULTIWAV_2018-06-19_21-32-26_1529433146.WAV',
 '2018-06-19_001\\MULTIWAV_2018-06-19_21

In [28]:
print(len(ushichka_tmcfiles), len(ushichka_wavfiles))

8163 2095


## Why are the TMC and WAV file numbers so different? 
* TMC files outnumber WAV files as for every multichannel WAV file there are 3 TMC files from each thermal camera trigger. This means effectively there are about ~2100 camera trigger events. 
* In principle, the $N_{camera\:files} = N_{wavfiles} \times 3$. The number of WAV files (8163) exceed the expected 6200 (~2100x3) files possibly because sometimes due to poor battery level the sync/trigger signal didn't get amplified sufficiently through the control box. This means sometimes, for each audio file that was recorded, there could be > 3 camera files, or even multiple files per camera per trigger. This will indeed be the tricky part to check. 

In [29]:
col_file_type = ['raw_audio']*len(ushichka_wavfiles)+['raw_video']*len(ushichka_tmcfiles)
raw_av_files = pd.DataFrame(data={'unique_filepath': ushichka_wavfiles+ushichka_tmcfiles,
                                   'file_type': col_file_type} )

In [30]:
raw_av_files

Unnamed: 0,unique_filepath,file_type
0,2018-06-19_001\MULTIWAV_2018-06-19_20-56-21_15...,raw_audio
1,2018-06-19_001\MULTIWAV_2018-06-19_20-57-55_15...,raw_audio
2,2018-06-19_001\MULTIWAV_2018-06-19_21-00-04_15...,raw_audio
3,2018-06-19_001\MULTIWAV_2018-06-19_21-15-16_15...,raw_audio
4,2018-06-19_001\MULTIWAV_2018-06-19_21-15-23_15...,raw_audio
...,...,...
10253,2018-08-19\K3\P0000004\00011000.TMC,raw_video
10254,2018-08-19\K3\P0000004\00012000.TMC,raw_video
10255,2018-08-19\K3\P0000004\00013000.TMC,raw_video
10256,2018-08-19\K3\P0000004\00014000.TMC,raw_video


In [31]:
drives

['D:/', 'E:/', 'F:/', 'G:/']

In [32]:

all_drives_wavfiles = list(flatten([glob.glob(drive+'/**/*.WAV',recursive=True) for drive in tqdm.tqdm(drives)]))
len(all_drives_wavfiles)


100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  2.76it/s]


31684

In [33]:

all_drives_tmc = list(flatten([glob.glob(drive+'/**/*.TMC',recursive=True) for drive in tqdm.tqdm(drives)]))
len(all_drives_tmc)

100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  2.72it/s]


43517

### Checking that the unique wav file copies across hard drives are the same
Most files are saved in multiple locations. Let us double check that the file copies are actually the same. This allows us to store the MD5 hash as a unique identifier for each raw audio/video file that we have in our 
records. 

In [34]:
def generate_md5sum_hexadecimal(filepath):
    md5_hash = hashlib.md5()
    with open(filepath,"rb") as f:
        # Read and update hash in chunks of 4K
        for byte_block in iter(lambda: f.read(4096),b""):
            md5_hash.update(byte_block)
        return md5_hash.hexdigest()

def get_filecopy_hexdigests(partial_filepath, all_contents):
    '''
    Checks for 
    '''
    file_copies = [ eachfile for eachfile in all_contents if partial_filepath in eachfile]
    if len(file_copies)>0:
        # generate hex digests for each file copy
        return [generate_md5sum_hexadecimal(each) for each in file_copies]
    else:
        return 'NaN'


In [35]:
tmc_copy_hexes = Parallel(n_jobs=4)(delayed(get_filecopy_hexdigests)(each, all_drives_tmc) for each in tqdm.tqdm(ushichka_tmcfiles))

100%|████████████████████████████████████████████████████████████████████████████| 8163/8163 [9:30:09<00:00,  4.19s/it]


In [36]:
def check_hexes_are_the_same(file_hexes):
    '''
    List with sublists containing the MD5 hexdigest of 
    different copies of the same file
    '''
    all_the_same = []
    for each in file_hexes:
        if len(each)>1:
            all_the_same.append(all(x==each[0] for x in each))
        else:
            all_the_same.append('NaN')
    return all_the_same

In [64]:
tmc_hexbase = pd.DataFrame(data={'unique_tmc_filepath':ushichka_tmcfiles, 'md5_hex_copies':tmc_copy_hexes})

In [38]:
wav_copy_hexes = Parallel(n_jobs=4)(delayed(get_filecopy_hexdigests)(each, all_drives_wavfiles) for each in tqdm.tqdm(ushichka_wavfiles))

100%|████████████████████████████████████████████████████████████████████████████| 2095/2095 [4:56:58<00:00,  8.51s/it]


In [65]:
wav_hexbase = pd.DataFrame(data={'unique_wav_filepath':ushichka_wavfiles, 'md5_hex_copies':wav_copy_hexes})

In [40]:
wav_hexbase

Unnamed: 0,unique_wav_filepath,md5_hex
0,2018-06-19_001\MULTIWAV_2018-06-19_20-56-21_15...,"[20a6d3771a3d81b0f56c39bf524a661d, 20a6d3771a3..."
1,2018-06-19_001\MULTIWAV_2018-06-19_20-57-55_15...,"[d7299c49536cbb7ec88a769dfa09d67e, d7299c49536..."
2,2018-06-19_001\MULTIWAV_2018-06-19_21-00-04_15...,"[d7f0c8f785ebe576a5dc18266154d488, d7f0c8f785e..."
3,2018-06-19_001\MULTIWAV_2018-06-19_21-15-16_15...,"[bb892a980991a628c8c97aab908384e7, bb892a98099..."
4,2018-06-19_001\MULTIWAV_2018-06-19_21-15-23_15...,"[f24aaa5cb683937319d983da212e2122, f24aaa5cb68..."
...,...,...
2090,2018-08-19_003\MULTIWAV_2018-08-20_09-33-18_15...,"[922ddd2f3cee29762e4fc66082f643e4, 922ddd2f3ce..."
2091,2018-08-19_003\MULTIWAV_2018-08-20_09-33-36_15...,"[19544f1a1e29eef5285a0f7a514c2abe, 19544f1a1e2..."
2092,2018-08-19_003\SPKRPLAYBACK_multichirp_2018-08...,"[37e46f82992ca3158afb00167dee82e0, 37e46f82992..."
2093,2018-08-19_003\SPKRPLAYBACK_multichirp_2018-08...,"[83e80747b53d4aaa039869c5ea73814f, 83e80747b53..."


In [66]:
# check that all md5 hashes are the same for all copies of a wav file.
all_wavcopies_same = all(check_hexes_are_the_same(wav_hexbase['md5_hex_copies']))
print(all_wavcopies_same)

True


In [67]:
# check that all md5 hashes are the same for all TMC file copies
all_tmccopies_same = all(check_hexes_are_the_same(tmc_hexbase['md5_hex_copies']))
print(all_tmccopies_same)

True


In [68]:
# check that all of the files could be hex-digested 
[each for every in tmc_hexbase['md5_hex_copies'] for each in every if 'NaN' in each]

[]

In [69]:
tmc_hexbase

Unnamed: 0,unique_tmc_filepath,md5_hex_copies
0,2018-06-19\K1\P0000000\00000000.TMC,"[1bc84705f1f9ccfa43b3888b90c68395, 1bc84705f1f..."
1,2018-06-19\K1\P0000000\00001000.TMC,"[537b2fc286b073b0e951e39fa73a0280, 537b2fc286b..."
2,2018-06-19\K1\P0000000\00002000.TMC,"[bc162074f92c83166774e70c08e94ce1, bc162074f92..."
3,2018-06-19\K1\P0000000\00003000.TMC,"[ae36194ff9d3e5ed78af7670746246d1, ae36194ff9d..."
4,2018-06-19\K1\P0000000\00004000.TMC,"[6de770c7d4df9960002d4bf768c15c66, 6de770c7d4d..."
...,...,...
8158,2018-08-19\K3\P0000004\00011000.TMC,"[999c06ad95ed9716770c13fb50e1c332, 999c06ad95e..."
8159,2018-08-19\K3\P0000004\00012000.TMC,"[d81d7333df2280b5a4d343d84e95ddce, d81d7333df2..."
8160,2018-08-19\K3\P0000004\00013000.TMC,"[cbb2910f76c6a4721ecd5de8b96fff29, cbb2910f76c..."
8161,2018-08-19\K3\P0000004\00014000.TMC,"[af0b56eb55d5497cda52562640c18eb5, af0b56eb55d..."


### All Ushichka audio and video file copies are identical content-wise. 
Now, let's save the details of these files in a parsable way

In [70]:
tmc_hexbase['md5_digest'] = tmc_hexbase['md5_hex_copies'].apply(lambda X: X[0])
wav_hexbase['md5_digest'] = wav_hexbase['md5_hex_copies'].apply(lambda X: X[0])

In [71]:
tmc_hexbase

Unnamed: 0,unique_tmc_filepath,md5_hex_copies,md5_digest
0,2018-06-19\K1\P0000000\00000000.TMC,"[1bc84705f1f9ccfa43b3888b90c68395, 1bc84705f1f...",1bc84705f1f9ccfa43b3888b90c68395
1,2018-06-19\K1\P0000000\00001000.TMC,"[537b2fc286b073b0e951e39fa73a0280, 537b2fc286b...",537b2fc286b073b0e951e39fa73a0280
2,2018-06-19\K1\P0000000\00002000.TMC,"[bc162074f92c83166774e70c08e94ce1, bc162074f92...",bc162074f92c83166774e70c08e94ce1
3,2018-06-19\K1\P0000000\00003000.TMC,"[ae36194ff9d3e5ed78af7670746246d1, ae36194ff9d...",ae36194ff9d3e5ed78af7670746246d1
4,2018-06-19\K1\P0000000\00004000.TMC,"[6de770c7d4df9960002d4bf768c15c66, 6de770c7d4d...",6de770c7d4df9960002d4bf768c15c66
...,...,...,...
8158,2018-08-19\K3\P0000004\00011000.TMC,"[999c06ad95ed9716770c13fb50e1c332, 999c06ad95e...",999c06ad95ed9716770c13fb50e1c332
8159,2018-08-19\K3\P0000004\00012000.TMC,"[d81d7333df2280b5a4d343d84e95ddce, d81d7333df2...",d81d7333df2280b5a4d343d84e95ddce
8160,2018-08-19\K3\P0000004\00013000.TMC,"[cbb2910f76c6a4721ecd5de8b96fff29, cbb2910f76c...",cbb2910f76c6a4721ecd5de8b96fff29
8161,2018-08-19\K3\P0000004\00014000.TMC,"[af0b56eb55d5497cda52562640c18eb5, af0b56eb55d...",af0b56eb55d5497cda52562640c18eb5


In [72]:
wav_hexbase

Unnamed: 0,unique_wav_filepath,md5_hex_copies,md5_digest
0,2018-06-19_001\MULTIWAV_2018-06-19_20-56-21_15...,"[20a6d3771a3d81b0f56c39bf524a661d, 20a6d3771a3...",20a6d3771a3d81b0f56c39bf524a661d
1,2018-06-19_001\MULTIWAV_2018-06-19_20-57-55_15...,"[d7299c49536cbb7ec88a769dfa09d67e, d7299c49536...",d7299c49536cbb7ec88a769dfa09d67e
2,2018-06-19_001\MULTIWAV_2018-06-19_21-00-04_15...,"[d7f0c8f785ebe576a5dc18266154d488, d7f0c8f785e...",d7f0c8f785ebe576a5dc18266154d488
3,2018-06-19_001\MULTIWAV_2018-06-19_21-15-16_15...,"[bb892a980991a628c8c97aab908384e7, bb892a98099...",bb892a980991a628c8c97aab908384e7
4,2018-06-19_001\MULTIWAV_2018-06-19_21-15-23_15...,"[f24aaa5cb683937319d983da212e2122, f24aaa5cb68...",f24aaa5cb683937319d983da212e2122
...,...,...,...
2090,2018-08-19_003\MULTIWAV_2018-08-20_09-33-18_15...,"[922ddd2f3cee29762e4fc66082f643e4, 922ddd2f3ce...",922ddd2f3cee29762e4fc66082f643e4
2091,2018-08-19_003\MULTIWAV_2018-08-20_09-33-36_15...,"[19544f1a1e29eef5285a0f7a514c2abe, 19544f1a1e2...",19544f1a1e29eef5285a0f7a514c2abe
2092,2018-08-19_003\SPKRPLAYBACK_multichirp_2018-08...,"[37e46f82992ca3158afb00167dee82e0, 37e46f82992...",37e46f82992ca3158afb00167dee82e0
2093,2018-08-19_003\SPKRPLAYBACK_multichirp_2018-08...,"[83e80747b53d4aaa039869c5ea73814f, 83e80747b53...",83e80747b53d4aaa039869c5ea73814f


In [73]:
# save the files into csv files 
wav_hexbase.to_csv('ushichka_wavfilelist_md5digest.csv')
tmc_hexbase.to_csv('ushichka_tmcfilelist_md5digest.csv')

In [74]:
import datetime as dt 
print(dt.datetime.now())

2021-03-31 10:19:12.459488
