# Move amlr07-20221204 GCS files

The code for processiong amlr07-20221204 shadowgraph images (Cutter's code, adapted from Ohman et al methods) wrote out processed images into a single directory. The purpose of this notebook is to copy these files to their own folders, to be imported into VIAME-Web-AMLR.

Image types: 
- -ffPCG.png images: Flatfielded images, with Pixel Gamma Correction
- -imgff.png images: Flatfielded images
- .jpgorig-regions.jpg: Original jpg images, with red region bounding boxes pasted onto them

Note that both versions of flatfielded images have had other processing steps applied, such as masking.

Import modules, inlcuding 'sourcing' py file with functions

In [1]:
from google.cloud import storage
from itertools import repeat
import subprocess
import pandas as pd
import multiprocessing as mp

%run -m file_move

## Variables and Prep

Set variable names, and generate list of files to rename

In [2]:
storage_client = storage.Client(project = "ggn-nmfs-usamlr-dev-7b99")
source_bucket_name    = "amlr-imagery-proc-dev"
file_prefix    = "gliders/2022/amlr07-20221204/shadowgraph/images/Dir011"

# file_substr    = "-ffPCG"
file_substr    = "-imgff"
# file_substr    = ".jpgorig-regions"

destination_bucket_name = "amlr-gliders-imagery-proc-dev"

In [3]:
file_list_orig = list_blobs_with_prefix(
    source_bucket_name, file_prefix, file_substr=file_substr)    

In [4]:
print(f"there are {len(file_list_orig)} files with {file_substr} " +
      f"with the prefix {source_bucket_name}/{file_prefix}")
print("First 10 files:") 
for i in file_list_orig[0:9]:
    print(i)
    
# # Summarize by string length, if desired
# lengths_list = (lambda x:[len(i) for i in x])(file_list_orig)
# pd.Series(lengths_list).value_counts()

there are 7715 files with -imgff with the prefix amlr-imagery-proc-dev/gliders/2022/amlr07-20221204/shadowgraph/images/Dir010
First 10 files:
gliders/2022/amlr07-20221204/shadowgraph/images/Dir0100/output/amlr07 20221208-151532-001-imgff.png
gliders/2022/amlr07-20221204/shadowgraph/images/Dir0100/output/amlr07 20221208-151538-002-imgff.png
gliders/2022/amlr07-20221204/shadowgraph/images/Dir0100/output/amlr07 20221208-151544-003-imgff.png
gliders/2022/amlr07-20221204/shadowgraph/images/Dir0100/output/amlr07 20221208-151550-004-imgff.png
gliders/2022/amlr07-20221204/shadowgraph/images/Dir0100/output/amlr07 20221208-151556-005-imgff.png
gliders/2022/amlr07-20221204/shadowgraph/images/Dir0100/output/amlr07 20221208-151602-006-imgff.png
gliders/2022/amlr07-20221204/shadowgraph/images/Dir0100/output/amlr07 20221208-151608-007-imgff.png
gliders/2022/amlr07-20221204/shadowgraph/images/Dir0100/output/amlr07 20221208-151614-008-imgff.png
gliders/2022/amlr07-20221204/shadowgraph/images/Dir0100/ou

Create new file list

In [5]:
file_list_destination = []
for i in file_list_orig:
    i_orig = i
    i = i.replace("gliders/2022", "FREEBYRD/2023")
    i = i.replace("/shadowgraph/images", f"/images{file_substr}")
    i = i.replace("/output", "")
    file_list_destination.append(i)

In [6]:
from random import sample
for i in sample(file_list_destination, 5):
    print(i)

FREEBYRD/2023/amlr07-20221204/images-imgff/Dir0104/amlr07 20221209-002808-013-imgff.png
FREEBYRD/2023/amlr07-20221204/images-imgff/Dir0104/amlr07 20221209-005101-005-imgff.png
FREEBYRD/2023/amlr07-20221204/images-imgff/Dir0106/amlr07 20221209-061925-023-imgff.png
FREEBYRD/2023/amlr07-20221204/images-imgff/Dir0103/amlr07 20221208-211724-017-imgff.png
FREEBYRD/2023/amlr07-20221204/images-imgff/Dir0102/amlr07 20221208-205632-019-imgff.png


## Copy
Copy the blobs, rather than rsyncing. Create the bucket object outside the functions to minimize the work that the functions need to do

In [7]:
source_bucket = storage_client.bucket(source_bucket_name)
destination_bucket = storage_client.bucket(destination_bucket_name)

#### Testing cells

In [8]:
# i = 1

# # source_blob = source_bucket.blob(file_list_orig[i])
# # blob_copy = source_bucket.copy_blob(
# #     source_blob, destination_bucket, file_list_destination[i]
# #     #if_generation_match=destination_generation_match_precondition,
# # )

# copy_blob_if_new(
#     source_bucket, file_list_orig[i], 
#     destination_bucket, file_list_destination[i]
# )


#### Run the full thing

Run the full thing in parallel

In [9]:
%%time
numcores = mp.cpu_count()
print(f"Running with {numcores} cores")

with mp.Pool(numcores) as pool:
    out_list = pool.starmap(
        copy_blob_client, 
        zip(repeat(source_bucket_name), file_list_orig, 
            repeat(destination_bucket_name), file_list_destination)
    )

Running with 48 cores
CPU times: user 120 ms, sys: 237 ms, total: 357 ms
Wall time: 29 s


Run the full thing sequentially, so we can pass the client objects

In [10]:
# %%time
# for (i, j) in zip(file_list_orig, file_list_destination):
#     # if source_bucket.blob(i).exists():
#     #     # print(f"skipping {source_bucket.blob(i).name}")
#     # else:
#     print(j)
#     copy_blob(source_bucket, i, destination_bucket, j)

### Sanity checks

In [11]:
# prefix_dest = "FREEBYRD/2023/amlr07-20221204/regions/Dir"
# file_list_dest = list_blobs_with_prefix(bucket_destination, prefix_dest)    
# print(f"there are {len(file_list_dest)} files " +
#       f"with the prefix {bucket_destination}/{prefix_dest}")

# lengths_list = (lambda x:[len(i) for i in x])(file_list_dest)
# pd.Series(lengths_list).value_counts()