# Create .npz files from patchlets

```
#
# Copyright (c) Sinergise, 2019 -- 2021.
#
# This file belongs to subproject "field-delineation" of project NIVA (www.niva4cap.eu).
# All rights reserved.
#
# This source code is licensed under the MIT license found in the LICENSE
# file in the root directory of this source tree.
#
```

This notebook creates a series of `.npz` files which join the data and labels sampled in patchlets from the previous iteration. A dataframe is created to keep track of the origin of the patchlets, namely which eopatch they come from and at which position they were sampled. This dataframe is later used forthe cross-validation splits. 

In [1]:
%load_ext autoreload
%autoreload 2

In [17]:
import os

import numpy as np
import pandas as pd 
from tqdm.auto import tqdm

from functools import partial 
from concurrent.futures import ProcessPoolExecutor

from fd.utils import prepare_filesystem, multiprocess
from fd.create_npz_files import (
    CreateNpzConfig, 
    extract_npys, 
    concatenate_npys, 
    save_into_chunks
)

### Define filesystem and eopatches location 

In [3]:
config = CreateNpzConfig(
    bucket_name='bucket-name',
    aws_access_key_id='',
    aws_secret_access_key='',
    aws_region='eu-central-1', 
    patchlets_folder='data/Castilla/2020-04/patchlets',
    output_folder='data/Castilla/2020-04/patchlets_npz', 
    output_dataframe='metadata/Castilla/2020-04/patchlet-info.csv',
    chunk_size=50)

In [4]:
filesystem = prepare_filesystem(config)

In [5]:
patchlets = [os.path.join(config.patchlets_folder, eop_name)
             for eop_name in filesystem.listdir(config.patchlets_folder)]

In [6]:
len(patchlets)

7740

Extract numpy arrays of: 
* X
* y_boundary
* y_extent 
* y_distance
* timestamps
* eop_names

from eopatches.

In [7]:
partial_fn = partial(extract_npys, cfg=config)

In [8]:
npys = multiprocess(partial_fn, patchlets, max_workers=24)

HBox(children=(FloatProgress(value=0.0, max=7740.0), HTML(value='')))




Concatenate numpy arrays per eopatch into one array. 

In [9]:
npys_dict = concatenate_npys(npys)

In [10]:
npys_dict.keys()

dict_keys(['X', 'y_boundary', 'y_extent', 'y_distance', 'timestamps', 'eop_names'])

Split the the big arrays into smaller chunks of size chunk_size and save as npz files. 

In [13]:
save_into_chunks(config, npys_dict)

#### Check that results make sense 

In [14]:
npzs = filesystem.listdir(config.output_folder)

In [15]:
len(npzs)

375

In [18]:
test_npz = np.load(filesystem.open(os.path.join(config.output_folder, npzs[0]), 'rb'), 
                   allow_pickle=True)

In [19]:
test_npz['X'].shape, test_npz['y_extent'].shape, test_npz['timestamps'].shape

((50, 256, 256, 4), (50, 256, 256, 1), (50,))

In [20]:
df = pd.read_csv(filesystem.open(config.output_dataframe))

In [21]:
df.head()

Unnamed: 0,chunk,eopatch,patchlet,chunk_pos,timestamp
0,patchlets_field_delineation_0.npz,29TPE_8_0,data/Castilla/2020-04/patchlets/29TPE_8_0_0,0,2020-03-25 00:00:00+00:00
1,patchlets_field_delineation_0.npz,29TPE_8_0,data/Castilla/2020-04/patchlets/29TPE_8_0_0,1,2020-03-27 00:00:00+00:00
2,patchlets_field_delineation_0.npz,29TPE_8_0,data/Castilla/2020-04/patchlets/29TPE_8_0_1,2,2020-03-25 00:00:00+00:00
3,patchlets_field_delineation_0.npz,29TPE_8_0,data/Castilla/2020-04/patchlets/29TPE_8_0_1,3,2020-03-27 00:00:00+00:00
4,patchlets_field_delineation_0.npz,29TPE_8_0,data/Castilla/2020-04/patchlets/29TPE_8_0_2,4,2020-03-25 00:00:00+00:00


In [22]:
len(df)

18701