# Session 11: Data Processing

In [12]:
import os

# For parallel processing
import parsl
from parsl import python_app
from parsl.config import Config
from parsl.channels import LocalChannel
from parsl.executors import HighThroughputExecutor
from parsl.providers import LocalProvider

# helpers
from grouputils import initialize_rasterizer

## Background

![](https://raw.githubusercontent.com/PermafrostDiscoveryGateway/viz-raster/develop/docs/images/raster_tldr.png)

Today, we will create rasters from the regularly gridded staged data from yesterday (vector data, left side of diagram above). Each pixel of the raster will contain a stasticic calculated based on the underlying vector data for that pixel.

The two statistics we calculate are:

- number of IWP per pixel
- proportion of pixel covered by IWP

The vector data on the left are geopackages, which can be read in as simple geodataframes. The rasters take the geometry columns of all the geodataframes, which are all simple polygons, and represent them as pixels. 

Similar to the `TileStager` from the first step of the group project, in this step we will use a `RasterTiler` class, which we initialize using `initialize_rasterizer`.

In [13]:
#initializing raster
iwp_rasterizer = initialize_rasterizer('/home/shares/example-pdg-data/SCC-2023')

Similarly to how the `iwp_stager` communicated configuration information to the stager, the `iwp_rasterizer` communicates  information to the rasterizer. This includes:

- where the staged files are located
- where to write the raster files
- size of the rasters (in pixels)
- what statistics to calculate (each becomes a band)


Let's have a look at the data we need to rasterize first.

In [14]:
staged_paths = iwp_rasterizer.tiles.get_filenames_from_dir('staged')
print(len(staged_paths))

454


We can use the `rasterize_vector` method on our `RasterTiler` class object, and pass it the first file in our list of staged files to rasterize just one file.

In [15]:
iwp_rasterizer.rasterize_vector(staged_paths[0])

Tile(x=901, y=1060, z=13)

This went pretty quickly, but with hundreds of files, it would still be beneficial to parallelize. Estimate the length of time it would take to process the files in series below:

In [16]:
# estimate duration of rasterization process
.8*454

363.20000000000005

## Rasterize in Parallel

Given what you know about the limits of parallelization, discuss in your group whether you think this problem is:

- cpu-bound
- memory-bound
- I/O-bound
- network-bound

How does it compare to the parallelization task from yesterday?

Instead of parallelizing this step over all 454 files, we will instead process the data in batches of files. Below is a function we can use to make batches of files.

In [17]:
# Because rasterization is relatively quick, we want each parsl "task" to process a batch of tiles.
def make_batch(items, batch_size):
    # Create batches of a given size from a list of items.
    return [items[i:i + batch_size] for i in range(0, len(items), batch_size)]

Use the `make_batch` function to make batches of 10 items each of our staged data (`staged_paths`).

In [18]:
# make batches of 10 files
batches = make_batch(staged_paths,10)

Now, set up your Parsl executor again, with `max_workers` set at 11 and `max_blocks` set at 1.

In [19]:
# TEMPLATE FOR PARSL CONFIG:

# activate_env = 'workon scomp'
# htex_config = Config(
#   executors=[
#       HighThroughputExecutor(
#           ..., 
#           provider = LocalProvider(
#             worker_init = activate_env,
#             ...
#           )
#       )
#   ]
# )

activate_env = 'workon scomp'
htex_config = Config(
  executors=[
      HighThroughputExecutor(
          max_workers = 11, 
          provider = LocalProvider(
            worker_init = activate_env,
            max_blocks = 1
          )
      )
  ]
)

parsl.clear()
parsl.load()

<parsl.dataflow.dflow.DataFlowKernel at 0x7f82f72d9dc0>

Next, set up your Parsl app to run the `rasterize_vector` method in parallel. Remember that Parsl apps cannot rely on global variables or package imports, so you'll need to make sure to pass the app all of the variables it needs.

In [24]:
# Make a Parsl app that uses the rasterize_vectors method
@python_app
def rasterize_batch(batch, rasterizer):
    for file in batch:
        rasterizer.rasterize_vector(file)
    
    return 'done'


Now, execute the app in parallel over all of the batches of files you created previously.

While you wait for _all_ the geotiffs to be written, you can monitor the process with the command: `find group-project/geotiff -type f | wc -l` 

The total number of geotiffs for this step will be equivalent to the number of staged files.

In [25]:
# Rasterize the batches in parallel
app_futures = []
for batch in batches:
    app_future = rasterize_batch(batch, iwp_rasterizer)
    app_futures.append(app_future)

# Don't continue to print message until all tiles have been rasterized
[app_future.result() for app_future in app_futures]

['done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done']

Don't forget to add lines to shut down the executor and clear `parsl`.

In [1]:
htex_config.executors[0].shutdown()
parsl.clear()

NameError: name 'htex_config' is not defined

Check to make sure that you have the same number of GeoTIFF files as we do staged vector tiles:

In [27]:
geotiff_paths = iwp_rasterizer.tiles.get_filenames_from_dir('geotiff')
len(geotiff_paths) == len(staged_paths)

True

We have a 1:1 ratio because we only processed the highest resolution files: those at zoom-level 13.

## Bonus

Thinking back to the group project yesterday, how would you test whether this problem is CPU or I/O bound? One option would be to test whether adding more CPU decreases your computation time. Set up a test on a subset of the staged files (the first 100 or so is fine) and time the process at different `max_workers`. To make sure you don't overload the server, do not exceed 15 workers in your tests.