### Dask GUIs: Monitoring Workers, Tasks, and Memory

Here we'll revisit that lab solution one more time.

But this time, we'll focus less on getting the answers, and more on seeing what Dask is doing.

Specifically, we'll look at some of the elements of the Dask Dashboard GUI.

Once again, we'll start by creating that client with 2 workers, 1 thread, and 1GB of RAM each.

In [None]:
from dask.distributed import Client

client = Client(n_workers=2, threads_per_worker=1, memory_limit='1GB')

client

Although we have widgets integrated into Jupyter Lab, those are just for convenience. The "real" dashboard is at the URL above (though it may require some tweaking to work for Binder-hosted containers.)

When we __Read data__ you'll notice that nothing happens in the Dask GUI widgets, because these operations are just setting up a compute graph which will be executed later:

In [None]:
import dask.dataframe

ddf = dask.dataframe.read_csv('data/pageviews_small.csv', sep=' ', blocksize=10e6)

ddf.columns = ['project', 'page', 'requests', 'x']

ddf2 = ddf.drop('x', axis=1)

ddf3 = ddf2[ddf2.project == 'en']
ddf3

When we __Count__ (and `.compute()`) all the records, we'll see tasks get scheduled. __Before__ running this command, note memory, CPU, etc. in the GUI

In [None]:
ddf2.count().compute() #all

The GUI tells us quite a lot about what's happened. If you really want to see how the computation was decomposed by Dask, you can render a task graph before executing (although you won't normally need to do this):

In [None]:
ddf2.count().visualize()

That explains a bit about the hover labels in, e.g., the Task Stream display. It's pretty much what we would expect for a lazy, data-parallel count, where the partitions are Pandas dataframes.

How about the *English-only* count? That one is a bit more complicated because of the filtering:

In [None]:
ddf3.count().visualize() #English

In [None]:
ddf3.count().compute() #English

### Let's look at Dask's Profile View

Note: almost all of Dask's dashboard views update in realtime. The Profile View __does not__. Although Dask is collecting perf data behind the scenes, the profiler timeline doesn't update until you click the "Update" button. 

At that point you can select a time period from the refreshed timeline, and Dask will render a flame graph from that selected period.

In [None]:
ddf2.set_index('project')

# Lab Exercise

With the Wikimedia data, find the total number of pageviews for each project, and create a report with the top 20. 

Hints:
* Use groupby / sum to aggregate
* Use nlargest to report the top 20

The goal is to explore some of these performance tools (GUI widgets and graph visualization), moreso than creating code.

<!--
bigger hint: ddf2.groupby('project').sum().nlargest(20, 'requests').compute()
-->

In [None]:
# try it

In [None]:
client.close()