# Midas Demo

Welcome to Midas! Midas is a reserach prototype to help us explore a _new programming medium_, where you we provide tools to help you move smoothly between coding and **interactions on visualizations**. Now that sounds rather abstract, so let's get started on the demo, which will give you concrete examples for when you might want to move from code to interactions and back.

Please follow the tutorial to learn the basics of Midas. We will walk through with a real example and introduce each feature as motivated by a specific need. Until you are done with the tutorial, please do *not* randomly click on the interface, and only do ask is asked in our tutorial. This will help you understand.

Should you have any questions, please feel free to ask Yifan, who will be present during the entire session.

<font color="gray">If you are using Ubuntu and do not see a blue dot button in the menu bar, that means emojis are not showing properly for you. Please try running `!sudo apt reinstall fonts-noto-color-emoji` in a cell.</font>

In [None]:
from midas import Midas
import numpy as np

# Initiate Midas environment
m = Midas()

No need to read through the text in the blue pannel---we'll walk you through most of it.

In [None]:
# Load data
fires_df = m.from_file("./data/fire_earlier.csv")
fires_df

You will see a blue bar below the cell---this is an indicator for where "Midas generated cells" will be placed after. You will know what we mean in the next step.

### <font color="0080FF">Exploratory Analysis: Distribution of Fire Causes</font>

From the above table, the first question that come to mind is the distribution of causes of fie (`CAUSE_DESCR`). Conviniently, Midas has a shortcut for seeing column distributions---**please find `CAUSE_DESCR` in the pannel to the right and click it**.

A cell with yello emoji may be created above---that is because we always create cells after the last executed cell. **Please move it here for organization** ⬇️.  The cell describes the query used to derive the data for the chart, which is a `group` operation, and by default, the aggregation is a `count`. You also see that `.vis` is called at the end---it's useful for customizing your own visualizations. We'll talk more about this later.

### <font color="0080FF">Exploratory Analysis: Distribution of State</font>

Another thing we can look at is how many fires there are by state (`STATE`). Again you can go ahead and **click on the `STATE` column in the yellow pane**.

Now you might find that you want to see more of the charts---to do this, you can **drag the left edge of the pannel to change the size**. If this is too large, you can **click on "toggle midas" and "toggle column pane" in the menu bar to hide/show the shelves**.

### <font color="0080FF">Investigation: Cause Distribution in California</font>

Let's get a sense of what fires happen in California---**please click on CA in the `STATE_calls_df_dist` chart**. You will see some shorter blue bars being filtered in the `CAUSE_DESCR_calls_df_dist`---this is the distribution of `CAUSE_DESCR` filtered by `CA`.

You will also see cells created that are annotated with "🔵"---this serves as an executable log of the interactions. You can try executing a differnet value, e.g. change 'CA' to 'NY'. Or you can programmtically empty the selection using `m.sel([])`.

You will also notice that the chart which you interacted with is highlighted in red---this serves as an additional visual cue of active selections.

Since the filtered values are a smaller portion of the chart, it might be hard to compare. **Please click on the ellipsis icon next to the chart name and then the "show filtered data only" icon**.  This will remove the paler blue bars in the background (fire counts across all states), and help you see better.  To close the dropdown menu, cick on the ellipsis button again.

We notice the interesting observation that California gets hit by Lightning a lot. To record this insight, **please click on the 📷 icon in the menu bar**. You will see a cell created with 🟠 emoji that contains both of the current charts, as well as the code that is used to derived the data on the charts, which may be helpful for reproduceability. <font color="gray">(The snap-shotted charts are stored as SVG, a HTML image format.)</font>

To manually record the top causes for california in text, we could run the following cell and copy the two columns out----`'Miscellaneous', 'Equipment Use'`. This might seem heavy handed for this example, but you may find it helpful for other scenarios.

In [None]:
CAUSE_DESCR_fires_df_dist.get_filtered_data()\
                         .sort('count')\
                         .head(2)['CAUSE_DESCR']

### <font color="0080FF">Investigtion: Lightning and Land Area?</font>
Now let's dig a little further and see what other states are affected by Lightning. **Please click on the Lightning bar**. Perhaps the count of lightning fires is mostly due to land area? From this [source](https://www.usgs.gov/special-topic/water-science-school/science/how-wet-your-state-water-area-each-state?qt-science_center_objects=0#qt-science_center_objects), we can get the the area of each state to compare with out data.

In [None]:
# Load data
land_size_df = m.from_file("./data/state_land_sizes.csv")
land_size_df.head(3)

We can modify the dataframe so that the new df is consistent with the new data

In [None]:
state_dict = {"Alabama":"AL", "Alaska":"AK", "Arizona":"AZ", "Arkansas":"AR", "California":"CA", "Colorado":"CO", "Connecticut":"CT", "Delaware":"DE", "Florida":"FL", "Georgia":"GA", "Hawaii":"HI", "Idaho":"ID", "Illinois":"IL", "Indiana":"IN", "Iowa":"IA", "Kansas":"KS", "Kentucky":"KY", "Louisiana":"LA", "Maine":"ME", "Maryland":"MD", "Massachusetts":"MA", "Michigan":"MI", "Minnesota":"MN", "Mississippi":"MS", "Missouri":"MO", "Montana":"MT", "Nebraska":"NE", "Nevada":"NV", "New Hampshire":"NH", "New Jersey":"NJ", "New Mexico":"NM", "New York":"NY", "North Carolina":"NC", "North Dakota":"ND", "Ohio":"OH", "Oklahoma":"OK", "Oregon":"OR", "Pennsylvania":"PA", "Rhode Island":"RI", "South Carolina":"SC", "South Dakota":"SD", "Tennessee":"TN", "Texas":"TX", "Utah":"UT", "Vermont":"VT", "Virginia":"VA", "Washington":"WA", "West Virginia":"WV", "Wisconsin":"WI", "Wyoming":"WY", "District of Columbia": "DC"}
land_size_df['STATE'] = land_size_df.apply(lambda x: state_dict[x], 'state_name')


### <font color="0080FF">Analyzing distribution of lightning across states</font>

We can accessing the data programmatically---in the cell below, we can **use `get_filtered_data` to directly access the filtered result**. Tip: type, "get_" then press the "Tab" key for auto-complete!

In [None]:
# now let's get the the data filtered by lightning
lightning_df = STATE_fires_df_dist.get_filtered_data()
lightning_df

In [None]:
# now we are going to plot a scatter plot against the lightning counts and the land counts
count_and_area_df = lightning_df.join('STATE', land_size_df.select(['STATE', 'area_sq_miles']), 'STATE')
count_area_scatter = count_and_area_df.select(['count', 'area_sq_miles'])

**You can also use `static_vis` to quickly look at one-off visualizations.** The static vis is generated right below the cell, and _not_ int he midas chart area---this is because they do not reactively update based on the interactions, and are one-off.

<font color="gray">If you don't see any output, try running `!jupyter nbextension install --sys-prefix --py vega` in the cell after.</font>

In [None]:
count_area_scatter.static_vis()

In [None]:
count_area_scatter.corr()

**Perhaps we can get the land size as opposed to total size**

In [None]:
count_and_area_df = lightning_df.join('STATE', land_size_df.select(['STATE', 'land_sq_miles']), 'STATE')
count_land_scatter = count_and_area_df.select(['count', 'land_sq_miles'])
count_land_scatter.static_vis()

Now we don't need to look at `land_size_df` anymore, **please click on the text in the yellow pane to hide the columns**.

### <font color="0080FF">Exploratory Analysis: Distribution of discovery time</font>

**Please go ahead and click on the `DISCOVERY_TIME` column.** For numeric values, the default is a brush (via dragging). If you want to make it click based, try running the following cell, with `selection_type="multiclick"` passed in.

In [None]:
DISCOVERY_TIME_fires_df_dist = fires_df.group('DISCOVERY_TIME_bin').vis(selection_type="multiclick")

### <font color="0080FF">Exploratory Analysis: Distribution of fire size and discovery time</font>

In the cell below, we can create our own **custom visualizations**, and then adding it to Midas interactions via `vis`.

In [None]:
size_time_df = fires_df.select(['DISCOVERY_TIME', 'FIRE_SIZE'])
size_time_df.vis()

### <font color="0080FF">Understanding Geo</font>

To get a sense of where the fires are distributed, we can use external visualization tools, like `folium`.

In [None]:
import folium
import folium.plugins

CENTER_COORD = (39, -98) # center of US
us_map = folium.Map(location=CENTER_COORD, zoom_start=2)
locs = fires_df.select(['LATITUDE', 'LONGITUDE']).to_numpy()
heatmap = folium.plugins.HeatMap(locs.tolist(), radius = 10)
us_map.add_child(heatmap)

We can use **reactive cells** to drive the filtering---first let's understand what reactive cells do with the following example.

The API `m.immediate_value` shows the current filtered value, and the `%%reactive` cell magic makes the cell run everytime there is a UI based selection on the dataframe `CAUSE_DESCR_fires_df_dist`.

In [None]:
%%reactive -df CAUSE_DESCR_fires_df_dist
# note that the df name is NOT in quotes

m.immediate_value

Now we can take it to a more complex case---let's redraw the map as we select different different causes for the fire. By placing `m.immediate_value` into the predicate the filters the original DF, we are able now.

In [None]:
%%reactive -df CAUSE_DESCR_fires_df_dist

CENTER_COORD = (39, -98)
us_map = folium.Map(location=CENTER_COORD, zoom_start=4)
locs = fires_df.where('CAUSE_DESCR', m.are.contained_in(m.immediate_value)).select(['LATITUDE', 'LONGITUDE']).to_numpy()
heatmap = folium.plugins.HeatMap(locs.tolist(), radius = 10)
us_map.add_child(heatmap)

### <font color="0080FF">More fires data</font>

After the initial analysis we now find another data source with the same schema but different data.

In [None]:
later_fires_df = m.from_file("./data/fire_earlier.csv")
later_fires_df

#### Reuse analsyis

To reuse the visual analysis you have performed, you can look at the code/log generated by Midas, and you can also directly copy code used to derive certain charts you found interesting. You can click the ellipsis button and then click on the "get code to clipboard button" to reproduce your work. You should copying the code from the chart and modify it to apply to the new dataframe, it will look something like the below:
```
fires_df.where('STAT_CAUSE_DESCR', m.are.contained_in(superstring=['Debris Burning']), None)\
             .group('STATE', None)\
               .static_vis()
```

### <font color="0080FF">Taking note of what you have looked at</font>

Often in data analysis, it's important to understand what data you have looked at and what data you have not. The following function, `all_selections`,  returns all the interactions you have made. We have limited to the most recent 5, which you should feel free to change. An example observation based on the history is _"I have looked at `Lightning` and `Debris Burning` in more detail; it might also be interesting to look at fire sizes next time"_.

In [None]:
m.all_selections[-5:]