# Midas Tutorial

Hello! Please follow the tutorial to learn the basics of Midas. We will walk through with a real example and introduce each feature as motivated by a specific need. Until you are done with the tutorial, please do *not* randomly click on the interface, and only do ask is asked in our tutorial. This will help you understand.

Should you have any questions, please feel free to ask Yifan, who will be present during the entire session.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from midas import Midas
import numpy as np

# Initiate Midas environment
m = Midas()

No need to read through the text in the blue pannel---we'll walk you through most of it.

In [None]:
# Load data
fires_df = m.from_file("./data/fires.csv")
fires_df

You will see a blue bar below the cell---this is an indicator for where "Midas generated cells" will be placed after. You will know what we mean in the next step.

### <font color="0080FF">Exploratory Analysis: Distribution of Fire Causes</font>

From the above table, the first question that come to mind is the distribution of causes of fie (`CAUSE_DESCR`). Conviniently, Midas has a shortcut for seeing column distributions---**please find `CAUSE_DESCR` in the pannel to the right and click it**.

A cell with yello emoji may be created above---that is because we always create cells after the last executed cell. **Please move it here for organization** ⬇️.  The cell describes the query used to derive the data for the chart, which is a `group` operation, and by default, the aggregation is a `count`. You also see that `.vis` is called at the end---it's useful for customizing your own visualizations. We'll talk more about this later.

### <font color="0080FF">Exploratory Analysis: Distribution of State</font>

Another thing we can look at is how many fires there are by state (`STATE`). Again you can go ahead and click on the `STATE` column in the yellow pane.

### <font color="0080FF">Investigation: Cause Distribution in California</font>

Let's get a sense of what fires happen in California---**please click on CA in the `STATE_calls_df_dist` chart**. You will see some shorter blue bars being filtered in the `CAUSE_DESCR_calls_df_dist`---this is the distribution of `CAUSE_DESCR` filtered by `CA`.

Since the filtered values are a smaller portion of the chart, it might be hard to compare. **Please click on the 📌 icon next to the chart name**.  This will remove the paler blue bars in the background (fire counts across all states), and help you see better.

We notice the interesting observation that California gets hit by Lightening a lot. To record this insight, **please click on the 📷 icon in the menu bar**. You will see a cell created with 🟠 emoji that contains the SVGs of both of the current charts, as well as the code that is used to derived the data on the charts, which may be helpful for reproduceability.

Now you might find that you want to see more of the charts---to do this, you can **drag the left edge of the pannel to change the size**. If this is too large, you can **click on "toggle midas" and "toggle column shelf" in the menu bar to hide/show the shelves**.

### <font color="0080FF">Investigtion: Lightening</font>
Now let's dig a little further and see what other states are affected by Lightening. **Please click on the Lightening bar**. Perhaps the count of lightening fires is mostly due to land area? From this [source](https://www.usgs.gov/special-topic/water-science-school/science/how-wet-your-state-water-area-each-state?qt-science_center_objects=0#qt-science_center_objects), we can get the the area of each state to compare with out data.

In [None]:
# Load data
land_size_df = m.from_file("./data/state_land_sizes.csv")
land_size_df


In [None]:
state_dict = {"Alabama":"AL", "Alaska":"AK", "Arizona":"AZ", "Arkansas":"AR", "California":"CA", "Colorado":"CO", "Connecticut":"CT", "Delaware":"DE", "Florida":"FL", "Georgia":"GA", "Hawaii":"HI", "Idaho":"ID", "Illinois":"IL", "Indiana":"IN", "Iowa":"IA", "Kansas":"KS", "Kentucky":"KY", "Louisiana":"LA", "Maine":"ME", "Maryland":"MD", "Massachusetts":"MA", "Michigan":"MI", "Minnesota":"MN", "Mississippi":"MS", "Missouri":"MO", "Montana":"MT", "Nebraska":"NE", "Nevada":"NV", "New Hampshire":"NH", "New Jersey":"NJ", "New Mexico":"NM", "New York":"NY", "North Carolina":"NC", "North Dakota":"ND", "Ohio":"OH", "Oklahoma":"OK", "Oregon":"OR", "Pennsylvania":"PA", "Rhode Island":"RI", "South Carolina":"SC", "South Dakota":"SD", "Tennessee":"TN", "Texas":"TX", "Utah":"UT", "Vermont":"VT", "Virginia":"VA", "Washington":"WA", "West Virginia":"WV", "Wisconsin":"WI", "Wyoming":"WY", "District of Columbia": "DC"}
land_size_df['STATE'] = land_size_df.apply(lambda x: state_dict[x], 'state_name')


### <font color="0080FF">Analyzing distribution of lightening across states</font>

We can accessing the data programmatically---in the cell below, we can **use `get_filtered_data` to directly access the filtered result**. Tip: type, "get_" then press the "Tab" key for auto-complete!

In [None]:
# now let's get the the data filtered by lightening
lightening_df = STATE_fires_df_dist.get_filtered_data()
lightening_df

In [None]:
# now we are going to plot a scatter plot against the lightening counts and the land counts
count_and_area_df = lightening_df.join('STATE', land_size_df.select(['STATE', 'area_sq_miles']), 'STATE')
count_area_scatter = count_and_area_df.select(['count', 'area_sq_miles'])

**You can also use `static_vis` to quickly look at one-off visualizations.**

In [None]:
count_area_scatter.static_vis()

In [None]:
count_area_scatter.corr()

In [None]:

count_and_area_df = lightening_df.join('STATE', land_size_df.select(['STATE', 'land_sq_miles']), 'STATE')
count_land_scatter = count_and_area_df.select(['count', 'land_sq_miles'])
count_land_scatter.static_vis()


Now we don't need to look at `land_size_df` anymore, **please click on the text in the yellow pane to hide the columns**.

### <font color="0080FF">Exploratory Analysis: Distribution of fire size and discovery time</font>

In the cell below, we can create our own custom visualizations.

In [None]:
size_time_df = fires_df.select(['DISCOVERY_TIME', 'FIRE_SIZE'])
size_time_df.vis()

### <font color="0080FF">Understanding Geo</font>

To get a sense of where the fires are distributed, we can use external visualization tools, like `folium`.

In [None]:
import folium
import folium.plugins

# 39.8283° N, 98.5795° W
CENTER_COORD = (39, -98)
us_map = folium.Map(location=CENTER_COORD, zoom_start=2)
locs = fires_df.select(['LATITUDE', 'LONGITUDE']).to_numpy()
heatmap = folium.plugins.HeatMap(locs.tolist(), radius = 10)
us_map.add_child(heatmap)

We can use **reactive cells** to drive the filtering---first let's understand what reactive cells do with the following example.

In [None]:
%%reactive -df CAUSE_DESCR_fires_df_dist
# note that the df name is NOT in quotes

m.current_selection

Now we can take it to a more complex case---let's redraw the map as we select different areas. you can facet the maps based on different cause descriptions.

In [None]:
%%reactive -df CAUSE_DESCR_fires_df_dist

CENTER_COORD = (39, -98)
us_map = folium.Map(location=CENTER_COORD, zoom_start=4)
locs = fires_df.where('CAUSE_DESCR', m.are.contained_in(m.immediate_value)).select(['LATITUDE', 'LONGITUDE']).to_numpy()
heatmap = folium.plugins.HeatMap(locs.tolist(), radius = 10)
us_map.add_child(heatmap)

### <font color="0080FF">Working with a dataset of the same schema</font>

Often, we analyze two datasets with the same schema but different content, due to sampling or others. To reuse the visual analysis you have performed, you can look at the code/log generated by Midas, and you can also directly copy code used to derive certain charts you found interesting. You can do this by looking at the code stored in the snapshot, or you can click the elipsis button and then click on the "get code to clipboard button" to reproduce your work. or to use reactive cells.

### <font color="0080FF">Taking note of what you have looked at</font>

Often in data analysis, it's important to understand what data you have looked at and what data you have not. The following function, `all_selections`,  returns all the interactions you have made. We have limited to the most recent 5, which you should feel free to change. An example observation based on the history is _"I have looked at `Lightening` and `Debris Burning` in more detail; it might also be interesting to look at fire sizes next time"_.

In [None]:
m.all_selections[-5:]