# Demo for Persist Jupyter Extension

#### You can follow the tutorial at the following URL:

<a href="https://tinyurl.com/44euyhtw" style="font-size: 2em"><code>https://tinyurl.com/44euyhtw</code></a>

<img src="thesis_demo_qr.png" alt="QR Code for https://tinyurl.com/44euyhtw" width="500"/>

# Installation

<div>
<ol style="font-size: 1.3em">
<li>
    Install using pip:<br/>
    <code>pip install persist_ext</code>
</li>
<li>
Restart kernel
</li>
</ol>
</div>

# Analyzing Utah Avalanche Data

## Introduction

For this live demo, I will do some basic data cleanup tasks on the avalanche data take from [Utah Avalanche Center](https://utahavalanchecenter.org/). The data provides insights into [avalanche occurrences](https://utahavalanchecenter.org/avalanches) in Utah.

In [3]:
import pandas as pd # For saving/loading data
import altair as alt # For creating VegaAltair visualizations

import persist_ext as PR # Load Persist extension

## Column Manipulation Tasks

#### Delete columns `Coordinates` and `Comments 1`
#### Remove `;` from `;Region` and `;Trigger`

## Data Description

The table below describes the different columns in the dataset. Each row in the dataset represents a reported avalanche with location, trigger, and aspect details. The data spans multiple years, from __2004__ to __2023__.

| Column          | Description                                                    |
|-----------------|----------------------------------------------------------------|
| Date            | Date on which the avalanche was recorded                       |
| Region          | Region in Utah where the avalanche occurred                    |
| Place           | Exact location where the avalanche was recorded                |
| Trigger         | Cause of the avalanche                                         |
| Weak Layer      | Layer of snow that was weakest and likely to fail              |
| Depth_inches    | Depth of the avalanche in inches                               |
| Width_inches    | Width of the avalanche in inches                               |
| Vertical_inches | Vertical distance covered by the avalanche in inches           |
| Aspect          | Direction of the slope where the avalanche occurred            |
| Elevation_feet  | Elevation of the location in feet                              |
| Coordinates     | Approximate geographical coordinates of the avalanche location |
| Comments 1      | Additional comments provided by the reporter                   |

In [37]:
df = pd.read_csv('./avalanches_data.csv')
print("\n\n".join(df.columns.tolist()))

;Region

Month

Day

Year

;Trigger

;Weak Layer

Depth_inches

Vertical_inches

;Aspect

Elevation_feet

Coordinates

Comments 1


In [31]:
PR.PersistTable(df)

PersistWidget(data_values=[{'__id_column': '1', ';Region': 'Salt Lake', 'Month': 11, 'Day': 9, 'Year': 2012, '…

In [36]:
print("\n\n".join(avalanche_demo_df.columns.tolist()))

Region

Trigger

Depth_inches

Coordinates

Comments 1

Elevation_feet

Year

Month


In [14]:
# avalanche_demo_df.to_csv("avalanche_data_demo.csv", index=False)

In [None]:
df_task_1a.head()

#### Pandas

#### **Instructions**
1. **Column Removal:**
	- Remove the specified columns using Pandas commands.
2. **Generate dataframe:**
	- Assign the modified dataframe to variable `df_task_1a`
3. **Show Output:**
	- Print the head of `df_task_1a` to show the changes.

In [None]:
df.head()

In [None]:
df_task_1a = df.drop(columns=["Comments 1", "Coordinates"])

df_task_1a.head()

### **Task 1b: Fix Column Names**

#### **Objective**
For this subtask, we will focus on fixing column names to ensure consistency and clarity. We'll start by identifying the issues with the column names, specifically targeting those with a `;` prefix that needs removal.

#### Persist

#### **Instructions**
1. **Rename Columns:**
    - Use the interactive  table in Persist to correct the column names by removing the leading `;` from their names:
        - _;Aspect_ → _Aspect_
        - _;Region_ → _Region_
        - _;Trigger_ → _Trigger_
        - _;Weak Layer_ → _Weak Layer_
2. **Generate dataframe:**
    - Assign the revised dataframe to the variable `df_task_1b`.
3. **Show Output:**
    - Display the head of `df_task_1b` to verify the changes.

In [None]:
PR.PersistTable(df_task_1a)

In [None]:
df_task_1b.head()

#### Pandas

#### **Instructions**
1. **Rename Columns:**
    - Employ Pandas commands to rename the columns, eliminating the leading ";" as specified:
        - _;Aspect_ → _Aspect_
        - _;Region_ → _Region_
        - _;Trigger_ → _Trigger_
        - _;Weak Layer_ → _Weak Layer_
2. **Generate dataframe:**
    - Assign the updated dataframe to variable `df_task_1b`.
3. **Show Output:**
    - Print the head of `df_task_1b` to confirm the updated column names.

In [None]:
df_task_1a.head()

In [None]:
# Renaming columns to remove the leading ';'
df_task_1b = df_task_1a.rename(columns={
    ";Region": "Region",
    ";Trigger": "Trigger",
    ";Weak Layer": "Weak Layer",
    ";Aspect": "Aspect",
})

df_task_1b.head()

## **Task 1c: Correcting Data Type of 'Depth_inches'**

#### **Objective**
In this task, we will address a data type issue in the `Depth_inches` column of our dataframe. This column is incorrectly formatted as a object (string) due to the presence of the inches symbol `"`.

Remove any inches symbols `"` from the `Depth_inches` column and convert it to a float data type.

In [None]:
df_task_1b.dtypes

### **Persist**

#### **Instructions**
1. **Identify Entries with Inches Symbol:**
    - Use the interactive table in Persist to look for rows with `"` in `Depth_inches` column
2. **Edit and Correct Entries:**
    - Edit the cells to remove the inches symbol from these entries. (e.g. `15"` → `15`) 
3. **Convert Data Type:**
    - Change the data type of the `Depth_inches` column from string to float.
4. **Generate Dataframe:**
    - Assign the modified dataframe to a variable `df_task_1c`.
5. **Show Output:**
    - Display the dtypes of `df_task_1c` to verify the data type correction.

In [None]:
PR.PersistTable(df_task_1b)

In [None]:
df_task_1c.head()

In [None]:
df_task_1c.dtypes

### **Pandas**

#### **Instructions**

1. **Remove Inches Symbol and Correct Format:**
    - Use Pandas to replace the inches symbol in the `Depth_inches` column.
2. **Convert Data Type:**
    - Convert the `Depth_inches` column to float.
3. **Generate Dataframe:**
    - Save the updated dataframe as `df_task_1c`.
4. **Show Output:**
    - Print the dtypes of `df_task_1c` to confirm the changes.

In [None]:
df_task_1b.dtypes

In [None]:
df_task_1c = df_task_1b
df_task_1c["Depth_inches"] = df_task_1c["Depth_inches"].str.replace('"', '')
df_task_1c["Depth_inches"] = df_task_1c["Depth_inches"].astype("Float64")

In [None]:
df_task_1c.dtypes

# Task 2: Filtering data

In Task 2, we further improve our data by removing outliers and removing certain records to have more consistent data. 

We will also take a brief look at relations between cause of an avalanche (`Trigger`) and failure point of ice (`Weak Layer`)

## **Task 2a: Remove Outliers**

#### **Objective**
In this task, we address data accuracy by filtering out anomalies in elevation data. We observe some records with elevations outside the plausible range for Utah, suggesting recording errors.

Remove avalanche records with elevations below 2000 feet and above 13500 feet, which are outside the realistic range for Utah.

#### Persist

#### **Instructions**
1. **Identify and Remove Anomalies:**
    - Interactively select data points with elevations below 2100 feet and above 13500 feet in the Persist Scatterplot.
    - Use Persist's interactive features to remove these anomalous records.
2. **Generate Dataframe:**
    - Assign the cleaned dataframe to a variable `df_task_2a`.
3. **Show Output:**
    - Display the head of `df_task_2a`.

In [None]:
PR.plot.scatterplot(df_task_1c, "Elevation_feet:Q", "Vertical_inches:Q")

In [None]:
df_task_2a.head()

#### Pandas

#### **Instructions**
1. **Locate Anomalous Data:**
    - Refer to the _seaborn_ scatterplot for `Elevation_feet` vs `Vertical_inches`
    - Write code to identify records where `Elevation_feet` is either below 2100 feet or above 13500 feet.
3. **Remove Anomalies:**
    - Use Pandas commands to filter out these anomalous records from the dataframe.
4. **Generate Dataframe:**
    - Save the cleaned dataframe as `df_task_2a`.
5. **Plot Output:**
    - Recreate the scatterplot from step 1 in a new cell using `df_task_2a`.
    - Print the head of `df_task_2a`.

In [None]:
df_task_2a = df_task_1c

plt.figure(figsize=(7, 5))
sns.scatterplot(data=df_task_2a, x='Elevation_feet', y='Vertical_inches')

plt.xlabel('Elevation (feet)')
plt.ylabel('Vertical Distance (inches)')

# Display the plot
plt.show()

In [None]:
df_task_2a = df_task_1c[(df_task_1c['Elevation_feet'] >= 2100) & (df_task_1c['Elevation_feet'] <= 13500)]
df_task_2a.head()

In [None]:
plt.figure(figsize=(7, 5))
sns.scatterplot(data=df_task_2a, x='Elevation_feet', y='Vertical_inches')

plt.xlabel('Elevation (feet)')
plt.ylabel('Vertical Distance (inches)')

# Display the plot
plt.show()

## **Task 2b: Filtering Out Old Data**

The interactive barchart below, shows the data aggregated by year. There are noticeably fewer records for the years before 2010.

During this subtask we will remove the older records, keeping only the records post 2010.

### Persist

#### **Instructions**
1. **Create and Analyze Bar Chart:**
    - Looking at an interactive bar chart in Persist showing the number of avalanches recorded each year, identify the bars showing data we want.
2. **Interactive Year Selection:**
    - Use a brush to interactively select and remove appropriate records.
3. **Generate Dataframe:**
    - Assign the refined dataframe to a variable `df_task_2b`.
4. **Show Output:**
    - Display the head of `df_task_2b` to verify the removal of earlier years.

In [None]:
PR.plot.barchart(df_task_2a, "utcyear(Date):O", "count()", selection_type="interval")

In [None]:
df_task_2b.head()

### Pandas

#### **Instructions**
1. **Create Bar Chart with Seaborn:**
    - Use the Seaborn plot with bar chart visualizing the number of avalanches per year. 
2. **Identify Sparse Years:**
    - Based on the bar chart, identify years before 2010 with fewer avalanche records.
3. **Code to Filter Out Sparse Years:**
    - Write Pandas code to exclude these years from the dataset.
4. **Show Output:**
    - Print the head of `df_task_2b` and optionally recreate the bar chart to show the dataset focusing on years 2010 and onwards.


In [None]:
df_task_2b = df_task_2a
df_task_2b = df_task_2b.convert_dtypes()

plt.figure(figsize=(10, 5))
sns.countplot(x=df_task_2b["Date"].dt.year)

plt.xlabel('Year')
plt.ylabel('# of records')

# Display the plot
plt.show()

In [None]:
df_task_2b = df_task_2b[df_task_2b["Date"].dt.year >= 2010]
df_task_2b.head()

In [None]:
plt.figure(figsize=(10, 5))
sns.countplot(x=df_task_2b["Date"].dt.year)

plt.xlabel('Year')
plt.ylabel('# of records')

# Display the plot
plt.show()

In [None]:
df_task_2b.head()

## **Task 2c: Identifying frequently failing `Weak Layers` for avalanches triggered by _'snowboarders'_ and _'skiers'_**

### Persist

#### **Instructions**
1. **Linked Bar Charts:**
    - You will start with two linked interactive bar charts: one for `Trigger` and another for `Weak Layer`.
    - Both bar charts show `count` for their respective category.
    - You can click on a trigger in the `Trigger` bar chart and the `Weak Layer`' bar chart dynamically updates to show only occurrences corresponding to the selected triggers.
2. **Interactive Selection:**
    - Interactively select triggers and use the updated `Weak Layer`.
3. **Identify the most frequent failure point:**
    - Analyze the filtered 'Weak Layer' bar chart to determine the most frequently failed layers for selected category and make a note in a markdown cell about both the name of the layer and frequency.
4. **Generate Dataframe and Output:**
    - You will not generate any dataframe for this task. NOTE: Please save the notebook after you are finsihed with interactions. 

In [None]:
pts = alt.selection_point(name="selector", fields=['Trigger'])

base = alt.Chart(df_task_2b).encode(y="count()")

trigger = base.mark_bar().encode(
    x="Trigger:N",
    color=alt.condition(pts, "Trigger:N", alt.value("gray"))
).add_params(pts)

weak_layer = base.mark_bar().encode(
    x="Weak Layer:N",
    color="Weak Layer:N",
    tooltip="count()"
).transform_filter(
    pts
)

chart = alt.hconcat(
    trigger, weak_layer
).resolve_scale(
    color="independent",
)

PR.PersistChart(chart, data=df_task_2b)

**Task 2c Notes:**

- Snowboarder -> New snow/old snow interface (31)
- Skiers -> Facets (261)

### Pandas

#### **Instructions**
1. **Identify Predominant Weak Layer:**
    - Determine the most common weak layer for these `Snowboarder` and `Skier` triggers for the data.
2. **Output:**
    - Note in markdown cell both the name of the layer and frequency for each of the above trigger.

In [None]:
df_task_2b.head()

In [None]:
snowboarders = df_task_2b[df_task_2b["Trigger"] == "Snowboarder"]
snowboarders.head()

In [None]:
skiers = df_task_2b[df_task_2b["Trigger"] == "Skier"]
skiers.head()

In [None]:
snowboarders["Weak Layer"].value_counts().sort_values(ascending=False)

In [None]:
skiers["Weak Layer"].value_counts().sort_values(ascending=False)

- snowboarders -> New Snow/Old Snow Interface    (31)
- skiers -> Facets                         (261)

## Task 3: Data Wrangling

### Task 3a: Creating and assigning 'Avalanche Season'**

#### **Objective**

In this subtask, we'll introduce a new categorical variable named `Avalanche Season` into our dataset. This addition aims to classify each avalanche record into different parts of the avalanche season (Start, Middle, End) based on the month it occurred.

Create a new category `Avalanche Season` in the dataset and assign each record to `Start`, `Middle`, or `End` of the avalanche season based on its month.

### Persist

#### **Instructions**
1. **Visualization**
    - We will work with an interactive bar chart in Persist showing the count of avalanche instances aggregated by month.
2. **Define Season Categories:**
    - Based on typical avalanche seasons in Utah, you will first create a new category called `Avalanche Season` using the `Edit Categories` button in the header.
    - In the same menu you will add three options for this category -- `Start`, `Middle`, `End`.
3. **Interactive Assignment:**
    - Use Persist's interactive features to select each month and assign it to one of the `Avalanche Season` values (Start, Middle, End).
    - You should use the following ranges for assigning proper categories:
        - `Start` of Season: October, November, December
    	- `Middle` of Season: January, February, March
    	- `End` of Season: April, May, June
4. **Generate Dataframe:**
    - Assign the updated dataset to a new variable: `df_task_3a`.
5. **Show Output:**
    - Print the head of the dataframe.

In [None]:
select = alt.selection_interval(name="selector", encodings=["x"])

chart = alt.Chart(df_task_2b, height=400, width=500).mark_bar(tooltip=True).encode(
    x=alt.X("utcmonth(Date):N").sort([10]),
    y="count()",
    opacity=alt.condition(select, alt.value(1), alt.value(0.2)),
).add_params(select)

PR.PersistChart(chart, data=df_task_2b)

In [None]:
persist_df_10["M"] = persist_df_10["Date"].dt.month
persist_df_10.groupby("M").count()

In [None]:
df_task_3a.head()

### **Pandas**

#### **Instructions**
1. **Create New Variable:**
    - Add a new column `Avalanche Season` to the DataFrame.
2. **Assign Category:**
    - Using the `month` from the `Date` column assign proper values to the new category.
    - You should use the following ranges for assigning proper categories:
        - `Start` of Season: October, November, December, January, February
    	- `Middle` of Season: March, April, May,
    	- `End` of Season: June, July, August, September
3. **Generate Dataframe:**
    - Save the modified DataFrame with the new `Avalanche Season` category to `df_task_3a`.
4. **Show Output:**
    - Display the head of `df_task_3a` to confirm the addition and categorization of the new variable.

In [None]:
df_task_3a = df_task_2b.copy()

In [None]:
df_task_3a["Avalanche Season"] = "End"

In [None]:
df_task_3a.head()

In [None]:
df_task_3a.loc[df_task_3a["Date"].dt.month.isin([10,11,12,1,2]), "Avalanche Season"] = "Start"
df_task_3a.loc[df_task_3a["Date"].dt.month.isin([3,4,5]), "Avalanche Season"] = "Middle"

df_task_3a.head()

# **Task 3b:## **Analyzing Top Avalanche Trigger by Season**

#### **Objective**
In this subtask, we'll analyze which trigger is most prevalent for avalanches in each season phase (Start, Middle, End) using the `Avalanche Season` category created in Task 3a.

### **Persist**

#### **Instructions**
1. **Visualization:**
    - We have two linked interactive bar charts: one for 'Avalanche Season' and another for 'Trigger'.
    - You can select a category to highlight using the legend for `Avalanche Season` bar chart. The `Trigger` bar chart will dynamically update in response to your selections.
2. **Analyze Trigger Data:**
    - Observe the filtered `Trigger` bar chart to identify the top trigger for the selected season phase.
    - You can hover on the bars to get the exact frequency.
3. **Document Findings:**
    - Note down the most common trigger for each season phase based on your interactive analysis in a new markdown cell.

In [None]:
selection = alt.selection_point(name="selector", fields=["Avalanche Season"], bind="legend")
base = alt.Chart(df_task_3a)

seasons = base.mark_bar().encode(
    x=alt.X("Avalanche Season:N").sort(["Start", "Middle", "End"]),
    y="count()",
    opacity=alt.condition(selection, alt.value(1), alt.value(0.3)),
    color="Avalanche Season:N"
).add_params(
    selection
)

trigger = base.mark_bar().encode(
    x="Trigger:N",
    y="count()",
    color="Trigger:N",
    tooltip="count()"
).transform_filter(
    selection
)

chart = seasons | trigger

chart = chart.resolve_scale(
    color="independent"
)


PR.PersistChart(chart)

In [None]:
c = alt.Chart(df_task_3a).mark_bar().encode(
    x="Trigger:N",
    y="count()",
    color=alt.Color("Avalanche Season:N").sort(["Start", "Middle", "End"]),
    opacity=alt.condition(selection, alt.value(1), alt.value(0.3)),
    order="selection_order:N",
    tooltip="count()"
).transform_calculate(
    selection_order="if(selector && selector['Avalanche Season'], if((datum['Avalanche Season'] === selector['Avalanche Season'][0]),0,1),0)"
).add_params(selection)


PR.PersistChart(c)

**Task 3b Notes:**

- Start of season -> Skiers (591)
- Middle of season -> Natural (260)
- End of season -> Skiers (2)

### **Pandas**

#### **Instructions**
1. **Analyze Triggers by Season:*
	- Determine the most common trigger for each season.
2. **Present Findings:**
	- Note in markdown cell both the name and frequency for each trigger.

In [None]:
grouped_size_df = df_task_3a[["Avalanche Season", "Trigger", "Date"]].groupby(["Avalanche Season", "Trigger"]).size().reset_index(name="counts")

grouped_size_df.sort_values(["Avalanche Season", "counts"], ascending=[True, False])

**Notes:**
- End ->	Skier	(2)
- Middle ->	Natural	(260)
- Start ->	Skier	(591)