# Analyzing Utah Avalanche Data

**Participant ID:**  
**Date / Time:**

## Introduction
Welcome to our data analysis study. For this part of the study, you'll be working with a dataset sourced from the [Utah Avalanche Center](https://utahavalanchecenter.org/). The data provides insights into [avalanche occurrences](https://utahavalanchecenter.org/avalanches) in Utah.


- You will use an extension called PersIst to complete **data cleanup and manipulation** tasks. 
- To familiarize yourself with its functionalities, please refer to the provided [tutorial notebook](../tutorial.ipynb).
- Interactive charts and tables have been pre-created for your convenience. These can be directly utilized by running the corresponding cells.
- Focus on leveraging the interactive capabilities of Persist for your analysis.
- In some cases, you will be asked to document your findings. Please do this in writing in a markdwon cell.

## Tasks Overview

In this study, you are presented with three fundamental data analysis tasks. Each task is designed to test different aspects of data analysis and manipulation.

- Carefully follow the step-by-step instructions provided for each task.
- As you work through the tasks, take note of any interesting findings or challenges you encounter, either by speaking your thoughts out loud or taking notes in a markdown cell. This can include observations about the data, any issues encountered, and your overall experience with the task/method.
- Feel free to add new code and markdown cells in the notebook as necessary to complete the tasks, but please do attempt the tasks with the PersIst functionality.

**Support**
- If you require assistance or need further clarification on any of the tasks, please let us know.
- If you find yourself stuck on a task and feel that you will not make any progress, you have the option to skip the task.
- For tasks that build upon the outputs of previous tasks, skipping a task will affect your ability to proceed. To avoid such problems we will assist you loading a fallback dataset. 

In [1]:
import helpers as h
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
import altair as alt

import persist_ext as PR

## Data Description

The table below describes the different columns in the dataset. Each row in the dataset represents a reported avalanche with details on location, trigger, and aspect. The data spans multiple years, starting from 2004 up to 2023.

| Column          | Description                                                    |
|-----------------|----------------------------------------------------------------|
| Date            | Date on which the avalanche was recorded                       |
| Region          | Region in Utah where the avalanche occurred                    |
| Place           | Exact location where the avalanche was recorded                |
| Trigger         | Cause of the avalanche                                         |
| Weak Layer      | Layer of snow that was weakest and likely to fail              |
| Depth_inches    | Depth of the avalanche in inches                               |
| Width_inches    | Width of the avalanche in inches                               |
| Vertical_inches | Vertical extend of the avalanche in inches                     |
| Aspect          | Direction of the slope where the avalanche occurred            |
| Elevation_feet  | Elevation of the avalanche location in feet                    |
| Coordinates     | Approximate geographical coordinates of the avalanche location |
| Comments 1      | Additional comments provided by the reporter                   |

In [3]:
df = pd.read_csv('./avalanches_data.csv')
df

Unnamed: 0,Date,;Region,Place,;Trigger,;Weak Layer,Depth_inches,Width_inches,Vertical_inches,;Aspect,Elevation_feet,Coordinates,Comments 1
0,11/9/2012,Salt Lake,Sunset Peak,Snowboarder,New Snow/Old Snow Interface,14.0,960.0,360.0,North,10400.0,"40.577977000000, -111.595817000000",While it was a small avalanche that was I caug...
1,11/11/2012,Salt Lake,Patsy Marly,Skier,New Snow/Old Snow Interface,30.0,1200.0,1200.0,North,9700.0,"40.592619000000, -111.616099000000",A North facing aspect with an exposed ridge in...
2,11/11/2012,Salt Lake,Two Dogs,Skier,Facets,36.0,840.0,5400.0,North,10200.0,"40.599291000000, -111.642315000000",Remotely triggered all the new storm snow (abo...
3,11/11/2012,Salt Lake,Emma Ridges,Skier,New Snow,18.0,600.0,6000.0,Southeast,10200.0,"40.598313000000, -111.628304000000",Impressive fast powder cloud ran in front of t...
4,11/11/2012,Salt Lake,Sunset Peak,Skier,Facets,42.0,18000.0,9600.0,North,10400.0,"40.578590000000, -111.595087000000",Three of us toured from Brighton to low saddle...
...,...,...,...,...,...,...,...,...,...,...,...,...
2387,4/22/2023,Salt Lake,Cardiff Bowl,Unknown,New Snow/Old Snow Interface,8.0,720.0,1800.0,East,9800.0,"40.592721660567, -111.649613218710",We spent the day skiing the southerly-facing a...
2388,4/22/2023,Logan,"Miller Bowl, East",Snowmobiler,New Snow/Old Snow Interface,18.0,540.0,4800.0,North,8700.0,"41.886233332343, -111.645074831510","Not sure about the story here, but we observed..."
2389,4/22/2023,Logan,Millville Peak,Snowboarder,New Snow/Old Snow Interface,12.0,3600.0,7200.0,North,8900.0,"41.677564539953, -111.718065248970",Details are a bit limited and we're not sure w...
2390,5/7/2023,Salt Lake,Red Top Mountain,Natural,Unknown,72.0,3000.0,12000.0,West,10800.0,"40.546874131921, -111.663880335390",Saw this avalanche around 9.30 AM from the top...


# Task 1: Refining Columns and Preparing Data

In the first task we will perform some basic data cleaning operations to get our dataset ready for further tasks.

### **Task 1a: Remove Columns**

Remove certain columns to streamline the dataset for further analysis.
- **_Comments 1:_** Contains textual comments not crucial for quantitative analysis.
- **_Coordinates:_** Detailed location data not needed for the current scope of analysis.

#### **Instructions**
1. **Column Removal:**
	- Use the interactive table feature in PersIst to remove the specified columns.
2. **Generate dataframe:**
	- Assign the modified dataframe to variable `df_task_1a`
3. **Show Output:**
	- Print the head of `df_task_1a` to show the changes.

In [4]:
PR.PersistTable(df)

PersistWidget(data_values=[{'__id_column': '1', 'Date': 1352419200000, ';Region': 'Salt Lake', 'Place': 'Sunse…

In [5]:
df_task_1a.head()

Unnamed: 0,Date,;Region,Place,;Trigger,;Weak Layer,Depth_inches,Width_inches,Vertical_inches,;Aspect,Elevation_feet,__annotations
0,2012-11-09,Salt Lake,Sunset Peak,Snowboarder,New Snow/Old Snow Interface,14.0,960,360.0,North,10400.0,No Annotation
1,2012-11-11,Salt Lake,Patsy Marly,Skier,New Snow/Old Snow Interface,30.0,1200,1200.0,North,9700.0,No Annotation
2,2012-11-11,Salt Lake,Two Dogs,Skier,Facets,36.0,840,5400.0,North,10200.0,No Annotation
3,2012-11-11,Salt Lake,Emma Ridges,Skier,New Snow,18.0,600,6000.0,Southeast,10200.0,No Annotation
4,2012-11-11,Salt Lake,Sunset Peak,Skier,Facets,42.0,18000,9600.0,North,10400.0,No Annotation


### **Task 1b: Fix Column Names**

Next, please fix column names to ensure consistency and clarity. 

It looks like something went wrong when reading the file and some column headers start with a `;`. Please remove the semicolon. 

#### **Instructions**
1. **Rename Columns:**
    - Use the interactive  table in Persist to correct the column names by removing the leading `;` from their names:
        - _;Aspect_ → _Aspect_
        - _;Region_ → _Region_
        - _;Trigger_ → _Trigger_
        - _;Weak Layer_ → _Weak Layer_
2. **Generate dataframe:**
    - Assign the revised dataframe to the variable `df_task_1b`.
3. **Show Output:**
    - Display the head of `df_task_1b` to verify the changes.

In [6]:
PR.PersistTable(df_task_1a)

PersistWidget(data_values=[{'__id_column': '1', 'Date': 1352419200000, ';Region': 'Salt Lake', 'Place': 'Sunse…

In [7]:
df_task_1b.head()

Unnamed: 0,Date,Region,Place,Trigger,Weak Layer,Depth_inches,Width_inches,Vertical_inches,Aspect,Elevation_feet,__annotations
2375,2023-04-09,Logan,Hatties Bowl,Natural,Wet grains,12.0,360,4800.0,East,6700.0,No Annotation
1643,2019-01-08,Skyline,Staker,Snowmobiler,Facets,24.0,3000,3000.0,East,9900.0,No Annotation
1508,2017-12-23,Salt Lake,Catherine's Pass,Skier,New Snow,12.0,600,2400.0,East,10400.0,No Annotation
1505,2017-06-02,Uintas,Bald Mountain,Skier,Wet grains,3.0,480,15000.0,East,11500.0,No Annotation
986,2014-02-11,Uintas,Mill Hollow,Natural,Ground Interface,30.0,3000,3000.0,East,8400.0,No Annotation


## **Task 1c: Correcting Data Type of 'Depth_inches'**

There is a data type issue in the `Depth_inches` column of our dataframe. This column is incorrectly formatted as an object (string) due to the presence of the inches symbol `"`.

Remove any inches symbols `"` from the `Depth_inches` column and convert it to a float data type.

In [8]:
df_task_1b.dtypes

Date               datetime64[ns]
Region             string[python]
Place              string[python]
Trigger            string[python]
Weak Layer         string[python]
Depth_inches       string[python]
Width_inches                Int64
Vertical_inches           Float64
Aspect             string[python]
Elevation_feet            Float64
__annotations      string[python]
dtype: object

#### **Instructions**
1. **Identify Entries with Inches Symbol:**
    - Use the interactive table in Persist to look for rows with `"` in `Depth_inches` column
2. **Edit and Correct Entries:**
    - Edit the cells to remove the inches symbol from these entries. (e.g. `15"` → `15`) 
3. **Convert Data Type:**
    - Change the data type of the `Depth_inches` column from string to float.
4. **Generate Dataframe:**
    - Assign the modified dataframe to a variable `df_task_1c`.
5. **Show Output:**
    - Display the dtypes of `df_task_1c` to verify the data type correction.

In [9]:
PR.PersistTable(df_task_1b)

PersistWidget(data_values=[{'__id_column': '1', 'Date': 1352419200000, 'Region': 'Salt Lake', 'Place': 'Sunset…

In [10]:
df_task_1c.head()

Unnamed: 0,Date,Region,Place,Trigger,Weak Layer,Depth_inches,Width_inches,Vertical_inches,Aspect,Elevation_feet,__annotations
0,2012-11-09,Salt Lake,Sunset Peak,Snowboarder,New Snow/Old Snow Interface,14.0,960,360.0,North,10400.0,No Annotation
1,2012-11-11,Salt Lake,Patsy Marly,Skier,New Snow/Old Snow Interface,30.0,1200,1200.0,North,9700.0,No Annotation
2,2012-11-11,Salt Lake,Two Dogs,Skier,Facets,36.0,840,5400.0,North,10200.0,No Annotation
3,2012-11-11,Salt Lake,Emma Ridges,Skier,New Snow,18.0,600,6000.0,Southeast,10200.0,No Annotation
4,2012-11-11,Salt Lake,Sunset Peak,Skier,Facets,42.0,18000,9600.0,North,10400.0,No Annotation


# Task 2: Filtering data

In Task 2, we further improve our data by removing outliers and removing certain records to have more consistent data. 

We will also take a brief look at relations between cause of an avalanche (`Trigger`) and failure point of ice (`Weak Layer`)

## **Task 2a: Remove Outliers**

#### **Objective**
In this task, we address data accuracy by filtering out anomalies in elevation data. We observe some records with elevations outside the plausible range for Utah, suggesting recording errors.

Remove avalanche records with elevations below 2000 feet and above 13500 feet, which are outside the realistic range for Utah.

#### **Instructions**
1. **Identify and Remove Anomalies:**
    - Interactively select data points with elevations below 2100 feet and above 13500 feet in the Persist Scatterplot.
    - Use Persist's interactive features to remove these anomalous records.
2. **Generate Dataframe:**
    - Assign the cleaned dataframe to a variable `df_task_2a`.
3. **Show Output:**
    - Display the head of `df_task_2a`.

In [None]:
PR.plot.scatterplot(df_task_1c, "Elevation_feet:Q", "Vertical_inches:Q")

## **Task 2b: Filtering Out Old Data**

The interactive barchart below, shows the data aggregated by year. There are noticeably fewer records for the years before 2010.

During this subtask we will remove the older records, keeping only the records post 2010.

#### **Instructions**
1. **Create and Analyze Bar Chart:**
    - Looking at an interactive bar chart in Persist showing the number of avalanches recorded each year, identify the bars showing data we want.
2. **Interactive Year Selection:**
    - Use a brush to interactively select and remove appropriate records.
3. **Generate Dataframe:**
    - Assign the refined dataframe to a variable `df_task_2b`.
4. **Show Output:**
    - Display the head of `df_task_2b` to verify the removal of earlier years.

In [None]:
PR.plot.barchart(df_task_2a, "utcyear(Date):O", "count()", selection_type="interval")

## **Task 2c: Identifying frequently failing `Weak Layers` for avalanches triggered by _'snowboarders'_ and _'skiers'_**

#### **Instructions**
1. **Linked Bar Charts:**
    - You will start with two linked interactive bar charts: one for `Trigger` and another for `Weak Layer`.
    - Both bar charts show `count` for their respective category.
    - You can click on a trigger in the `Trigger` bar chart and the `Weak Layer`' bar chart dynamically updates to show only occurrences corresponding to the selected triggers.
2. **Interactive Selection:**
    - Interactively select triggers and use the updated `Weak Layer`.
3. **Identify the most frequent failure point:**
    - Analyze the filtered 'Weak Layer' bar chart to determine the most frequently failed layers for selected category and make a note in a markdown cell about both the name of the layer and frequency.
4. **Generate Dataframe and Output:**
    - You will not generate any dataframe for this task. NOTE: Please save the notebook after you are finsihed with interactions. 

In [None]:
pts = alt.selection_point(name="selector", fields=['Trigger'])

base = alt.Chart(df_task_2b).encode(y="count()")

trigger = base.mark_bar().encode(
    x="Trigger:N",
    color=alt.condition(pts, "Trigger:N", alt.value("gray"))
).add_params(pts)

weak_layer = base.mark_bar().encode(
    x="Weak Layer:N",
    color="Weak Layer:N",
    tooltip="count()"
).transform_filter(
    pts
)

chart = alt.hconcat(
    trigger, weak_layer
).resolve_scale(
    color="independent",
)

PR.PersistChart(chart, data=df_task_2b)

**Task 2c Notes:**

Write your answer here

## Task 3: Data Wrangling

### Task 3a: Creating and assigning 'Avalanche Season'**

#### **Objective**

In this subtask, we'll introduce a new categorical variable named `Avalanche Season` into our dataset. This addition aims to classify each avalanche record into different parts of the avalanche season (Start, Middle, End) based on the month it occurred.

Create a new category `Avalanche Season` in the dataset and assign each record to `Start`, `Middle`, or `End` of the avalanche season based on its month.

#### **Instructions**
1. **Visualization**
    - We will work with an interactive bar chart in Persist showing the count of avalanche instances aggregated by month.
2. **Define Season Categories:**
    - Based on typical avalanche seasons in Utah, you will first create a new category called `Avalanche Season` using the `Edit Categories` button in the header.
    - In the same menu you will add three options for this category -- `Start`, `Middle`, `End`.
3. **Interactive Assignment:**
    - Use Persist's interactive features to select each month and assign it to one of the `Avalanche Season` values (Start, Middle, End).
    - You should use the following ranges for assigning proper categories:
        - `Start` of Season: October, November, December, January, February
    	- `Middle` of Season: March, April, May,
    	- `End` of Season: June, July, August, September
4. **Generate Dataframe:**
    - Assign the updated dataset to a new variable: `df_task_3a`.
5. **Show Output:**
    - Print the head of the dataframe.

In [None]:
select = alt.selection_interval(name="selector", encodings=["x"])

chart = alt.Chart(df_task_2b, height=400, width=500).mark_bar().encode(
    x=alt.X("utcmonth(Date):N"),
    y="count()",
    opacity=alt.condition(select, alt.value(1), alt.value(0.2))
).add_params(select)

PR.PersistChart(chart, data=df_task_2b)

# **Task 3b: Analyzing Top Avalanche Trigger by Season**

#### **Objective**
In this subtask, we'll analyze which trigger is most prevalent for avalanches in each season phase (Start, Middle, End) using the `Avalanche Season` category created in Task 3a.

#### **Instructions**
1. **Visualization:**
    - We have two linked interactive bar charts: one for 'Avalanche Season' and another for 'Trigger'.
    - You can select a category to highlight using the legend for `Avalanche Season` bar chart. The `Trigger` bar chart will dynamically update in response to your selections.
2. **Analyze Trigger Data:**
    - Observe the filtered `Trigger` bar chart to identify the top trigger for the selected season phase.
    - You can hover on the bars to get the exact frequency.
3. **Document Findings:**
    - Note down the most common trigger for each season phase based on your interactive analysis in a new markdown cell.

In [None]:
selection = alt.selection_point(name="selector", fields=["Avalanche Season"], bind="legend")
base = alt.Chart(df_task_3a)

seasons = base.mark_bar().encode(
    x=alt.X("Avalanche Season:N").sort(["Start", "Middle", "End"]),
    y="count()",
    opacity=alt.condition(selection, alt.value(1), alt.value(0.3)),
    color="Avalanche Season:N"
).add_params(
    selection
)

trigger = base.mark_bar().encode(
    x="Trigger:N",
    y="count()",
    color="Trigger:N",
    tooltip="count()"
).transform_filter(
    selection
)

chart = seasons | trigger

chart = chart.resolve_scale(
    color="independent"
)

PR.PersistChart(chart)

**Task 3b Notes:**

Write your answer here