# Analyzing Utah Avalanche Data

**Participant ID:** P8

**Date / Time:**

## Introduction
Welcome to our data analysis study. For this part of the study, you'll be working with a dataset sourced from the [Utah Avalanche Center](https://utahavalanchecenter.org/). The data provides insights into [avalanche occurrences](https://utahavalanchecenter.org/avalanches) in Utah.


- You will use an extension called PersIst to complete **data cleanup and manipulation** tasks. 
- Interactive charts and tables have been pre-created for your convenience. These can be directly utilized by running the corresponding cells.
- Focus on leveraging the interactive capabilities of Persist for your analysis.
- Carefully follow the step-by-step instructions provided for each task.
- In some cases, you will be asked to document your findings. Please do this in writing in a markdwon cell.
- As you work through the tasks, take note of any interesting findings or challenges with the software or pandas that you may encounter, either by speaking your thoughts out loud or taking notes in a markdown cell.
- Feel free to add new code and markdown cells in the notebook as necessary to complete the tasks, but please do attempt the tasks with the PersIst functionality.

In [1]:
import helpers as h
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
import altair as alt

import persist_ext as PR

## Data Description

The table below describes the different columns in the dataset. Each row in the dataset represents a reported avalanche with details on location, trigger, and aspect. The data spans multiple years, starting from 2004 up to 2023.

| Column          | Description                                                    |
|-----------------|----------------------------------------------------------------|
| Region          | Region in Utah where the avalanche occurred                    |
| Month           | Month in which the avalanche was recorded                      |
| Day             | Day on which the avalanche was recorded                        |
| Year            | Year in which the avalanche was recorded                       |
| Trigger         | Cause of the avalanche                                         |
| Weak Layer      | Layer of snow that was weakest and likely to fail              |
| Depth_inches    | Depth of the avalanche in inches                               |
| Vertical_inches | Vertical distance covered by the avalanche in inches           |
| Aspect          | Direction of the slope where the avalanche occurred            |
| Elevation_feet  | Elevation of the location in feet                              |
| Coordinates     | Approximate geographical coordinates of the avalanche location |
| Comments 1      | Additional comments provided by the reporter                   |

In [2]:
df = pd.read_csv('./avalanches_data.csv')
df.head()

Unnamed: 0,;Region,Month,Day,Year,;Trigger,;Weak Layer,Depth_inches,Vertical_inches,;Aspect,Elevation_feet,Coordinates,Comments 1
0,Salt Lake,11,9,2012,Snowboarder,New Snow/Old Snow Interface,14.0,360.0,North,10400.0,"40.577977000000, -111.595817000000",While it was a small avalanche that was I caug...
1,Salt Lake,11,11,2012,Skier,New Snow/Old Snow Interface,30.0,1200.0,North,9700.0,"40.592619000000, -111.616099000000",A North facing aspect with an exposed ridge in...
2,Salt Lake,11,11,2012,Skier,Facets,36.0,5400.0,North,10200.0,"40.599291000000, -111.642315000000",Remotely triggered all the new storm snow (abo...
3,Salt Lake,11,11,2012,Skier,New Snow,"18.0""",6000.0,Southeast,10200.0,"40.598313000000, -111.628304000000",Impressive fast powder cloud ran in front of t...
4,Salt Lake,11,11,2012,Skier,Facets,42.0,9600.0,North,10400.0,"40.578590000000, -111.595087000000",Three of us toured from Brighton to low saddle...


# Task 1: Refining Columns and Preparing Data

In the first task we will perform some basic data cleaning operations to get our dataset ready for further tasks.

### **Task 1a: Remove Columns**

Remove the following columns to streamline the dataset for further analysis:

- **_Comments 1:_** Contains textual comments not crucial for quantitative analysis.
- **_Coordinates:_** Detailed location data not needed for the current scope of analysis.

#### **Instructions**
1. **Column Removal:**
	- Use the interactive table feature in PersIst to remove the specified columns.
3. **Verify the Output:**
	- Print the head of the generated dataframe to verify the changes.

In [3]:
PR.PersistTable(df, df_name="df_task_1a")

PersistWidget(data_values=[{'__id_column': '1', ';Region': 'Salt Lake', 'Month': 11, 'Day': 9, 'Year': 2012, '…

In [4]:
df_task_1a.head()

Unnamed: 0,;Region,Month,Day,Year,;Trigger,;Weak Layer,Depth_inches,Vertical_inches,;Aspect,Elevation_feet,Coordinates
0,Salt Lake,11,9,2012,Snowboarder,New Snow/Old Snow Interface,14.0,360.0,North,10400.0,"40.577977000000, -111.595817000000"
1,Salt Lake,11,11,2012,Skier,New Snow/Old Snow Interface,30.0,1200.0,North,9700.0,"40.592619000000, -111.616099000000"
2,Salt Lake,11,11,2012,Skier,Facets,36.0,5400.0,North,10200.0,"40.599291000000, -111.642315000000"
3,Salt Lake,11,11,2012,Skier,New Snow,"18.0""",6000.0,Southeast,10200.0,"40.598313000000, -111.628304000000"
4,Salt Lake,11,11,2012,Skier,Facets,42.0,9600.0,North,10400.0,"40.578590000000, -111.595087000000"


### **Task 1b: Fix Column Names**

It looks like something went wrong when reading the file and some column headers start with a `;`. **Please remove the semicolon from all headers**. 

#### **Instructions**
1. **Rename Columns:**
    - Use the interactive  table in Persist to correct the column names by removing the leading `;` from their names:
        - _;Aspect_ → _Aspect_
        - _;Region_ → _Region_
        - _;Trigger_ → _Trigger_
        - _;Weak Layer_ → _Weak Layer_
2. **Verify the Output:**
	- Print the head of the generated dataframe to verify the changes.

In [5]:
PR.PersistTable(df_task_1a, df_name="df_task_1b")

PersistWidget(data_values=[{'__id_column': '1', ';Region': 'Salt Lake', 'Month': 11, 'Day': 9, 'Year': 2012, '…

In [7]:
df_task_1b.head()

Unnamed: 0,Region,Month,Day,Year,Trigger,Weak Layer,Depth_inches,Vertical_inches,Aspect,Elevation_feet,Coordinates
0,Salt Lake,11,9,2012,Snowboarder,New Snow/Old Snow Interface,14.0,360.0,North,10400.0,"40.577977000000, -111.595817000000"
1,Salt Lake,11,11,2012,Skier,New Snow/Old Snow Interface,30.0,1200.0,North,9700.0,"40.592619000000, -111.616099000000"
2,Salt Lake,11,11,2012,Skier,Facets,36.0,5400.0,North,10200.0,"40.599291000000, -111.642315000000"
3,Salt Lake,11,11,2012,Skier,New Snow,"18.0""",6000.0,Southeast,10200.0,"40.598313000000, -111.628304000000"
4,Salt Lake,11,11,2012,Skier,Facets,42.0,9600.0,North,10400.0,"40.578590000000, -111.595087000000"


In [6]:
df_task_1b.head()

Unnamed: 0,Region,Month,Day,Year,Trigger,;Weak Layer,Depth_inches,Vertical_inches,Aspect,Elevation_feet,Coordinates
0,Salt Lake,11,9,2012,Snowboarder,New Snow/Old Snow Interface,14.0,360.0,North,10400.0,"40.577977000000, -111.595817000000"
1,Salt Lake,11,11,2012,Skier,New Snow/Old Snow Interface,30.0,1200.0,North,9700.0,"40.592619000000, -111.616099000000"
2,Salt Lake,11,11,2012,Skier,Facets,36.0,5400.0,North,10200.0,"40.599291000000, -111.642315000000"
3,Salt Lake,11,11,2012,Skier,New Snow,"18.0""",6000.0,Southeast,10200.0,"40.598313000000, -111.628304000000"
4,Salt Lake,11,11,2012,Skier,Facets,42.0,9600.0,North,10400.0,"40.578590000000, -111.595087000000"


## **Task 1c: Correcting Data Type of 'Depth_inches'**

There is a data type issue in the `Depth_inches` column of our dataframe. This column is incorrectly formatted as an object (string) due to the presence of the inches symbol `"`.

Remove any inches symbols `"` from the `Depth_inches` column and convert it to a float data type.

In [None]:
df_task_1b.dtypes

#### **Instructions**
1. **Identify Entries with Inches Symbol:**
    - Use the interactive table in Persist to look for rows with `"` in `Depth_inches` column
    - _Hint_: You can search in the interactive table
2. **Edit and Correct Entries:**
    - Edit the cells to remove the inches symbol from these entries. (e.g. `15"` → `15`)
3. **Convert Data Type:**
    - Change the data type of the `Depth_inches` column from string to float.
3. **Verify the Output:**
	- Print the head of the generated dataframe to verify the changes.

In [8]:
PR.PersistTable(df_task_1b, df_name="df_task_1c")

PersistWidget(data_values=[{'__id_column': '1', 'Region': 'Salt Lake', 'Month': 11, 'Day': 9, 'Year': 2012, 'T…

In [9]:
df_task_1c.head()

Unnamed: 0,Region,Month,Day,Year,Trigger,Weak Layer,Depth_inches,Vertical_inches,Aspect,Elevation_feet,Coordinates
0,Salt Lake,11,9,2012,Snowboarder,New Snow/Old Snow Interface,14.0,360.0,North,10400.0,"40.577977000000, -111.595817000000"
1,Salt Lake,11,11,2012,Skier,New Snow/Old Snow Interface,30.0,1200.0,North,9700.0,"40.592619000000, -111.616099000000"
2,Salt Lake,11,11,2012,Skier,Facets,36.0,5400.0,North,10200.0,"40.599291000000, -111.642315000000"
3,Salt Lake,11,11,2012,Skier,New Snow,18.0,6000.0,Southeast,10200.0,"40.598313000000, -111.628304000000"
4,Salt Lake,11,11,2012,Skier,Facets,42.0,9600.0,North,10400.0,"40.578590000000, -111.595087000000"


In [10]:
PR.PersistTable(df_task_1c)

PersistWidget(data_values=[{'__id_column': '1', 'Region': 'Salt Lake', 'Month': 11, 'Day': 9, 'Year': 2012, 'T…

# Task 2: Filtering data

In Task 2, we further improve our data by removing outliers and removing certain records to have more consistent data. 

## **Task 2a: Remove Outliers**

In this task, we address data accuracy by filtering out anomalies in the elevation data. We observe some records with elevations outside the plausible range for Utah, suggesting recording errors.

**Remove avalanche records with elevations below ~4,000 feet and above ~15,000 feet, which are outside the realistic range for Utah.**

#### **Instructions**
1. **Identify and Remove Anomalies:**
    - Interactively select data points with elevations below ~4,000 feet or above ~15,000 feet in the Persist Scatterplot.
    - Use Persist's interactive features to remove these anomalous records.
2. **Verify the Output:**
    - Print the head of the generated dataframe to verify the changes.

In [11]:
PR.plot.scatterplot(df_task_1c, "Elevation_feet:Q", "Vertical_inches:Q", df_name="df_task_2a")

PersistWidget(data_values=[{'__id_column': '1', 'Region': 'Salt Lake', 'Month': 11, 'Day': 9, 'Year': 2012, 'T…

In [12]:
df_task_2a.head()

Unnamed: 0,Region,Month,Day,Year,Trigger,Weak Layer,Depth_inches,Vertical_inches,Aspect,Elevation_feet,Coordinates
0,Salt Lake,11,9,2012,Snowboarder,New Snow/Old Snow Interface,14,360.0,North,10400.0,"40.577977000000, -111.595817000000"
1,Salt Lake,11,11,2012,Skier,New Snow/Old Snow Interface,30,1200.0,North,9700.0,"40.592619000000, -111.616099000000"
2,Salt Lake,11,11,2012,Skier,Facets,36,5400.0,North,10200.0,"40.599291000000, -111.642315000000"
3,Salt Lake,11,11,2012,Skier,New Snow,18,6000.0,Southeast,10200.0,"40.598313000000, -111.628304000000"
4,Salt Lake,11,11,2012,Skier,Facets,42,9600.0,North,10400.0,"40.578590000000, -111.595087000000"


In [13]:
PR.plot.scatterplot(df_task_2a, "Elevation_feet:Q", "Vertical_inches:Q", df_name="df_task_2a")

PersistWidget(data_values=[{'__id_column': '1', 'Region': 'Salt Lake', 'Month': 11, 'Day': 9, 'Year': 2012, 'T…

## **Task 2b: Filtering Out Old Data**

The interactive barchart below, shows the data aggregated by year. There are noticeably fewer records for the years before 2010.

During this subtask we will remove the older records, keeping only the records for year 2010 and above

#### **Instructions**
1. **Analyze the Bar Chart:**
    - Identify the bars that show significantly less data.
2. **Filter Year:**
    - Select and remove the years with few records.
3. **Verify the Output:**
    - Print the head of the generated dataframe to verify the changes.

In [14]:
PR.plot.barchart(df_task_2a, "Year:O", "count()", selection_type="interval", df_name="df_task_2b")

PersistWidget(data_values=[{'__id_column': '1', 'Region': 'Salt Lake', 'Month': 11, 'Day': 9, 'Year': 2012, 'T…

In [15]:
df_task_2b.head()

Unnamed: 0,Region,Month,Day,Year,Trigger,Weak Layer,Depth_inches,Vertical_inches,Aspect,Elevation_feet,Coordinates
0,Salt Lake,11,9,2012,Snowboarder,New Snow/Old Snow Interface,14,360,North,10400,"40.577977000000, -111.595817000000"
1,Salt Lake,11,11,2012,Skier,New Snow/Old Snow Interface,30,1200,North,9700,"40.592619000000, -111.616099000000"
2,Salt Lake,11,11,2012,Skier,Facets,36,5400,North,10200,"40.599291000000, -111.642315000000"
3,Salt Lake,11,11,2012,Skier,New Snow,18,6000,Southeast,10200,"40.598313000000, -111.628304000000"
4,Salt Lake,11,11,2012,Skier,Facets,42,9600,North,10400,"40.578590000000, -111.595087000000"


## Task 3: Data Wrangling

### **Task 3a: Creating and assigning 'Avalanche Season'**

Next, we'll introduce a new categorical variable named `Avalanche Season` into our dataset. This addition aims to classify each avalanche record into different parts of the avalanche season (Start, Middle, End) based on the month it occurred in.

Create a new category `Avalanche Season` in the dataset and assign each record to `Start`, `Middle`, or `End` of the avalanche season based on its month.

#### **Instructions**
1. **Define Season Categories:**
    - Based on typical avalanche seasons in Utah, create a new category called `Avalanche Season`.
    - Add three options for this category -- `Start`, `Middle`, `End`.
2. **Interactive Category Assignment:**
    - Use Persist's interactive features to select each month and assign it to one of the `Avalanche Season` values (Start, Middle, End).
    - You should use the following ranges for assigning proper categories:
        - `Start` of Season for _October_, _November_, _December_
    	- `Middle` of Season for _January_, _February_, _March_
    	- `End` of Season for _April, May_, _June_
3. **Verify the Output:**
    - Print the head of the generated dataframe to verify the changes.

In [16]:
select = alt.selection_interval(name="selector", encodings=["x"])

chart = alt.Chart(df_task_2b, height=400, width=500).mark_bar().encode(
    x=alt.X("Month:N").sort([10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8 ,9]),
    y="count()",
    opacity=alt.condition(select, alt.value(1), alt.value(0.2)),
    tooltip="month(Month):N"
).add_params(select)

PR.PersistChart(chart, data=df_task_2b, df_name="df_task_3a")

PersistWidget(data_values=[{'__id_column': '1', 'Region': 'Salt Lake', 'Month': 11, 'Day': 9, 'Year': 2012, 'T…

In [17]:
df_task_3a.head()

Unnamed: 0,Avalanche Season,Ava,Region,Month,Day,Year,Trigger,Weak Layer,Depth_inches,Vertical_inches,Aspect,Elevation_feet,Coordinates
0,Start,No Assignment,Salt Lake,11,9,2012,Snowboarder,New Snow/Old Snow Interface,14,360,North,10400,"40.577977000000, -111.595817000000"
1,Start,No Assignment,Salt Lake,11,11,2012,Skier,New Snow/Old Snow Interface,30,1200,North,9700,"40.592619000000, -111.616099000000"
2,Start,No Assignment,Salt Lake,11,11,2012,Skier,Facets,36,5400,North,10200,"40.599291000000, -111.642315000000"
3,Start,No Assignment,Salt Lake,11,11,2012,Skier,New Snow,18,6000,Southeast,10200,"40.598313000000, -111.628304000000"
4,Start,No Assignment,Salt Lake,11,11,2012,Skier,Facets,42,9600,North,10400,"40.578590000000, -111.595087000000"


## **Task 3b: Analyzing Top Avalanche Trigger by Season**

Now we'll analyze which trigger is most prevalent for avalanches in each season phase (Start, Middle, End) using the `Avalanche Season` category created in Task 3a.

#### **Instructions**
1. **Context:**
    - We have a facted bar chart. The `x` axis encodes the `Trigger` column in the data and the columns encode the newly added category `Avalanche Season`.
2. **Analyze Trigger Data:**
    - Observe the most common trigger for each season.
    - You can hover on the bars to get the exact frequency.
3. **Document Findings:**
    - Note down the most common trigger for each season based on your interactive analysis in the markdown cell.

In [18]:
NEW_COLUMN = "Avalanche Season"

chart = alt.Chart(df_task_3a).mark_bar().encode(
    x="Trigger:N",
    y="count():Q",
    color=alt.Color(f"{NEW_COLUMN}:N").sort(["Start", "Middle", "End"]),
    column=alt.Column(f"{NEW_COLUMN}:N").sort(["Start", "Middle", "End"]),
    tooltip="count()"
)

PR.PersistChart(chart)

PersistWidget(data_values=[{'__id_column': '1', 'Avalanche Season': 'Start', 'Ava': 'No Assignment', 'Region':…

**Task 3b Notes:**

- Most common Trigger for `Start` of the season: Skier (63)
- Most common Trigger for `Middle` of the season:Natural (722)
- Most common Trigger for `End` of the season: Natural (88)