**_I will use color to differentiate instructions for <font color="steelblue">persist mode</font> and <font color="red">pandas mode</font>_**

# Data Analysis

In this notebook you will work with the data for [avalanches in Utah](https://utahavalanchecenter.org/avalanches). The data was sourced from the [Utah Avalanche Center](https://utahavalanchecenter.org/).

We will ask you to perform three basic data analysis tasks. Please follow the instructions and note your findings and/or any issues you face in markdown.

<font color="steelblue">In this notebook, you will be using the persIst extension to perform the tasks. The extension is already installed and enabled in this notebook environment. Please refer to the [tutorial notebook](abc.com) as needed to understand how to use the extension. To create a new interactive chart, you have to create the visualization using the [altair](https://altair-viz.github.io) visualization library. We have interactive charts created for the tasks which you can use directly instead.</font>

<font color="red">In this notebook, you will be using pandas to perform the tasks. Pandas is already installed and enabled in this notebook environment. You are free to use any library like _matplotlib_, _seaborn_, _altair_, etc. to create your visualizations. These libraries are installed as well. You can use the internet to get help with the syntax python or how to use a particular library.</font>

You can add new code and markdown cells as required to complete the tasks.

In [1]:
import altair as alt # Load altair for charting

import pandas as pd # Load pandas for data processing

import persist_ext as PR # Import the extension

## Data Description

The table below describes the different columns in the dataset. Each row in the dataset is a reported avalanche with the location, trigger, aspect of the slow. The data spans multiple years starting from 2004 upto 2023.

| Column          | Description                                                      |
|-----------------|------------------------------------------------------------------|
| Date            | This is the date on which the instance of avalanche was recorded |
| Region          | Region in Utah where the instance was recorded                   |
| Place           | Exact place where the instance was recorded                      |
| Trigger         | The cause of the avalanche                                       |
| Weak Layer      | Layer of the snow that was weakest and likely the one to fail    |
| Depth_inches    |                                                                  |
| Width_inches    |                                                                  |
| Vertical_inches |                                                                  |
| Aspect          | Direction of the slope where the avalanche happened              |
| Elevation_feet  |                                                                  |
| Coordinates     | Approximate location of the avalanche                            |
| Comments 1      | Comments added by the reporter                                   |

In [2]:
df = pd.read_csv('./avalanches_cleaned.csv')
df

Unnamed: 0,Date,;Region,Place,;Trigger,;Weak Layer,Depth_inches,Width_inches,Vertical_inches,;Aspect,Elevation_feet,Coordinates,Comments 1
0,11/9/2012,Salt Lake,Sunset Peak,Snowboarder,New Snow/Old Snow Interface,14.0,960.0,360.0,North,10400.0,"40.577977000000, -111.595817000000",While it was a small avalanche that was I caug...
1,11/11/2012,Salt Lake,Patsy Marly,Skier,New Snow/Old Snow Interface,30.0,1200.0,1200.0,North,9700.0,"40.592619000000, -111.616099000000",A North facing aspect with an exposed ridge in...
2,11/11/2012,Salt Lake,Two Dogs,Skier,Facets,36.0,840.0,5400.0,North,10200.0,"40.599291000000, -111.642315000000",Remotely triggered all the new storm snow (abo...
3,11/11/2012,Salt Lake,Emma Ridges,Skier,New Snow,18.0,600.0,6000.0,Southeast,10200.0,"40.598313000000, -111.628304000000",Impressive fast powder cloud ran in front of t...
4,11/11/2012,Salt Lake,Sunset Peak,Skier,Facets,42.0,18000.0,9600.0,North,10400.0,"40.578590000000, -111.595087000000",Three of us toured from Brighton to low saddle...
...,...,...,...,...,...,...,...,...,...,...,...,...
2387,4/22/2023,Salt Lake,Cardiff Bowl,Unknown,New Snow/Old Snow Interface,8.0,720.0,1800.0,East,9800.0,"40.592721660567, -111.649613218710",We spent the day skiing the southerly-facing a...
2388,4/22/2023,Logan,"Miller Bowl, East",Snowmobiler,New Snow/Old Snow Interface,18.0,540.0,4800.0,North,8700.0,"41.886233332343, -111.645074831510","Not sure about the story here, but we observed..."
2389,4/22/2023,Logan,Millville Peak,Snowboarder,New Snow/Old Snow Interface,12.0,3600.0,7200.0,North,8900.0,"41.677564539953, -111.718065248970",Details are a bit limited and we're not sure w...
2390,5/7/2023,Salt Lake,Red Top Mountain,Natural,Unknown,72.0,3000.0,12000.0,West,10800.0,"40.546874131921, -111.663880335390",Saw this avalanche around 9.30 AM from the top...


## Task 1: Cleaning the dataframe

In the first task we will perform some basic data cleaning operations to get our dataset ready for further tasks.

### Task 1a: Removing columns not required for the analysis; and fixing column names.

When we print the **df** dataframe we see that we have a column called **Comments 1**. It has text comments made for each recorded instance. For our current analysis we are not going to use this column. We will drop this column from the dataframe.

We also have four columns: **;Date**, **;Region**, **;Trigger**, **;Weak_Layer**, with a `;` character in the column name. We should rename the columns to **Date**, **Region**, **Trigger**, **Weak Layer** respectively by removing the leading `;` character.

<font color="steelblue">PersIst extension gives you an interactive data table. You can perform rename & drop column operations directly in the header.</font>

<font color="steelblue">Once you have done the operations, you should create a new dataframe called **cols_fixed_df**</font>

<font color="red">Write code to rename the four columns and drop the one as specified. Assign the variable to a new dataframe called **cols_fixed_df**</font>

**GOAL**: The task demonstrates effectiveness of the interactive table, it's ease in performing column actions.

#### <font color="steelblue">Persist solution</font>

In [3]:
PR.vis.interactive_table(df)

OutputWithTrrackWidget(body_widget=InteractiveTableWidget(df_column_dtypes={'__id_column': 'string', 'Date': '…

In [4]:
cols_fixed_df_persist = PR.df.get("cols_fixed_df_persist")
cols_fixed_df_persist

#### <font color="red">Pandas Solution</font>

In [5]:
cleaned_df_pandas.head()

NameError: name 'cleaned_df_pandas' is not defined

In [None]:
cols_fixed_df_pandas = cleaned_df_pandas.rename(columns={";Date": "Date",
                                                         ";Region": "Region",
                                                         ";Trigger": "Trigger",
                                                        ";Weak Layer": "Weak Layer"})
cols_fixed_df_pandas

### Task 1a: Removing outliers in the records for Elevation and Depth

Below we have an interactive scatterplot with `Elevation` on the X-axis and `Vertical`. We see that there are some outlier's in the data here, possibly incorrect data input while entering the avalanche instance. We should remove these outlier before we proceed with the analysis.

<font color="steelblue">The scatterplot below is an interactive altair plot. You can select the points in the plot using a rectangular brush. You can only have on brush active at a time.</font>

<font color="steelblue">The extension tries to help you complete the selections faster by suggesting algorithmic selections based on your initial brush. You can look at these suggestions in the **Intent** tab. Hovering on any of the suggested selections, highlights the points that will be selected. If you feel one of the suggested options fits what you wanted to selected, you can click the checkmark button on right of the suggestion to apply the selection.</font>

<font color="steelblue">Once you are satisfied that you have selected the outliers, use the **Filter** button in the header to remove them. After filtering out all the outliers, you can use the generate dataframe button in the header to create a new python variable. Create a new dataframe called **cleaned_df** and print it in the next cell.</font>

<font color="red">The scatterplot below is a scatterplot in seaborn.</font>

<font color="red">Remove the outliers you see in the plots, and print the final plot.</font>

<font color="red">Assign the cleaned dataframe to a variable called **cleaned_df**</font>

Abstract: Remove outliers/cluster based on a pattern

#### <font color="steelblue">Persist solution</font>

In [None]:
PR.vis.scatterplot(df, "Elevation_feet", "Vertical_inches")

In [None]:
clean2 = PR.df.get("clean")
clean2

In [None]:
cleaned_df_persist = PR.df.get("cleaned_df_persist")
cleaned_df_persist

#### <font color="red">Pandas solution</font>

In [None]:
plot = alt.Chart(df).mark_point().encode(x="Elevation_feet", y="Vertical_inches")

plot

In [None]:
cleaned_df_pandas = df[df["Elevation_feet"] < 15000]

plot = alt.Chart(cleaned_df_pandas).mark_point().encode(x="Elevation_feet", y="Vertical_inches")

plot

In [None]:
cleaned_df_pandas = cleaned_df_pandas[cleaned_df_pandas["Elevation_feet"] > 2000]

plot = alt.Chart(cleaned_df_pandas).mark_point().encode(x="Elevation_feet", y="Vertical_inches")

plot

### Task 1c: Filter out old data

Below we have an interactive barchart with data aggregated by the year. We see the `Year` on the X-axis and `number of records for the year` on y-axis. For the two years before 2010, we have very few records. We will remove these records from our dataset.

<font color="steelblue">The barchart below is an interactive altair plot. You can select the bars in the plot using a rectangular brush along Y-axis. You can only have on brush active at a time.</font>

<font color="steelblue">Once you are satisfied that you have selected the years to be filtered, use the **Filter** button in the header to either keep the years or remove them depending on your selection. After filtering the years as instructed, you can use the generate dataframe button in the header to create a new python variable. Create a new dataframe called **post_2010_df** and print it in the next cell.</font>

<font color="red">The barchart below is a barchart in seaborn.</font>

<font color="red">Filter out the data points as instructed using pandas.</font>

<font color="red">Assign the new dataframe to a variable called **post_2010_df** and plot it the same as the given barchart</font>

Abstract: Filter out data in a range

#### <font color="steelblue">Persist solution</blue>

In [None]:
PR.vis.barchart(cols_fixed_df_persist, "utcyear(Date):O", "count()")

In [None]:
post_2010_df_persist = PR.df.get("post_2010_df_persist")
post_2010_df_persist

#### Pandas solution

In [None]:
alt.Chart(cols_fixed_df_pandas).mark_bar().encode(x="year(Date):O", y="count()")

In [None]:
post_2010_df_pandas = cols_fixed_df_pandas
post_2010_df_pandas["Date"] = pd.to_datetime(cols_fixed_df_pandas["Date"], utc=True)
post_2010_df_pandas

In [None]:
post_2010_df_pandas = post_2010_df_pandas[post_2010_df_pandas["Date"].dt.year >= 2010]
post_2010_df_pandas

In [None]:
alt.Chart(post_2010_df_pandas).mark_bar().encode(x="utcyear(Date):O", y="count()")

## Task 2

We will use the new dataset for avalanches post 2010 to further analyse relationship between phases of the avalanche season and other variables.

### Task 2a: Categorize data in phases

Out data is missing the data for phases of the season! We have to add a new column to the dataset called `Avalanche Season Phase`. The new column column will have three values: `Start`, `Middle`, `End`. You have to categorize the data into one of the columns depending on the month. Refer to the following order for assignment:
- **Dec - Feb** -> `Start`
- **Mar - May** -> `Middle`
- **June - Nov** -> `End`

<font color="steelblue">First, we have to create a new category and add options to it. You can click the add category button in the header to open the category popup. Make a new category called **Avalanche Season Phase**. Then add three new options to the category: **Start**, **Middle**, **End**.</font>

<font color="steelblue">The barchart below is an interactive altair plot. You can select the bars in the plot by clicking on them. Press `shift` while clicking to select multiple bars. Clicking on empty area in the chart, clears the selection.</font>

<font color="steelblue">When select the months that should belong to the same category, you can use the categorize button in the header to assign the proper category to your selection.</font>

<font color="steelblue">When you are done with all the categorization, create a new dataframe called **season_phase_df** and print the grouped dataframe (grouped by the new column) in the next cell.</font>


<font color="red">The barchart below is a barchart in seaborn.</font>

<font color="red">Assign categories to different subsets of the data using pandas.</font>

<font color="red">Assign the new dataframe to a variable called **season_phase_df** and plot it the same as the barchart earlier</font>

Abstract: Cateogrize the data. Start with creating category and then categorize

#### Persist solution

In [None]:
PR.vis.barchart(post_2010_df_persist, alt.X("utcmonth(Date):O").sort(["Dec", "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sept", "Oct", "Nov"]), "count()")

In [None]:
sdf_grouped = PR.df.get("sdf_grouped", groupby="Avalanche Season Phase", aggregate={"Depth_inches": "mean", "Width_inches": "mean", "Vertical_inches": "mean", "Elevation_feet": "mean"})
sdf_grouped

In [None]:
season_phase_data_persist_grouped = PR.df.get("season_phase_data_persist_grouped", groupby="Avalanche Season Phase", aggregate={"Depth_inches": "mean", "Width_inches": "mean", "Vertical_inches": "mean", "Elevation_feet": "mean"})
season_phase_data_persist_grouped

#### Pandas solution

In [None]:
post_2010_df_pandas.head()

In [None]:
season_phase_data_persist = post_2010_df_pandas
season_phase_data_persist["month"] = post_2010_df_pandas["Date"].dt.month
season_phase_data_persist

In [None]:
season_phase_data_persist["Avalanche Season Phase"] = "Middle"

season_phase_data_persist.loc[(
    (season_phase_data_persist["month"] == 12) |
    (season_phase_data_persist["month"] == 1) |
    (season_phase_data_persist["month"] == 2)
),["Avalanche Season Phase"]] = "Start"


season_phase_data_persist.loc[(
    (season_phase_data_persist["month"] >= 6) &
    (season_phase_data_persist["month"] <= 11)
),["Avalanche Season Phase"]] = "End"

season_phase_data_persist.head()

In [None]:
season_phase_data_persist_grouped = season_phase_data_persist.groupby("Avalanche Season Phase").agg({"Depth_inches": "mean", "Width_inches": "mean", "Vertical_inches": "mean", "Elevation_feet": "mean"})
season_phase_data_persist_grouped

### Task 2b:

Abstract: Note an interesting pattern based on the plot created from the newly categorized data.


Branches for iteration not for analysis

In [None]:
PR.vis.barchart(season_phase_data_persist, alt.X("Trigger:N").sort("-y"), "count()").facet(alt.Facet("Avlanche Season Phase:N").sort(["Start", "Middle", "End"]))

In [None]:
PR.vis.barchart(season_phase_data, alt.X("Aspect:N").sort("-y"), "count()").facet(alt.Facet("Avlanche Season Phase:N").sort(["Start", "Middle", "End"]))

In [None]:
PR.vis.barchart(season_phase_data, alt.X("Vertical_inches:Q").bin().sort("-y"), "count()").facet(alt.Facet("Avlanche Season Phase:N").sort(["Start", "Middle", "End"]))

In [None]:
PR.vis.heatmap(season_phase_data_persist, "Avlanche Season Phase:N", "Aspect:N", "mean(Depth_inches)")