# Data Analysis

In this notebook you will work with the data for [avalanches in Utah](https://utahavalanchecenter.org/avalanches). The data was sourced from the [Utah Avalanche Center](https://utahavalanchecenter.org/).

We will ask you to perform three basic data analysis tasks. Please follow the instructions and note your findings and/or any issues you face in markdown.

<font color="steelblue">In this notebook, you will be using the persIst extension to perform the tasks. The extension is already installed and enabled in this notebook environment. Please refer to the [tutorial notebook](abc.com) as needed to understand how to use the extension. To create a new interactive chart, you have to create the visualization using the [altair](https://altair-viz.github.io) visualization library. We have interactive charts created for the tasks which you can use directly instead.</font>

You can add new code and markdown cells as required to complete the tasks.

In [None]:
import altair as alt # Load altair for charting

import pandas as pd # Load pandas for data processing

import persist_ext as PR # Import the extension

## Data Description

The table below describes the different columns in the dataset. Each row in the dataset is a reported avalanche with the location, trigger, aspect of the slow. The data spans multiple years starting from 2004 upto 2023.

| Column          | Description                                                      |
|-----------------|------------------------------------------------------------------|
| Date            | This is the date on which the instance of avalanche was recorded |
| Region          | Region in Utah where the instance was recorded                   |
| Place           | Exact place where the instance was recorded                      |
| Trigger         | The cause of the avalanche                                       |
| Weak Layer      | Layer of the snow that was weakest and likely the one to fail    |
| Depth_inches    |                                                                  |
| Width_inches    |                                                                  |
| Vertical_inches |                                                                  |
| Aspect          | Direction of the slope where the avalanche happened              |
| Elevation_feet  |                                                                  |
| Coordinates     | Approximate location of the avalanche                            |
| Comments 1      | Comments added by the reporter                                   |

In [None]:
df = pd.read_csv('./avalanches_cleaned.csv')
df.head()

## Task 1: Cleaning the dataframe

In the first task we will perform some basic data cleaning operations to get our dataset ready for further tasks.

### Task 1a: Removing columns not required for the analysis; and fixing column names.

When we print the **df** dataframe we see that we have a column called **Comments 1**. It has text comments made for each recorded instance. For our current analysis we are not going to use this column. We will drop this column from the dataframe.

We also have four columns: **;Aspect**, **;Region**, **;Trigger**, **;Weak_Layer**, with a `;` character in the column name. We should rename the columns to **Aspect**, **Region**, **Trigger**, **Weak Layer** respectively by removing the leading `;` character.

<font color="steelblue">PersIst extension gives you an interactive data table. You can perform rename & drop column operations directly in the header.</font>

<font color="steelblue">Once you have done the operations, you should create a new dataframe called **cols_fixed_df**</font>

In [None]:
PR.vis.interactive_table(df)

## Task 2

### Task 2a: Removing outliers in the records for Elevation and Depth

Below we have an interactive scatterplot with `Elevation` on the X-axis and `Vertical`. We see that there are some outlier's in the data here, possibly incorrect data input while entering the avalanche instance. We should remove these outlier before we proceed with the analysis.

<font color="steelblue">The scatterplot below is an interactive altair plot. You can select the points in the plot using a rectangular brush. You can only have on brush active at a time.</font>

<font color="steelblue">The extension tries to help you complete the selections faster by suggesting algorithmic selections based on your initial brush. You can look at these suggestions in the **Intent** tab. Hovering on any of the suggested selections, highlights the points that will be selected. If you feel one of the suggested options fits what you wanted to selected, you can click the checkmark button on right of the suggestion to apply the selection.</font>

<font color="steelblue">Once you are satisfied that you have selected the outliers, use the **Filter** button in the header to remove them. After filtering out all the outliers, you can use the generate dataframe button in the header to create a new python variable. Create a new dataframe called **cleaned_df** and print it in the next cell.</font>

In [None]:
PR.vis.scatterplot(cols_fixed, "Elevation_feet:Q", "Vertical_inches:Q")

### Task 2b: Filter out old data

Below we have an interactive barchart with data aggregated by the year. We see the `Year` on the X-axis and `number of records for the year` on y-axis. For the two years before 2010, we have very few records. We will remove these records from our dataset.

<font color="steelblue">The barchart below is an interactive altair plot. You can select the bars in the plot using a rectangular brush along Y-axis. You can only have on brush active at a time.</font>

<font color="steelblue">Once you are satisfied that you have selected the years to be filtered, use the **Filter** button in the header to either keep the years or remove them depending on your selection. After filtering the years as instructed, you can use the generate dataframe button in the header to create a new python variable. Create a new dataframe called **post_2010_df** and print it in the next cell.</font>

In [None]:
PR.vis.barchart(cleaned_df, "utcyear(Date):O", "count()")

## Task 3

We will use the new dataset for avalanches post 2010 to further analyse relationship between phases of the avalanche season and other variables.

### Task 3: Categorize data in phases

Out data is missing the data for phases of the season! We have to add a new column to the dataset called `Avalanche Season Phase`. The new column column will have three values: `Start`, `Middle`, `End`. You have to categorize the data into one of the columns depending on the month. Refer to the following order for assignment:
- **Dec - Feb** -> `Start`
- **Mar - May** -> `Middle`
- **June - Nov** -> `End`

<font color="steelblue">First, we have to create a new category and add options to it. You can click the add category button in the header to open the category popup. Make a new category called **Avalanche Season Phase**. Then add three new options to the category: **Start**, **Middle**, **End**.</font>

<font color="steelblue">The barchart below is an interactive altair plot. You can select the bars in the plot by clicking on them. Press `shift` while clicking to select multiple bars. Clicking on empty area in the chart, clears the selection.</font>

<font color="steelblue">When select the months that should belong to the same category, you can use the categorize button in the header to assign the proper category to your selection.</font>

<font color="steelblue">When you are done with all the categorization, create a new dataframe called **season_phase_df** and print the grouped dataframe (grouped by the new column) in the next cell.</font>

In [None]:
PR.vis.barchart(post_2010_df, alt.X("utcmonth(Date):O").sort(["Dec", "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sept", "Oct", "Nov"]), "count()")