# Data Analysis

In this notebook you will work with the data for [avalanches in Utah](https://utahavalanchecenter.org/avalanches). The data was sourced from the [Utah Avalanche Center](https://utahavalanchecenter.org/).

We will ask you to perform three basic data analysis tasks. Please follow the instructions and note your findings and/or any issues you face in markdown.

<font color="red">In this notebook, you will be using pandas to perform the tasks. Pandas is already installed and enabled in this notebook environment. You are free to use any library like _matplotlib_, _seaborn_, _altair_, etc. to create your visualizations. These libraries are installed as well. You can use the internet to get help with the syntax python or how to use a particular library.</font>

You can add new code and markdown cells as required to complete the tasks.

In [None]:
import seaborn as sns

import pandas as pd # Load pandas for data processing

## Data Description

The table below describes the different columns in the dataset. Each row in the dataset is a reported avalanche with the location, trigger, aspect of the slow. The data spans multiple years starting from 2004 upto 2023.

| Column          | Description                                                      |
|-----------------|------------------------------------------------------------------|
| Date            | This is the date on which the instance of avalanche was recorded |
| Region          | Region in Utah where the instance was recorded                   |
| Place           | Exact place where the instance was recorded                      |
| Trigger         | The cause of the avalanche                                       |
| Weak Layer      | Layer of the snow that was weakest and likely the one to fail    |
| Depth_inches    |                                                                  |
| Width_inches    |                                                                  |
| Vertical_inches |                                                                  |
| Aspect          | Direction of the slope where the avalanche happened              |
| Elevation_feet  |                                                                  |
| Coordinates     | Approximate location of the avalanche                            |
| Comments 1      | Comments added by the reporter                                   |

In [None]:
df = pd.read_csv('./avalanches_cleaned.csv')
df

## Task 1: Cleaning the dataframe

In the first task we will perform some basic data cleaning operations to get our dataset ready for further tasks.

### Task 1a: Removing columns not required for the analysis; and fixing column names.

When we print the **df** dataframe we see that we have a column called **Comments 1**. It has text comments made for each recorded instance. For our current analysis we are not going to use this column. We will drop this column from the dataframe.

We also have four columns: **;Aspect**, **;Region**, **;Trigger**, **;Weak_Layer**, with a `;` character in the column name. We should rename the columns to **Aspect**, **Region**, **Trigger**, **Weak Layer** respectively by removing the leading `;` character.

<font color="red">Write code to rename the four columns and drop the one as specified. Assign the variable to a new dataframe called **cols_fixed_df**</font>

In [None]:
df.head()

## Task 2

### Task 2a: Removing outliers in the records for Elevation and Depth

Below we have an interactive scatterplot with `Elevation` on the X-axis and `Vertical`. We see that there are some outlier's in the data here, possibly incorrect data input while entering the avalanche instance. We should remove these outlier before we proceed with the analysis.

<font color="red">The scatterplot below is a scatterplot in seaborn.</font>

<font color="red">Remove the outliers you see in the plots, and print the final plot.</font>

<font color="red">Assign the cleaned dataframe to a variable called **cleaned_df**</font>

In [None]:
plot = sns.scatterplot(df, x="Elevation_feet", y="Vertical_inches")

plot

### Task 2b: Filter out old data

Below we have an barchart with data aggregated by the year. We see the `Year` on the X-axis and `number of records for the year` on y-axis. For the two years before 2010, we have very few records. We will remove these records from our dataset.

<font color="red">Filter out the data points as instructed using pandas.</font>

<font color="red">Assign the new dataframe to a variable called **post_2010_df** and plot it the same as the given barchart</font>

In [None]:
sns.histplot(x=pd.to_datetime(df["Date"]))

## Task 3

We will use the new dataset for avalanches post 2010 to further analyse relationship between phases of the avalanche season and other variables.

### Task 3: Categorize data in phases

Out data is missing the data for phases of the season! We have to add a new column to the dataset called `Avalanche Season Phase`. The new column column will have three values: `Start`, `Middle`, `End`. You have to categorize the data into one of the columns depending on the month. Refer to the following order for assignment:
- **Dec - Feb** -> `Start`
- **Mar - May** -> `Middle`
- **June - Nov** -> `End`

<font color="red">The barchart below is a barchart in seaborn.</font>

<font color="red">Assign categories to different subsets of the data using pandas.</font>

<font color="red">Assign the new dataframe to a variable called **season_phase_df** and plot it the same as the barchart earlier</font>

In [None]:
months = pd.to_datetime(df["Date"]).dt.month_name()
plot = sns.histplot(x=pd.to_datetime(df["Date"]).dt.month, discrete=True)
plot.set_xticks(ticks=range(months.shape[0]), labels=months)
plot