## ENSF611 Project
**Authors: Tobin Eberle, Tom Wilson, Jeff Wheeler**

### Project Outline

Avalanche forecasting is the practice of observing current and historical snowpack conditions to make educated guesses about how reactive the snowpack is, and thus the likelihood of avalanches. It is a mixture of an art and a science where field observations and a forecaster's experience generally have a lot of weight in the final avalanche rating for a particular area. This project will investigate the classification of avalanche ratings for each of the three avalanche zones in Yoho National Park (Canada), to determine how accurately we can determine the likelihood of avalanches solely based on weather data.

For reference the avalanche danger ratings are as follows:
- No Rating: Summer, or no snow in the mountains.
- Spring Conditions: Very variable avalanche conditions based on the fluctuation of temperature throughout the day making it difficult to forecast.
- Low: Generally safe avalanche conditions.
- Moderate: Heightened avalanche conditions.
- Considerable: Dangerous avalanche conditions.
- High: Very dangerous avalanche conditions.
- Exterme: Extraordinarily dangerous avalanche conditions.

The avalanche zones are split into three seperate areas:
- Below Treeline (btl): Elevation of a mountain band that is covered by forest.
- Treeline (tln): Sparse forest cover and is the transition between uniform forest cover below it and the alpine above it.
- Alpine (alp): Wide expanses of open, exposed terrain with few or no trees.

It is important to note there is a  difference between the avalanche zones as they each obtain a seperate rating from eachother.

### Pre-Processing
Pre-process the data to remove NaN's and fill missing data.

In [43]:
#Import the dataset and inspect the contents
import pandas as pd
df = pd.read_csv('av_dataset_ensf611.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2140 entries, 0 to 2139
Data columns (total 29 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   date                     2140 non-null   object 
 1   btl_rating               2140 non-null   object 
 2   tln_rating               2140 non-null   object 
 3   alp_rating               2140 non-null   object 
 4   problem_1                880 non-null    object 
 5   problem_2                565 non-null    object 
 6   problem_3                157 non-null    object 
 7   chance_1                 880 non-null    object 
 8   chance_2                 565 non-null    object 
 9   chance_3                 157 non-null    object 
 10  longitude                2140 non-null   float64
 11  latitude                 2140 non-null   float64
 12  station_name             2140 non-null   object 
 13  climate_id               2140 non-null   object 
 14  year                    

The strategy to deal with missing values in the following columns is as follows:

```date```: Drop first date column as it contains redundant information.

```btl/tln/alp_rating```: Delete NaN rows.

```problem_1/2/3```: Replace NaN with 'noproblem'. Format column to ensure no '-' seperates the problem description.

```chance_1/2/3```: Drop these columns as they are determined by forecasters and will skew results.

```longitude/latitude/climate_id/data_quality/station_name```: Drop columns as not relevant to dataset.

```year/month/day```: No missing values

```max/min/mean_temp```: Forward fill (extreme of temps should be relatively simlar day-to-day)

```hot/cool_deg_days```: Forward fill. Otherwise could drop these columns as they are a derivation of temperature extremes.

```total_rain/snow```: Drop columns as all null.

```total_precip```: Fill Nan with zeros as can assume no precipitaion.    

```snow_on_ground```: Forward fill as the snow remains monstly constant for missing data points

```dir_of_max_gust```: Drop as wind direction isn't as important as wind speed and wind directions of winds under 27km/h aren't recorded.

```spd_of_max_gust```: Fill with zeroes, this is only reported if wind is greated than 31km/h. 


In [None]:
#Implement the data cleaning for each column as outlined above
import re
pattern = r'[\w]'

#Dropping unneeded columns
df = df.drop(columns= ['date', 'chance_1', 'chance_2', 'chance_3', 'longitude', 'latitude', 'climate_id', 'station_name', 'data_quality', 'total_rain_mm', 'total_snow_cm', 'dir_of_max_gust_10s_deg'])

#First 50 rows are missing data, drop them
df.drop(axis = 0, inplace= True, index= range(0, 50))

#Avalanche problems
df['problem_1'].fillna('norating', inplace= True)
df['problem_2'].fillna('norating', inplace= True)
df['problem_3'].fillna('norating', inplace= True)
df['problem_1'] = df['problem_1'].replace(pattern, '', regex= True)
df['problem_2'] = df['problem_2'].replace(pattern, '', regex= True)
df['problem_3'] = df['problem_3'].replace(pattern, '', regex= True)

#Temperatures
df['max_temp_c'].ffill(inplace= True)
df['min_temp_c'].ffill(inplace= True)
df['mean_temp_c'].ffill(inplace= True)
df['cool_deg_day_c'].ffill(inplace= True)
df['heat_deg_day_c'].ffill(inplace= True)

#Precipitation
df['total_precip_mm'].fillna(0, inplace= True)
df['snow_on_ground_cm'].ffill(inplace= True)

#Wind Speed
df['spd_of_max_gust_kmh'].fillna(0, inplace= True)

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2090 entries, 50 to 2139
Data columns (total 17 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   btl_rating           2090 non-null   object 
 1   tln_rating           2090 non-null   object 
 2   alp_rating           2090 non-null   object 
 3   problem_1            2090 non-null   object 
 4   problem_2            2090 non-null   object 
 5   problem_3            2090 non-null   object 
 6   year                 2090 non-null   int64  
 7   month                2090 non-null   int64  
 8   day                  2090 non-null   int64  
 9   max_temp_c           2090 non-null   float64
 10  min_temp_c           2090 non-null   float64
 11  mean_temp_c          2090 non-null   float64
 12  heat_deg_day_c       2090 non-null   float64
 13  cool_deg_day_c       2090 non-null   float64
 14  total_precip_mm      2090 non-null   float64
 15  snow_on_ground_cm    2090 non-null   

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['problem_1'].fillna('norating', inplace= True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['problem_2'].fillna('norating', inplace= True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting 

Unnamed: 0,btl_rating,tln_rating,alp_rating,problem_1,problem_2,problem_3,year,month,day,max_temp_c,min_temp_c,mean_temp_c,heat_deg_day_c,cool_deg_day_c,total_precip_mm,snow_on_ground_cm,spd_of_max_gust_kmh
50,norating,norating,norating,,,,2019,2,20,-2.8,-19.2,-11.0,29.0,0.0,0.5,105.0,0.0
51,norating,norating,norating,,,,2019,2,21,-4.2,-22.0,-13.1,31.1,0.0,0.2,104.0,0.0
52,norating,norating,norating,,,,2019,2,22,-7.1,-15.1,-11.1,29.1,0.0,0.2,104.0,38.0
53,norating,norating,norating,,,,2019,2,23,-7.7,-16.2,-11.9,29.9,0.0,1.4,105.0,31.0
54,norating,norating,norating,,,,2019,2,24,-10.0,-22.7,-16.4,34.4,0.0,0.6,105.0,34.0
