<a href="https://colab.research.google.com/github/subornaa/Data-Analytics-Tutorials/blob/main/Data_Wrangling_Tutorials.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div align="center">

# Data Wrangling in Python
</div>

![prf_image](https://opendata.nfis.org/mapserver/PRF_Layout.jpg)

This notebook provides python code to analyze the Petawawa Research Forest (PRF) inventory and ancillary datasets.

All data is sourced from the following website:

https://opendata.nfis.org/mapserver/PRF.html

Please consult this page to get the most up to date version of the PRF data.

# Introduction and Dataset Background

This tutorial focuses on sumarizing tabular data using Python.  The concepts covered are common in data pre-processing phase of an analysis workflow, but are useful in many different contexts.

The tutorial makes use of the Petawawa Research Forest (PRF) data, which is described in more detail in on the tutorial series [GitHub site](https://github.com/subornaa/Data-Analytics-Tutorials).

## Tutorial goal

The goal of this tutorial is to make use of individual tree data to answer questions relating to the PRF. We make use of the [`pandas`](https://pandas.pydata.org/docs/getting_started/index.html) python package to apply common data wrangling concepts such as dealing with null (NA) values, subsetting data, and summarizing data.

## Data Dictionary

This tutorial makes use of individual tree measurements taken at permanent sample plots (PSPs) across the PRF. A data dictionary is provided below summarizing the `trees.csv`. In this data, each tree is a row and each column is an attribute (e.g., height).

| **Column**      | **Definition**                                                                 |
|------------------|-------------------------------------------------------------------------------|
| PlotName         | Plot name                                                                    |
| TreeID           | Tree ID                                                                      |
| TreeSpec         | Tree species                                                                 |
| Origin           | Origin. N = natural (includes coppice), P = planted                          |
| Status           | Status. L = Live, D = Dead (only includes decayclass 1 & 2)                  |
| DBH              | Dbh (cm)                                                                     |
| CrownClass       | Crown class                                                                  |
| QualityClass     | Quality class                                                                |
| DecayClass       | Decay class                                                                  |
| Ht               | Height (m), includes estimated heights                                       |
| HLF              | HLF                                                                          |
| HtFlag           | HtFlag                                                                       |
| baha             | Basal area/ha = Dbh * Dbh * 0.00007854 * stems                               |
| ht_meas          | Height (m), if measured in the field                                         |
| stems            | Stems per hectare (number of trees/ha each tree represents)                  |
| mvol             | Gross merchantable volume (m³/ha)                                            |
| tvol             | Gross total volume (m³/ha)                                                  |
| biomass          | Aboveground biomass (kg/ha)                                                 |
| size             | Sawlog size                                                                  |

## References

White, Joanne C., et al. "The Petawawa Research Forest: Establishment of a remote sensing supersite." The Forestry Chronicle 95.3 (2019): 149-156.



# Install and load required packages

In [37]:
import os
import shutil
import pandas as pd
import geopandas as gpd

# Download data

In [38]:
# Download the data if it does not yet exist
if not os.path.exists("data"):
  !gdown 1UDKAdXW0h6JSf7k31PZ-srrQ3487l9e2
  !unzip prf_data.zip -d data/
  os.remove("prf_data.zip")
else:
  print("Data has already been downloaded.")

!ls data/

Data has already been downloaded.
plots.gpkg  trees.csv


# Data Wrangling

In [39]:
trees_df = pd.read_csv("data/trees.csv")
trees_df.head()


Unnamed: 0,tree_spec,PlotName,TreeID,TreeSpec,Origin,Status,DBH,CrownClass,QualityClass,DecayClass,...,BA_all,TPH_all,codom,domht,ht_meas,stems,mvol,tvol,biomass,size
0,1,PRF001,24.0,1,P,D,10.1,,,1.0,...,33.601655,2688,N,12.223077,,16,0.0,0.708735,393.3964,Poles
1,1,PRF001,46.0,1,P,D,9.9,,,2.0,...,33.601655,2688,N,12.223077,,16,0.0,0.673254,375.305379,Poles
2,2,PRF001,20.0,2,N,L,67.5,D,A,,...,33.601655,2688,Y,33.433333,33.9,16,77.327438,79.482658,39691.63995,Large
3,2,PRF001,50.0,2,N,L,57.9,D,U,,...,33.601655,2688,Y,33.433333,,16,56.444281,58.117292,28251.255888,Large
4,1,PRF001,10.0,1,N,L,55.9,D,A,,...,33.601655,2688,Y,33.433333,33.0,16,48.008649,49.833743,24501.838779,Large


## How many trees are in the dataset? How many permanent sample plots (PSPs)?

In [40]:
# Add your code here

## Which column(s) in the trees data frame contain NAs? How many rows contain NAs?

In [41]:
# Add your code here

# Solutions

In [42]:
# How many trees are in the dataset? How many permanent sample plots (PSPs)?

print(trees_df.shape[0])

print(len(trees_df))

12590
12590


In [43]:
# Which column(s) in the trees data frame contain NAs? How many rows contain NAs?

col_na_counts = trees_df.isnull().sum()

print("Columns that contain NAs:\n\n", col_na_counts[col_na_counts > 0])

trees_no_nas = trees_df.dropna()

print("\n\nNumber of rows with NAs:\n\n", len(trees_df) - len(trees_no_nas))

print(f"\n\nThere are {len(trees_no_nas)} rows in the DF which contain no NA values.")

Columns that contain NAs:

 TreeID           1470
CrownClass       1964
QualityClass     1965
DecayClass      12093
HLF             12578
HtFlag          12412
Int_ht           2846
ht_meas          8063
biomass           248
dtype: int64


Number of rows with NAs:

 12590


There are 0 rows in the DF which contain no NA values.


In [44]:
trees_df.isnull().sum()

Unnamed: 0,0
tree_spec,0
PlotName,0
TreeID,1470
TreeSpec,0
Origin,0
Status,0
DBH,0
CrownClass,1964
QualityClass,1965
DecayClass,12093
