<div align="center">

# Data Wrangling in Python
</div>

![prf_image](https://opendata.nfis.org/mapserver/PRF_Layout.jpg)

This notebook provides python code to analyze the Petawawa Research Forest (PRF) inventory and ancillary datasets.

All data is sourced from the following website:

https://opendata.nfis.org/mapserver/PRF.html

Please consult this page to get the most up to date version of the PRF data.

# Introduction and Dataset Background

This tutorial focuses on sumarizing tabular data using Python.  The concepts covered are common in data pre-processing phase of an analysis workflow, but are useful in many different contexts.

The tutorial makes use of the Petawawa Research Forest (PRF) data, which is described in more detail in on the tutorial series [GitHub site](https://github.com/subornaa/Data-Analytics-Tutorials).

## Tutorial goal

The goal of this tutorial is to make use of individual tree data to answer questions relating to the PRF. We make use of the [`pandas`](https://pandas.pydata.org/docs/getting_started/index.html) python package to apply common data wrangling concepts such as dealing with null (NA) values, subsetting data, and summarizing data.

## Data Dictionary

This tutorial makes use of individual tree measurements taken at permanent sample plots (PSPs) across the PRF. A data dictionary is provided below summarizing the `trees.csv`. In this data, each tree is a row and each column is an attribute (e.g., height).

| **Column**      | **Definition**                                                                 |
|------------------|-------------------------------------------------------------------------------|
| PlotName         | Plot name                                                                    |
| TreeID           | Tree ID                                                                      |
| TreeSpec         | Tree species                                                                 |
| Origin           | Origin. N = natural (includes coppice), P = planted                          |
| Status           | Status. L = Live, D = Dead (only includes decayclass 1 & 2)                  |
| DBH              | Dbh (cm)                                                                     |
| CrownClass       | Crown class                                                                  |
| QualityClass     | Quality class                                                                |
| DecayClass       | Decay class                                                                  |
| Ht               | Height (m), includes estimated heights                                       |
| HLF              | HLF                                                                          |
| HtFlag           | HtFlag                                                                       |
| baha             | Basal area/ha = Dbh * Dbh * 0.00007854 * stems                               |
| ht_meas          | Height (m), if measured in the field                                         |
| stems            | Stems per hectare (number of trees/ha each tree represents)                  |
| mvol             | Gross merchantable volume (m³/ha)                                            |
| tvol             | Gross total volume (m³/ha)                                                  |
| biomass          | Aboveground biomass (kg/ha)                                                 |
| size             | Sawlog size                                                                  |

## References

White, Joanne C., et al. "The Petawawa Research Forest: Establishment of a remote sensing supersite." The Forestry Chronicle 95.3 (2019): 149-156.



## Install and load required packages

In [16]:
!pip install pandas==2.2.3
!pip install geopandas==1.0.1



In [5]:
import os
import shutil
import pandas as pd
import geopandas as gpd

# Download data

**The following block of code downloads the tree dataset
within the data folder. This approach checks if data folder already exists in your path. If not we download zip file from google drive using `gdown` and unzip within data folder.
Please note that if you're running notebooks locally, the recommended approach is to manually download the dataset, store it in your local drive, and link it to this notebook accordingly.**



[A]


In [6]:
# Download the data if it does not yet exist
if not os.path.exists("data"):
  !gdown 1UDKAdXW0h6JSf7k31PZ-srrQ3487l9e2
  !unzip prf_data.zip -d data/
  os.remove("prf_data.zip")
else:
  print("Data has already been downloaded.")

!ls data/

Data has already been downloaded.
boundary.gpkg  forest_point_cloud.las  p99.tif	plots.gpkg  trees.csv


## Load Data

To learn more about the dataset let's load the dataset with the `head` function that displays the first 5 rows (by default). And then, to further explore some data anomalies, statistical summary of the dataset we'll use functions like `info` and `describe`



[A]


In [7]:
trees_df = pd.read_csv("data/trees.csv")
trees_df.head()


Unnamed: 0,tree_spec,PlotName,TreeID,TreeSpec,Origin,Status,DBH,CrownClass,QualityClass,DecayClass,...,BA_all,TPH_all,codom,domht,ht_meas,stems,mvol,tvol,biomass,size
0,1,PRF001,24.0,1,P,D,10.1,,,1.0,...,33.601655,2688,N,12.223077,,16,0.0,0.708735,393.3964,Poles
1,1,PRF001,46.0,1,P,D,9.9,,,2.0,...,33.601655,2688,N,12.223077,,16,0.0,0.673254,375.305379,Poles
2,2,PRF001,20.0,2,N,L,67.5,D,A,,...,33.601655,2688,Y,33.433333,33.9,16,77.327438,79.482658,39691.63995,Large
3,2,PRF001,50.0,2,N,L,57.9,D,U,,...,33.601655,2688,Y,33.433333,,16,56.444281,58.117292,28251.255888,Large
4,1,PRF001,10.0,1,N,L,55.9,D,A,,...,33.601655,2688,Y,33.433333,33.0,16,48.008649,49.833743,24501.838779,Large


In [36]:
trees_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12590 entries, 0 to 12589
Data columns (total 27 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   tree_spec     12590 non-null  int64  
 1   PlotName      12590 non-null  object 
 2   TreeID        11120 non-null  float64
 3   TreeSpec      12590 non-null  int64  
 4   Origin        12590 non-null  object 
 5   Status        12590 non-null  object 
 6   DBH           12590 non-null  float64
 7   CrownClass    10626 non-null  object 
 8   QualityClass  10625 non-null  object 
 9   DecayClass    497 non-null    float64
 10  Ht            12590 non-null  float64
 11  HLF           12 non-null     float64
 12  HtFlag        178 non-null    object 
 13  phf           12590 non-null  int64  
 14  baha          12590 non-null  float64
 15  CD_ht         12590 non-null  float64
 16  Int_ht        9744 non-null   float64
 17  BA_all        12590 non-null  float64
 18  TPH_all       12590 non-nu

In [35]:
trees_df.describe()

Unnamed: 0,tree_spec,TreeID,TreeSpec,DBH,DecayClass,Ht,HLF,phf,baha,CD_ht,Int_ht,BA_all,TPH_all,domht,ht_meas,stems,mvol,tvol,biomass
count,12590.0,11120.0,12590.0,12590.0,497.0,12590.0,12.0,12590.0,12590.0,12590.0,9744.0,12590.0,12590.0,12590.0,4527.0,12590.0,12590.0,12590.0,12342.0
mean,21.317633,36.288759,21.354805,17.622415,1.45674,15.883219,21.408333,37.483717,0.602011,21.41404,14.03043,31.141931,2154.160445,17.608462,17.628429,37.483717,4.724783,5.711698,3286.085532
std,18.705983,29.183461,18.691398,11.559345,0.498627,6.713323,5.119208,59.090872,0.895101,6.532401,3.015557,11.356456,1719.08205,5.923878,8.522551,59.090872,11.80013,12.025668,6305.457552
min,1.0,1.0,1.0,2.5,1.0,1.301584,10.3,16.0,0.070686,5.65,7.075,0.453647,16.0,5.65,2.3,16.0,0.0,0.0,138.409692
25%,2.0,15.0,2.0,10.6,1.0,11.531384,19.975,16.0,0.174975,16.970588,12.2,23.554825,1232.0,13.066667,11.4,16.0,0.0,0.939288,634.448847
50%,20.0,30.0,20.0,14.5,1.0,14.907205,23.35,16.0,0.309749,21.2,13.94,30.202758,1864.0,16.5125,16.9,16.0,0.881678,1.970713,1266.913205
75%,32.0,50.0,32.0,21.1,2.0,19.398785,24.825,16.0,0.623451,25.122222,15.833333,37.97556,2448.0,20.985714,23.0,16.0,3.687844,4.640412,2800.442589
max,74.0,300.0,74.0,97.5,2.0,50.3,26.0,200.0,11.945934,43.225,23.28,68.442282,15392.0,43.225,50.3,200.0,173.370393,177.874549,89100.864298


## How many trees are in the dataset? How many permanent sample plots (PSPs)?

In [8]:
# Add your code here

## Which column(s) in the trees data frame contain NAs? How many rows contain NAs?

In [9]:
# Add your code here

# Solutions

In [29]:
#@title Solution 1

# How many trees are in the dataset? How many permanent sample plots (PSPs)?

print(trees_df.shape[0])

print(len(trees_df))



12590
12590


<h4>💬 Explanation</h4>

As we know that each row represent a tree entry so, these are the two ways to get total number of rows in a dataset


In [30]:
#@title Solution 2

# Which column(s) in the trees data frame contain NAs? How many rows contain NAs?

col_na_counts = trees_df.isnull().sum()

print("Columns that contain NAs:\n\n", col_na_counts[col_na_counts > 0])

trees_no_nas = trees_df.dropna()

print("\n\nNumber of rows with NAs:\n\n", len(trees_df) - len(trees_no_nas))

print(f"\n\nThere are {len(trees_no_nas)} rows in the DF which contain no NA values.\n")

# number of null values in each column (attribute)
# TO REVIEW : THIS PIECE OF CODE IS REDUDNANT AS WE ALREADY VIEW THE SAME THING ABOVE
print(f"Number of null values in each column: \n\n")
trees_df.isnull().sum()

Columns that contain NAs:

 TreeID           1470
CrownClass       1964
QualityClass     1965
DecayClass      12093
HLF             12578
HtFlag          12412
Int_ht           2846
ht_meas          8063
biomass           248
dtype: int64


Number of rows with NAs:

 12590


There are 0 rows in the DF which contain no NA values.

Number of null values in each column: 




Unnamed: 0,0
tree_spec,0
PlotName,0
TreeID,1470
TreeSpec,0
Origin,0
Status,0
DBH,0
CrownClass,1964
QualityClass,1965
DecayClass,12093


<h4>💬 Explanation</h4>

To check which columns contain NAs we can apply condition to filter only the columns with more than 0 null values. Here we perform that using `isnull` function

To count the number of rows with NAs, we can drop all the entries with any NA value using `dropna` function. And the difference between the number of total entries and the number of entries in dataset with no NAs gives us the number of rows (entries) which contain NAs

## What's the best technique to account for the missing data in this dataset? Use any appropriate technique

In [38]:
# Add your code here

In [39]:
#@title Solution 3

