<a href="https://colab.research.google.com/github/subornaa/Data-Analytics-Tutorials/blob/main/Data_Wrangling_Tutorials.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div align="center">

# Data Wrangling in Python
</div>

![prf_image](https://github.com/subornaa/Data-Analytics-Tutorials/blob/Apram_branch/images/datawrangling.png?raw=1)

This notebook provides python code to analyze the Petawawa Research Forest (PRF) inventory and ancillary datasets.

All data is sourced from the following website:

https://opendata.nfis.org/mapserver/PRF.html

Please consult this page to get the most up to date version of the PRF data.

# Introduction and Dataset Background

This tutorial focuses on summarizing tabular data using Python.  The concepts covered are common in data pre-processing phase of an analysis workflow, but are useful in many different contexts.

The tutorial makes use of the Petawawa Research Forest (PRF) data, which is described in more detail in on the tutorial series [GitHub site](https://github.com/subornaa/Data-Analytics-Tutorials).

## Tutorial goal

The goal of this tutorial is to make use of individual tree data to answer questions relating to the PRF. We make use of the [`pandas`](https://pandas.pydata.org/docs/getting_started/index.html) python package to apply common data wrangling concepts such as dealing with null (NA) values, subsetting data, and summarizing data.

## Data Dictionary

This tutorial makes use of individual tree measurements taken at permanent sample plots (PSPs) across the PRF. A data dictionary is provided below summarizing the `trees.csv`. In this data, each tree is a row and each column is an attribute (e.g., height).

| **Column**      | **Definition**                                                                 |
|------------------|-------------------------------------------------------------------------------|
| PlotName         | Plot name                                                                    |
| TreeID           | Tree ID                                                                      |
| TreeSpec         | Tree species                                                                 |
| Origin           | Origin. N = natural (includes coppice), P = planted                          |
| Status           | Status. L = Live, D = Dead (only includes decayclass 1 & 2)                  |
| DBH              | Dbh (cm)                                                                     |
| CrownClass       | Crown class                                                                  |
| QualityClass     | Quality class                                                                |
| DecayClass       | Decay class                                                                  |
| Ht               | Height (m), includes estimated heights                                       |
| HLF              | HLF                                                                          |
| HtFlag           | HtFlag                                                                       |
| baha             | Basal area/ha = Dbh * Dbh * 0.00007854 * stems                               |
| ht_meas          | Height (m), if measured in the field                                         |
| stems            | Stems per hectare (number of trees/ha each tree represents)                  |
| mvol             | Gross merchantable volume (m³/ha)                                            |
| tvol             | Gross total volume (m³/ha)                                                  |
| biomass          | Aboveground biomass (kg/ha)                                                 |
| size             | Sawlog size                                                                  |


*White, Joanne C., et al. "The Petawawa Research Forest: Establishment of a remote sensing supersite." The Forestry Chronicle 95.3 (2019): 149-156.*



## Install and load required packages

In [None]:
!pip install -q pandas==2.2.2
!pip install -q geopandas==1.0.1
!pip install -q seaborn==0.12.2

In [None]:
import os
import shutil
import pandas as pd
import geopandas as gpd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Download data

The following block of code downloads the tree dataset
within the data folder. This approach checks if data folder already exists in your path. If not we download zip file from google drive using `gdown` and unzip within data folder.
Please note that if you're running notebooks locally, the recommended approach is to manually download the dataset, store it in your local drive, and link it to this notebook accordingly.


In [None]:
# Download the data if it does not yet exist
if not os.path.exists("data"):
  !gdown 1UDKAdXW0h6JSf7k31PZ-srrQ3487l9e2
  !unzip prf_data.zip -d data/
  os.remove("prf_data.zip")
else:
  print("Data has already been downloaded.")

!ls data/

## Load Data

To learn more about the dataset, let's load the dataset with the `head` function that displays the first 5 rows (by default). Then, to further explore some data anomalies, statistical summary of the dataset we'll use functions like `info` and `describe`

In [None]:
trees_df = pd.read_csv("data/trees.csv")
trees_df.head()

In [None]:
trees_df.info()

In [None]:
trees_df.describe()

# Understanding dataset

Before conducting any analysis, it is crucial to develop a thorough understanding of the dataset. This involves exploring the structure, types of variables, and the context in which the data was collected. In forestry research, understanding the data helps ensure that ecological variables such as species, DBH, and biomass are interpreted correctly and that any patterns or anomalies are recognized early. This foundational step guides the selection of appropriate analytical methods and helps prevent misinterpretation of results.

**Question 1 - How many trees are in the dataset? How many permanent sample plots (PSPs)? Fill in the code below.**

In [None]:
print(trees_df....[0])

print(...(trees_df))

In [None]:
#@title Solution

# How many trees are in the dataset? How many permanent sample plots (PSPs)?

print(trees_df.shape[0])

print(len(trees_df))



**Question 2 - How many trees are there with height greater than or equal to 30m? Fill in the code below.**

In [None]:
# Select all rows having Height greater than or equal to 30. Retain all columns in output.
large_trees = trees_df.loc[...['height']>=...,:]
display(large_trees)
# How many trees are larger than 30 m?
print(f"\n\nThere are {len(...)} trees larger than 30 m.\n")

In [None]:
#@title Solution
# Select all rows having Height greater than or equal to 30. Retain all columns in output.
large_trees = trees_df.loc[trees_df['height']>=30,:]
display(large_trees)
# How many trees are larger than 30 m?
print(f"\n\nThere are {len(large_trees)} trees larger than 30 m.\n")

**Question 3 - What's the number of Natural trees with height more than 30m? Fill in the code below.**

In [None]:
natural_large_trees = trees_df.loc[(...['height']>...) & (trees_df['...']=='N'),:]
display(natural_large_trees)
# How many natural trees are larger than 30 m?
print(f"\n\nThere are {len(...)} natural trees larger than 30 m.\n")

In [None]:
#@title Solution
natural_large_trees = trees_df.loc[(trees_df['height']>30) & (trees_df['Origin']=='N'),:]
display(natural_large_trees)
# How many natural trees are larger than 30 m?
print(f"\n\nThere are {len(natural_large_trees)} natural trees larger than 30 m.\n")

**Question 4  - Which species has the highest merchantable volume, total volume and biomass? Fill in the code below.**

In [None]:
# Summary statistics for mvol grouped by species
mvol_summary = trees_df.groupby('species')['...']....()
display(mvol_summary)

# Summary statistics for tvol grouped by species
tvol_summary = trees_df.groupby('...')['tvol']....()
display(tvol_summary)

# Summary statistics for biomass grouped by species
biomass_summary = trees_df.groupby('...')['...'].describe()
display(biomass_summary)

# species with the highest mean mvol
highest_mean_mvol = mvol_summary['...'].idxmax()
print(f"The species with the highest mean mvol is: {highest_mean_mvol}")
# species with the highest mean tvol
highest_mean_tvol = tvol_summary['...'].idxmax()
print(f"The species with the highest mean tvol is: {highest_mean_tvol}")
# species with the highest mean biomass
highest_mean_biomass = biomass_summary['...'].idxmax()
print(f"The species with the highest mean biomass is: {highest_mean_biomass}")

In [None]:
#@title Solution

# Summary statistics for mvol grouped by species
mvol_summary = trees_df.groupby('species')['mvol'].describe()
display(mvol_summary)

# Summary statistics for tvol grouped by species
tvol_summary = trees_df.groupby('species')['tvol'].describe()
display(tvol_summary)

# Summary statistics for biomass grouped by species
biomass_summary = trees_df.groupby('species')['biomass'].describe()
display(biomass_summary)

# species with the highest mean mvol
highest_mean_mvol = mvol_summary['mean'].idxmax()
print(f"The species with the highest mean mvol is: {highest_mean_mvol}")
# species with the highest mean tvol
highest_mean_tvol = tvol_summary['mean'].idxmax()
print(f"The species with the highest mean tvol is: {highest_mean_tvol}")
# species with the highest mean biomass
highest_mean_biomass = biomass_summary['mean'].idxmax()
print(f"The species with the highest mean biomass is: {highest_mean_biomass}")

# Identifying & Handling missing data

Missing data is a common challenge in ecological and forestry datasets due to field limitations, measurement errors, or data entry issues. Identifying and appropriately handling missing values is essential to maintain the integrity and reliability of the analysis. Effective strategies such as imputation, exclusion, or flagging missing entries help ensure that the dataset remains robust and that subsequent analyses or models are not biased or compromised by incomplete information. Addressing missing data transparently also supports reproducibility and scientific rigor in forestry research.

**Question 1 - Which column(s) in the trees data frame contain NAs? How many rows contain NAs? Fill in the code below.**

In [None]:
col_na_counts = trees_df....().sum()

print("Columns that contain NAs:\n\n", col_na_counts[... > 0])

trees_no_nas = trees_df....()

print("\n\nNumber of rows with NAs:\n\n", len(trees_df) - len(trees_no_nas))

print(f"\n\nThere are {len(trees_no_nas)} rows in the DataFrame which contain no NA values.\n")

# number of null values in each column (attribute)
print(f"Number of null values in each column: \n\n")
trees_df.isnull().sum()

In [None]:
#@title Solution

# Which column(s) in the trees data frame contain NAs? How many rows contain NAs?

col_na_counts = trees_df.isnull().sum()

print("Columns that contain NAs:\n\n", col_na_counts[col_na_counts > 0])

trees_no_nas = trees_df.dropna()

print("\n\nNumber of rows with NAs:\n\n", len(trees_df) - len(trees_no_nas))

print(f"\n\nThere are {len(trees_no_nas)} rows in the DataFrame which contain no NA values.\n")

# number of null values in each column (attribute)
# TO REVIEW : THIS PIECE OF CODE IS REDUDNANT AS WE ALREADY VIEW THE SAME THING ABOVE
print(f"Number of null values in each column: \n\n")
trees_df.isnull().sum()

<details>
<summary>Explanation</summary>

To check which columns contain NAs we can apply condition to filter only the columns with more than 0 null values. Here we perform that using `isnull` function

To count the number of rows with NAs, we can drop all the entries with any NA value using `dropna` function. And the difference between the number of total entries and the number of entries in dataset with no NAs gives us the number of rows (entries) which contain NAs.
</details>

**Question 2  - What's the strategy to account for the missing data in the DecayClass column? Fill in the code code below.**

In [None]:
# Find the rows where the 'Status' is 'D' (dead) and 'DecayClass' is NaN
missing_decay = ...[(...['Status'] == 'D') ... (trees_df['DecayClass']....())]

# As there are no such rows, we can leave 'DecayClass' as it is
print(f"\n\nRows where Status is 'D' and DecayClass is NaN: \n\n{missing_decay}")

In [None]:
#@title Solution

# Find the rows where the 'Status' is 'D' (dead) and 'DecayClass' is NaN
missing_decay = trees_df[(trees_df['Status'] == 'D') & (trees_df['DecayClass'].isna())]

# As there are no such rows, we can leave 'DecayClass' as it is
print(f"\n\nRows where Status is 'D' and DecayClass is NaN: \n\n{missing_decay}")

**Question 3 - What's the strategy used to account for the missing data in the DecayClass column?**

*Type your Answer here*

<details>
<summary>Solution</summary>

After reading the description of DecayClass, we know that it's classification system to classify the degree of stem decay in standing dead trees i.e., it is only applicable when the tree `Status` is `'D' (dead)`. Therefore, if we were to drop this column due to missing values, we would **lose a valuable forest health indicator**. However, imputing it incorrectly (e.g., filling with a mode or placeholder across all trees) could misrepresent trees that are alive or not applicable for decay classification.

Therefore, to impute the missing values we can leave the rows with missing `DecayClass` but Tree status is `'L' (live)` but impute the ones with TreeStatus `'D' (dead)`

From the above code snippet we can see that there are no trees with TreeStatus `'D' (dead)` and missing `DecayClass` i.e., no imputation needed.

</details>

**Question 4  - What's the appropriate strategy to account for missing data in the biomass attribute? Fill in the code code below.**

In [None]:
ironwood_mask = (
    trees_df['species'].str.strip().eq('Ironwood') &
    trees_df['biomass'].isna() &
    trees_df['DBH'].notna()
)

# Apply Ironwood-specific equation where criteria are met
trees_df.loc[..., 'biomass'] = trees_df.loc[..., 'DBH'].apply(
    lambda dbh: 0.124 * (dbh ** 2.3)  # kg/tree
)

# Convert to kg/ha (assuming stand density = 1000 trees/ha)
STAND_DENSITY = 1000  # Petawawa default
trees_df.loc[ironwood_mask, 'biomass'] *= STAND_DENSITY

# Verify no remaining Ironwood missing values
print(f"Remaining missing biomass in Ironwood: {trees_df[trees_df['species'].str.strip() == 'Ironwood']['biomass'].isna().sum()}")

# # Display the first few rows of the DataFrame after filling missing values in biomass
display(trees_df....())

# # Remaining null values in the DataFrame
print(f"\n\nRemaining null values in the DataFrame: \n\n{trees_df.isnull().sum()}")

In [None]:
#@title Solution
ironwood_mask = (
    trees_df['species'].str.strip().eq('Ironwood') &
    trees_df['biomass'].isna() &
    trees_df['DBH'].notna()
)

# Apply Ironwood-specific equation where criteria are met
trees_df.loc[ironwood_mask, 'biomass'] = trees_df.loc[ironwood_mask, 'DBH'].apply(
    lambda dbh: 0.124 * (dbh ** 2.3)  # kg/tree
)


# Convert to kg/ha (assuming stand density = 1000 trees/ha)
STAND_DENSITY = 1000  # Petawawa default
trees_df.loc[ironwood_mask, 'biomass'] *= STAND_DENSITY

# Verify no remaining Ironwood missing values
print(f"Remaining missing biomass in Ironwood: {trees_df[trees_df['species'].str.strip() == 'Ironwood']['biomass'].isna().sum()}")

# # Display the first few rows of the DataFrame after filling missing values in biomass
display(trees_df.head())

# # Remaining null values in the DataFrame
print(f"\n\nRemaining null values in the DataFrame: \n\n{trees_df.isnull().sum()}")

**Question 5 - What's the strategy used account for missing data in the biomass attribute?**

*Type Your Answer Here*

<details>
<summary>Solution</summary>

After reading the description of Biomass, we know that it's the biomass of trees abovground with unit trees/ha. Therefore, to impute it we need to use the biomass equation provided by the NFIS for each species missing the biomass.

Fortunately, only `Ironwood` is missing the data. Therefore, the `Ironwood` biomass equation (0.124 × DBH²·³) was sourced from `Sugar Maple` allometric studies validated in Eastern Canadian forests, including Petawawa.
This proxy was chosen because Ironwood (Ostrya virginiana) shares similar ecological traits (shade tolerance, wood density ~0.76 g/cm³) with Sugar Maple (Acer saccharum), ensuring biologically reasonable estimates.

Biomass per tree (kg/tree) was calculated using this equation and scaled to kg/ha using the stems column (trees/ha), which provides plot-specific stand density. This method aligns with Petawawa Research Forest protocols, where species-specific equations are prioritized, and proxies are used for less-studied species to maintain ecological accuracy.

</details>

**Question 6  - What's the strategy to account for the missing data in the TreeID column? Fill in the code below.**

In [None]:
# checking the 'TreeID' type
print(f"\n\nTreeID column is of type \n\n{trees_df['...'].dtype}")
# the unique values in the 'TreeID' column
print(f"\n\nUnique values in the 'TreeID' column: \n\n{trees_df['...'].unique()}")

# Replace the NaN values in 'TreeID' with -1.0
trees_df['TreeID'] = trees_df['...'].fillna(-1.0)

# display all the rows where TreeID is 'Unknown'
unknown_tree_ids = trees_df[trees_df['...'] == -1.0]
display(unknown_tree_ids)
display(trees_df.head())

# Remaining null values in the DataFrame
print(f"\n\nRemaining null values in the DataFrame: \n\n{trees_df.isnull()....()}")

In [None]:
#@title Solution

# checking the 'TreeID' type
print(f"\n\nTreeID column is of type \n\n{trees_df['TreeID'].dtype}")
# the unique values in the 'TreeID' column
print(f"\n\nUnique values in the 'TreeID' column: \n\n{trees_df['TreeID'].unique()}")

# Replace the NaN values in 'TreeID' with -1.0
trees_df['TreeID'] = trees_df['TreeID'].fillna(-1.0)

# display all the rows where TreeID is 'Unknown'
unknown_tree_ids = trees_df[trees_df['TreeID'] == -1.0]
display(unknown_tree_ids)
display(trees_df.head())

# Remaining null values in the DataFrame
print(f"\n\nRemaining null values in the DataFrame: \n\n{trees_df.isnull().sum()}")

**Question 7 - What's the strategy used to account for the missing data in the TreeID column?**

*Type your answer here*

<details>
<summary>Solution</summary>


After reading the description of TreeID, we know that it's used as a unique identifier for the Trees. Therefore, to impute it we need to specify a special value which indicate missingness, we cannot use techniques like mean, median as IDs are unique.

`-1.0` is a common code for representing missingness but before that it's essential to check that there isn't any existing ID using that code (we can do that using unique() function). Therefore, to impute the missing values we assign a special value `-1.0` to indicate missing data.

By labeling these records as `-1.0`, we:

* **Preserves Data Integrity:** By using a sentinel value like -1.0, you avoid fabricating or guessing TreeIDs, which is crucial for accurate tracking in forestry research.

* **Enables Easy Filtering:** Analysts can quickly identify and separate records with missing IDs for quality control or exclusion in analyses where a valid TreeID is required.

* **Prevents Data Loss:** This approach keeps all observations in the dataset, ensuring no valuable information is discarded due to missing identifiers.

* **Maintains Consistency:** Using a consistent placeholder (like -1.0) is a transparent, reproducible method that is easily understood by anyone working with the data.

</details>

**Question 8  - What's the strategy to account for the missing data in the CrownClass column? Fill in the code below.**

In [None]:
# crown class should not be applicable for dead trees
wrong_crownclass = trees_df[(trees_df['Status'] == '...') & (trees_df['CrownClass'].notna())]
display(wrong_crownclass)

# Flag conflicts: CrownClass present for dead trees
trees_df['CrownClass_Conflict'] = np.where(
    (trees_df['Status'] == 'D') & (trees_df['...'].notna()),
    True,
    False
)

display(trees_df[(trees_df['Status'] == 'D') & (trees_df['CrownClass']....())])

# imputing with 'Unknown' for CrownClass with live trees
trees_df.loc[
    (trees_df['Status'] == '...') & (trees_df['CrownClass'].isna()),
    'CrownClass'
] = 'Unknown'

display(trees_df.head())

# Remaining null values in CrownClass column with live trees
print(f"\n\nRemaining null values in CrownClass column with live trees: \n\n{trees_df[trees_df['Status'] == 'L']['...'].isnull().sum()}")

# Remaining null values in the DataFrame
print(f"\n\nRemaining null values in the DataFrame: \n\n{trees_df.isnull().sum()}")

In [None]:
#@title Solution

# crown class should not be applicable for dead trees
wrong_crownclass = trees_df[(trees_df['Status'] == 'D') & (trees_df['CrownClass'].notna())]
display(wrong_crownclass)

# Flag conflicts: CrownClass present for dead trees
trees_df['CrownClass_Conflict'] = np.where(
    (trees_df['Status'] == 'D') & (trees_df['CrownClass'].notna()),
    True,
    False
)

display(trees_df[(trees_df['Status'] == 'D') & (trees_df['CrownClass'].notna())])

# imputing with 'Unknown' for CrownClass with live trees
trees_df.loc[
    (trees_df['Status'] == 'L') & (trees_df['CrownClass'].isna()),
    'CrownClass'
] = 'Unknown'

display(trees_df.head())

# Remaining null values in CrownClass column with live trees
print(f"\n\nRemaining null values in CrownClass column with live trees: \n\n{trees_df[trees_df['Status'] == 'L']['CrownClass'].isnull().sum()}")

# Remaining null values in the DataFrame
print(f"\n\nRemaining null values in the DataFrame: \n\n{trees_df.isnull().sum()}")



**Question 9 - What's the strategy used to account for the missing data in the CrownClass column?**

*Type your Answer Here*

<details>
<summary>Solution</summary>

After reading the description of CrownClass, we know that it's a classification system to record crown class of live numbered trees i.e., it is only applicable when the tree `CrownClass` is `'L' (live)`.

Therefore, if we were to drop this column due to missing values, we would **lose a valuable forest health indicator**. Even though it's reasonable to keep the `CrownClass` column as it is for `L (live)` trees, due to not so strong covariate relationship. We'll impute the NaN values to `Unknown` so, it easier for further analysis later on.

Imputing `CrownClass` conflicts by flagging cases where dead trees (`Status` = 'D') have a recorded CrownClass is a good technique because it preserves data integrity and analytical transparency. `CrownClass` is a field-assessed attribute that is only meaningful for live trees, as it describes a tree’s competitive position and canopy status within a stand.
By identifying and flagging these conflicts instead of removing or altering the data, we:

* **Maintain the original data for quality control:** You can later review or exclude these flagged cases depending on your analysis goals, ensuring no information is lost prematurely.

* **Increase transparency:** Future users of the dataset can see where potential data entry errors or unusual cases exist, supporting reproducibility and trust in your data cleaning process.

* **Avoid introducing bias:** Rather than making assumptions about the correct value or deleting records, you document the issue, which is especially important for subjective or field-based variables like CrownClass.

</details>

**Question 10  - What's the strategy to account for the missing data in the Species column? Provide the code.**

In [None]:
missing_species = trees_df[trees_df['...'].isna()]
display(missing_species)
trees_df['...'] = trees_df['species']....('Unknown')
display(trees_df.tail())

In [None]:
#@title Solution
missing_species = trees_df[trees_df['species'].isna()]
display(missing_species)
trees_df['species'] = trees_df['species'].fillna('Unknown')
display(trees_df.tail())

**Question 11 - What's the strategy used to account for the missing data in the Species column?**

*Type your Answer Here*

<details>
<summary>Solution</summary>

Imputing missing species values as `Unknown` is a good technique because it preserves the integrity of your dataset without introducing misleading information. In forestry research, species identity is often essential for ecological analysis, but sometimes data is missing due to field constraints or data entry errors.

By labeling these records as `Unknown`, we:

* **Maintain all records :** No data is lost due to missing species, which is important for analysis.

* **Avoid incorrect assumptions :** You don’t risk assigning a wrong species, which could bias results or lead to faulty conclusions.

* **Enable clear filtering :** Analysts can easily identify and handle these 'Unknown' cases separately in future analysis.

* **Support transparency :** It’s immediately clear which entries lacked species data, improving reproducibility and data quality.

</details>

# Outliers and Anomalies

Visualizing outliers and anomalies is a key step in the data preparation process, especially for ecological and forestry datasets where measurement errors or rare events can significantly influence results.

By generating plots such as boxplots, scatterplots, or histograms, analysts can quickly identify data points that deviate markedly from expected patterns. Detecting these outliers early allows for informed decisions about whether to investigate, correct, or exclude them, ultimately improving the quality and reliability of subsequent analyses. Visual tools not only make anomalies more apparent but also facilitate transparent communication of data issues to collaborators.

**Question 1  - Are there individual trees with extremely high or low DBH (diameter at breast height) values compared to the rest of the dataset? Fill in the code below.**

In [None]:
sns.set(style="whitegrid")
plt.figure(figsize=(16, 8))

#set the boxplot and include data
sns.boxplot(data=trees_df, x='...', y='...', hue='...', palette='Set2')

#Add labels
plt.xlabel('Tree Species', fontsize=12)
plt.ylabel('DBH', fontsize=12)
plt.title('Distribution of DBH by Tree Species', fontsize=14)

#Rotate the varibles in the x-axis for better readability
plt.xticks(rotation=90, ha='center')

plt.tight_layout()
plt.show()

In [None]:
#@title Solution
sns.set(style="whitegrid")
plt.figure(figsize=(16, 8))

#set the boxplot and include data
sns.boxplot(data=trees_df, x='species', y='DBH', hue='species', palette='Set2')

#Add labels
plt.xlabel('Tree Species', fontsize=12)
plt.ylabel('DBH', fontsize=12)
plt.title('Distribution of DBH by Tree Species', fontsize=14)

#Rotate the varibles in the x-axis for better readability
plt.xticks(rotation=90, ha='center')

plt.tight_layout()
plt.show()



`White pine` and `Red pine` show extremely high basal area/ha values compared to other species, while several species like `Black cherry` and `Ironwood` tend to have much lower values in the dataset


**Question 2 - Do any tree species have unusually high or low biomass (kg/ha) that do not match expected ecological patterns? Fill in the code below.**

In [None]:
sns.set(style="whitegrid")
plt.figure(figsize=(14, 6))

# Group by PlotName and sum biomass for each plot
biomass_by_plot = trees_df.groupby('...')['biomass']....()

plt.figure(figsize=(14, 6))
biomass_by_plot.plot(kind='bar')
plt.xlabel('Tree Species')
plt.ylabel('Total Biomass (kg/ha)')
plt.title('Total Biomass (kg/ha) by Species')
plt.xticks(rotation=90, ha='center')
plt.tight_layout()
plt.show()

In [None]:
#@title Solution
sns.set(style="whitegrid")
plt.figure(figsize=(14, 6))

# Group by PlotName and sum biomass for each plot
biomass_by_plot = trees_df.groupby('species')['biomass'].sum()

plt.figure(figsize=(14, 6))
biomass_by_plot.plot(kind='bar')
plt.xlabel('Tree Species')
plt.ylabel('Total Biomass (kg/ha)')
plt.title('Total Biomass (kg/ha) by Species')
plt.xticks(rotation=90, ha='center')
plt.tight_layout()
plt.show()

The general trend in the plot shows that most tree species have moderate `total biomass (kg/ha)`, but a few species stand out with much higher values. `White pine` has the highest total biomass by a large margin, followed by `Ironwood` and `Red pine`. In contrast, species like `Black cherry`, `Northern white cedar`, and `White birch` have the lowest total biomass values in the dataset. This indicates that biomass is heavily concentrated in a few dominant species, while many others contribute relatively little

# Further Steps

We save the cleaned data to a new CSV to preserve our corrections and ensure future analyses use accurate, reliable information—while keeping the original raw data unchanged for reference.

In [None]:
# Save the DataFrame to a CSV file in the current directory
trees_df.to_csv('./data/trees_data_cleaned.csv', index=False)

## References

Perplexity AI. (2025). Responses to data analysis and visualization queries for forestry research in the Petawawa region. Retrieved from https://www.perplexity.ai