<a href="https://colab.research.google.com/github/subornaa/Data-Analytics-Tutorials/blob/main/Descriptive_Wrangling_Tutorials.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div align="center">

# Data Wrangling in Python
</div>

![prf_image](https://opendata.nfis.org/mapserver/PRF_Layout.jpg)

This notebook provides python code to analyze the Petawawa Research Forest (PRF) inventory and ancillary datasets.

All data is sourced from the following website:

https://opendata.nfis.org/mapserver/PRF.html

Please consult this page to get the most up to date version of the PRF data.

# Introduction and Dataset Background

This tutorial focuses on sumarizing tabular data using Python.  The concepts covered are common in data pre-processing phase of an analysis workflow, but are useful in many different contexts.

The tutorial makes use of the Petawawa Research Forest (PRF) data, which is described in more detail in on the tutorial series [GitHub site](https://github.com/subornaa/Data-Analytics-Tutorials).

## Tutorial goal

The goal of this tutorial is to make use of individual tree data to answer questions relating to the PRF. We make use of the [`pandas`](https://pandas.pydata.org/docs/getting_started/index.html) python package to apply common data wrangling concepts such as dealing with null (NA) values, subsetting data, and summarizing data.

## Data Dictionary

This tutorial makes use of individual tree measurements taken at permanent sample plots (PSPs) across the PRF. A data dictionary is provided below summarizing the `trees.csv`. In this data, each tree is a row and each column is an attribute (e.g., height).

| **Column**      | **Definition**                                                                 |
|------------------|-------------------------------------------------------------------------------|
| PlotName         | Plot name                                                                    |
| TreeID           | Tree ID                                                                      |
| TreeSpec         | Tree species                                                                 |
| Origin           | Origin. N = natural (includes coppice), P = planted                          |
| Status           | Status. L = Live, D = Dead (only includes decayclass 1 & 2)                  |
| DBH              | Dbh (cm)                                                                     |
| CrownClass       | Crown class                                                                  |
| QualityClass     | Quality class                                                                |
| DecayClass       | Decay class                                                                  |
| Ht               | Height (m), includes estimated heights                                       |
| HLF              | HLF                                                                          |
| HtFlag           | HtFlag                                                                       |
| baha             | Basal area/ha = Dbh * Dbh * 0.00007854 * stems                               |
| ht_meas          | Height (m), if measured in the field                                         |
| stems            | Stems per hectare (number of trees/ha each tree represents)                  |
| mvol             | Gross merchantable volume (m³/ha)                                            |
| tvol             | Gross total volume (m³/ha)                                                  |
| biomass          | Aboveground biomass (kg/ha)                                                 |
| size             | Sawlog size                                                                  |

## References

White, Joanne C., et al. "The Petawawa Research Forest: Establishment of a remote sensing supersite." The Forestry Chronicle 95.3 (2019): 149-156.



## Install and load required packages



---
[pip install specific version]-Added by Apram



In [1]:
!pip install pandas==2.2.3
!pip install geopandas==1.0.1
!pip install seaborn==0.12.2



In [2]:
import os
import shutil
import pandas as pd
import geopandas as gpd
import seaborn as sns
import matplotlib.pyplot as plt

# Download data



---
[data download description]-Added by Apram


The following block of code downloads the tree dataset
within the data folder. This approach checks if data folder already exists in your path. If not we download zip file from google drive using `gdown` and unzip within data folder.
Please note that if you're running notebooks locally, the recommended approach is to manually download the dataset, store it in your local drive, and link it to this notebook accordingly.


In [3]:
# Download the data if it does not yet exist
if not os.path.exists("data"):
  !gdown 1UDKAdXW0h6JSf7k31PZ-srrQ3487l9e2
  !unzip prf_data.zip -d data/
  os.remove("prf_data.zip")
else:
  print("Data has already been downloaded.")

!ls data/

Data has already been downloaded.
boundary.gpkg                     petawawa_s2_2024.tif
forest_point_cloud.las            plots.gpkg
forest_point_cloud_footprint.gpkg prf_data.zip
p99.tif                           trees.csv
petawawa_s2_2018.tif              water.gpkg


## Load Data



---


[loading data]- added by apram

To learn more about the dataset let's load the dataset with the `head` function that displays the first 5 rows (by default). And then, to further explore some data anomalies, statistical summary of the dataset we'll use functions like `info` and `describe`

In [4]:
trees_df = pd.read_csv("data/trees.csv")
trees_df.head()


Unnamed: 0,PlotName,TreeID,species,Origin,Status,DBH,CrownClass,DecayClass,height,baha,codom,mvol,tvol,biomass,size
0,PRF001,24.0,White pine,P,D,10.1,,1.0,11.552521,0.12819,N,0.0,0.708735,393.3964,Poles
1,PRF001,46.0,White pine,P,D,9.9,,2.0,11.422529,0.123163,N,0.0,0.673254,375.305379,Poles
2,PRF001,20.0,Red pine,N,L,67.5,D,,33.9,5.725566,Y,77.327438,79.482658,39691.63995,Large
3,PRF001,50.0,Red pine,N,L,57.9,D,,32.528851,4.212773,Y,56.444281,58.117292,28251.255888,Large
4,PRF001,10.0,White pine,N,L,55.9,D,,33.0,3.926761,Y,48.008649,49.833743,24501.838779,Large


In [5]:
trees_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12590 entries, 0 to 12589
Data columns (total 15 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   PlotName    12590 non-null  object 
 1   TreeID      11120 non-null  float64
 2   species     11500 non-null  object 
 3   Origin      12590 non-null  object 
 4   Status      12590 non-null  object 
 5   DBH         12590 non-null  float64
 6   CrownClass  10626 non-null  object 
 7   DecayClass  497 non-null    float64
 8   height      12590 non-null  float64
 9   baha        12590 non-null  float64
 10  codom       12590 non-null  object 
 11  mvol        12590 non-null  float64
 12  tvol        12590 non-null  float64
 13  biomass     12342 non-null  float64
 14  size        12590 non-null  object 
dtypes: float64(8), object(7)
memory usage: 1.4+ MB


In [6]:
trees_df.describe()

Unnamed: 0,TreeID,DBH,DecayClass,height,baha,mvol,tvol,biomass
count,11120.0,12590.0,497.0,12590.0,12590.0,12590.0,12590.0,12342.0
mean,36.288759,17.622415,1.45674,15.883219,0.602011,4.724783,5.711698,3286.085532
std,29.183461,11.559345,0.498627,6.713323,0.895101,11.80013,12.025668,6305.457552
min,1.0,2.5,1.0,1.301584,0.070686,0.0,0.0,138.409692
25%,15.0,10.6,1.0,11.531384,0.174975,0.0,0.939288,634.448847
50%,30.0,14.5,1.0,14.907205,0.309749,0.881678,1.970713,1266.913205
75%,50.0,21.1,2.0,19.398785,0.623451,3.687844,4.640412,2800.442589
max,300.0,97.5,2.0,50.3,11.945934,173.370393,177.874549,89100.864298


# Understanding dataset

**Question 1 - How many trees are in the dataset? How many permanent sample plots (PSPs)?**

In [7]:
# Add your code here



---


[solution and explanation]-added by apram

In [8]:
#@title Solution

# How many trees are in the dataset? How many permanent sample plots (PSPs)?

print(trees_df.shape[0])

print(len(trees_df))



12590
12590



---

[questions and solutions below this] -added by apram

**Question 2 - How many trees are there with height greater than or equal to 30m**

In [9]:
# Add your code here

In [10]:
#@title Solution
# Select all rows having Height greater than or equal to 30. Retain all columns in output.
large_trees = trees_df.loc[trees_df['height']>=30,:]
display(large_trees)
# How many trees are larger than 30 m?
print(f"\n\nThere are {len(large_trees)} trees larger than 30 m.\n")

Unnamed: 0,PlotName,TreeID,species,Origin,Status,DBH,CrownClass,DecayClass,height,baha,codom,mvol,tvol,biomass,size
2,PRF001,20.0,Red pine,N,L,67.5,D,,33.900000,5.725566,Y,77.327438,79.482658,39691.639950,Large
3,PRF001,50.0,Red pine,N,L,57.9,D,,32.528851,4.212773,Y,56.444281,58.117292,28251.255888,Large
4,PRF001,10.0,White pine,N,L,55.9,D,,33.000000,3.926761,Y,48.008649,49.833743,24501.838779,Large
5,PRF001,71.0,White pine,N,L,51.5,D,,33.400000,3.332923,Y,41.981759,43.595733,21144.144506,Large
78,PRF002,2.0,White pine,N,L,60.2,D,,30.100000,4.554114,Y,50.084125,52.163201,26349.261013,Large
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12535,PRF332,1.0,White pine,N,L,61.4,C,,36.500000,4.737483,Y,63.179640,65.281153,31924.646846,Large
12536,PRF332,2.0,Red pine,N,L,49.2,C,,32.000000,3.041873,Y,41.436830,42.757213,20281.506347,Large
12537,PRF332,8.0,White pine,N,L,48.5,C,,32.800000,2.955931,Y,36.989005,38.475165,18565.051953,Medium
12539,PRF332,3.0,Red pine,N,L,46.2,C,,31.600000,2.682223,Y,36.424833,37.642421,17751.484851,Medium




There are 526 trees larger than 30 m.



**Question 3 - What's the number of Natural trees with height more than 30m?**

In [11]:
#Add your code here

In [12]:
#@title Solution
natural_large_trees = trees_df.loc[(trees_df['height']>30) & (trees_df['Origin']=='N'),:]
display(natural_large_trees)
# How many natural trees are larger than 30 m?
print(f"\n\nThere are {len(natural_large_trees)} natural trees larger than 30 m.\n")

Unnamed: 0,PlotName,TreeID,species,Origin,Status,DBH,CrownClass,DecayClass,height,baha,codom,mvol,tvol,biomass,size
2,PRF001,20.0,Red pine,N,L,67.5,D,,33.900000,5.725566,Y,77.327438,79.482658,39691.639950,Large
3,PRF001,50.0,Red pine,N,L,57.9,D,,32.528851,4.212773,Y,56.444281,58.117292,28251.255888,Large
4,PRF001,10.0,White pine,N,L,55.9,D,,33.000000,3.926761,Y,48.008649,49.833743,24501.838779,Large
5,PRF001,71.0,White pine,N,L,51.5,D,,33.400000,3.332923,Y,41.981759,43.595733,21144.144506,Large
78,PRF002,2.0,White pine,N,L,60.2,D,,30.100000,4.554114,Y,50.084125,52.163201,26349.261013,Large
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12535,PRF332,1.0,White pine,N,L,61.4,C,,36.500000,4.737483,Y,63.179640,65.281153,31924.646846,Large
12536,PRF332,2.0,Red pine,N,L,49.2,C,,32.000000,3.041873,Y,41.436830,42.757213,20281.506347,Large
12537,PRF332,8.0,White pine,N,L,48.5,C,,32.800000,2.955931,Y,36.989005,38.475165,18565.051953,Medium
12539,PRF332,3.0,Red pine,N,L,46.2,C,,31.600000,2.682223,Y,36.424833,37.642421,17751.484851,Medium




There are 413 natural trees larger than 30 m.



**Question 4  - Which species has the highest merchantable volume, total volume and biomass?**

In [13]:
#Add your code here


In [14]:
#@title Solution

# Summary statistics for mvol grouped by species
mvol_summary = trees_df.groupby('species')['mvol'].describe()
display(mvol_summary)

# Summary statistics for tvol grouped by species
tvol_summary = trees_df.groupby('species')['tvol'].describe()
display(tvol_summary)

# Summary statistics for biomass grouped by species
biomass_summary = trees_df.groupby('species')['biomass'].describe()
display(biomass_summary)

# species with the highest mean mvol
highest_mean_mvol = mvol_summary['mean'].idxmax()
print(f"The species with the highest mean mvol is: {highest_mean_mvol}")
# species with the highest mean tvol
highest_mean_tvol = tvol_summary['mean'].idxmax()
print(f"The species with the highest mean tvol is: {highest_mean_tvol}")
# species with the highest mean biomass
highest_mean_biomass = biomass_summary['mean'].idxmax()
print(f"The species with the highest mean biomass is: {highest_mean_biomass}")

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
American beech,234.0,4.784825,9.339794,0.0,0.0,0.0,4.301597,54.352089
American elm,11.0,1.089485,1.696493,0.0,0.0,0.0,1.394433,5.396927
Balsam fir,1850.0,1.228208,1.46701,0.0,0.373656,0.764341,1.646249,14.930905
Balsam poplar,3.0,7.501004,12.992119,0.0,0.0,0.0,11.251505,22.503011
Basswood,86.0,10.060531,14.981966,0.0,0.0,2.529784,15.339783,74.103497
Black ash,225.0,1.139805,3.124837,0.0,0.0,0.0,0.805939,26.15439
Black cherry,11.0,1.584528,2.710791,0.0,0.0,0.0,1.960916,8.295301
Eastern hemlock,47.0,4.804712,9.689787,0.0,0.0,1.552688,3.30926,51.132485
Ironwood,248.0,0.052571,0.306358,0.0,0.0,0.0,0.0,3.324604
Jack pine,597.0,3.65268,2.881986,0.0,1.760958,3.124137,4.860344,24.195748


Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
American beech,234.0,6.427821,9.613008,0.124608,0.850631,1.736283,6.908304,56.817557
American elm,11.0,3.153926,2.383629,0.547321,1.088669,3.393826,4.582487,8.071978
Balsam fir,1850.0,1.771531,1.512205,0.233823,0.789622,1.270579,2.188899,15.601772
Balsam poplar,3.0,8.255433,13.239786,0.513878,0.611647,0.709416,12.12621,23.543004
Basswood,86.0,11.798715,15.186725,0.110702,1.280729,4.94617,17.281008,76.717508
Black ash,225.0,2.747,3.462914,0.238485,0.77721,1.475303,3.490148,28.294818
Black cherry,11.0,3.240378,2.962067,0.991583,1.138801,1.805927,4.309366,10.302958
Eastern hemlock,47.0,5.783324,10.016037,0.34119,0.842476,2.251709,4.187734,53.808661
Ironwood,248.0,1.343409,1.1742,0.121784,0.609055,0.902901,1.609466,6.931966
Jack pine,597.0,4.102191,2.873844,0.35891,2.182119,3.530161,5.263976,25.035001


Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
American beech,234.0,5419.948951,7954.981287,250.615651,812.511104,1544.438473,5755.738435,48497.309717
American elm,11.0,2095.52453,1489.736313,529.139008,815.86866,2251.789377,3013.499772,5054.202483
Balsam fir,1850.0,963.808277,813.820127,181.188854,445.673256,702.108074,1176.1415,7635.358907
Balsam poplar,3.0,3949.943954,6172.149883,320.239094,386.65789,453.076687,5764.796384,11076.516082
Basswood,86.0,5957.052938,7364.7396,138.409692,749.613214,2710.974704,8729.088721,36887.46042
Black ash,225.0,1805.176992,2060.292599,310.101196,579.467138,1112.106467,2353.504682,16211.247881
Black cherry,11.0,2408.461192,2204.171063,715.267594,857.806574,1360.300098,3236.997251,7796.828765
Eastern hemlock,47.0,3909.591667,7028.220476,341.307402,650.816528,1644.480848,2777.004549,39115.076541
Ironwood,0.0,,,,,,,
Jack pine,597.0,2274.973809,1458.124738,267.147773,1305.010406,2011.831594,2917.624715,12563.39833


The species with the highest mean mvol is: White pine 
The species with the highest mean tvol is: White pine 
The species with the highest mean biomass is: Largetooth aspen 


# Identifying & Handling missing data

**Question 1 - Which column(s) in the trees data frame contain NAs? How many rows contain NAs?**

In [15]:
#@title Solution

# Which column(s) in the trees data frame contain NAs? How many rows contain NAs?

col_na_counts = trees_df.isnull().sum()

print("Columns that contain NAs:\n\n", col_na_counts[col_na_counts > 0])

trees_no_nas = trees_df.dropna()

print("\n\nNumber of rows with NAs:\n\n", len(trees_df) - len(trees_no_nas))

print(f"\n\nThere are {len(trees_no_nas)} rows in the DataFrame which contain no NA values.\n")

# number of null values in each column (attribute)
# TO REVIEW : THIS PIECE OF CODE IS REDUDNANT AS WE ALREADY VIEW THE SAME THING ABOVE
print(f"Number of null values in each column: \n\n")
trees_df.isnull().sum()

Columns that contain NAs:

 TreeID         1470
species        1090
CrownClass     1964
DecayClass    12093
biomass         248
dtype: int64


Number of rows with NAs:

 12587


There are 3 rows in the DataFrame which contain no NA values.

Number of null values in each column: 




PlotName          0
TreeID         1470
species        1090
Origin            0
Status            0
DBH               0
CrownClass     1964
DecayClass    12093
height            0
baha              0
codom             0
mvol              0
tvol              0
biomass         248
size              0
dtype: int64

<details>
<summary>Explanation</summary>

To check which columns contain NAs we can apply condition to filter only the columns with more than 0 null values. Here we perform that using `isnull` function

To count the number of rows with NAs, we can drop all the entries with any NA value using `dropna` function. And the difference between the number of total entries and the number of entries in dataset with no NAs gives us the number of rows (entries) which contain NAs
</details>

**Question 2  - What's the strategy to account for the missing data in the DecayClass column? Provide the code & explaination**

*Type your Answer here*

In [16]:
# Add your code here

In [28]:
#@title Solution


# Find the rows where the 'Status' is 'D' (dead) and 'DecayClass' is NaN
missing_decay = trees_df[(trees_df['Status'] == 'D') & (trees_df['DecayClass'].isna())]

# As there are no such rows, we can leave 'DecayClass' as it is 
print(f"\n\nRows where Status is 'D' and DecayClass is NaN: \n\n{missing_decay}")



Rows where Status is 'D' and DecayClass is NaN: 

Empty DataFrame
Columns: [PlotName, TreeID, species, Origin, Status, DBH, CrownClass, DecayClass, height, baha, codom, mvol, tvol, biomass, size]
Index: []


<details>
<summary>Explanation</summary>

After reading the description of DecayClass, we know that it's classification system to classify the degree of stem decay in standing dead trees i.e., it is only applicable when the tree `Status` is `'D' (dead)`. Therefore, if we were to drop this column due to missing values, we would **lose a valuable forest health indicator**. However, imputing it incorrectly (e.g., filling with a mode or placeholder across all trees) could misrepresent trees that are alive or not applicable for decay classification.

Therefore, to impute the missing values we can leave the rows with missing `DecayClass` but Tree status is `'L' (live)` but impute the ones with TreeStatus `'D' (dead)`

</details>

**Question 3  - What's the appropriate strategy to account for missing data in the biomass attribute? Provide both code and explanation**

*Type Your Answer Here*

In [18]:
#Add your code here

In [None]:
#@title Solution
# bfill - Backward fill for the missing values in biomass column
# As per warning could use obj.bfill instead of obj.loc[:, 'biomass'].fillna(method='bfill')
trees_df.loc[:,'biomass'] = trees_df.loc[:,'biomass'].bfill()

# Display the first few rows of the DataFrame after filling missing values in biomass
display(trees_df.head())

# Remaining null values in the DataFrame
print(f"\n\nRemaining null values in the DataFrame: \n\n{trees_df.isnull().sum()}")

Unnamed: 0,PlotName,TreeID,species,Origin,Status,DBH,CrownClass,DecayClass,height,baha,codom,mvol,tvol,biomass,size
0,PRF001,24.0,White pine,P,D,10.1,,1.0,11.552521,0.12819,N,0.0,0.708735,393.3964,Poles
1,PRF001,46.0,White pine,P,D,9.9,,2.0,11.422529,0.123163,N,0.0,0.673254,375.305379,Poles
2,PRF001,20.0,Red pine,N,L,67.5,D,,33.9,5.725566,Y,77.327438,79.482658,39691.63995,Large
3,PRF001,50.0,Red pine,N,L,57.9,D,,32.528851,4.212773,Y,56.444281,58.117292,28251.255888,Large
4,PRF001,10.0,White pine,N,L,55.9,D,,33.0,3.926761,Y,48.008649,49.833743,24501.838779,Large




Remaining null values in the DataFrame: 

PlotName          0
TreeID         1470
species        1090
Origin            0
Status            0
DBH               0
CrownClass     1964
DecayClass    12093
height            0
baha              0
codom             0
mvol              0
tvol              0
biomass           0
size              0
dtype: int64


<details>
<summary>Explanation</summary>


</details>

**Question 3  - What's the strategy to account for the missing data in the TreeID column? Provide the code & explaination**

*Type your answer here*

In [20]:
# Add your code here

In [21]:
#@title Solution 

trees_df['TreeID'] = trees_df['TreeID'].fillna('Unknown')

# display all the rows where TreeID is 'Unknown'
unknown_tree_ids = trees_df[trees_df['TreeID'] == 'Unknown']
display(unknown_tree_ids)
display(trees_df.tail())

# Remaining null values in the DataFrame
print(f"\n\nRemaining null values in the DataFrame: \n\n{trees_df.isnull().sum()}")

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  trees_df['TreeID'] = trees_df['TreeID'].fillna('Unknown')


Unnamed: 0,PlotName,TreeID,species,Origin,Status,DBH,CrownClass,DecayClass,height,baha,codom,mvol,tvol,biomass,size
70,PRF001,Unknown,White pine,N,L,8.5,,,10.400000,1.134903,N,0.0,5.613244,3259.714610,Under
71,PRF001,Unknown,White birch,N,L,8.2,,,13.300000,1.056206,N,0.0,6.234093,4253.920594,Under
72,PRF001,Unknown,White pine,N,L,6.3,,,8.500000,0.623451,N,0.0,2.420092,1578.620915,Under
73,PRF001,Unknown,White birch,N,L,5.9,,,11.500000,0.546795,N,0.0,2.837856,1955.424983,Under
74,PRF001,Unknown,White pine,N,L,5.0,,,8.300000,0.392700,N,0.0,1.528335,995.309904,Under
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12585,PRF334,Unknown,Sugar Maple,N,L,3.7,,,8.893640,0.215043,N,0.0,0.858564,851.191939,Under
12586,PRF334,Unknown,Sugar Maple,N,L,3.3,,,8.500000,0.171060,N,0.0,0.653868,659.084439,Under
12587,PRF334,Unknown,Sugar Maple,N,L,3.2,,,8.395967,0.160850,N,0.0,0.607493,615.333049,Under
12588,PRF334,Unknown,Sugar Maple,N,L,3.0,,,7.400000,0.141372,N,0.0,0.469656,521.064025,Under


Unnamed: 0,PlotName,TreeID,species,Origin,Status,DBH,CrownClass,DecayClass,height,baha,codom,mvol,tvol,biomass,size
12585,PRF334,Unknown,Sugar Maple,N,L,3.7,,,8.89364,0.215043,N,0.0,0.858564,851.191939,Under
12586,PRF334,Unknown,Sugar Maple,N,L,3.3,,,8.5,0.17106,N,0.0,0.653868,659.084439,Under
12587,PRF334,Unknown,Sugar Maple,N,L,3.2,,,8.395967,0.16085,N,0.0,0.607493,615.333049,Under
12588,PRF334,Unknown,Sugar Maple,N,L,3.0,,,7.4,0.141372,N,0.0,0.469656,521.064025,Under
12589,PRF334,Unknown,Sugar Maple,N,L,2.7,,,7.850948,0.114511,N,0.0,0.40446,421.73074,Under




Remaining null values in the DataFrame: 

PlotName          0
TreeID            0
species        1090
Origin            0
Status            0
DBH               0
CrownClass     1964
DecayClass    12093
height            0
baha              0
codom             0
mvol              0
tvol              0
biomass           0
size              0
dtype: int64


**Question 4  - What's the strategy to account for the missing data in the CrownClass column? Provide the code & explaination**

*Type your Answer Here*

In [22]:
#Add your code here

In [None]:
#@title Solution 

# crown class should not be applicable for dead trees
wrong_crownclass = trees_df[(trees_df['Status'] == 'D') & (trees_df['CrownClass'].notna())]
display(wrong_crownclass)
# rows with NaN crown class for live trees
missing_crownclass_live = trees_df[(trees_df['Status'] == 'L') & (trees_df['CrownClass'].isna())]
display(missing_crownclass_live)


Unnamed: 0,PlotName,TreeID,species,Origin,Status,DBH,CrownClass,DecayClass,height,baha,codom,mvol,tvol,biomass,size
466,PRF009,3.0,Balsam fir,N,D,10.9,C,1.0,14.258931,0.149301,Y,0.604461,0.998581,501.231106,Poles
6247,PRF124,31.0,White spruce,P,D,17.8,D,1.0,17.471853,0.398154,Y,2.634514,3.023578,1644.203417,Poles
10179,PRF212,48.0,White spruce,N,D,18.5,C,2.0,17.828463,0.430085,Y,2.914603,3.30783,1792.940402,Poles


Unnamed: 0,PlotName,TreeID,species,Origin,Status,DBH,CrownClass,DecayClass,height,baha,codom,mvol,tvol,biomass,size
70,PRF001,Unknown,White pine,N,L,8.5,,,10.400000,1.134903,N,0.0,5.613244,3259.714610,Under
71,PRF001,Unknown,White birch,N,L,8.2,,,13.300000,1.056206,N,0.0,6.234093,4253.920594,Under
72,PRF001,Unknown,White pine,N,L,6.3,,,8.500000,0.623451,N,0.0,2.420092,1578.620915,Under
73,PRF001,Unknown,White birch,N,L,5.9,,,11.500000,0.546795,N,0.0,2.837856,1955.424983,Under
74,PRF001,Unknown,White pine,N,L,5.0,,,8.300000,0.392700,N,0.0,1.528335,995.309904,Under
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12585,PRF334,Unknown,Sugar Maple,N,L,3.7,,,8.893640,0.215043,N,0.0,0.858564,851.191939,Under
12586,PRF334,Unknown,Sugar Maple,N,L,3.3,,,8.500000,0.171060,N,0.0,0.653868,659.084439,Under
12587,PRF334,Unknown,Sugar Maple,N,L,3.2,,,8.395967,0.160850,N,0.0,0.607493,615.333049,Under
12588,PRF334,Unknown,Sugar Maple,N,L,3.0,,,7.400000,0.141372,N,0.0,0.469656,521.064025,Under


<details>
<summary>Explanation</summary>
After reading the description of CrownClass, we know that it's classification system to record crown class of live numbered trees i.e., it is only applicable when the tree `CrownClass` is `'L' (live)`. Therefore, if we were to drop this column due to missing values, we would **lose a valuable forest health indicator**. However, imputing it incorrectly (e.g., filling with a mode or placeholder across all trees) could misrepresent trees that are alive or not applicable for crown class classification.

Therefore, to impute the missing values we can leave the rows with missing `CrownClass` but Tree status is `'D' (dead)` but impute the ones with TreeStatus `'L' (live)`

</details>

**Question 5  - What's the strategy to account for the missing data in the Species column? Provide the code & explaination**

*Type your Answer Here*

In [34]:
#Add your code here

In [47]:
#@title Solution 
missing_species = trees_df[trees_df['species'].isna()]
display(missing_species)
trees_df['species'] = trees_df['species'].fillna('Unknown')
display(trees_df.head())

Unnamed: 0,PlotName,TreeID,species,Origin,Status,DBH,CrownClass,DecayClass,height,baha,codom,mvol,tvol,biomass,size


You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  trees_df['species'] = trees_df['species'].fillna('Unknown')


Unnamed: 0,PlotName,TreeID,species,Origin,Status,DBH,CrownClass,DecayClass,height,baha,codom,mvol,tvol,biomass,size
0,PRF001,24.0,White pine,P,D,10.1,,1.0,11.552521,0.12819,N,0.0,0.708735,393.3964,Poles
1,PRF001,46.0,White pine,P,D,9.9,,2.0,11.422529,0.123163,N,0.0,0.673254,375.305379,Poles
2,PRF001,20.0,Red pine,N,L,67.5,D,,33.9,5.725566,Y,77.327438,79.482658,39691.63995,Large
3,PRF001,50.0,Red pine,N,L,57.9,D,,32.528851,4.212773,Y,56.444281,58.117292,28251.255888,Large
4,PRF001,10.0,White pine,N,L,55.9,D,,33.0,3.926761,Y,48.008649,49.833743,24501.838779,Large


<details>
<summary>Explanation</summary>


</details>

# Outliers and Anomalies

**An introduction, why we need it and how that will be useful for any arbitrary dataset**'

cleaned_tree_data
 - showing specific column of boolean outlier (add in the tree_data)
 - more plots for biomass outlier other than exercises
 - description and new data dictionary (in txt/md file) after the cleaned data is made + exp and see