# Descriptive Analytics in Python

<img src = 'https://drive.google.com/uc?id=1WC7SSdfFseYRgmZ6lbodv898zi2nquJ0' width = 80%>

# Introduction and Dataset Background

This tutorial focuses on summarizing the tree data within the sample plots. Each plot contains multiple trees, and for further model development it is necessary to aggregate the tree level measurements to the plot level.

This tutorial makes use of the Petawawa Research Forest (PRF) data, which is described in more detail in on the tutorial series [GitHub site](https://github.com/subornaa/Data-Analytics-Tutorials).

## Tutorial goal

The goal of this tutorial will be to first aggregate the tree level data to the plot level, and then join this data with the plot locations so that we can associated the tree data with exact locations in the PRF.

## Dataset description

We will work with two datasets:

1) `trees.csv`

2) `plots.gpkg`

The tree level data (trees.csv) is a comma separeted value (CSV) file containing tabular data. The plots data is a geopackage file, which contains the spatial location (i.e., coordinates) of all sample plots in the PRF.

An important column in both the trees and plots data is `PlotName`, which represents each unique plot. So for example, in plot `PRF015`, there are 40 trees.

Each sample plot has a 14.1 m radius (625 meters squared)

# Install and load packages

In [1]:
import os
import shutil
import pandas as pd
import geopandas as gpd
import seaborn as sns
import matplotlib.pyplot as plt

#Part 1: Download data

The following block of code retrieves the tree dataset directly from Google Drive. This approach streamlines data storage and access, making it more efficient to manage large datasets. Please note that the code is designed to work only on Unix-based systems such as macOS, Linux, and Colab. If you are using a Windows device and running notebooks locally, the recommended approach is to manually download the dataset, store it in your local drive, and link it to this notebook accordingly. <mark>G</mark>

In [2]:
# Download the data if it does not yet exist
if not os.path.exists("data"):
  !gdown 1UDKAdXW0h6JSf7k31PZ-srrQ3487l9e2
  !unzip prf_data.zip -d data/
  os.remove("prf_data.zip")
else:
  print("Data has already been downloaded.")

!ls data/

Downloading...
From: https://drive.google.com/uc?id=1UDKAdXW0h6JSf7k31PZ-srrQ3487l9e2
To: /content/prf_data.zip
  0% 0.00/487k [00:00<?, ?B/s]100% 487k/487k [00:00<00:00, 7.57MB/s]
Archive:  prf_data.zip
  inflating: data/trees.csv          
  inflating: data/plots.gpkg         
plots.gpkg  trees.csv


Lets first try to get a glance of this dataset. Load the dataset with the correct function and display the first 5 rows. <mark>G</mark>

In [2]:
#Q1.
# trees_df = pd....("data/trees.csv")
# trees_df....()

Lets try to display the number of unique plots and TreeIDs in the dataset. <mark>G</mark>

In [3]:
#Q2.
# How many plots are there in the dataset
# len(trees_df['...']....())

In [4]:
#Q3.
#len(trees_df['...']....())

Lets try to get all the trees with the plot name `PRF015` <mark>G</mark>

In [5]:
#Q4.
# Check trees in PRF015
#trees_df[trees_df['...'] == '...']

An equivalent way to write the above code is shown below. Using string-based queries is often more readable, but both methods are valid and functionally the same. <mark>G</mark>

In [6]:
trees_df.query("PlotName == 'PRF015'")

NameError: name 'trees_df' is not defined

# Part 2: Summary Statistics <mark>Gurman Start</mark>

Let us examine a specific column in the dataset to explore potential trends. For the column `TPH_all`, calculate the maximum, minimum, median, and mean values, grouped by the `tree_spec` column.

In [None]:
#Q1.
# stats = trees_df.groupby('...')["..."].agg(['...', '...', '...', '...'])
# stats

Unnamed: 0_level_0,max,min,median,mean
tree_spec,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,15392,48,1792.0,2812.875406
2,15392,96,2180.0,2053.797101
3,8688,480,1344.0,1411.899497
12,15392,48,1752.0,2127.724138
13,6224,496,1792.0,1785.444109
19,2968,1000,1952.0,1790.808511
20,15392,256,1944.0,2103.822703
22,6224,560,1968.0,3714.440678
25,9248,512,1864.0,2381.089286
30,5256,336,1528.0,1631.692308


Q2.

Is there anything you notice about this dataset? In particualar the Max values?

*Answer here*

Lets try to graph this column to be able to visulize what is happening

In [7]:
#Q3.
# sns.set(style="whitegrid")
# plt.figure(figsize=(12, 6))

# #set the boxplot and include data
# sns.boxplot(data=..., x='...', y='...', hue='tree_spec', palette='Set2', legend=False)

# #Add labels
# plt.xlabel('Tree Species', fontsize=12)
# plt.ylabel('Trees per Hectare (TPH)', fontsize=12)
# plt.title('Distribution of TPH by Tree Species', fontsize=14)

# #Rotate the varibles in the x-axis for better readability
# plt.xticks(rotation=45, ha='right')

# #Display
# plt.tight_layout()
# plt.show()

As we can see, while the majority of the values fall below 6,000, there are several noticeable outliers that could significantly skew the results of many machine learning models. This raises an important question: should we remove these outliers, or include them in our analysis moving forward? The answer depends on the context and purpose of the analysis. If the outliers represent genuine observations and are relevant to the problem at hand, it may be appropriate to include them, possibly with robust modeling techniques that can handle their influence (which will be covered in later chapters.).***It is important to avoid discarding data without a valid justification. Any reduction or pruning of the dataset should be supported by clear, logical reasoning.*** For example, if outliers result from data entry errors or are not representative of the population you're studying, excluding them could be justified. If data is related to your outcome however, more rigious statistical methods will have to be used, of which will be covered later.

Regardless of the approach you choose, the key takeaway is this: always visualize your data before drawing conclusions. Relying solely on summary statistics from earlier steps can be misleading, as they often fail to reveal the full distribution and nuances of the dataset.

<mark>Gurman end</mark>

# Part 3: Pipes in pandas

<img src = 'https://images-wixmp-ed30a86b8c4ca887773594c2.wixmp.com/f/09d3ec2e-8869-461b-9550-1a06f6606c57/df8uidr-d9dba8a8-bdbb-413f-bae9-1117cfb4c567.png/v1/fill/w_1920,h_1085/mario_background_pipe_land_by_thenightcapking_df8uidr-fullview.png?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJ1cm46YXBwOjdlMGQxODg5ODIyNjQzNzNhNWYwZDQxNWVhMGQyNmUwIiwiaXNzIjoidXJuOmFwcDo3ZTBkMTg4OTgyMjY0MzczYTVmMGQ0MTVlYTBkMjZlMCIsIm9iaiI6W1t7ImhlaWdodCI6Ijw9MTA4NSIsInBhdGgiOiJcL2ZcLzA5ZDNlYzJlLTg4NjktNDYxYi05NTUwLTFhMDZmNjYwNmM1N1wvZGY4dWlkci1kOWRiYThhOC1iZGJiLTQxM2YtYmFlOS0xMTE3Y2ZiNGM1NjcucG5nIiwid2lkdGgiOiI8PTE5MjAifV1dLCJhdWQiOlsidXJuOnNlcnZpY2U6aW1hZ2Uub3BlcmF0aW9ucyJdfQ.AqQ8vMbMxJwQuthyHRQJLwclpU1FrzNorFUT4aRG-J0' width = 50%>


We have seen how to use basic methods in pandas on a data frame such as `.mean()` or `.sum`. In some cases, we may want to chain together multiple methods instead of assigning new objects across multiple lines of code. Chaining multiple methods together in this style is called `piping`.

Below is a demonstration of a pipe in pandas. Note that the full pipe is wrapped in `()` to indicate that this is a pipe.


In [8]:
#Q1.
# Group trees by PlotName and sum biomass of all trees in plot (units are kilograms per hectare (Kg/ha)

# We need to convert to a more common unit of tonnes per hectare (Mg/ha) -> divide by 1000

# trees_agg_df = (trees_df.
#                 groupby('PlotName').
#                 agg(biomass_kg_ha = ('...', '...')).
#                 reset_index().
#                 assign(biomass_Mg_ha = lambda x: x['biomass_kg_ha'] / ...))

# trees_agg_df

In [17]:
# Load the plot locations data
#plots_gdf = gpd.read_file("data/plots.gpkg")

# Need to rename plot identifier column to match trees
#plots_gdf = plots_gdf.rename(columns = {'Plot': 'PlotName'})

#plots_gdf.head()

In [18]:
#Q2.
# Join summarized trees df with plot locations
#biomass_gdf = plots_gdf.merge(..., on='PlotName')

#fig, ax = plt.subplots(figsize=(10, 5))

# View the total biomass at each location
# biomass_gdf.plot(
#     column='...',
#     cmap='viridis',
#     legend=True,
#     edgecolor='black',
#     linewidth=0.5,
#     ax=ax,
# )

#ax.set_title('Total Biomass per Plot (Mg/ha)', fontsize=14)


#leg = ax.get_legend()


#plt.tight_layout()
#plt.show()

<mark>Gurman - added labels and other expanded the format of the graph as another exercise. Included a small summary below<mark>

With aggregations, we can answer more complex questions that are not immediately apparent from the raw dataset. The use of pipes helps simplify the necessary code, making it more readable and efficient.

# Part 4: Pipes, your turn

Lets try to answer this question:

"Which living tree species has the highest average height in centimeters when accounting for particular crown class and quality class

In [10]:
#Q1.
# agg_tree_df = (
#     trees_df
#     ....("... == '...'")
#     .groupby(['...', '...', 'QualityClass'])
#     .agg({'...': '...'})
#     .reset_index().
#     assign(... = lambda x: x['...'] * ...)
# )
# agg_tree_df.head()

Lets create a faceted bar chart to visulize this experiment

In [11]:
#Q2.
# plt.figure(figsize=(12, 6))
# sns.catplot(
#     data=agg_tree_df,
#     x='...',
#     y='...',
#     hue='QualityClass',
#     col='CrownClass',
#     kind='bar',
#     errorbar=None,
#     palette='Set2',
#     height=5,
#     aspect=1.5,
#     col_wrap=3
# )

# #Add labels and display
# plt.xlabel('Tree Species')
# plt.ylabel('Avg Merchantable Volume (m³/ha)')
# plt.tight_layout()
# plt.show()

Q3.

Which living speices has the highest averge height for a particual crown class and quailty class?

*Answer here*

# Part 5: Covariance and Correlation <mark>Gurman Start</mark>

Lets switch gears now into taking a look at covariance and correlation.

Below is a filtered version of the original dataset we have been working with. While the techniques demonstrated here are applicable to datasets of any size, applying them to large datasets can often make interpretation more difficult. When conducting prediction or inference, it is generally advisable to focus on a relevant subset of the data before making broader claims about the dataset as a whole.

In [12]:
#Q1.
#filtered_trees_df = ...[['BA_all', 'TPH_all','mvol', 'tvol', 'biomass']]

Please complete the code neded to generate the below tables

In [13]:
#Q2.
# print("Covariance matrix:")
# covarience = filtered_trees_df....()
# covarience

In [14]:
#Q3.
# print("Correlation matrix:")
# correlation = filtered_trees_df....()
# correlation

Q4.

What is something you notice about the two tables? Which table is more appropriate for examining how closely two variables are related to each other, and which one is better suited for understanding the extent to which two variables vary together or differ in magnitude?

*Answer Here*

Let’s try to visualize this difference with two side by side heatmaps. Fill in the code below

In [15]:
#Q5.
# fig, axes = plt.subplots(1, 2, figsize=(15, 7))


# sns.heatmap(..., annot=True, fmt=".2f", cmap="Blues", ax=axes[0])
# axes[0].set_title('Covariance Matrix')

# sns....(..., annot=True, fmt=".2f", cmap="Reds", ax=axes[1])
# axes[1].set_title('Correlation Matrix')

# plt.tight_layout()
# plt.show()

In summary, covariance and correlation both measure the tendency of two variables to move together, that is, how closely related they are. This is especially important in machine learning, where identifying and selecting highly correlated features (and removing less significant ones) can improve model accuracy and help reduce overfitting. These topics we will explore further later on.

Correlation is calculated in a way that makes it scale-invariant, meaning it is unaffected by the units of the variables. This makes it particularly useful for understanding the strength and direction of a relationship between features.

Covariance, on the other hand, reflects the direction of the linear relationship but not its strength, and it is sensitive to scale. While both metrics provide insight into relationships between variables, correlation is generally more useful for feature selection. <mark>Gurman end</mark>

# Solutions:

### Part 1:

In [None]:
#Q1.
trees_df = pd.read_csv("data/trees.csv")
trees_df.head()

In [None]:
#Q2.
# How many plots are there in the dataset
len(trees_df['PlotName'].unique())

In [None]:
#Q3.
len(trees_df['TreeID'].unique())

In [None]:
#Q4.
# Check trees in PRF015
trees_df[trees_df['PlotName'] == 'PRF015']

### Part 2:

In [None]:
#Q1.
stats = trees_df.groupby('tree_spec')["TPH_all"].agg(['max', 'min', 'median', 'mean'])
stats

Q2.

The maximum values are substantially higher than the other summary statistics. For many tree species, the mean is significantly greater than the median, indicating a right-skewed distribution. This pattern is typically caused by a small number of extremely large values that pull the average upward. Therefore, any analysis involving the distribution of this variable should carefully account for these outliers, as failing to do so may introduce bias into the results.

In [None]:
#Q3.
sns.set(style="whitegrid")
plt.figure(figsize=(12, 6))

#set the boxplot and include data
sns.boxplot(data=trees_df, x='tree_spec', y='TPH_all', hue='tree_spec', palette='Set2', legend=False)

#Add labels
plt.xlabel('Tree Species', fontsize=12)
plt.ylabel('Trees per Hectare (TPH)', fontsize=12)
plt.title('Distribution of TPH by Tree Species', fontsize=14)

#Rotate the varibles in the x-axis for better readability
plt.xticks(rotation=45, ha='right')

#Display
plt.tight_layout()
plt.show()

### Part 3:

In [None]:
#Q1.
trees_agg_df = (trees_df.
                groupby('PlotName').
                agg(biomass_kg_ha = ('biomass', 'sum')).
                reset_index().
                assign(biomass_Mg_ha = lambda x: x['biomass_kg_ha'] / 1000))

trees_agg_df

In [None]:
#Q2.
# Join summarized trees df with plot locations
biomass_gdf = plots_gdf.merge(trees_agg_df, on='PlotName')

fig, ax = plt.subplots(figsize=(10, 5))

# View the total biomass at each location
biomass_gdf.plot(
    column='biomass_Mg_ha',
    cmap='viridis',
    legend=True,
    edgecolor='black',
    linewidth=0.5,
    ax=ax,
)

ax.set_title('Total Biomass per Plot (Mg/ha)', fontsize=14)


leg = ax.get_legend()


plt.tight_layout()
plt.show()

### Part 4:

In [None]:
#Q1.
agg_tree_df = (
    trees_df
    .query("Status == 'L'")
    .groupby(['TreeSpec', 'CrownClass', 'QualityClass'])
    .agg({'ht_meas': 'mean'})
    .reset_index().
    assign(ht_meas = lambda x: x['ht_meas'] * 100)
)
agg_tree_df.head()

In [None]:
#Q2.
plt.figure(figsize=(12, 6))
sns.catplot(
    data=agg_tree_df,
    x='TreeSpec',
    y='ht_meas',
    hue='QualityClass',
    col='CrownClass',
    kind='bar',
    errorbar=None,
    palette='Set2',
    height=5,
    aspect=1.5,
    col_wrap=3
)

# Customize the plot
plt.xlabel('Tree Species')
plt.ylabel('Avg Merchantable Volume (m³/ha)')
plt.tight_layout()
plt.show()

Q3.

- Tree speices 70 for crown class C

- Tree speices 74 for crown class D and quailty class U, Tree speices 1 for crown class D and quailty class A 

- Tree speices 1 for crown class E

- Tree speices 70 for crown class I

- Tree speices 58 for crown class OS and quailty class U, Tree speices 2 for crown class OS and quailty class A 

- Tree speices 45 for crown class A

### Part 5:

In [None]:
#Q1.
filtered_trees_df = trees_df[['BA_all', 'TPH_all','mvol', 'tvol', 'biomass']]

In [None]:
#Q2.
print("Covariance matrix:")
covarience = filtered_trees_df.cov()
covarience

In [None]:
#Q3.
print("Correlation matrix:")
correlation = filtered_trees_df.corr()
correlation

Q4.

The Correlation matrix values are between -1 and 1 and is best suited for examining how closely two variables are related to each other. The Covariance matrix values vary in magnitude and are better suited for understanding the extent to which two variables vary together.  

In [None]:
#Q5.
fig, axes = plt.subplots(1, 2, figsize=(15, 7))


sns.heatmap(covarience, annot=True, fmt=".2f", cmap="Blues", ax=axes[0])
axes[0].set_title('Covariance Matrix')

sns.heatmap(correlation, annot=True, fmt=".2f", cmap="Reds", ax=axes[1])
axes[1].set_title('Correlation Matrix')

plt.tight_layout()
plt.show()