
<div style="position: relative; text-align: center;">
    <img src="img/earth_thin.png" alt="drawing" style="width:100%;">
    <div style="position: absolute; top: 50%; left: 52%; transform: translate(-50%, -50%); color: white;">
        <h1 style="text-align: left">Screening Interview</h1>
        <h2 style="text-align: left">Global Canopy Data Researcher</h2>
        <h3 style="text-align: left">Ian Goodrich</h3>
        <h4 style="text-align: left">November 2024</h4>
</div>



## What is ENCORE?
ENCORE maps the ways the ways in which human economic activity depends on and impacts the natural world, and is intended to support financial institutions in decision-making and reporting.

It is as much a framework as a dataset, making explicit the links between economic activities and the natural world. The data contained in the ENCORE dataset is derived from a literature review and qualiatative scoring process.

## Key concepts
- ENCORE uses the International Standard Classification of all Economic Activities (ISIC) as a schema for **economic activities**.
- It maps **dependencies** of economic activities on **ecosystem services**, which are classified according to the UN System of Environmental-Economic Accounting Ecosystem Accounting (SEEA - EA).
- It also maps **pressures** exerted by economic activities on ecosystems. Pressures are classified in line with the Driver-Pressure-State-Impact-Response (DPSIR) framework, using a modified version of the Natural Capital Protocol (2016).
- Relationships have **materiality scores** indicating the extent to which an economic activity depends on or pressures the natural world.



## Goals
- Build familiarity with the ENCORE dataset and Global Canopy's work
- To use ENCORE data to generate an insight into the relationship between human activity and the natural world
- To visualise the relationships in the data in a compelling way

## Data preparation
- I do some initial eyeballing and manual cleaning in LibreOffice Calc, deleting leading rows and fixing a csv encoding bug which was causing issues when importing into Python. The cleaned data is in `data/clean`.
- Materiality data is presented in "wide" format, with the first six columns containing data on the economic activities, and the remaining columns containing the materiality scores for each ecosystem service/pressure. I convert it to "long" format, so that each row represents a single relationship.
- Scores are presented as strings (i.e., "VL", "L", "M", "H", "VH"), which I convert to numerical values between 0 (VL) and 1 (VH) allowing for quantitative analysis.

In [1]:
import pandas as pd
import statsmodels.api as sm
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio

# Definite file paths for reading
file_paths = [
    {"type": "dependency", "path": "data/encore/clean/06_dependency_mat_ratings.csv"},
    {"type": "pressure", "path": "data/encore/clean/07_pressure_mat_ratings.csv"},
]

# Read in data
out = []
for file in file_paths:
    df = pd.read_csv(file["path"])

    # Rename columns
    df.columns = [
        (
            i.replace("ISIC", "").strip().replace(" ", "_").lower()
            if i in df.columns[:6]
            else i
        )
        for i in df.columns
    ]
    df.rename(columns={"level_used_for_analysis": "level"}, inplace=True)

    # Reshape to long format
    df = df.melt(
        id_vars=df.columns[:6],
        value_vars=df.columns[6:],
        var_name="ecol",
        value_name="rating",
    )

    # Map scores to numbers for analysis, all a bit arbitray, I've decided "ND" is null
    score_map = {
        "VL": 0,
        "L": 0.25,
        "M": 0.5,
        "H": 0.75,
        "VH": 1,
        "ND": None,
        "N/A": None,
    }
    df["rating"] = df["rating"].map(score_map)

    # Drop rows with no rating
    df.dropna(subset=["rating"], inplace=True)

    # Set type
    df["relationship_type"] = file["type"]

    # We'll want to sort things by section later, so taking the stem of the code
    df["code_stem"] = df["unique_code"].str[0]

    out.append(df)

# Main df
df = pd.concat(out)

## Scoring activities
- The ENCORE analysis produced materiality scores at different levels of the ISIC hierarchy (either "Group" or "Class"). For my analysis, I use the label of the lowest level, and group at highest "Section" level.
- I calculate the mean materiality score for dependencies and pressures for each economic activity. This tells us, on average, across all relationships, which sectors are more depedent and which exert the most pressure.

In [2]:
# Get lowest level of analysis
df["analysis"] = df["class"].fillna(df["group"])


# Group scores with mean rating for plotting
group_scores = (
    df.groupby(["code_stem", "section", "analysis", "level", "relationship_type"])[
        "rating"
    ]
    .mean()
    .unstack()
    .reset_index()
    .sort_values("code_stem")
)

# I also shorten the really long section title for plotting
group_scores["section"] = group_scores["section"].str.replace(
    "Activities of households as employers; undifferentiated goods- and services-producing activities of households for own use",
    "Household activities",
)

## Examining relationship between pressure and dependency
I take a quick look to see if there's any correlation between pressure and dependency. 

It seems like there's a positive linear relationship, which I pull out the predictions of for plotting.

In [3]:
print("Positive correlation between dependency and pressure")
print("\n", group_scores[["dependency", "pressure"]].corr())

# Run it through a linear model for plotting
x = group_scores["dependency"].fillna(0)
y = group_scores["pressure"].fillna(0)
X = sm.add_constant(x)
model = sm.OLS(y, X).fit()
line = model.predict(X)

print("\nModel summary")
print(model.summary())

Positive correlation between dependency and pressure

 relationship_type  dependency  pressure
relationship_type                      
dependency           1.000000  0.694048
pressure             0.694048  1.000000

Model summary
                            OLS Regression Results                            
Dep. Variable:               pressure   R-squared:                       0.337
Model:                            OLS   Adj. R-squared:                  0.334
Method:                 Least Squares   F-statistic:                     139.6
Date:                Tue, 26 Nov 2024   Prob (F-statistic):           2.51e-26
Time:                        10:28:49   Log-Likelihood:                 126.06
No. Observations:                 277   AIC:                            -248.1
Df Residuals:                     275   BIC:                            -240.9
Df Model:                           1                                         
Covariance Type:            nonrobust                      

In [4]:
fig = px.scatter(
    group_scores,
    x="dependency",
    y="pressure",
    color="section",
    hover_name="analysis",
    hover_data=["level"],
)


fig.add_trace(
    go.Scatter(
        x=x,
        y=line,
        mode="lines",
        name="OLS",
        line=dict(color="white", width=2, dash="dash"),
    )
)

# Axis labels
fig.update_xaxes(title_text="Mean dependency")
fig.update_yaxes(title_text="Mean pressure")

# Lock layout
fig.layout.xaxis.fixedrange = True
fig.layout.yaxis.fixedrange = True


# Add a title
fig.update_layout(
    title="<b>Human Activity: Dependency vs Pressure</b><br>Mean Ratings by ISIC Group/Class used for analysis"
)

# Dark theme
fig.update_layout(template="plotly_dark")

# Reduce font size for legend and set legend title
fig.update_layout(legend=dict(font=dict(size=10)), legend_title_text="Section")

# Size
fig.update_layout(width=800, height=600, autosize=False)

# Show without the plotly toolbar
fig.show(config={"displayModeBar": False})

# export to html
pio.write_html(fig, "charts/dependency_pressure.html", config={"displayModeBar": False})

## Findings
- There is a strong positive correlation between dependencies and pressures, i.e., that the activities which most strongly depend on nature are also most likely to exert pressure on it.
- The category of activity with the strongest dependencies and pressures is very clearly "Agriculture, forestry and fishing".
- Unsurprisingly, "Mining and quarrying" and "Manufacturing" also score highly on pressures, with weaker (yet often strong) dependencies.

# Visualising links
I wanted to find a way to quickly and easily visualise the various dependencies and pressures related to each category of economic activity. I started off with a traditional network plot, but found myself spending too much time optimising layout.

Conceptualising the relationships as flows, from what activities rely on to how they pressure ecosystems, I decided on a Sankey plot.

In [None]:
# It gets a bit messy this, I'm used to building networks from the ground up,
# I think there's a better way with pandas series, but this works

# Initialise some objects we'll need
nodes = {}
edges = {
    "source": [],
    "target": [],
    "value": [],
    "label": [],
    "color": [],
}

# Plotly wants unique node ids rather than labels to id links
node_id = 0

# Loop through the data and build the nodes and edges
for _, row in (
    df.groupby(["section", "ecol", "relationship_type"])["rating"]
    .mean()  # Again, we're working on the section mean for ease of plotting and interpretation
    .reset_index()
    .iterrows()
):
    if row["section"] not in nodes:
        # Basic node data for economic activities
        nodes[row["section"]] = {}
        nodes[row["section"]]["id"] = node_id
        nodes[row["section"]]["label"] = row["section"]
        nodes[row["section"]]["type"] = "Economic Activity"

        # Each new node bumps the id
        node_id += 1

    if row["ecol"] not in nodes:
        # Pretty much the same as above
        nodes[row["ecol"]] = {}
        nodes[row["ecol"]]["id"] = node_id
        nodes[row["ecol"]]["label"] = row["ecol"]
        nodes[row["ecol"]]["type"] = row[
            "relationship_type"
        ].title()  # They're lowercase in the data

        # As above
        node_id += 1

    # Add the edge data, order is important here, dependency is the source and activity is the target
    if row["relationship_type"] == "dependency":
        edges["source"].append(nodes[row["ecol"]]["id"])
        edges["target"].append(nodes[row["section"]]["id"])

    # But if it's pressure, we want to reverse the order
    else:
        edges["source"].append(nodes[row["section"]]["id"])
        edges["target"].append(nodes[row["ecol"]]["id"])

    # Add the rating and label to our edge data
    edges["value"].append(row["rating"])
    edges["label"].append(row["relationship_type"].title())

# Positioning is important, as plotly wants to arrange things so in a way that puts
# the most connected nodes in the middle, I want to keep the economic activities in the middle.


# This manually sets the horizontal position of the nodes
def horizontal_position(node):
    if node["type"] == "Economic Activity":
        return 0.5
    elif node["type"] == "Dependency":
        return 0
    else:
        return 1


# The plot itself
fig = go.Figure(
    data=[
        go.Sankey(
            arrangement="snap",
            node=dict(
                # Most of this is just defaults from the docs
                pad=15,
                thickness=20,
                line=dict(color="black", width=0.5),
                label=[node["label"] for node in nodes.values()],
                # Here we do the positioning with the function above
                x=[horizontal_position(node) for node in nodes.values()],
                # Plotly ignores xs if ys not specified, so I just set them all to 0.1
                y=[0.1 for node in nodes.values()],
            ),
            link=dict(
                # Sources and targets
                source=edges["source"],
                target=edges["target"],
                value=edges["value"],
                # Styling, a nice hard white for the hover and a very mellow and transparent green for the fill
                color="rgba(119, 221, 119, 0.1)",
                hovercolor="white",
                hovertemplate="%{source.label}<br/> -> %{target.label}<extra></extra>",
            ),
        )
    ]
)

# Add a title
fig.update_layout(
    title="<b>Dependency to pressure flows</b><br>Mean Ratings by ISIC Section"
)

# Dark theme
fig.update_layout(template="plotly_dark")

# Size
fig.update_layout(width=1400, height=768)

# Show without the plotly toolbar
fig.show(config={"displayModeBar": False})

# export to html
pio.write_html(
    fig, "charts/dependency_pressure_sankey.html", config={"displayModeBar": False}
)

## Wrapping up

This was a pretty quick excersise. The difficult bits were:
- Getting my head around what ENCORE is and does.
- Dealing with some of plotly's more obscure features, Sankey plots are pretty new to me so that was particularly interesting.
- An ill-fated attempt to get this notebook converted to a reveal.js-based presentation. Doable, but dealing with all the weird little visual bugs was a bit much for a screening call. Another time, perhaps.

## Possible continuation
If I were to continue working with this data I would:
- Speak to end-users before doing _anything_.
- Fully clean and restructure the CSV data published on [encorenature.org](encorenature.org), for example:
    - Consolidate descriptive sheets (e.g., Dependency links) with quantitative sheets (e.g., Dependency materiality ratings) using a "tall" format.
    - Ensure all entities throughout the database have a unique reference to simplify merges and joins.
    - Consider options for publication in machine-readable format (e.g., JSON).
    - Validate hyperlinks in the references, several were broken.
- Incorporate more descriptive data into plots. For example providing full decriptive detail of the nature of a dependency/pressure in the scatter.
- Use Dash or D3 to make the plots fully interactive, for example allowing traversal between ISIC levels, filtering etc.
- Examine options for making further use of the geodata hosted on [encorenature.org](encorenature.org)