# CDP: Hybrid Metrics to Evaluate Corporate Sustainability

[Hybrid Metrics](https://www.sharedvalue.org/resource/hybrid-metrics/) is one of the approaches to evaluate companies by balanced metrics between financial and non-financial.  
In this notebook, I combine CO2 emissions and [EBITDA](https://corporatefinanceinstitute.com/resources/knowledge/finance/what-is-ebitda/) to evaluate sustainability of companies.  

![image](https://i.imgur.com/n4ybHol.png)

The amount of CO2 emission is one of the non-financial indicators.

If the two companies earn the same revenue and one is low-carbon emission and the other is not (ex. one sells electric cars and the other sells diesel cars), the former is considered as a sustainable comapny.    
Because if regulation to CO2 emission becomes strict,  the former company can handle it easily and there will be chan to expand its sales (it is defined as climate-related risks and opportunities in [TCFD](https://assets.bbhub.io/company/sites/60/2020/10/FINAL-2017-TCFD-Report-11052018.pdf)).



**EBITDA / CO2 emission** is simple Hybrid Metrics that balances financial earning and non-financial external impact. 

![images](https://i.imgur.com/WUq8qUk.png)

There is still room for improvement but it is good for the first step.


## Prepare the data

* Financial data: [I offer the dataset for annual financial data for CDP data integration](Annual Financial Data For Hybrid Metrics)
* Non-financial (Emission) data: [I already offer the way to extract corporate emission data from CDP data](https://www.kaggle.com/takahirokubo0/cdp-extract-emissions-from-corporate-responses)


In [None]:
!pip install simfin

In [None]:
import os
import re
import json
import pandas as pd
import numpy as np
import altair as alt
import simfin as sf
from simfin.names import *

Load the financial data from [Annual Financial Data For Hybrid Metrics](https://www.kaggle.com/takahirokubo0/annual-financial-data-for-hybrid-cdp-kpi).

In [None]:
FINANCIAL_ROOT = "../input/annual-financial-data-for-hybrid-cdp-kpi/cdp_financial_data.csv"
f_df = pd.read_csv(FINANCIAL_ROOT)
f_df.head(5)

Make the emission data from CDP. If you want to know the detail, please refer [CDP: Extract Emissions from Corporate Responses](https://www.kaggle.com/takahirokubo0/cdp-extract-emissions-from-corporate-responses).

In [None]:
RESPONSE_ROOT = "../input/cdp-unlocking-climate-solutions/Corporations/Corporations Responses"
YEARS = (2018, 2019, 2020)
cl_dfs = {}

for year in YEARS:
    kind = "Climate Change"
    file_name = "{}_Full_{}_Dataset.csv".format(year, kind.replace(" ", "_"))
    path = "{}/{}/{}".format(RESPONSE_ROOT, kind, file_name)
    df = pd.read_csv(path)
    cl_dfs[year] = df

In [None]:
def extract_c6_emissions(year_df):
    """
    Extract Scope1, Scope2 and Scope3 emissions from C6.
    """
    structure = {
        "C6.1": {
            "column_name": "Scope1",
            "column_number": 1,
            "row_number": 1
        },
        "C6.3": {
            "column_name": ["Scope2-location", "Scope2-market"],
            "column_number": [1, 2],
            "row_number": 1
        },
        "C6.5": {
            "column_name": ["Scope3"],
            "column_number": 2
        }
    }
    
    items = ["account_number", "organization", "survey_year",
             "question_number", "column_number", "row_number",
             "table_columns_unique_reference", "response_value"]
    
    c6_emissions = []
    for target_number in structure:
        location = structure[target_number]
        df = year_df[year_df["question_number"] == target_number]
        
        # Select columns
        columns = location["column_number"]
        columns = columns if isinstance(columns, list) else [columns]
        for i, c in enumerate(columns):
            name = location["column_name"]
            name = name if isinstance(name, str) else name[i]
            selected = df[df["column_number"] == c]
            selected = selected[items]
            
            # Filter by rows
            if "row_number" in location:
                r = location["row_number"]
                selected = selected[selected["row_number"] == r]
            
            # Preprocess response value
            selected["response_value"] = pd.to_numeric(selected["response_value"], errors="coerce")
            selected = selected.dropna(subset=["response_value"])
            selected["scope"] = pd.Series([name] * len(selected), index=selected.index)
            c6_emissions.append(selected)
        
    c6_emissions = pd.concat(c6_emissions)
    items.append("scope")
    items.remove("row_number")
    c6_emissions = c6_emissions.groupby(items).sum().reset_index()
    c6_emissions.rename(columns={"response_value": "emissions"}, inplace=True)
    
    return c6_emissions

Merge the financial and non-financial data.

In [None]:
def make_dataset(year_dfs, f_df):
    """
    Make financial & non-financial dataset.
    """
    
    emissions = []
    for year in year_dfs:
        e_df = extract_c6_emissions(year_dfs[year])
        e_df["survey_year"] = year
        pivot = e_df.pivot_table(index=["account_number", "survey_year"], columns="scope", values="emissions")
        pivot = pivot.reset_index()
        pivot.fillna(0, inplace=True)
        pivot["emissions"] = pivot["Scope1"] + pivot["Scope2-location"] + pivot["Scope2-market"] + pivot["Scope3"]
        emissions.append(pivot)
    
    emissions = pd.concat(emissions)
    df = emissions.merge(f_df, how="inner", on=["account_number", "survey_year"], suffixes=("_emission", None))    
    return df

df = make_dataset(cl_dfs, f_df)
df.head(5)

In [None]:
print("Number of records is {}.".format(len(df)))

## Visualize Component of Hybrid Metrics

Let's visualize the relation between EBITDA and emissions before calculating the metrics.

In [None]:
def clip(s, lower_th=1, upper_th=99):
    """
    Clip the value of series
    """
    _lower, _upper = np.percentile(s, [lower_th, upper_th])
    return np.clip(s, _lower, _upper)


def f_normalize(s):
        return (s - np.mean(s)) / np.std(s)

In [None]:
pd.concat([clip(df[EBITDA]), clip(df["emissions"])], axis=1).hist(bins=10)

Both are long-tail distribution, for that reason apply log scale.

In [None]:
df_without_minus = df[(df[EBITDA] > 0) & (df["emissions"] > 0)]

In [None]:
pd.concat([clip(np.log(df_without_minus[EBITDA])), clip(np.log(df_without_minus["emissions"]))], axis=1).hist(bins=10)

In [None]:
def visualize_hybrid_metrics(df, normalize=False, max_bin=10, lower_th=1, upper_th=99):
    ebitda = np.log(df[EBITDA]) if not normalize else np.log(f_normalize(df[EBITDA]))
    emissions = np.log(df["emissions"]) if not normalize else np.log(f_normalize(df["emissions"]))
    if normalize:
        hybrid_metrics = np.log(f_normalize(df[EBITDA]) / f_normalize(df["emissions"]))
    else:
        hybrid_metrics = np.log(df[EBITDA] / df["emissions"])

    _df = pd.DataFrame({"year": df["survey_year"].apply(str),
                        "hybrid_metrics": clip(hybrid_metrics, lower_th, upper_th),
                        "ebitda(log)": ebitda, "emissions(log)": emissions})
    
    brush = alt.selection(type="interval")
    base = alt.Chart(_df).add_selection(brush)
    
    points = base.mark_circle().encode(
        x=alt.X("ebitda(log):Q", bin=alt.Bin(maxbins=max_bin) if max_bin > 0 else None),
        y=alt.Y("emissions(log):Q", bin=alt.Bin(maxbins=max_bin) if max_bin > 0 else None),
        size="count()",
        color="hybrid_metrics"
    )
    
    tick_axis = alt.Axis(labels=False, domain=False, ticks=False)

    x_ticks = base.mark_tick().encode(
        alt.X("ebitda(log):Q", axis=tick_axis),
        alt.Y("year", title="", axis=tick_axis),
        color=alt.condition(brush, "year", alt.value("lightgrey"))
    )

    y_ticks = base.mark_tick().encode(
        alt.X("year", title="", axis=tick_axis),
        alt.Y("emissions(log):Q", axis=tick_axis),
        color=alt.condition(brush, "year", alt.value("lightgrey"))
    )

    # Build the chart
    return y_ticks | (points & x_ticks)


visualize_hybrid_metrics(df_without_minus, max_bin=20)

In [None]:
visualize_hybrid_metrics(df_without_minus, normalize=True, max_bin=20)

We can confirm low-emission companies tend to get high hybrid_metrics score.

In [None]:
def visualize_hybrid_metrics_by_category(df, category="primary_industry",
                                         normalize=True, lower_th=1, upper_th=99):

    if normalize:
        hybrid_metrics = np.log(f_normalize(df[EBITDA]) / f_normalize(df["emissions"]))
    else:
        hybrid_metrics = np.log(df[EBITDA] / df["emissions"])
    
    _df = pd.DataFrame({"year": df["survey_year"].apply(str),
                        "hybrid_metrics": clip(hybrid_metrics, lower_th, upper_th),
                        "category": df[category]})

    return alt.Chart(_df).mark_boxplot().encode(
                x=alt.X("category:O"),
                y=alt.Y("hybrid_metrics:Q")
            ).properties(width=600)

In [None]:
visualize_hybrid_metrics_by_category(df_without_minus, category="primary_industry")

There seems some difference between industries.  
Let's watch the ranking of hybrid metrics for each category. Does it make sense to your feeling?

In [None]:
def rank_hybrid_metrics_by_category(df, category="primary_industry", year=2019,
                                    normalize=True, lower_th=1, upper_th=99):

    if normalize:
        hybrid_metrics = np.log(f_normalize(df[EBITDA]) / f_normalize(df["emissions"]))
    else:
        hybrid_metrics = np.log(df[EBITDA] / df["emissions"])

    
    df = df[df["survey_year"] == year]
    _df = pd.DataFrame({"name": df["organization"],
                        "hybrid_metrics": clip(hybrid_metrics, lower_th, upper_th),
                        "category": df[category]})
    
    return _df.sort_values(by="hybrid_metrics", ascending=False)


ranking_2019 = rank_hybrid_metrics_by_category(df_without_minus)
ranking_2019.dropna().head(10)

In [None]:
ranking_2019.dropna().tail(10)

In [None]:
ranking_2019[ranking_2019["category"] == "Power generation"].dropna()

Exelon and 


## Decomposition of Hybrid Metrics

The formula of EBITDA is the following. 

*EBITDA = operating income + depreciation and amortization*

The *depreciation and amortization* represents the smoothed cost of assets. We can separate asset efficiency for earning cash and emission efficiency for asset procurement.

![formula](https://i.imgur.com/CzBt2zW.png)


In [None]:
def visualize_decomposed_metrics(df, normalize=True, max_bin=10, lower_th=1, upper_th=99):

    if normalize:
        hybrid_metrics = np.log(f_normalize(df[EBITDA]) / f_normalize(df["emissions"]))
        asset_efficiency = np.log(f_normalize(df[EBITDA]) / f_normalize(df[DEPR_AMOR]))
        procurement_efficiency = np.log(f_normalize(df[DEPR_AMOR]) / f_normalize(df["emissions"]))
    else:
        hybrid_metrics = np.log(df[EBITDA] / df["emissions"])
        asset_efficiency = np.log(df[EBITDA] / df[DEPR_AMOR])
        procurement_efficiency = np.log(df[DEPR_AMOR] / df["emissions"])

    _df = pd.DataFrame({"year": df["survey_year"].apply(str),
                        "asset_efficiency": clip(asset_efficiency, lower_th, upper_th),
                        "procurement_efficiency": clip(procurement_efficiency, lower_th, upper_th),
                        "hybrid_metrics": clip(hybrid_metrics, lower_th, upper_th)
                       })
    
    brush = alt.selection(type="interval")
    base = alt.Chart(_df).add_selection(brush)

    points = base.mark_circle().encode(
        x=alt.X("procurement_efficiency:Q", bin=alt.Bin(maxbins=max_bin) if max_bin > 0 else None),
        y=alt.Y("asset_efficiency:Q", bin=alt.Bin(maxbins=max_bin) if max_bin > 0 else None),
        size="count()",
        color=alt.condition(brush, "hybrid_metrics", alt.value("lightgrey"))
    )
    
    tick_axis = alt.Axis(labels=False, domain=False, ticks=False)

    x_ticks = base.mark_tick().encode(
        alt.X("procurement_efficiency:Q", axis=tick_axis),
        alt.Y("year", title="", axis=tick_axis),
        color=alt.condition(brush, "year", alt.value("lightgrey"))
    )

    y_ticks = base.mark_tick().encode(
        alt.X("year", title="", axis=tick_axis),
        alt.Y("asset_efficiency:Q", axis=tick_axis),
        color=alt.condition(brush, "year", alt.value("lightgrey"))
    )

    # Build the chart
    return y_ticks | (points & x_ticks)


visualize_decomposed_metrics(df_without_minus, normalize=False, max_bin=20)

In [None]:
visualize_decomposed_metrics(df_without_minus, normalize=True, max_bin=20)

In the above figure, we can see that asset procurement efficiency has more impact on hybrid metrics.  
Let's watch the box plot and ranking.

In [None]:
def visualize_decomposed_by_category(df, efficiency="asset", category="primary_industry",
                                     normalize=True, lower_th=1, upper_th=99):
    
    if efficiency == "asset":
        name = "asset_efficiency"
        if normalize:
            metrics = np.log(f_normalize(df[EBITDA]) / f_normalize(df[DEPR_AMOR]))
        else:
            metrics = asset_efficiency = np.log(df[EBITDA] / df[DEPR_AMOR])
    else:
        name = "procurement_efficiency"
        if normalize:
            metrics = np.log(f_normalize(df[DEPR_AMOR]) / f_normalize(df["emissions"]))
        else:
            metrics = np.log(df[DEPR_AMOR] / df["emissions"])
                
    _df = pd.DataFrame({"year": df["survey_year"].apply(str),
                        name: metrics,
                        "category": df[category]})

    return alt.Chart(_df).mark_boxplot().encode(
                x=alt.X("category:O"),
                y=alt.Y("{}:Q".format(name))
            ).properties(width=600)


In [None]:
visualize_decomposed_by_category(df_without_minus, efficiency="asset")

In [None]:
visualize_decomposed_by_category(df_without_minus, efficiency="procurement")

We can see the more wide distribution than hybrid metrics itself.    
Let's watch the ranking.

In [None]:
def rank_procurement_efficiency_by_category(df, category="primary_industry", year=2019,
                                            normalize=True, lower_th=1, upper_th=99):
    df = df[df["survey_year"] == year]
    if normalize:
        procurement_efficiency = np.log(f_normalize(df[DEPR_AMOR]) / f_normalize(df["emissions"]))
    else:
        procurement_efficiency = np.log(df[DEPR_AMOR] / df["emissions"])

    _df = pd.DataFrame({"year": df["survey_year"].apply(str),
                        "name": df["organization"].apply(str),
                        "procurement_efficiency": procurement_efficiency,
                        "category": df[category]})
    
    return _df.sort_values(by="procurement_efficiency", ascending=False)


ranking_2019p = rank_procurement_efficiency_by_category(df_without_minus)
ranking_2019p.head(10)

In [None]:
ranking_2019p.dropna().tail(10)

In this notebook, I show the possibility of hybrid metrics to evaluate corporate sustainability.  
I can confirm the effectiveness of it by classifying the companies but it will require more verification by comparing the response itself. 
