In this notebook, we use the technique of [sentence embeddings](https://paperswithcode.com/task/sentence-embedding) to analyse the SDG alignment of German DAX companies, as reflected by communications by/about these companies. After trying the basic analyses presented here, we encourage users to dive into more advanced questions related to sustainability and greenwashing.

The dataset used in this notebook on be [downloaded on Kaggle](https://www.kaggle.com/datasets/equintel/dax-esg-media-dataset).


In [None]:
# In this cell, set DATA_DIR to the directory from which both files can be read.
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

DATA_DIR = "/content/gdrive/MyDrive/dax_esg_dataset/"

Mounted at /content/gdrive/


In [None]:
# Make sure that all dependencies are installed.
import pandas as pd
import os
import numpy as np
import sentence_transformers
import torch
import plotly.graph_objects as go

## 1. Loading the data

In [None]:
esg_documents_df = pd.read_csv(os.path.join(DATA_DIR, "esg_documents_for_dax_companies.csv"), sep="|")
esg_documents_df.head()

Unnamed: 0.1,Unnamed: 0,company,content,datatype,date,domain,esg_topics,internal,symbol,title,url
0,2,Beiersdorf AG,Sustainability Highlight Report CARE BEYOND SK...,sustainability_report,2021-03-31,,"['CleanWater', 'GHGEmission', 'ProductLiabilit...",1,BEI,BeiersdorfAG Sustainability Report 2021,
1,3,Deutsche Telekom AG,Corporate Responsibility Report 2021 2 Content...,sustainability_report,2021-03-31,,"['DataSecurity', 'Iso50001', 'GlobalWarming', ...",1,DTE,DeutscheTelekomAG Sustainability Report 2021,
2,5,Vonovia SE,VONOVIA SE SUSTAINABILITY REPORT 2021 =For a S...,sustainability_report,2021-03-31,,"['Whistleblowing', 'DataSecurity', 'Vaccine', ...",1,VNA,VonoviaSE Sustainability Report 2021,
3,6,Merck KGaA,Sustainability Report 2021 TABLE OF CONTENTS S...,sustainability_report,2021-03-31,,"['DataSecurity', 'DataMisuse', 'DrugResistance...",1,MRK,MerckKGaA Sustainability Report 2021,
4,9,MTU,Our ideas and concepts FOR A SUSTAINABLE FUTUR...,sustainability_report,2020-03-31,,"['WorkLifeBalance', 'Corruption', 'AirQuality'...",1,MTX,MTUAeroEngines Sustainability Report 2020,


In [None]:
# form texts by concatenating title and content
esg_texts = esg_documents_df.apply(lambda row: " ".join([str(row["title"]), str(row["content"])]), axis=1)
esg_texts[0][:100]

'BeiersdorfAG Sustainability Report 2021 Sustainability Highlight Report CARE BEYOND SKIN 2021 03 For'

In [None]:
sdg_df = pd.read_csv(os.path.join(DATA_DIR, "sdg_descriptions_with_targetsText.csv"))
sdg_df.head()

Unnamed: 0,id,name,description,targets,targets_json_array,progress
0,1,No Poverty,End poverty in all its forms everywhere,"['1.1', 'By 2030, eradicate extreme poverty fo...","[{""target"":""1.1"",""description"":""By 2030, eradi...",['The impact of the COVID-19 pandemic reversed...
1,2,Zero Hunger,"End hunger, achieve food security and improved...","['2.1', 'By 2030, end hunger and ensure access...","[{""target"":""2.1"",""description"":""By 2030, end h...","['Between 2014 and the onset of the pandemic, ..."
2,3,Good Health and Well-being,Ensure healthy lives and promote well-being fo...,"['3.1', 'By 2030, reduce the global maternal m...","[{""target"":""3.1"",""description"":""By 2030, reduc...","['By April 2022, the coronavirus causing COVID..."
3,4,Quality Education,Ensure inclusive and equitable quality educati...,"['4.1', 'By 2030, ensure that all girls and bo...","[{""target"":""4.1"",""description"":""By 2030, ensur...",['The COVID-19 outbreak has caused a global ed...
4,5,Gender Equality,Achieve gender equality and empower all women ...,"['5.1', 'End all forms of discrimination again...","[{""target"":""5.1"",""description"":""End all forms ...","[""The world is not on track to achieve gender ..."


In [None]:
sdg_texts = sdg_df.apply(lambda row: " ".join([row["name"], row["description"], row["targets"], row["progress"]]), axis=1)
sdg_texts[0][:100]

"No Poverty End poverty in all its forms everywhere ['1.1', 'By 2030, eradicate extreme poverty for a"

In [None]:
companies = sorted(esg_documents_df.company.unique())
companies

array(['Beiersdorf AG', 'Deutsche Telekom AG', 'Vonovia SE', 'Merck KGaA',
       'MTU', 'E ONSE', 'RWE AG', 'Heidelberg Cement AG', 'Siemens AG',
       'Linde', 'Qiagen', 'Henkel', 'Daimler AG', 'Continental AG',
       'Bayer AG', 'Volkswagen AG', 'Fresenius', 'Symrise AG',
       'Sartorius AG', 'Porsche', 'SAP', 'Adidas AG', 'Deutsche Bank AG',
       'Puma SE', 'Siemens Healthineers AG', 'Airbus SE', 'Covestro AG',
       'Allianz SE', 'Infineon Technologies AG', 'BMW', 'Hannover R AG',
       'Siemens Energy', 'Zalando SE',
       'Muenchener Rueckversicherungs Gesellschaft AGin Muenchen',
       'Deutsche Post AG', 'BASF SE', 'Deutsche Boerse AG', 'Brenntag',
       'AkzoNobelNV', 'Vonovia'], dtype=object)

## 2. Build embeddings

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# load the retriever model from huggingface model hub
retriever = sentence_transformers.SentenceTransformer("flax-sentence-embeddings/all_datasets_v3_mpnet-base", device=device)
retriever

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

In [None]:
company_text_embeddings = retriever.encode(esg_texts)
sdg_embeddings = retriever.encode(sdg_texts)

## 3. SDG alignment of the DAX companies

In [None]:
def cosine_similarity(embedding1, embedding2):
    sim = np.dot(embedding1, embedding2)/(np.linalg.norm(embedding1)*np.linalg.norm(embedding2))
    return sim

We model SDG alignment as similarity between the company-related texts and the SDG descriptions. In this section, we first define the similarity function using standard cosine similarity. Then, we demonstrate some possible alignment analyses, their visualisations and interpretations. Finally, we provide suggestions for further, more advanced alignment analyses. 

### Most relevant SDGs for DAX companies

Let's first consider the overall relevance of the 17 SDGs for all DAX companies taken together. We first compute an embedding that averages over all company texts. Then, we compare this embedding with the embeddings of the SDGs.

In [None]:
all_companies_embedding = np.mean(company_text_embeddings, axis=0)

In [None]:
sdg_relevance_scores = [cosine_similarity(all_companies_embedding, sdg_embedding) for sdg_embedding in sdg_embeddings]
sdg_relevance_series = pd.Series(sdg_relevance_scores, index=sdg_df["name"])
sdg_relevance_series.sort_values(inplace=True)
sdg_relevance_series.head()

name
Gender Equality                           0.166441
Peace, Justice and Strong Institutions    0.174364
Life On Land                              0.204423
Quality Education                         0.228199
Zero Hunger                               0.259333
dtype: float32

In [None]:
fig = go.Figure(data=[go.Bar(x=sdg_relevance_series, y=[sdg+" " for sdg in sdg_relevance_series.index], orientation='h')])
# Change the bar mode
fig.update_layout(height=600, width=750, title="SDG relevance for DAX index")
fig.update_xaxes(title="Relevance")
fig.show()

### Most relevant SDGs for a specific company, contrasting internal and external data

In this analysis, we focus on a specific company which is defined using the COMPANY variable. We look up the "internal" and "external" embeddings for this company, average them and measure their similarity with each of the SDGs.

In [None]:
COMPANY = "Brenntag"

internal_company_indices = esg_documents_df[(esg_documents_df.company == COMPANY) & (esg_documents_df.internal == 1)].index
internal_company_embedding = np.mean(company_text_embeddings[internal_company_indices], axis=0)

external_company_indices = esg_documents_df[(esg_documents_df.company == COMPANY) & (esg_documents_df.internal == 0)].index
external_company_embedding = np.mean(company_text_embeddings[external_company_indices], axis=0)

company_sdg_relevance_scores = [[cosine_similarity(internal_company_embedding, sdg_embedding), 
                                 cosine_similarity(external_company_embedding, sdg_embedding)] for sdg_embedding in sdg_embeddings]
company_sdg_relevance_df = pd.DataFrame.from_records(company_sdg_relevance_scores, index=sdg_df["name"], columns=["internal", "external"])
company_sdg_relevance_df.sort_values("internal", inplace=True)
company_sdg_relevance_df.head()

Unnamed: 0_level_0,internal,external
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Gender Equality,0.107339,0.093954
"Peace, Justice and Strong Institutions",0.135041,0.130411
Good Health and Well-being,0.153959,0.203413
Zero Hunger,0.155495,0.193238
Climate Action,0.160574,0.240062


In [None]:
fig = go.Figure(data=[
    go.Bar(name='Internal', x=company_sdg_relevance_df["internal"], y=company_sdg_relevance_df.index, orientation='h'),
    go.Bar(name='External', x=company_sdg_relevance_df["external"], y=company_sdg_relevance_df.index, orientation='h')
])

fig.update_layout(barmode='group', height=600, width=750, title=f"SDG Relevance for {COMPANY}")
fig.update_xaxes(title="Relevance")
fig.show()

Next steps to further explore the data: modify the chart to answer the following questions:

- What are the internally most important SDGs for BMW?
- What are the most important SDGs for BMW when we take into account both the internal and the external data?
- What are the SDGs for BMW with the largest relevance gap between internal and external data?

### Heatmap of SDG relevance for all companies

In this analyses, we look at the alignment of all companies with the 17 SDGs, creating an overview over the DAX index.

In [None]:
company_embeddings = []
for company in companies:
    company_indices = esg_documents_df[esg_documents_df.company == company].index
    company_embedding = np.mean(company_text_embeddings[company_indices], axis=0)
    company_embeddings.append(company_embedding)

In [None]:
company_records = []

for company_embedding in company_embeddings:
    company_record = []
    for sdg_embedding in sdg_embeddings:
        company_record.append(cosine_similarity(company_embedding, sdg_embedding))
    company_records.append(company_record)

In [None]:
heatmap_array = np.array(company_records)
heatmap_array.shape

(40, 17)

In [None]:
fig = go.Figure(data=go.Heatmap(
        z=heatmap_array,
        x=sdg_df["name"].tolist(),
        y=[company + " " for company in companies],
        colorscale='Viridis'))

fig.update_layout(height=1000)

fig.show()

Next steps to further explore the data: modify the chart to answer the following questions:

- What are the companies that are most aligned with the SDGs? (sort by sum of rows)
- Which SDGs are most relevant for the DAX index? (sort by sum of columns)