# Automated Analysis of ESG Reports: Evaluating Compliance with ESRS Guidelines using Semantha

In this notebook, we will explore a novel approach to analyzing Environmental, Social, and Governance (ESG) reports using our state-of-the-art natural language processing (NLP) platform semantha. Our focus will be on the application of European Sustainability Reporting Standard (ESRS), and how we can utilize NLP to assess the compliance of companies to these guidelines.

ESG reports play a pivotal role in demonstrating a company's commitment to sustainable and socially responsible practices. However, evaluating these reports manually is a laborious and time-consuming task. Our tool facilitates this process by automatically analyzing the content and structure of these reports and assigning a coverage score for each one, based on how well they adhere to the ESRS guidelines.

Our approach can be broken down into two major steps:

1. Extracting relevant sections from the ESG reports and analyzing them against the ESRS guidelines. Each report receives a coverage score that signifies the level of its compliance with the guidelines.

2. Analyzing these coverage scores on both an industry-wide and company-specific level. This step involves studying the evolution of reporting quality over time, identifying top-performing industries, and highlighting the strengths and weaknesses in individual companies' reporting. These are just examples and by no means capture all the insights semantha can give us.

The following sections delve into these steps in detail.



## 1. Computing Coverage Scores with Semantha

Coverage scores are the backbone of our analysis and a key metric to evaluate the quality of ESG reporting. We computed these scores by comparing the content of each report with the ESRS guidelines. Higher coverage scores indicate a greater degree of alignment with the guidelines, and consequently, a more comprehensive and compliant ESG report.


### 1.1 Loading the Corpus Data

Before we can begin the analysis, we need to load our corpus of ESG reports. These reports have been collected and pre-processed into a structured format that's easy to work with.

Let's go ahead and look at our corpus.

In [None]:
import json

#Provide path to config.json (see template config.json.tpl)
config_file_path = "config.json"

with open(config_file_path, "r") as config_json:
  config = json.load(config_json)
  server_url = config["server_url"]
  api_key = config["api_key"]
  semantha_domain = config["semantha_domain"]

In [None]:
import os
import pandas as pd

#Enter the corresponding market you'd like to check: Fortune100, FTSE, DAX40
market="DAX40"

corpus = pd.read_csv("Auto ESG/"+market+"/corpus_" + market + ".csv")
display(corpus[["File", "Company", "Sector", "Year"]])

Our corpus consists of ESG reports from various companies across different sectors and years. Each row in the dataset represents a separate ESG report, with columns providing details about the company, sector, and the year of the report.

With our data now loaded, we can proceed to the next steps of our analysis.


### 1.2 Matching the Corpus Against ESRS with the Semantha SDK

In this next part of our analysis, we use our Semantha SDK to cross-reference the ESG reports in our corpus against the European Sustainability Reporting Standards (ESRS).
This matching process is a fundamental part of our coverage scoring methodology.


The first step is to install and import the semantha package

In [None]:
%pip install semantha_sdk

import semantha_sdk

The functions defined beloware helper functions to retrieve library entries based on tags, get paragraph matches in a document, and retrieve library matches for a given tag.

In [None]:
def get_library_entries_for_tag(tag):
    return semantha.domains(domainname=semantha_domain).referencedocuments.get(
            tags=tag, fields="id,name,contentpreview"
        ).data

def get_paragraph_matches_of_doc(doc):
    match_list = []
    for page in doc.pages:
        if page.contents is not None:
            for content in page.contents:
                if content.paragraphs is not None:
                    for p in content.paragraphs:
                        if p.references is not None and len(p.references) > 0:
                            match_list.append((p, p.references))

    return match_list

def retrieve_library_matches_per_tag(tags, doc):
    result_dict = {}
    for t in tags:
        matched = []
        not_matched = []
        doc_ids_of_paragraph_matches = set([i.document_id for x in get_paragraph_matches_of_doc(doc) for i in x[1]])
        lib_entries_for_tag = get_library_entries_for_tag(t)
        for entry in lib_entries_for_tag:
            if entry.id in doc_ids_of_paragraph_matches:
                matched.append(entry)
            else:
                not_matched.append(entry)
        result_dict[t] = {
            "matched": matched,
            "not_matched": not_matched
        }
    return result_dict

def compare_to_library(in_file, threshold):
        return semantha.domains(domainname=semantha_domain).references.post(
            file=in_file,
            similaritythreshold=threshold,
            maxreferences=1
        )

Next, we log in to the Semantha server and fetch the tags used for referencing in the ESG reports.

In [None]:
semantha = semantha_sdk.login(server_url=server_url, key=api_key)
tags = semantha.domains(domainname=semantha_domain).tags.get()
print(tags)

Then, we iterate over the corpus of ESG reports. For each report, we use semantha to compare the report to the ESRS guidelines. We collect some metadata along with the number of matches for each tag. This collected information forms the base for our coverage score.

In [None]:
from tqdm import tqdm

#Threshold of how specifid semantha should be: a good setting is .65 to .75
THRESHOLD = 0.65

matches = []
for index, row in tqdm(corpus.iterrows(), total=len(corpus)):
    try:
        metadata = {
            "Company": row.Company,
            "Sector": row.Sector,
            "Year": row.Year
        }

        file = open(os.path.join("Auto ESG/" + market, row.File), "rb")
        doc = compare_to_library(in_file=file, threshold=THRESHOLD)
        print("Finished processing: " + file.name)
        num_matches = {
            tag: len(reference_documents["matched"]) for tag, reference_documents in retrieve_library_matches_per_tag(tags, doc).items()
        }

        matches.append(metadata | num_matches)
        saveGame = pd.DataFrame(matches)
        saveGame.to_excel("Auto ESG/" + market + "/_results/SaveGame_" + market + "_" + str(index) + "_" + str(THRESHOLD) + ".xlsx")

    except FileNotFoundError:
        print(f"WARNING: {row.PDF_File} not found.")
        continue

    # except TypeError:
    #     print(f"WARNING: {TypeError.__qualname__}")
    #     continue

matches = pd.DataFrame(matches)
matches.to_excel("Auto ESG/" + market + "/_results/matches_" + market + "_" + str(THRESHOLD) + ".xlsx")

### 1.3 Computing Coverage Scores

With the matches obtained in the previous step, we now compute the coverage scores for each ESG report. The coverage score is a key metric that represents how well a company's ESG report aligns with the ESRS guidelines.

In [None]:
matches_file = "Auto ESG/" + market + "/_results/matches_" + market + "_" + str(THRESHOLD) + ".xlsx"

scores = pd.read_excel(matches_file)

num_guidelines_per_tag = pd.DataFrame(columns=tags, data=[
    [len(semantha.domains(semantha_domain).referencedocuments.get(tags=tag, fields="id,name,contentpreview").data) for tag in tags]
])

matches = pd.read_excel(matches_file)
matches["Total Score"] = matches[tags].sum(axis=1) / num_guidelines_per_tag.loc[0].sum()

scores[tags] = scores[tags] / num_guidelines_per_tag.loc[0]

scores["E"] = scores[tags].filter(regex="^E").mean(axis=1)  # TODO: should be in tags as well (Includes "Sector" column for S)
scores["S"] = scores[tags].filter(regex="^S").mean(axis=1)
scores["G"] = scores[tags].filter(regex="^G").mean(axis=1)

scores["Total Score (Normalized)"] = scores[["E", "S", "G"]].mean(axis=1)
scores["Total Score"] = matches["Total Score"]
display(scores)

scores.to_excel("Auto ESG/" + market + "/_results/scores_" + market + "_" + str(THRESHOLD) + ".xlsx")

In [None]:
matches

In [None]:
# Identify and drop rows with missing company or sector information
missing_values = scores[scores['Company'].isna() | scores['Sector'].isna()].reset_index()
display(missing_values)

In [None]:
# Drop companies not in the fortune 100 as of 2022
scores = scores.dropna(subset=['Company', 'Sector'])
len(scores)

## 2. Inter-Company Analyses

Having computed the coverage scores, we can now turn our attention to the cross-company analysis. This part of our investigation aims to uncover patterns and trends in ESG reporting across various companies.

In particular, we will explore questions such as:

- How has the quality of reporting evolved over time?
- Which sectors are leading in compliance with ESRS guidelines?
- Are there specific components of ESG (Environmental, Social, Governance) where certain sectors excel?

By exploring these questions, we aim to provide a broader perspective on the state of ESG reporting. Let's delve into this analysis.


### 2.1 Tracking ESG Reporting Quality Over Time
Let's commence our exploration by analyzing the temporal evolution of ESG reporting quality. This measure helps us understand how companies have adapted to the ESRS guidelines over time.

In this section, we compute the yearly average of the total coverage scores, which reflects the average alignment of ESG reports with ESRS guidelines for each year.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set the style of the plots
sns.set_theme(style="whitegrid")

In [None]:
# Calculate yearly average of total matches
yearly_performance = scores.groupby('Year')['Total Score'].mean().reset_index()

# Create a line plot for average performance over time
plt.figure(figsize=(12, 6))
sns.lineplot(x='Year', y='Total Score', data=yearly_performance, marker='o')
plt.title('Average ESG Reporting Matches Over Time', fontsize=15)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Average Total Score', fontsize=12)
plt.show()

### 2.2 Comparing ESG Reporting Quality Across Sectors

Next, we take a closer look at the performance of various industry sectors. Understanding sector performance can reveal industry-specific patterns and trends, providing further context to our analysis.

In this step, we compute the average total coverage scores by sector. These scores are then visualized in a bar plot, allowing us to easily compare the ESG reporting quality across different sectors.

In [None]:
# Calculate average performance by Sector
sector_performance = scores.groupby('Sector')['Total Score'].mean().sort_values(ascending=False).reset_index()

# Create a bar plot for average performance by Sector
plt.figure(figsize=(12, 6))
sns.barplot(x='Total Score', y='Sector', data=sector_performance, orient='h', palette='viridis')
plt.title('Average ESG Reporting Matches by Sector', fontsize=15)
plt.xlabel('Average Total Score', fontsize=12)
plt.ylabel('Sector', fontsize=12)
plt.show()

The bar plot above displays the average total ESG reporting score across different sectors. Higher scores indicate better alignment with ESRS guidelines, meaning those sectors exhibit a higher quality of ESG reporting. This sectoral analysis allows us to identify industry leaders and those sectors that may require additional efforts to improve their reporting practices.

In addition to overall ESG reporting quality, it's valuable to examine sector performance across the individual ESG components: Environmental, Social, and Governance. This granular perspective allows us to identify which sectors excel in specific areas and where there might be room for improvement.

Let's break down the average scores by sector for each ESG component and visualize the results.


In [None]:
# Calculate average E, S, and G scores for each sector
sector_performance_esg = scores.groupby('Sector')[['E', 'S', 'G']].mean()

# Calculate the best ESG component for each sector
sector_performance_esg['Best Component'] = sector_performance_esg[['E', 'S', 'G']].idxmax(axis=1)
sector_performance_esg['Best Score'] = sector_performance_esg[['E', 'S', 'G']].max(axis=1)

# Reset index to make 'Sector' a column
sector_performance_esg.reset_index(inplace=True)

# Sort sectors by their best score
sector_performance_esg = sector_performance_esg.sort_values('Best Score', ascending=False)

# Create a bar plot showing the best performing ESG component for each sector
plt.figure(figsize=(12, 6))
sns.barplot(x='Best Score', y='Sector', data=sector_performance_esg, hue='Best Component', dodge=False, palette='viridis')
plt.title('Best Performing ESG Component by Sector', fontsize=15)
plt.xlabel('Best ESG Score', fontsize=12)
plt.ylabel('Sector', fontsize=12)
plt.legend(title='Best Component')
plt.show()


The plot above highlights the best performing ESG component for each sector, denoted by color, along with the corresponding score. This representation provides a snapshot of where each sector shines the brightest in its ESG reporting practices.

### 2.3 Top and Low Performers in Each Sector

Another insightful analysis involves identifying the companies with the highest and lowest ESG reporting scores within each sector. By comparing companies within the same sector, we get a clearer understanding of relative performance, taking into account the unique factors and challenges faced by companies operating in the same field.

We'll identify the top and lowest performers in each sector and visualize the results using a double bar chart, with one bar for the top-performing company and another for the lowest-performing company in each sector.


In [None]:
import numpy as np

# Identify top and bottom performers in each sector
top_performers_sector = scores.loc[scores.groupby('Sector')['Total Score'].idxmax()]
low_performers_sector = scores.loc[scores.groupby('Sector')['Total Score'].idxmin()]

# Merge the results into a single DataFrame
performance_by_sector = pd.merge(top_performers_sector, low_performers_sector, on='Sector', suffixes=('_top', '_low'))

# Create the bar chart
plt.figure(figsize=(14, 8))
barWidth = 0.25

# Set position of bar on X axis
r1 = np.arange(len(performance_by_sector))
r2 = [x + barWidth for x in r1]

# Make the plot
bars1 = plt.bar(r1, performance_by_sector['Total Score_top'], color='g', width=barWidth, edgecolor='grey', label='Top Performers')
bars2 = plt.bar(r2, performance_by_sector['Total Score_low'], color='r', width=barWidth, edgecolor='grey', label='Low Performers')

# Add xticks on the middle of the group bars
plt.xlabel('Sector', fontweight='bold', fontsize=12)
plt.xticks([r + barWidth / 2 for r in range(len(performance_by_sector))], performance_by_sector['Sector'], rotation=90)
plt.ylabel('Total Score', fontsize=12)

plt.title('Top and Low ESG Reporting Performers by Sector', fontsize=15)
plt.legend()

# Add the names of the top performing companies
for i, (top_company, low_company) in enumerate(zip(performance_by_sector['Company_top'], performance_by_sector['Company_low'])):
    plt.text(r1[i], performance_by_sector['Total Score_top'].iloc[i] + 0.02, top_company, ha='center', va='bottom', rotation=90, fontsize=10, color='green')
    plt.text(r2[i], performance_by_sector['Total Score_low'].iloc[i] + 0.02, low_company, ha='center', va='bottom', rotation=90, fontsize=10, color='red')

plt.tight_layout()
plt.show()


The double bar chart above visualizes the ESG reporting scores of the top and lowest performers in each sector. We've annotated the bars with the names of the respective companies, the green labels correspond to the top performers while the red labels to the bottom performers. This chart gives us a comparative snapshot of the range of ESG reporting quality within each sector.

### 2.4 Top Performers in Each Sector
This one is the "show the good ones". We won't blame companies where we could not get all information or the information is yet being provided by the companies we reached out to. Not every company has an easy to access ESG report, after all.

We'll identify the top performers in each sector and visualize the results using a bar chart.

In [None]:
import numpy as np

# Identify top and bottom performers in each sector
top_performers_sector = scores.loc[scores.groupby('Sector')['Total Score'].idxmax()]
# low_performers_sector = scores.loc[scores.groupby('Sector')['Total Score'].idxmin()]

# Merge the results into a single DataFrame
# performance_by_sector = pd.merge(top_performers_sector, on='Sector', suffixes=('_top'))

# Create the bar chart
plt.figure(figsize=(14, 8))
barWidth = 0.25

# Set position of bar on X axis
r1 = np.arange(len(performance_by_sector))
r2 = [x + barWidth for x in r1]

# Make the plot
bars1 = plt.bar(r1, performance_by_sector['Total Score_top'], color='g', width=barWidth, edgecolor='grey', label='Top Performers')
# bars2 = plt.bar(r2, performance_by_sector['Total Score_low'], color='r', width=barWidth, edgecolor='grey', label='Low Performers')

# Add xticks on the middle of the group bars
plt.xlabel('Sector', fontweight='bold', fontsize=12)
plt.xticks([r + barWidth / 2 for r in range(len(performance_by_sector))], performance_by_sector['Sector'], rotation=90)
plt.ylabel('Total Score', fontsize=12)

plt.title('Top ESG Reporting Performers by Sector', fontsize=15)
plt.legend()

# Add the names of the top performing companies
for i, (top_company, low_company) in enumerate(zip(performance_by_sector['Company_top'], performance_by_sector['Company_low'])):
    plt.text(r1[i], performance_by_sector['Total Score_top'].iloc[i] + 0.02, top_company, ha='center', va='bottom', rotation=90, fontsize=10, color='green')
    # plt.text(r2[i], performance_by_sector['Total Score_low'].iloc[i] + 0.02, low_company, ha='center', va='bottom', rotation=90, fontsize=10, color='red')

plt.tight_layout()
plt.show()

## 3. Intra-Company Analyses

Having gained insights into the broader landscape of ESG reporting across sectors and over time, we now narrow our focus to the company level. This section dives deeper into individual corporations' ESG reports to reveal their unique reporting strengths and weaknesses.

By focusing on specific companies, we can discern patterns and trends that may be obscured in aggregate data. This analysis will help us understand:

1. **Performance Over Time:** How has the company's ESG reporting quality evolved over the years?

2. **Strengths and Areas of Improvement:** What are the primary ESG topics where a given company excels, and what areas could be better targeted for improvements?

4. **ESG Landscape Visualization:** Can we use AI to generate an image that describes the company's ESG landscape, highlighting key areas of focus and potential gaps? (Spoiler: Yes)

Let's proceed to explore the ESG reporting performance by fixing a company, let's choose Apple.


In [None]:
# Specify the company to analyze
company_name = "British American Tobacco"

### 3.1 ESG Reporting Quality over Time

In this analysis, we'll examine how a company's reporting quality for the three ESG pillars - Environmental, Social, and Governance - has evolved over time. To provide a concrete example, we'll focus on a specific company and track its reporting quality across the three dimensions in different years.

By comparing the reporting quality for different aspects over time, we can identify potential trends, improvements, or areas where the company might need to put more emphasis.

Let's pick a company (for instance, Company X) and display its ESG performance over time.

In [None]:
# Select data for the chosen company
company_data = scores[scores['Company'] == company_name]

# Plot ESG components over time
plt.figure(figsize=(10, 6))
for component in ["E", "S", "G"]:
    component_data = company_data.groupby('Year')[component].mean().reset_index()
    plt.plot(component_data['Year'], component_data[component], label=component)

plt.title(f'ESG Components Over Time for {company_name}', fontsize=15)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Average Score', fontsize=12)
plt.legend()
plt.show()

The plot above visualizes the evolution of the Environmental, Social, and Governance reporting quality for a given company over time. By comparing the individual component scores, we can better understand how the company's focus on each ESG dimension has evolved.

### 3.2 Analyizing Strengths and Areas for Improvement

In order to gauge the performance of a company in specific ESG topics, we will take a closer look at their reporting in different areas. Our aim is to identify where the company excels - their 'Strengths', and which areas might require additional attention - their 'Areas for Improvement'. To do this, we will identify the top three and bottom three ESG topics for each company in terms of the match count with our reference library.

In [None]:
def get_strengths_weaknesses(company_name):
    # TODO: Remove total from the considered columns
    company_data = scores[scores['Company'] == company_name][tags]  # .select_dtypes(include='int64') # Select only numeric columns
    strengths = company_data.iloc[0].nlargest(3)
    weaknesses = company_data.iloc[0].nsmallest(3)
    return strengths, weaknesses

# Get strengths and weaknesses for Apple
top_strengths, top_weaknesses = get_strengths_weaknesses(company_name)

# Apple - Strengths
plt.figure(figsize=(12, 6))
top_strengths.plot(kind='barh', color='green', alpha=0.6)
plt.title(f'{company_name} - Strengths in ESG Reporting', fontsize=15)
plt.xlabel('Number of Matches', fontsize=12)
plt.ylabel('ESG Report Sections', fontsize=12)
plt.gca().invert_yaxis()  # Reverse the order of the y-axis for better readability
plt.show()

# Apple - Areas for Improvement
plt.figure(figsize=(12, 6))
top_weaknesses.plot(kind='barh', color='red', alpha=0.6)
plt.title(f'{company_name} - Areas for Improvement in ESG Reporting', fontsize=15)
plt.xlabel('Number of Matches', fontsize=12)
plt.ylabel('ESG Report Sections', fontsize=12)
plt.gca().invert_yaxis()  # Reverse the order of the y-axis for better readability
plt.show()

These visualizations provide a clear perspective on a company's strengths and areas for improvement in ESG reporting, thus informing their strategic decision-making in addressing these gaps.

### 3.3 ESG Landscape Visualization

This final section of our analysis aims to provide an image-based representation of each company's performance in the ESG categories. We will generate an image for each company that includes the letters E, S, and G if the company's score for the respective category exceeds a predefined threshold.

The resulting visual representation can provide a quick and easily understandable summary of a company's ESG performance.

In [None]:
import matplotlib.image as mpimg
import glob

def display_esg_image(company_name, threshold=0.1):
    company_data = scores[scores['Company'] == company_name][["E", "S", "G"]].iloc[0]
    esg_str = ''
    for letter, score in company_data.items():
        if score > threshold:
            esg_str += letter

    # If no letters are added, use '-'
    if esg_str == '':
        esg_str = '-'

    # Find the first image in the folder
    image_folder = os.path.join('Auto ESG/images', esg_str)
    image_files = glob.glob(os.path.join(image_folder, '*.png'))  # Assuming the images are in PNG format
    if not image_files:  # If the list is empty
        print(f"No images found in folder {image_folder}")
        return

    img_path = image_files[0]  # Use the first image
    img = mpimg.imread(img_path)

    # Display the image
    plt.figure(figsize=(10, 10))
    plt.imshow(img)
    plt.axis('off')  # Remove axis
    plt.title(f'{company_name} ESG Landscape', fontsize=20)
    plt.show()

# Test the function
display_esg_image(company_name)


## Conclusion

The analysis we have performed using the `semantha_sdk` has provided valuable insights into the ESG reporting quality of different companies across various sectors. It's clear that the quality of ESG reporting varies widely among companies and sectors, and that there are specific areas where certain sectors excel.

We've seen how some sectors have consistently strong performance in ESG reporting while others have room for improvement. The trends over time have shown us how the importance of ESG factors has grown and how companies have adapted to meet these changing expectations.

On an individual company level, we have identified strengths and potential areas of improvement within their ESG reporting. This information can be leveraged to improve future reports and enhance their ESG performance.

The image generation section presented an innovative way to visually represent a company's ESG performance, providing a quick and intuitive understanding of their achievements in each of the ESG components.

This detailed, data-driven approach can help stakeholders make more informed decisions and help companies identify areas where they can make positive changes. As we move towards a future where sustainability and corporate responsibility are increasingly valued, this kind of analysis will be essential for assessing and driving ESG performance.

Thank you for joining us on this deep dive into ESG report analysis. We're looking forward to seeing how these insights will drive the future of ESG reporting and performance.
