#### ** Use GPU (Google Colab or Kaggle) for faster results

To run in Google Colab or Kaggle, uncomment the next cell for dependencies:

In [None]:
# import torch
# print(torch.cuda.is_available())
# print(torch.cuda.get_device_name(0))
# x = torch.randn(1024, 1024, device="cuda")

In [None]:
# !pip install git+https://gitlab.com/netmode/sdg-detector.git
# !pip install neo4j
# !pip install pycountry

# def country_code_converter(input_countries):
#     """
#     :param input_countries: list containing the name of the countries (can be numpy array)
#     :return: list with the ISO alpha 3 codes for the given input ('Unknown Country' if no match found)
#     """
#     countries = {}
#     countries_official = {}
#     countries_common = {}

#     #loops over all of the countries contained in the pycountry library and populates dictionary
#     for country in pycountry.countries:
#         countries[country.name] = country.alpha_3

#     #loops over the alpha_3 codes from the countries dictionary
#     #populates dictionary containing official names and codes
#     for alpha_3 in list(countries.values()):
#         try:
#             countries_official[pycountry.countries.get(alpha_3 = alpha_3).official_name] = alpha_3
#         except:
#             None
#     #same for common names
#     for alpha_3 in list(countries.values()):
#         try:
#             countries_common[pycountry.countries.get(alpha_3 = alpha_3).common_name] = alpha_3
#         except:
#             None

#     codes = []
#     # appends ISO codes for all matches by trying different country name types
#     # appends Unknown Country if no match found
#     for i in input_countries:
#         if i in countries.keys():
#             codes.append(countries.get(i))

#         elif i in countries_official.keys():
#             codes.append(countries_official.get(i))

#         elif i in countries_common.keys():
#             codes.append(countries_common.get(i))

#         else:
#             codes.append(None)
#     return codes

In [1]:
from neo4j import GraphDatabase, basic_auth
import neo4j
import pandas as pd
import numpy as np
import time
import os
from dotenv import load_dotenv
from pathlib import Path
from SDGDetector import SDGDetector
import pycountry
from functions import country_code_converter

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Eleftheria\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


There are 1 GPU(s) available.
Device name: NVIDIA RTX PRO 500 Blackwell Generation Laptop GPU


In [4]:
#load the environment variables
dotenv_path = Path('~/.env')
load_dotenv(dotenv_path=dotenv_path)  # This line brings all environment variables from .env into os.environ

# Get variables
SUSTAINGRAPH_URI = os.getenv('SUSTAINGRAPH_URI')
SUSTAINGRAPH_USER = os.getenv('SUSTAINGRAPH_USER')
SUSTAINGRAPH_PASSWORD = os.getenv('SUSTAINGRAPH_PASSWORD')
database_name = os.getenv('DATABASE_NAME')

# Connect to database
driver = GraphDatabase.driver(SUSTAINGRAPH_URI, auth=(SUSTAINGRAPH_USER, SUSTAINGRAPH_PASSWORD))

# Verify connectivity
with driver.session(database=database_name) as session:
    print(session.run("RETURN 'Connected to ' + $db", db=database_name).single()[0])

Connected to neo4j


In this section, we want to connect the texts regarding the **E**uropean **G**reen **D**eal strategies and the **C**ountry **S**pecific **R**ecommendations with the SDGs.

> CSRs

From an economic development perspective, the Country Specific Recommendations (CSRs) issued by the European Commission to individual EU member states aim to address a wide range of policy areas, including climate change. The CSRs relevant to climate change typically focus on increasing renewable energy sources, improving energy efficiency, promoting sustainable transport, and reducing greenhouse gas emissions in various sectors. These recommendations can be found [here](https://commission.europa.eu/index_en).

> EGD Strategies

At the European Union (EU) level, the European Green Deal (EGD) is a comprehensive plan to transform the region into a sustainable, carbon-neutral economy by 2050. The EGD identifies several priority areas, while multiple documents are produced per year for specifying the action plan per priority area. In the SustainGraph, we focus on the above strategies:


![SustainGraph-egd](https://gitlab.com/netmode/sustaingraph/-/wikis/uploads/cf09f3bd6f06e20cb877a6b9db17df23/image.png)

> SDGDetector Library

We import the textual data in the SustainGraph, by using the open-source library [SDGDetector](https://gitlab.com/netmode/sdg-detector). This library either uses a pre-trained fine-tuned model to classify the given texts to the SDG or implements the method of keywords extraction to associate them with the SDG or combines the two aforementioned methods. In our case, we want to relate the documents with the SDGs with a weight, defined by the index ${r}_{SDG}$.

The process of importing the EGD and CSR data is as follows:
- Regarding the EGD, we have created an excel file '8.EGD_Strategy.xlsx' containing the text of each section in the EGD Strategies. The CSRs are pdf files containing a lot of unnecessary information and so we have to preprocess these pdfs, in order to get the desired format. The preprocessing of the CSRs can be found [here](https://gitlab.com/netmode/sdg-text2kg/-/wikis/Country-Specific-Recommendations). At the end of this process, we get a pickle file containing all the recommendations of each country per year.The excel and the pickle file can be found under the folder 'Data'.
- By using these files, we get the index ${r}_{SDG}$ by calling the class SDG_classifier of the SDGDetector library.

> Our methodology

The [SustainNLP Gitlab Repository](https://gitlab.com/netmode/sdg-text2kg) explores how NLP techniques can help promote the understanding of SDGS in different documents. This repo compares different transformer-based models for the mapping of EGD and CSR texts with the SDGs and also uses the method of keywords extraction to extract the most representative keywords of the texts and associate them with the keywords of the SDGs. More details about our methodology can be found in this repo.

In our experiments, we found that XLNet transformer-based model achieved an f1-score of 0.9 in our text classification problem, and can be accessed [here](https://gitlab.com/netmode/sdg-text2kg/-/blob/main/Data/Classification%20Task-Transfer%20Learning/xlnet_model).



The final graph schema is provided below:
![Alt text](wiki/SustainGraph-Policies.png)

### Constraints

Implementing neo4j databse constraints.
- CSR nodes are unique, and are identified by their 'text' property.
- CSRSubpart nodes can be duplicates when it comes to their text, and are identified by the combination of their 'text' and 'csrSubpartNbr' properties.
- PolicyArea nodes are identified by their 'name' property.

In [16]:
def create_constraint(tx,statement):
    tx.run(statement)

constraints = [
    """CREATE CONSTRAINT egd_unique IF NOT EXISTS FOR (n:EGD) REQUIRE (n.name,n.type,n.dateOfReport) IS NODE KEY""",
    """CREATE CONSTRAINT egd_name_type IF NOT EXISTS FOR (n:EGD) REQUIRE n.name IS :: STRING""",
    """CREATE CONSTRAINT egd_type_type IF NOT EXISTS FOR (n:EGD) REQUIRE n.type IS :: STRING""",
    """CREATE CONSTRAINT egd_date_type IF NOT EXISTS FOR (n:EGD) REQUIRE n.dateOfReport IS :: DATE""",

    """CREATE CONSTRAINT csr_unique IF NOT EXISTS FOR (n:CSR) REQUIRE (n.text) IS NODE KEY""",
    """CREATE CONSTRAINT csr_text_type IF NOT EXISTS FOR (n:CSR) REQUIRE n.text IS :: STRING""",
    """CREATE CONSTRAINT csrSubpart_unique IF NOT EXISTS for (n:CSRSubpart) REQUIRE (n.text, n.csrSubpartNbr) IS NODE KEY""",
    """CREATE CONSTRAINT csrSubpart_text_type IF NOT EXISTS FOR (n:CSRSubpart) REQUIRE n.text IS :: STRING""",

    """CREATE CONSTRAINT csr_policyArea_name_unique IF NOT EXISTS FOR (n:PolicyArea) REQUIRE (n.name) IS NODE KEY""",
]

with driver.session(database=database_name) as session:
    for statement_constraint in constraints:
        session.execute_write(create_constraint, statement_constraint)

#### Write batch function

In [11]:
def write_batch(tx,statement, params_list):
    tx.run(statement, parameters={"parameters": params_list})

### European Green Deal Documents

#### Read EGD data

In [5]:
# Read EGD dataframe
data_egd = pd.read_excel('Data/5.PolicyFramework_EGD.xlsx')

# Create unique number per section's text
data_egd['number'] = data_egd.index
data_egd.number = data_egd.number.astype(str)
data_egd['unique_code'] = data_egd[['Strategy', 'number']].agg('_'.join, axis=1)

# Remove special characters and spaces
data_egd['Text'] = data_egd['Text'].str.replace('[!@#$ÆØ·-]', '')
data_egd['Text'] = data_egd['Text'].str.replace('\s{2,}', ' ')
data_egd.Strategy = data_egd.Strategy.str.rstrip()

# Convert type of Year column to integer & create Month number column
data_egd.Year = data_egd.Year.astype('int')
months = {'March':3, 'May':5, 'July':7, 'October':10, 'November':11, 'December':12,'February':2}
data_egd['Month_number'] = data_egd['Month'].map(months)

#### Calculate the associations by using the SDGDetector
Now, it is time to call the SDG_classifier class of the SDGDetector library. We specify the fine-tuned model, that we have downloaded, the top_keywords,diversity and n_gram_range. Since we have to load the transformer-based models into the memory , we use Kaggle notebooks by activating the GPU.

In [None]:
# Add the path of the downloaded fine-tuned model to modelname

combo = SDGDetector.SDG_classifier(pretrained_model_name='XLNet',pretrained_model_path='modelname', # Add the path of the downloaded fine-tuned model to modelname
                                sentence_model_name='all-mpnet-base-v2')

start_time = time.time()
sdgs_egd,sdg_names_egd,r_sgd_egd = combo.predict(list(data_egd['Text']),return_association=True)
print("--- %s seconds ---" % (time.time() - start_time))

Loading XLNET model fine-tuned on OSDG-CD...
Loading Sentence Transformer model: all-mpnet-base-v2..
The association of batch 1 of 32 texts with the SDGs by using the model was calculated
The association of batch 2 of 32 texts with the SDGs by using the model was calculated
The association of batch 3 of 32 texts with the SDGs by using the model was calculated
The association of batch 4 of 32 texts with the SDGs by using the model was calculated
The association of batch 5 of 32 texts with the SDGs by using the model was calculated
The association of batch 6 of 32 texts with the SDGs by using the model was calculated
The association of batch 7 of 32 texts with the SDGs by using the model was calculated
The association of batch 8 of 32 texts with the SDGs by using the model was calculated
The association of batch 9 of 32 texts with the SDGs by using the model was calculated
The association of batch 10 of 5 texts with the SDGs by using the model was calculated
The cosine similarity score b

In [30]:
sdg = list(np.arange(1,18,1))*len(data_egd['unique_code'])
codes = np.repeat(data_egd['unique_code'],17)
association_ = list(np.concatenate(r_sgd_egd, axis=0))

df_egd = pd.DataFrame({'code':codes,'Association':association_,'sdg':sdg})
df_egds = df_egd.merge(data_egd[['unique_code','Strategy']],how='left', left_on=['code'], right_on=['unique_code'])
df_egds.drop(['unique_code'],axis=1,inplace=True)

#### Prepare EGDdata

In [None]:
# Export to pickle & csv file
df_egds.to_pickle('EGD_combo_pred.pkl')
df_egds = pd.read_pickle('EGD_combo_pred.pkl')

# Level up to strategy level
df = df_egds.groupby(['Strategy','sdg'])['Association'].mean().reset_index()
df['Association'] = round(df['Association']*100,2)

# Create Policy Areas and Description
policy_areas = {'A Farm to Fork Strategy':['Farm to Fork'],'A New Industrial Strategy for Europe':['Sustainable Industry'],
                'A Renovation Wave for Europe - greening our buildings, creating jobs, improving lives':['Buildings and Renovations'],
                'A hydrogen strategy for a climate-neutral Europe':['Clean Energy'],
                "A new approach for a sustainable blue economy in the EU Transforming the EU's Blue Economy for a Sustainable Future":['Sustainable Industry','Eliminating pollution'],
                'An EU Strategy to harness the potential of offshore renewable energy for a climate neutral future':['Clean Energy'],
                'Chemicals Strategy for Sustainability':['Eliminating pollution'],
                'EU Biodiversity Strategy for 2030':['Biodiversity'],
                'EU Soil Strategy for 2030':['Biodiversity','Eliminating pollution','Climate Action'],
                'EU Solar Energy Strategy':['Clean Energy'],
                'EU Strategy for Sustainable and Circular Textiles':['Sustainable Industry'],
                'EU strategy to reduce methane emissions':['Eliminating pollution'],
                'Forging a climate-resilient Europe - the new EU Strategy on Adaptation to Climate Change':['Climate Action'],
                'New EU Forest Strategy for 2030':['Climate Action','Biodiversity'],
                'Powering a climate-neutral economy: An EU Strategy for Energy System Integration':['Clean Energy'],
                'Strategy for Financing the Transition to a Sustainable Economy':['Sustainable Industry'],
                'Sustainable and Smart Mobility Strategy – putting European transport on track for the future':['Sustainable Mobility']}

df['Policy Area'] = df['Strategy'].map(policy_areas)

df['Description'] = np.where(df['Association']<10, 'very low',
                             np.where((df['Association'] >= 10) & (df['Association'] < 30), 'low',
                             np.where((df['Association'] >= 30) & (df['Association'] < 60), 'medium','high')))

# Create year & month
month_dict = pd.Series(data_egd.Month_number.values,index=data_egd.Strategy).to_dict()
year_dict = pd.Series(data_egd.Year.values,index=data_egd.Strategy).to_dict()

#### Import data in SustainGraph

In [14]:
# delete old data
def execute_delete(tx, statement):
    tx.run(statement)

delete_statement = """
    MATCH (n:EGD)
    DETACH DELETE n
"""

with driver.session(database=database_name) as session:
    session.execute_write(execute_delete, delete_statement)

In [17]:
# Create Observation nodes in the neo4j and commit result in batches.
statement_egd = """
UNWIND $parameters AS row
MATCH (g:Goal {code: row.sdg})
WITH g, row
MERGE (egd:EGD {
    name: row.Strategy,
    type: 'Strategy',
    dateOfReport: date({year: row.year, month: row.month})
})
MERGE (egd)-[:ASSOCIATED_WITH {
    weight: row.weight,
    description: row.desc
}]->(g)
WITH egd, row
FOREACH (policyArea IN row.policy_area |
    MERGE (p:PolicyArea {name: policyArea})
    MERGE (egd)-[:HAS_POLICY_AREA {
        dateOfReport: date({year: row.year, month: row.month})
    }]->(p)
)
"""

# Begin a new auto-commit GraphTransaction.
batch_size=10000
params=[]
batch_i = 1
with driver.session(database=database_name) as session:
    for index, row in df.iterrows():
        for policy in row['Policy Area']:
            params_dict = {
                'sdg': str(row['sdg']),
                'Strategy':str(row['Strategy']),
                'policy_area':row['Policy Area'],
                'year': int(year_dict[row['Strategy']]),
                'month': int(month_dict[row['Strategy']]),
                'weight':float(row['Association']),
                'desc': str(row['Description'])
            }
            params.append(params_dict)
            if index % batch_size == 0 and index > 0:
                st = time.time()
                session.execute_write(write_batch, params_list = params,statement = statement_egd)
                et = time.time()
                # get the execution time
                elapsed_time = et - st
                print('Batch {} with {} observations : Done! ({} minutes)'.format(batch_i,len(params),elapsed_time/60))
                params = []
                batch_i +=1

    if params:
        st = time.time()  # Record start time for the last batch
        session.execute_write(write_batch, params_list=params, statement=statement_egd)
        et = time.time()
        elapsed_time = et - st
        print('{} observations: Done! ({} minutes)'.format(len(params), elapsed_time/60))

357 observations: Done! (0.003775489330291748 minutes)


> Check cypher query

In [18]:
records, summary, keys = driver.execute_query("""\
    MATCH (n:EGD)-[r:ASSOCIATED_WITH]->(g:Goal)
    RETURN count(distinct(n)) as egd
        """,routing_="r", database_=database_name)
print("{egd} EGDs in {time} ms.".format(
    egd=records[0]['egd'],
    time=summary.result_available_after,
))

records, summary, keys = driver.execute_query("""\
    MATCH (n:EGD)-[r:ASSOCIATED_WITH]->(g:Goal)
    RETURN count(distinct(r)) as rels
        """,routing_="r",database_=database_name)
print("{egd} EGD-ASSOCIATED_WITH-GOAL(expected: {expected}) in {time} ms.".format(
    egd=records[0]['rels'],
    time=summary.result_available_after,
    expected = 17*17
))

17 EGDs in 107 ms.
289 EGD-ASSOCIATED_WITH-GOAL(expected: 289) in 38 ms.


### Country Specific Recommendations

We use the [**Country-specific recommendations database**](https://ec.europa.eu/economy_finance/country-specific-recommendations-database/), which  is the main tool for recording and monitoring progress with the implementation of CSRs. All CSRs adopted in the context of the European Semester since 2011 are registered in the database, as well as the Commission services´ assessment on progress with their implementation over time.

The CSR data are stored in the respective excel file in the Data folder. Each CSR consists of a main body text and corresponds to one or more Policy Areas; the main body text is split into sub-CSRs, each of which has a corresponding Policy Area as well. Each excel row is marked as 'X.Y' in the CSR number column, where X corresponds to the main CSR and Y to the sub-CSRs it is comprised of. CSRs, sub-CSRs and Policy Areas are represented in the SustainGraph through CSR, CSRSubpart and PolicyArea nodes respectively.

### Preprocessing CSR data

From the excel file, we preprocess the data by:
- extracting the year from the provided date
- converting data provided in the form of semi-colon separated values into python lists
- adding an 'Identifier' column for each row, which concatenates the country, year and version properties

We then filter the dataset by 'Identifier' column to only include the most recent version of the data for each country per year. Recent versions are defined by latest existing version for each year, since not all years are included in each version.
We create a 'Country code' column using country_code_converter and a 'unique code' indexing column.

The final dataframe containing all preprocessed data can be found under df_csr.

Due to CSR data being consistently updated, including earlier years, we delete all old CSR data from the SustainGraph each time we re-run this to upload new data.

In [6]:
# delete old data
def execute_delete(tx, statement):
    tx.run(statement)

delete_statement1 = """
    MATCH (n:CSR)
    DETACH DELETE n
"""

delete_statement2 = """
    MATCH (s:CSRSubpart)
    DETACH DELETE s;
"""

with driver.session(database=database_name) as session:
    session.execute_write(execute_delete, delete_statement1)
    session.execute_write(execute_delete, delete_statement2)

In [7]:
def extract_year_month(date_str):
# Extract year and month information from the date string
    year_month = date_str.split(' - ')[0]
    return int(year_month)

def convert_to_list(value):
# Convert values separated by ; to lists, return empty list if value is - (happens in EAR(s) column)
    if value == '-':
        return []
    else:
        return value.title().split(';')

# Read data
data_csr = pd.read_excel('Data/5.PolicyFramework_CSR.xlsx')
print('Original length of data: ', len(data_csr))

# Create Identifier column
data_csr['Identifier'] = data_csr['Country'] + '_' + data_csr['Year'].astype(str) + '_' + data_csr['Version']

# Find the most recent version of data for each country per year by filtering the Identifier column
dict_countries = {}
identifiers = []
for country in data_csr['Country'].unique():
    df_country = data_csr[data_csr['Country'] == country].reset_index(drop=True)
    dict_year_of_country = {}
    for year in df_country['Year'].unique():
        df_year_country = df_country[df_country['Year'] == year]
        versions = list(df_year_country['Version'].unique())
        recent_version = sorted(versions, key=extract_year_month, reverse=True)[0]
        dict_year_of_country[year] = recent_version
        identifiers.append(f"{country}_{year}_{recent_version}")
    dict_countries[country] = dict_year_of_country

# Filter and retain only necessary columns directly to preserve alignment
df_csr = data_csr[data_csr['Identifier'].isin(identifiers)].copy().reset_index(drop=True)
print('Length of data after selecting the most recent version: ', len(df_csr))

# Mapping country names to ISO3 codes
countries = {}
csr_countries = list(df_csr['Country'].unique())
for i, country in enumerate(csr_countries):
    countries[country] = country_code_converter(csr_countries)[i]

df_csr['Country code'] = df_csr['Country'].map(countries)

print('Unique texts in the most recent version:', len(df_csr['Text'].unique()))

# Create a dictionary to map unique texts to unique codes
text_to_code = {text: f'code_{i}' for i, text in enumerate(df_csr['Text'].unique())}
df_csr['unique_code'] = df_csr['Text'].map(text_to_code)

# Clean semi-colon-separated strings and turn them into lists
df_csr['EAR(s)'] = df_csr['EAR(s)'].apply(convert_to_list)
df_csr['Policy Area(s)'] = df_csr['Policy Area(s)'].apply(convert_to_list)


Original length of data:  32627
Length of data after selecting the most recent version:  5781
Unique texts in the most recent version: 5128


We proceed to split the df_csr dataframe, which contains all the preprocessed data, into two dataframes, corresponding to main and sub-CSRs: main_rows and sub_rows. We also keep track of sub-CSR to main CSR correspondence through the csr_dict, which lists all sub-CSRs per main CSR as identified by country, year and text.

In [8]:
# Create Main CSR column to keep track of sub-CSR to main CSR numbering
df_csr['CSR Nbr'] = pd.to_numeric(df_csr['CSR Nbr'], errors='coerce')
df_csr['Main_CSR'] = df_csr['CSR Nbr'].astype(str).str.extract(r'^(\d+)')
df_csr['csrSubpart Nbr'] = df_csr['CSR Nbr'].astype(str).str.extract(r'\.(\d+)$')

# Split into CSR main rows (e.g. numbering 1.0) and subrows (e.g. numbering 1.1): main_rows and sub_rows dataframes
main_rows = df_csr[df_csr['CSR Nbr'] % 1 == 0].copy()
sub_rows = df_csr[df_csr['CSR Nbr'] % 1 != 0].copy()
print("Number of main CSR rows in dataframe:",len(main_rows),"and number of sub-rows in dataframe:",len(sub_rows))
print("Total number of rows in dataframe:",len(main_rows)+len(sub_rows), ", expected:",len(df_csr))

# Create dictionary to keep track of sub-CSRs (sub_rows dataframe) correspondence to main CSRs (main_rows dataframe), including numbering
csr_dict = {}

for _, main_row in main_rows.iterrows():
    key = (
        main_row['Country'],
        main_row['Year'],
        main_row['CSR Nbr'],
        main_row['Text']
    )

    # Match subrows by Country, Year, and Main_CSR as values to the csr_dict dictionary
    matching_sub_rows = sub_rows[
        (sub_rows['Country'] == main_row['Country']) &
        (sub_rows['Year'] == main_row['Year']) &
        (sub_rows['Main_CSR'] == str(int(main_row['CSR Nbr'])))
    ]

    csr_dict[key] = {
        'sub_rows': matching_sub_rows.to_dict(orient='records')
    }

print("Number of keys in dictionary:",len(csr_dict),", expected:",len(main_rows))
print("Number of values in dictionary:",sum(len(value['sub_rows']) for value in csr_dict.values()),", expected:",len(sub_rows))

Number of main CSR rows in dataframe: 1348 and number of sub-rows in dataframe: 4433
Total number of rows in dataframe: 5781 , expected: 5781
Number of keys in dictionary: 1348 , expected: 1348
Number of values in dictionary: 4433 , expected: 4433


#### Find associations by using SDGDetector

We use the SDGDetector library to compute the association of main CSRs with each of the 17 SDGs, utilising a combination of the model and keywords methods.

In [10]:
combo_csr = SDGDetector.SDG_classifier(pretrained_model_name='XLNet',pretrained_model_path='models/xlnet_model', # Add the path of the downloaded fine-tuned model to modelname
                                sentence_model_name='all-mpnet-base-v2')

start_time = time.time()

sdgs_csr,sdg_names_csr,association_csr = combo_csr.predict(list(main_rows['Text']),top_keywords=5,diversity=0.3,
n_gram_range=(1,2),return_association=True)
print("--- %s seconds ---" % (time.time() - start_time))

Loading XLNET model fine-tuned on OSDG-CD...


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


Loading Sentence Transformer model: all-mpnet-base-v2..


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


The association of batch 1 of 32 texts with the SDGs by using the model was calculated
The association of batch 2 of 32 texts with the SDGs by using the model was calculated
The association of batch 3 of 32 texts with the SDGs by using the model was calculated
The association of batch 4 of 32 texts with the SDGs by using the model was calculated
The association of batch 5 of 32 texts with the SDGs by using the model was calculated
The association of batch 6 of 32 texts with the SDGs by using the model was calculated
The association of batch 7 of 32 texts with the SDGs by using the model was calculated
The association of batch 8 of 32 texts with the SDGs by using the model was calculated
The association of batch 9 of 32 texts with the SDGs by using the model was calculated
The association of batch 10 of 32 texts with the SDGs by using the model was calculated
The association of batch 11 of 32 texts with the SDGs by using the model was calculated
The association of batch 12 of 32 texts w

In [11]:
# Create three lists: sdg number, main CSR unique code and association to each sdg from SDGDetector
sdg = list(np.arange(1,18,1))*len(main_rows['unique_code'])
codes = np.repeat(main_rows['unique_code'],17)
association_ = list(np.concatenate(association_csr, axis=0))

# Create association column for the SDGDetector results to be imported to the dataframe
df_association = pd.DataFrame({'unique_code':codes,'Association':association_,'sdg':sdg})
df_association['Association'] = round(df_association.Association*100,2)

# Merge with the dataframe main_rows to obtain all columns
df = main_rows.merge(df_association,how='left', on=['unique_code'])
df.drop(['unique_code'],axis=1,inplace=True)

In [12]:
# Export to pickle & csv file
df.to_pickle('CSR_combo_pred.pkl')
df_pkl = pd.read_pickle('CSR_combo_pred.pkl')

# Categorise each row based on 'Association' value: very low, low, medium and high
df_pkl['Description'] = np.where(df_pkl['Association']<10, 'very low',
                             np.where((df_pkl['Association'] >= 10) & (df_pkl['Association'] < 30), 'low',
                             np.where((df_pkl['Association'] >= 30) & (df_pkl['Association'] < 60), 'medium','high')))

#### Import data in SustainGraph

CSR data are imported to the graph through CSR and CSRSubpart nodes.

We import all CSRs and connect them to the corresponding Policy Area through the HAS_POLICY_AREA relationship with the proper PolicyArea nodes, as mapped by the main_rows dataframe. It is crucial to note that **CSR nodes are unique**: should one or more countries share the same CSR for the same or different years, the specific CSR is only imported in the graph once, as one node. This is also the reason why the printed number of imported observations (length of main_rows dataframe) is different to the actual number of imported CSR nodes, which is checked later in the check cypher queries.



In [15]:
# Using the main_rows dataframe, import CSR nodes, merge CSR to GeoArea relationship in the neo4j and commit result in batches.
statement_csr = """
    UNWIND $parameters as row
    MERGE (csr:CSR{text:row.text})
    WITH csr,row
    MATCH (ga:GeoArea)
    WHERE ga.ISOalpha3code = row.geocode
    WITH csr,ga,row,(case when row.mip='Yes' then True else False end) as mip_val
    MERGE (csr)-[:REFERS_TO_AREA{csrNumber:row.csrnumber,versionOfReport:row.version,annualMultiannual:row.annual_multiannual,assessment:row.assessment,ear:row.ear,mip:mip_val,dateOfReport:date({year: row.year})}]->(ga)
    WITH csr, row, ga
    FOREACH (policyArea IN row.pa |
        MERGE (p:PolicyArea {name: policyArea})
        MERGE (csr)-[:HAS_POLICY_AREA{geoAreaISOalpha3code:ga.ISOalpha3code,geoAreaISOalpha2code:ga.ISOalpha2code,geoAreaM49code:ga.M49code,geoAreaEUcode:ga.EUcode,versionOfReport:row.version,dateOfReport:row.year}]->(p)
    )
    """

# Begin a new auto-commit GraphTransaction.
batch_size=10000
params=[]
batch_i = 1
with driver.session(database=database_name) as session:
    for index, row in main_rows.iterrows():
        params_dict = {
            'text': str(row['Text']),
            'pa':row['Policy Area(s)'],
            'geocode':str(row['Country code']),
            'csrnumber':int(row['CSR Nbr']),
            'version':str(row['Version']),
            'annual_multiannual':str(row['Annual/Multiannual']),
            'assessment':str(row['Assessment']),
            'mip':str(row['MIP']),
            'ear':row['EAR(s)'],
            'year':int(row['Year'])
        }
        params.append(params_dict)
        if index % batch_size == 0 and index > 0:
            st = time.time()
            session.execute_write(write_batch, params_list = params,statement = statement_csr)
            et = time.time()
            # get the execution time
            elapsed_time = et - st
            print('Batch {} with {} observations : Done! ({} minutes)'.format(batch_i,len(params),elapsed_time/60))
            params = []
            batch_i +=1

    if params:
        st = time.time()  # Record start time for the last batch
        session.execute_write(write_batch, params_list=params, statement=statement_csr)
        et = time.time()
        elapsed_time = et - st
        print('{} observations: Done! ({} minutes)'.format(len(params), elapsed_time/60))

1348 observations: Done! (0.029663928349812827 minutes)


We import the association of CSRs with each of the 17 SDGs through the 'ASSOCIATED_WITH' relationship to each Goal node.

In [16]:
# Associate each CSR with each Goal based on SDGDetector in the neo4j and commit result in batches.
statement_csr = """
    UNWIND $parameters as row
    MATCH (g:Goal{code:row.sdg})
    MATCH (csr:CSR{text:row.text})
    WITH csr,g,row
    MERGE (csr)-[:ASSOCIATED_WITH{weight:row.weight,description:row.desc}]->(g)
    """

# Begin a new auto-commit GraphTransaction.
batch_size=10000
params=[]
batch_i = 1
with driver.session(database=database_name) as session:
    for index, row in df_pkl.iterrows(): #from .pkl
        params_dict = {
            'text': str(row['Text']),
            'sdg':str(row['sdg']),
            'weight':float(row['Association']),
            'desc': str(row['Description'])
        }
        params.append(params_dict)
        if index % batch_size == 0 and index > 0:
            st = time.time()
            session.execute_write(write_batch, params_list = params,statement = statement_csr)
            et = time.time()
            # get the execution time
            elapsed_time = et - st
            print('Batch {} with {} observations : Done! ({} minutes)'.format(batch_i,len(params),elapsed_time/60))
            params = []
            batch_i +=1

    if params:
        st = time.time()  # Record start time for the last batch
        session.execute_write(write_batch, params_list=params, statement=statement_csr)
        et = time.time()
        elapsed_time = et - st
        print('{} observations: Done! ({} minutes)'.format(len(params), elapsed_time/60))

Batch 1 with 10001 observations : Done! (0.2102037231127421 minutes)
Batch 2 with 10000 observations : Done! (0.20294456084569296 minutes)
3357 observations: Done! (0.06392921209335327 minutes)


We import all sub-CSRs through the sub_rows dataframe as CSRSubpart nodes. Each CSR is connected to all CSRSubparts it is comprised of (as defined in csr_dict) through the HAS_SUBPART relationship. Similarly to main CSRs, each CSRSubpart node is connected to its corresponding Policy Area through the HAS_POLICY_AREA relationship with the proper PolicyArea nodes, as mapped by the sub_rows dataframe. Note that there are some sub-CSRs that don't correspond to any Policy Areas.

It is also crucial to note that **CSRSubpart nodes can be duplicate**. Each sub-CSR can be shared among many CSRs, which can be either the same or different, originating from one or many countries. Each CSRSubpart node is uniquely identified by the tuple of its text and its order number (Y in the X.Y format of CSR Number in the original data); thus, sub-CSRs that share the same text and ordering within the main CSR are imported in the graph once, as one node, but sub-CSRs sharing the same text but are in a different ordering withing the main CSR are imported as multiple nodes. We made this choice in order to be able to properly keep track of correspondence to CSR and PolicyArea nodes according to the area of origin of each text. The outgoing 'HAS_POLICY_AREA' relationship keeps track of the country of origin of each CSRSubpart, by referencing the country's geocodes as properties.

In [17]:
# Using the sub_rows dataframe and csr_dict, create CSRSubpart nodes, connect them to other nodes accordingly in the neo4j and commit result in batches.
statement_csrsubpart = """
    UNWIND $parameters AS row
    MATCH (csr:CSR {text: row.text})
    MERGE (sp:CSRSubpart {text: row.subtext, csrSubpartNbr: row.csrSubpartNbr})
    MERGE (csr)-[:HAS_SUBPART]->(sp)
    WITH row, sp, csr
    MATCH (csr)-[]-(ga:GeoArea)
    UNWIND row.subpolicyareas AS subpolicyarea
    MATCH (pa:PolicyArea {name: subpolicyarea})
    MERGE (sp)-[:HAS_POLICY_AREA{geoAreaISOalpha3code:ga.ISOalpha3code,geoAreaISOalpha2code:ga.ISOalpha2code,geoAreaM49code:ga.M49code,geoAreaEUcode:ga.EUcode,versionOfReport:row.version,yearOfReport:row.year}]->(pa)
    """

# Begin a new auto-commit GraphTransaction.
batch_size=10000
params=[]
batch_i = 1

# iterate through dict
for key, value in csr_dict.items():
    main_text = key[3]  # keep text as key
    for sub_row in value['sub_rows']:
        param = {
            'text': main_text,
            'subtext': sub_row['Text'],
            'subpolicyareas': sub_row['Policy Area(s)'],
            'year': sub_row['Year'],
            'version': sub_row['Version'],
            'csrSubpartNbr': sub_row['csrSubpart Nbr']
        }
        params.append(param)

with driver.session(database=database_name) as session:
    for i in range(0, len(params), batch_size):
        batch = params[i:i + batch_size]
        st = time.time()
        session.execute_write(write_batch, params_list=batch, statement=statement_csrsubpart)
        et = time.time()
        print(f'Batch {batch_i} with {len(batch)} observations: Done! ({(et - st)/60:.2f} minutes)')
        batch_i += 1

Batch 1 with 4433 observations: Done! (0.11 minutes)


> Check cypher query

In [18]:
# For main CSR nodes
records_csr, summary, keys = driver.execute_query("""\
    MATCH (n:CSR)-[r:ASSOCIATED_WITH]-(g:Goal) RETURN count(distinct n) as r
        """,routing_="r",database_=database_name)
print("{r} CSR nodes in the graph {time} ms, expected: {expected}".format(
    r=records_csr[0]['r'],
    time=summary.result_available_after,
    expected = len(main_rows['Text'].unique())
))

# Check that there are no duplicate CSR nodes in the graph (should be zero, CSRs aren't allowed to be duplicates)
records, summary, keys = driver.execute_query("""\
    MATCH (sp:CSR) WITH sp.text AS text, sp.number AS number, count(sp) AS count, collect(sp) AS nodes WHERE count > 1 RETURN text, number, count, nodes
        """,routing_="r",database_=database_name)
print("Number of duplicate CSR nodes in the graph: {r}, expected 0 (CSR nodes should be unique)".format(
    r=len(records),
    time=summary.result_available_after,
    expected = len(main_rows)
))

1337 CSR nodes in the graph 61 ms, expected: 1337
Number of duplicate CSR nodes in the graph: 0, expected 0 (CSR nodes should be unique)


In [19]:
# For CSR to Goals relationships: ensure that every CSR is connected to the 17 Goals
records, summary, keys = driver.execute_query("""\
    MATCH (n:CSR)-[r:ASSOCIATED_WITH]-(g:Goal) RETURN count(distinct r) as r
        """,routing_="r",database_=database_name)
print("{r} 'ASSOCIATED WITH' relationships between CSRs and Goals in {time} ms, expected: {expected}".format(
    r=records[0]['r'],
    time=summary.result_available_after,
    expected = len(main_rows['Text'].unique())*17 # there are 17 Goals
))

22729 'ASSOCIATED WITH' relationships between CSRs and Goals in 48 ms, expected: 22729


In [20]:
# For CSR to GeoArea 'REFERS TO AREA' relationships: ensure that every CSR is connected to a GeoArea
records, summary, keys = driver.execute_query("""\
    MATCH (n:CSR)-[rel:REFERS_TO_AREA]->(ga:GeoArea) RETURN count(distinct rel) AS r
        """,routing_="r",database_=database_name)
print("{r} REFERS_TO_AREA relationships between CSRs and GeoAreas in {time} ms, expected: {expected}".format(
    r=records[0]['r'],
    time=summary.result_available_after,
    expected = len(main_rows)
))

1348 REFERS_TO_AREA relationships between CSRs and GeoAreas in 48 ms, expected: 1348


In [21]:
# For CSRSubpart nodes
subparts = set()
for key, value in csr_dict.items():
    for sub_row in value['sub_rows']:
        subtext = sub_row['Text']
        csr_nbr = str(sub_row['CSR Nbr'])
        csr_subpart_nbr = csr_nbr.split('.')[1] if '.' in csr_nbr else ''
        subparts.add((subtext, csr_subpart_nbr))

records, summary, keys = driver.execute_query("""\
    MATCH (n:CSRSubpart) RETURN count(n) as r
        """,routing_="r",database_=database_name)

expected_count = len(subparts)
print("{r} CSRSubpart nodes in the graph in {time} ms, expected: {expected}".format(
    r=records[0]['r'],
    time=summary.result_available_after,
    expected=expected_count
))

4056 CSRSubpart nodes in the graph in 57 ms, expected: 4056


In [22]:
# For CSR to CSRSubpart relationships: ensure that every CSRSubpart is accounted for by a main CSR
records, summary, keys = driver.execute_query("""\
    MATCH (sp:CSRSubpart) WHERE NOT (sp)<-[:HAS_SUBPART]-(:CSR) RETURN count(sp) as r
        """,routing_="r",database_=database_name)
print("CSRSubpart nodes without ingoing 'HAS_SUBPART' relationships: {r}, expected 0 (all CSRSubparts should be connected to a CSR)".format(
    r=records[0]['r'],
    time=summary.result_available_after,
    expected = len(sub_rows)
))

CSRSubpart nodes without ingoing 'HAS_SUBPART' relationships: 0, expected 0 (all CSRSubparts should be connected to a CSR)


In [23]:
# For CSRSubpart to PolicyArea relationships: check CSRSubpart to PolicyArea connection (not all sub-CSRs have a corresponding Policy Area)
records, summary, keys = driver.execute_query("""\
    MATCH (c:CSRSubpart) WHERE NOT (c)-[:HAS_POLICY_AREA]->(:PolicyArea) RETURN count(c) as r
        """,routing_="r",database_=database_name)

empty_policy_area_nodes = set()
for idx, row in sub_rows.iterrows():
    text = row['Text']
    number = str(row.get('csrSubpart Nbr', ''))
    areas = row.get('Policy Area(s)', [])

    if not areas or (isinstance(areas, list) and all(not str(a).strip() for a in areas)):
        empty_policy_area_nodes.add((text, number))
print("CSRSubpart nodes without outgoing 'HAS_POLICY_AREA' relationships: {r}, expected: {expected}".format(
    r=records[0]['r'],
    time=summary.result_available_after,
    expected=len(empty_policy_area_nodes)
))

CSRSubpart nodes without outgoing 'HAS_POLICY_AREA' relationships: 2, expected: 2
