Basing this off of the ESG tutorial from ESGBERT

In [1]:
!pip install transformers
!pip install tika




In [2]:
!pip install pandas 
!pip install matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline # for using the models

#import spacy # for sentence extraction
from tika import parser # for the report extraction



  from .autonotebook import tqdm as notebook_tqdm


This loads in the 3 models

In [3]:
### Load the models (takes ca. 1 min)
# Environmental model.
name = "ESGBERT/EnvironmentalBERT-environmental" # path to download from HuggingFace
# In simple words, the tokenizer prepares the text for the model and the model classifies the text-
tokenizer = AutoTokenizer.from_pretrained(name)
model = AutoModelForSequenceClassification.from_pretrained(name)
# The pipeline combines tokenizer and model to one process.
pipe_env = pipeline("text-classification", model=model, tokenizer=tokenizer)

# Also load the social and governance model.
# Social model.
name = "ESGBERT/SocialBERT-social"
tokenizer = AutoTokenizer.from_pretrained(name)
model = AutoModelForSequenceClassification.from_pretrained(name)
pipe_soc = pipeline("text-classification", model=model, tokenizer=tokenizer)

# Governance model.
name = "ESGBERT/GovernanceBERT-governance"
tokenizer = AutoTokenizer.from_pretrained(name)
model = AutoModelForSequenceClassification.from_pretrained(name)
pipe_gov = pipeline("text-classification", model=model, tokenizer=tokenizer)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


OK, try a test here

In [4]:
# You can input single sentences or arrays of sentences into the pipeline,
sentences_test = ["Besides financial considerations, we also consider harms to the biodiversity and broader ecosystem impacts.",
                  "Scope 1 emissions are reported here on a like-for-like basis against the 2013 baseline and exclude emissions from additional vehicles used during repairs.",
                  "Tokenization is used in natural language processing to split paragraphs and sentences into smaller units that can be more easily assigned meaning."]
test = pipe_env(sentences_test)
print(test)

[{'label': 'environmental', 'score': 0.994878888130188}, {'label': 'environmental', 'score': 0.9979760050773621}, {'label': 'none', 'score': 0.997612714767456}]


Now, run with my data, this data is already tokenized at the sentence level

In [25]:
file_path = '/Users/benjamin.williams/Library/CloudStorage/OneDrive-UniversityofDenver/Research/edgar-nlp/data_token.csv'

# Read the CSV into a DataFrame
df = pd.read_csv(file_path)

sentence_list = df['sentence'].tolist()

# Display the first few elements to verify
print(len(sentence_list))
print(type(sentence_list))


311828
<class 'list'>


When I tried to run the model on the `sentence_list` data it threw an error, so the next code block makes sure its a string-list

In [26]:
str_list = [str(item) for item in sentence_list]
print(type(str_list))

<class 'list'>


So now use `str_list` in the model

In [27]:
env_test = pipe_env(str_list, padding=True, truncation=True)

# You might only want the labels.
env_labels_t = [x["label"] for x in env_test]


Now make a dataframe with sentence, label, and score

In [37]:
env_score_t = [x["score"] for x in env_test]
data_env = pd.DataFrame({"sentence": str_list, "environmental": env_labels_t,"score": env_score_t})
print(data_env)

                                                 sentence  environmental  \
0          a clear vision     2023 sustainabilit y report  environmental   
1       table of contents  a letter from executive man...           none   
2       1   company profile..............................           none   
3       6   vision and progress..........................           none   
4       16   key performance indicators (kpis)...........           none   
...                                                   ...            ...   
311823  moreover, non-financial information, such as t...           none   
311824  historical, current, and forward-looking envir...  environmental   
311825  in addition, while we may seek to align these ...           none   
311826  moreover, our disclosures based on such framew...           none   
311827          park hyatt aviara resort, golf club & spa           none   

           score  
0       0.958016  
1       0.996945  
2       0.997617  
3       0.9

Worked, so now save it off as a csv for analysis in R

In [38]:
data_env.to_csv('/Users/benjamin.williams/Library/CloudStorage/OneDrive-UniversityofDenver/Research/edgar-nlp/esg_hugging_face/esg_env_labels.csv', index=False)