# `NB_sebastian_ESGBERT`
This notebook appends runs `ESGBERT/EnvironmentalBERT-environmental` and `ESGBERT/SocialBERT-social` on the `all_reviews_merged_3000_sample.csv` dataset and appends the results as two new columns.

## NB Setup

In [1]:
!pip install langdetect

Defaulting to user installation because normal site-packages is not writeable


In [2]:
# Import Libraries
import pandas as pd
import numpy as np
import re
from langdetect import detect

In [3]:
# Set display options to show all rows and columns
#pd.set_option('display.max_rows', None)  # Display all rows
#pd.set_option('display.max_columns', None)  # Display all columns
#pd.set_option('display.width', None)  # Auto-adjust width to display all columns

# Reset display options to default after printing
#pd.reset_option('display.max_rows')
#pd.reset_option('display.max_columns')
#pd.reset_option('display.width')

## `df` Setup & Cleaning

In [4]:
# Read df
df = pd.read_csv('all_reviews_merged_3000_sample.csv')
df_copy = df.copy()
df.head(1)

Unnamed: 0,reviewId,asin,date_cleaned,username,title,keyword,verified,rating_cleaned,text,train_test,sustainability (y/n)
0,R2MESKRWXZPAPR,B08SRSWFY6,2021-07-02,Mettea Green,Good for drinks,best Kids and Baby,True,4.0,Bought these in hopes to decrease buying apple...,test,0


In [5]:
# Remove columns 'train_test' and 'sustainability (y/n)'
columns_to_drop = ['train_test', 'sustainability (y/n)']
df = df.drop(columns=columns_to_drop)
df.head(1)

Unnamed: 0,reviewId,asin,date_cleaned,username,title,keyword,verified,rating_cleaned,text
0,R2MESKRWXZPAPR,B08SRSWFY6,2021-07-02,Mettea Green,Good for drinks,best Kids and Baby,True,4.0,Bought these in hopes to decrease buying apple...


In [6]:
# Function to clean and filter reviews
def clean_review(review):
    # Remove leading and trailing whitespaces
    cleaned_text = review.strip()
    
    # Normalize whitespace to single spaces
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text)
    
    # Sentence case: First character of the first word to uppercase, rest to lowercase
    cleaned_text = cleaned_text.capitalize()
    
    # Remove emojis and special characters
    cleaned_text = re.sub(r'[^\w\s]', '', cleaned_text)
    
    return cleaned_text

In [7]:
# Apply cleaning function to 'text' column
df['cleaned_text'] = df['text'].apply(clean_review)

In [8]:
# Function to detect English reviews
def is_english(text):
    try:
        lang = detect(text)
        return lang == 'en'
    except:
        return False

# Filter out non-English reviews
df = df[df['cleaned_text'].apply(is_english)]

In [9]:
# Compare df before and after cleaning
print(df_copy.shape)
print(df.shape)

(3000, 11)
(2780, 10)


# ESGBERT
NLP for Sustainable Finance and sister of https://huggingface.co/climatebert

- HuggingFace Documentation: [ESGBERT Documentation](https://huggingface.co/ESGBERT)
- Medium Tutorial: [Analyzing ESG with AI and NLP (Tutorial#1): Report Analysis Towards ESG Risks and Opportunities](https://medium.com/@schimanski.tobi/analyzing-esg-with-ai-and-nlp-tutorial-1-report-analysis-towards-esg-risks-and-opportunities-8daa2695f6c5_)

In [10]:
### MAKE SURE TO INSTALL THIS LIB: !pip install transformers
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline # for using the models

### Load the models (takes ca. 1 min)
# Environmental model.
name = "ESGBERT/EnvironmentalBERT-environmental" # path to download from HuggingFace
# In simple words, the tokenizer prepares the text for the model and the model classifies the text-
tokenizer = AutoTokenizer.from_pretrained(name)
model = AutoModelForSequenceClassification.from_pretrained(name)
# The pipeline combines tokenizer and model to one process.
pipe_env = pipeline("text-classification", model=model, tokenizer=tokenizer)

# Also load the social and governance model.
# Social model.
name = "ESGBERT/SocialBERT-social"
tokenizer = AutoTokenizer.from_pretrained(name)
model = AutoModelForSequenceClassification.from_pretrained(name)
pipe_soc = pipeline("text-classification", model=model, tokenizer=tokenizer)

# Governance model.
#name = "ESGBERT/GovernanceBERT-governance"
#tokenizer = AutoTokenizer.from_pretrained(name)
#model = AutoModelForSequenceClassification.from_pretrained(name)
#pipe_gov = pipeline("text-classification", model=model, tokenizer=tokenizer)

  return self.fget.__get__(instance, owner)()


### Testing `ESGBERT/EnvironmentalBERT-environmental` and `ESGBERT/SocialBERT-social`

In [11]:
print(pipe_env("I like chocolate.", padding=True, truncation=True))
print(pipe_env("This product damages the environment.", padding=True, truncation=True))
print(pipe_env("The documentary had a strong focus on sustainability.", padding=True, truncation=True))
print(pipe_env("I don't like the way this product makes me feel.", padding=True, truncation=True))
print(pipe_env("The product is renewable and biodegradable.", padding=True, truncation=True))

[{'label': 'none', 'score': 0.9892128705978394}]
[{'label': 'environmental', 'score': 0.9964158535003662}]
[{'label': 'environmental', 'score': 0.9898588061332703}]
[{'label': 'none', 'score': 0.9941224455833435}]
[{'label': 'environmental', 'score': 0.9708414673805237}]


In [12]:
print(pipe_soc("This company behind the product is women and minority-owned.", padding=True, truncation=True))
print(pipe_soc("I like chocolate.", padding=True, truncation=True))
print(pipe_soc("The company does not care about its employees.", padding=True, truncation=True))
print(pipe_soc("Despite the reviews, this is not a good product.", padding=True, truncation=True))
print(pipe_soc("The company cares about social issues.", padding=True, truncation=True))

[{'label': 'social', 'score': 0.9997933506965637}]
[{'label': 'none', 'score': 0.9996131062507629}]
[{'label': 'social', 'score': 0.9997789263725281}]
[{'label': 'none', 'score': 0.9999130964279175}]
[{'label': 'social', 'score': 0.999315619468689}]


## `ESGBERT/EnvironmentalBERT-environmental` - Application
- Documentation: [EnvironmentalBERT-environmental](https://huggingface.co/ESGBERT/EnvironmentalBERT-environmental)

In [13]:
# Function to process environmental predictions
def env_process_predictions(predictions):
    for prediction in predictions:
        if prediction['label'] == 'none':
            return -prediction['score']  # Return negative score if label is 'none'
        elif prediction['label'] == 'environmental':
            return prediction['score']  # Return positive score if label is 'environmental'
        else:
            return np.nan  # Handle unexpected labels

In [14]:
# Apply pipe_env to each row in 'text' column
df['EnvironmentalBERT-environmental'] = df['cleaned_text'].apply(lambda x: pipe_env(x, padding=True, truncation=True))

# Process predictions and store as floats in new column
df['EnvironmentalBERT-environmental'] = df['EnvironmentalBERT-environmental'].apply(env_process_predictions)

In [15]:
# Filter rows where 'EnvironmentalBERT-environmental' > 0
df[df['EnvironmentalBERT-environmental'] > 0][['cleaned_text', 'EnvironmentalBERT-environmental']]

Unnamed: 0,cleaned_text,EnvironmentalBERT-environmental
76,I gathered a lot of insight from watching this...,0.765215
91,These bags go to landfill they are not compost...,0.91902
189,Great product from fossil delivery was sharp g...,0.995022
201,Great gift for little boys with lots of energy...,0.756844
210,I liked the movie better than i thought it mus...,0.934367
219,If youre not putting this under a roof then be...,0.849106
220,Maybe its my fault for not realizing these are...,0.582875
251,Wonderful depiction of history and the struggl...,0.717814
455,4 of solar light motion are not detecting i di...,0.904172
502,Very unique feel and softness to it by far the...,0.564345


## `ESGBERT/SocialBERT-social` - Application
- Documentation: [ESGBERT/SocialBERT-social](https://huggingface.co/ESGBERT/SocialBERT-social)

In [16]:
# Function to process social predictions
def soc_process_predictions(predictions):
    for prediction in predictions:
        if prediction['label'] == 'none':
            return -prediction['score']  # Return negative score if label is 'none'
        elif prediction['label'] == 'social':
            return prediction['score']  # Return positive score if label is 'environmental'
        else:
            return np.nan  # Handle unexpected labels

In [17]:
# Apply pipe_env to each row in 'text' column
df['ESGBERT/SocialBERT-social'] = df['cleaned_text'].apply(lambda x: pipe_soc(x, padding=True, truncation=True))

# Process predictions and store as floats in new column
df['ESGBERT/SocialBERT-social'] = df['ESGBERT/SocialBERT-social'].apply(soc_process_predictions)

In [18]:
# Filter rows where 'EnvironmentalBERT-environmental' > 0
df[df['ESGBERT/SocialBERT-social'] < 0][['cleaned_text', 'ESGBERT/SocialBERT-social']]

Unnamed: 0,cleaned_text,ESGBERT/SocialBERT-social
0,Bought these in hopes to decrease buying apple...,-0.999913
1,I bought vivoactive 3 to replace my polar i ha...,-0.999816
2,Ive had this less then a month and the pump ha...,-0.999931
3,Bought these and theyre perfect,-0.999904
4,I had to return my first one because i got a d...,-0.999916
...,...,...
2995,My daughter and nieces loved this item they co...,-0.999925
2996,This item is something i did not order i order...,-0.999908
2997,So i bought this season back in november 2020 ...,-0.999878
2998,Great for small areas and small people,-0.997623


## Output Dataframe

In [19]:
df

Unnamed: 0,reviewId,asin,date_cleaned,username,title,keyword,verified,rating_cleaned,text,cleaned_text,EnvironmentalBERT-environmental,ESGBERT/SocialBERT-social
0,R2MESKRWXZPAPR,B08SRSWFY6,2021-07-02,Mettea Green,Good for drinks,best Kids and Baby,True,4.0,Bought these in hopes to decrease buying apple...,Bought these in hopes to decrease buying apple...,-0.992591,-0.999913
1,RP965NGI0QV7C,B074KBWL9J,2018-02-04,Ron,I bought Vivoactive 3 to replace my Polar. I ...,Watch,False,2.0,I bought Vivoactive 3 to replace my Polar. I h...,I bought vivoactive 3 to replace my polar i ha...,-0.992330,-0.999816
2,R1X00I45L1PM0X,B07H57NBGP,2021-12-18,Ashley,Broke Within a Month,best Pet supplies,True,1.0,I’ve had this less then a month and the pump h...,Ive had this less then a month and the pump ha...,-0.995453,-0.999931
3,R739IV9ESOX5V,B071ZNFVMD,2020-03-29,superdad,Great bags for the $,best Food and Grocery,True,4.0,Bought these and they're perfect!,Bought these and theyre perfect,-0.991564,-0.999904
4,R24NZ6FVQUABIO,B079775ZZQ,2020-09-19,Jordan,Works great!,best Smart home,True,4.0,I had to return my first one because I got a d...,I had to return my first one because i got a d...,-0.995013,-0.999916
...,...,...,...,...,...,...,...,...,...,...,...,...
2995,RA1JXJK9B4FLF,B08FKZWPGH,2022-01-17,Leslie Edwards,Great Gift!,Toys,True,5.0,My daughter and nieces loved this item. They c...,My daughter and nieces loved this item they co...,-0.995071,-0.999925
2996,R1BMQ5M6JJQXQA,B006W6YHHI,2018-04-04,Thomas Bernardo,"This item is something I did not order, I ...",Pet supplies,True,1.0,"This item is something I did not order, I orde...",This item is something i did not order i order...,-0.995402,-0.999908
2997,R1LBAT6WY0NZW4,B074RKHNV6,2021-04-27,Amazon Customer,,Game,True,3.0,So I bought this season back in November 2020 ...,So i bought this season back in november 2020 ...,-0.993736,-0.999878
2998,R38V9NOYK59YBP,B0765VTBLV,2019-08-27,Anthony Lombardo,Very easy to install,best Outdoor,True,5.0,Great for small areas and small people.,Great for small areas and small people,-0.992423,-0.997623


In [20]:
# Save dataset
df.to_csv('ESGBERT_all_reviews_merged_3000_sample.csv', index=False)
print('Saved as CSV.')

Saved as CSV.
