This is a PoC that uses Reddit.

Here is how I approached this:

 - For the Sentiment Analysis of the Macro-economic factors: I used the parameters given by Karun as keywords to scrape Reddit posts. This was in addition to the country_name given by the user and all of its aliases. So, now it returns a sentiment analysis of the factors in relation to the country. I used the best model I could find for the sentiment analysis.
  - I did not use the further co-relations provided by Karun in the sentiment analysis process as using them there would be assuming the sentiment, and there would be no need to sentiment analyse. However, we could use them to associate the sentiment analysis with stock performance if we wish to do so.

Note: The libraries installed and imported could be cleaned up as this includes some old models and trials. These are indicated.


## Setting up Everything

In [76]:
!pip install praw
!pip install pandas
!pip install pycountry
!pip install numpy
!pip install transformers
!pip install torch

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [46]:
import praw
import pandas as pd
from textblob import TextBlob
import pycountry
import re
import nltk
nltk.download('stopwords')
nltk.download('vader_lexicon')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.sentiment import SentimentIntensityAnalyzer
import numpy

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Country Sentiment Analysis using Macro-Economic Factors
### Sentiment Analysis with FinBert
An attempt to use a better sentiment analyzer than NLTK Vader. FinBert- Tone: https://huggingface.co/yiyanghkust/finbert-tone.


#### Testing it out

In [86]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load the model and tokenizer
model_name = "yiyanghkust/finbert-tone"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Example text
text = "The Economy is amazing"

# Tokenize the text
inputs = tokenizer(text, return_tensors="pt")

# Get the model outputs
outputs = model(**inputs)

# Get the predicted class scores
scores = outputs.logits.softmax(dim=1).detach().numpy()[0]
print(scores)

# Get the label names
label_names = model.config.id2label.values()

# Save the scores to variables
positive_score = scores[list(label_names).index('Positive')]
negative_score = scores[list(label_names).index('Negative')]
neutral_score = scores[list(label_names).index('Neutral')]

print(positive_score)
print(negative_score)
print(neutral_score)

[3.9124135e-08 9.9999988e-01 8.4741295e-08]
0.9999999
8.4741295e-08
3.9124135e-08


#### Implementation on Reddit

In [91]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load the model and tokenizer
model_name = "yiyanghkust/finbert-tone"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
	
# Setting Up Data Cleaning
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    text = re.sub(r'[^\w\s]', '', text)
    words = [word for word in text.split() if word.lower() not in stop_words]
    words = [lemmatizer.lemmatize(word) for word in words]
    text = ' '.join(words)
    return text.lower()

# Setting Up getting Country Aliases
def get_country_aliases(country_name):
    try:
        country = pycountry.countries.search_fuzzy(country_name)[0]
        aliases = [country.name]
        if country.alpha_2 == 'US':
            aliases += [country.alpha_2, country.alpha_3]
        if hasattr(country, 'official_name'):
            aliases.append(country.official_name)
        if hasattr(country, 'common_name'):
            aliases.append(country.common_name)
        return aliases
    except:
        return []
# Setting in Up Reddit Connection
reddit = praw.Reddit(
    client_id="aerXcM8ROdz47RuNMv2OGg",
    client_secret="dCoMUYzYC3NSHvL6kYH7z-1InRaAJg",
    user_agent="Sentimentent_Analysis_PoC/0.0.1",
    check_for_async=False, 
)

# returns True if properly connected. This is a read_only instance.
# print(reddit.read_only) 

# Setting up the Reddit Scraping
country = input("Enter the name of the country: ")
country_aliases = get_country_aliases(country)

subreddits = ["economics", "stocks", "investing", "wallstreetbets"]
keywords = ["interest rate", "inflation", "exchange rate", "money supply", "GDP", "FII & FDI", "oil prices", "gold prices"]
ignored_keywords = ["PSA: You can"]  

# Setting up the DataFrame to store the scraped posts
df = pd.DataFrame(columns=['Title', 'Post', 'Subreddit', 'keywords_included', 'negative_score', 'neutral_score', 'positive_score', 'url'])

# Actually Scraping and Sentiment Analyzing
for sub in subreddits:
    subreddit = reddit.subreddit(sub)
    for keyword in keywords:
        for post in subreddit.search(keyword, limit=10):
            if any(ignored in post.title.lower() or ignored in post.selftext.lower() for ignored in ignored_keywords):
                continue  # ignore the post if it contains ignored keywords
            if post.score < 100:
                continue  # ignore the post if it has less than 100 upvotes. This is an arbitrary number, and not researched at all.
            post_text = post.title + " " + post.selftext
            cleaned_text = clean_text(post_text)
            if any(alias.lower() in cleaned_text for alias in country_aliases):
                # Tokenize the text
                inputs = tokenizer(cleaned_text, return_tensors="pt", max_length=512, truncation=True)
                # Get the model outputs
                outputs = model(**inputs)
                # Get the predicted class scores
                scores = outputs.logits.softmax(dim=1).detach().numpy()[0]
                # Get the label names
                label_names = model.config.id2label.values()
                # Save the scores to variables
                positive_score = scores[list(label_names).index('Positive')]
                negative_score = scores[list(label_names).index('Negative')]
                neutral_score = scores[list(label_names).index('Neutral')]
                df = pd.concat([df, pd.DataFrame({
                    'Title': [post.title],
                    'Post': [post.selftext],
                    'Subreddit': [sub],
                    'keywords_included': [keyword],
                    'negative_score': [negative_score],
                    'neutral_score': [neutral_score],
                    'positive_score': [positive_score],
                    'url': [post.url]
                })], ignore_index=True)

# Add new columns to the dataframe
# df['Highest Score'] = df[['negative_score', 'positive_score', 'neutral_score']].max(axis=1)
df['Highest Score'] = df[['negative_score', 'positive_score', 'neutral_score']].idxmax(axis=1) + ': ' + df[['negative_score', 'positive_score', 'neutral_score']].max(axis=1).apply(lambda x: '{:.2f}'.format(x))

# Calculate overall positive, negative, and neutral tone for each keyword
df_agg = df.groupby('keywords_included').agg({'negative_score': 'mean', 'positive_score': 'mean', 'neutral_score': 'mean'})
df_agg['Total'] = df_agg.sum(axis=1) # To assure that the aggregation process went well
# This makes it a percentage if you are so inclined
#df_agg[['negative_score', 'positive_score', 'neutral_score']] = df_agg[['negative_score', 'positive_score', 'neutral_score']].div(df_agg['Total'], axis=0).multiply(100)
#df_agg = df_agg.drop(columns=['Total'])

# Display the modified dataframe with posts, sentiments, and URLs
display(df)

# Display the aggregate sentiment score for each keyword
display(df_agg)

Enter the name of the country: US


Unnamed: 0,Title,Post,Subreddit,keywords_included,negative_score,neutral_score,positive_score,url,Highest Score
0,Fed Interest Rate Decision Could Hurt Housing,,economics,interest rate,0.999997,6.862281e-07,2.264405e-06,https://cepr.net/fed-interest-rate-decision-co...,negative_score: 1.00
1,Fed Tightening Reduces Horrendous Wealth Dispa...,,economics,interest rate,0.000309,9.907777e-01,8.913389e-03,https://wolfstreet.com/2022/12/21/fed-tighteni...,neutral_score: 0.99
2,Latest US inflation data raises questions abou...,,economics,interest rate,0.947478,5.241895e-02,1.027064e-04,https://www.theguardian.com/business/2022/oct/...,negative_score: 0.95
3,Documents from 1 May to 30 August 2022 about t...,,economics,interest rate,0.125963,8.709918e-01,3.044910e-03,https://www.rba.gov.au/information/foi/disclos...,neutral_score: 0.87
4,The Federal Reserve must choose between inflat...,,economics,inflation,0.000011,9.999886e-01,7.480334e-08,https://finance.yahoo.com/news/federal-must-ch...,neutral_score: 1.00
...,...,...,...,...,...,...,...,...,...
161,1.1 million people are dead from covid-19. Wha...,1.1 million people are dead from covid-19. Wha...,wallstreetbets,gold prices,0.975084,2.198211e-02,2.933634e-03,https://www.reddit.com/r/wallstreetbets/commen...,negative_score: 0.98
162,Gold prices,Ive been wondering about how Gold used to be u...,wallstreetbets,gold prices,0.019780,9.799078e-01,3.121550e-04,https://www.reddit.com/r/wallstreetbets/commen...,neutral_score: 0.98
163,Gold Prices Hit by Renewed Bets on Higher Yiel...,I was telling people not to buy Gold or Silver...,wallstreetbets,gold prices,0.004245,9.561986e-01,3.955654e-02,https://www.reddit.com/r/wallstreetbets/commen...,neutral_score: 0.96
164,Is the gold price being manipulated to obfusca...,,wallstreetbets,gold prices,0.033720,9.637516e-01,2.528664e-03,https://www.reddit.com/gallery/y4mumr,neutral_score: 0.96


Unnamed: 0_level_0,negative_score,positive_score,neutral_score,Total
keywords_included,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
FII & FDI,0.004496,0.03108,0.964424,1.0
GDP,0.236769,0.217824,0.545407,1.0
exchange rate,0.085669,0.122229,0.792102,1.0
gold prices,0.168358,0.248516,0.583126,1.0
inflation,0.172598,0.143636,0.683767,1.0
interest rate,0.139399,0.129778,0.730823,1.0
money supply,0.14676,0.204275,0.648966,1.0
oil prices,0.320013,0.184387,0.4956,1.0


### Trying to implement it using Inference API 
Limits didn't allow me to test it

#### Testing and Playing with the FinBert - Tone Model using the Inference API

In [72]:
import requests

API_URL = "https://api-inference.huggingface.co/models/yiyanghkust/finbert-tone"
headers = {"Authorization": "Bearer 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"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()

scores = query({"inputs": "The economy is going up"})
print(scores)

negative_score = scores[0][0]['score']
neutral_score = scores[0][1]['score']
positive_score = scores[0][2]['score']

print(negative_score)
print(neutral_score)
print(positive_score)

def get_scores(output):
    scores = {}
    for item in output[0]:
        scores[item['label']] = item['score']
    return scores.get('Negative', 0), scores.get('Neutral', 0), scores.get('Positive', 0)

get_scores(scores)

[[{'label': 'Neutral', 'score': 0.7878367900848389}, {'label': 'Negative', 'score': 0.15684927999973297}, {'label': 'Positive', 'score': 0.05531390383839607}]]
0.7878367900848389
0.15684927999973297
0.05531390383839607


(0.15684927999973297, 0.7878367900848389, 0.05531390383839607)

#### Implementing it 
Failed attempt due to inference API limits.

In [74]:
import requests

API_URL = "https://api-inference.huggingface.co/models/yiyanghkust/finbert-tone"
headers = {"Authorization": "Bearer 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"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()
	

# Setting Up Data Cleaning

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    text = re.sub(r'[^\w\s]', '', text)
    words = [word for word in text.split() if word.lower() not in stop_words]
    words = [lemmatizer.lemmatize(word) for word in words]
    text = ' '.join(words)
    return text.lower()

# Setting Up getting Country Aliases
def get_country_aliases(country_name):
    try:
        country = pycountry.countries.search_fuzzy(country_name)[0]
        aliases = [country.name]
        if country.alpha_2 == 'US':
            aliases += [country.alpha_2, country.alpha_3]
        if hasattr(country, 'official_name'):
            aliases.append(country.official_name)
        if hasattr(country, 'common_name'):
            aliases.append(country.common_name)
        return aliases
    except:
        return []
# Setting in Up Reddit Connection
reddit = praw.Reddit(
    client_id="aerXcM8ROdz47RuNMv2OGg",
    client_secret="dCoMUYzYC3NSHvL6kYH7z-1InRaAJg",
    user_agent="Sentimentent_Analysis_PoC/0.0.1",
    check_for_async=False, 
)

# returns True if properly connected. This is a read_only instance.
# print(reddit.read_only) 

# Setting up the Reddit Scraping
country = input("Enter the name of the country: ")
country_aliases = get_country_aliases(country)

subreddits = ["economics", "stocks", "investing", "wallstreetbets"]
keywords = ["interest rate", "inflation", "exchange rate", "money supply", "GDP", "FII & FDI", "oil prices", "gold prices"]
ignored_keywords = ["PSA: You can"]  

# Setting up Function to order Model Output
def get_scores(output):
    scores = {}
    for item in output[0]:
        scores[item['label']] = item['score']
    return scores.get('Negative', 0), scores.get('Neutral', 0), scores.get('Positive', 0)

# Setting up the DataFrame to store the scraped posts
df = pd.DataFrame(columns=['Title', 'Post', 'Subreddit', 'keywords_included', 'negative_score', 'neutral_score', 'positive_score', 'url'])

# Actually Scraping and Sentiment Analyzing
for sub in subreddits:
    subreddit = reddit.subreddit(sub)
    for keyword in keywords:
        for post in subreddit.search(keyword, limit=10):
            if any(ignored in post.title.lower() or ignored in post.selftext.lower() for ignored in ignored_keywords):
                continue  # ignore the post if it contains ignored keywords
            post_text = post.title + " " + post.selftext
            cleaned_text = clean_text(post_text)
            if any(alias.lower() in cleaned_text for alias in country_aliases):
                sentiment_scores = query({'inputs': cleaned_text})
                sentiment_scores = get_scores(sentiment_scores)
                negative_score = sentiment_scores[0]
                neutral_score = sentiment_scores[1]
                positive_score = sentiment_scores[2]
                df = pd.concat([df, pd.DataFrame({
                    'Title': [post.title],
                    'Post': [post.selftext],
                    'Subreddit': [sub],
                    'keywords_included': [keyword],
                    'negative_score': [negative_score],
                    'neutral_score': [neutral_score],
                    'positive_score': [positive_score],
                    'url': [post.url]
                })], ignore_index=True)

# Add new columns to the dataframe
df['Highest Score'] = df[['Negative', 'Positive', 'Neutral']].max(axis=1)

# Calculate percentage of positive, negative, and neutral tone for each keyword
df_agg = df.groupby('keywords_included').agg({'Negative': 'mean', 'Positive': 'mean', 'Neutral': 'mean'})
df_agg['Total'] = df_agg.sum(axis=1)
df_agg[['Negative', 'Positive', 'Neutral']] = df_agg[['Negative', 'Positive', 'Neutral']].div(df_agg['Total'], axis=0).multiply(100)
df_agg = df_agg.drop(columns=['Total'])

# Display the modified dataframe with posts, sentiments, and URLs
display(df)

# Display the aggregate sentiment score for each keyword
display(df_agg)

Enter the name of the country: US


KeyError: ignored

### Uses NLTK Vader
which takes context into account. From a cursory glance at its performance, it doesn't seem to do well. Thought its performance should be better than textblob.

https://insight-group.github.io/MFIN7036/sentiment-analysis-lexicon-based-nv-or-tb.html#:~:text=NLTK%20Vader%20focus%20on%20analyzing,entities%20into%20consideration%20by%20POS.


In [54]:
# Setting Up Data Cleaning

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
sia = SentimentIntensityAnalyzer()

def clean_text(text):
    text = re.sub(r'[^\w\s]', '', text)
    words = [word for word in text.split() if word.lower() not in stop_words]
    words = [lemmatizer.lemmatize(word) for word in words]
    text = ' '.join(words)
    return text.lower()

# Setting Up getting Country Aliases
def get_country_aliases(country_name):
    try:
        country = pycountry.countries.search_fuzzy(country_name)[0]
        aliases = [country.name]
        if country.alpha_2 == 'US':
            aliases += [country.alpha_2, country.alpha_3]
        if hasattr(country, 'official_name'):
            aliases.append(country.official_name)
        if hasattr(country, 'common_name'):
            aliases.append(country.common_name)
        return aliases
    except:
        return []
# Setting in Up Reddit Connection
reddit = praw.Reddit(
    client_id="aerXcM8ROdz47RuNMv2OGg",
    client_secret="dCoMUYzYC3NSHvL6kYH7z-1InRaAJg",
    user_agent="Sentimentent_Analysis_PoC/0.0.1",
    check_for_async=False, 
)

# returns True if properly connected. This is a read_only instance.
# print(reddit.read_only) 

# Setting up the Reddit Scraping
country = input("Enter the name of the country: ")
country_aliases = get_country_aliases(country)

subreddits = ["economics", "stocks", "investing", "wallstreetbets"]
keywords = ["interest rate", "inflation", "exchange rate", "money supply", "GDP", "FII & FDI", "oil prices", "gold prices"]
ignored_keywords = ["PSA: You can"]  
# Setting up the DataFrame to store the scraped posts
df = pd.DataFrame(columns=['Title', 'Post', 'Subreddit', 'keywords_included', 'sentiment_score', 'url'])

for sub in subreddits:
    subreddit = reddit.subreddit(sub)
    for keyword in keywords:
        for post in subreddit.search(keyword, limit=10):
            if any(ignored in post.title.lower() or ignored in post.selftext.lower() for ignored in ignored_keywords):
                continue  # ignore the post if it contains ignored keywords
            post_text = post.title + " " + post.selftext
            cleaned_text = clean_text(post_text)
            if any(alias.lower() in cleaned_text for alias in country_aliases):
                sentiment_score = sia.polarity_scores(cleaned_text)['compound']
                df = pd.concat([df, pd.DataFrame({
                    'Title': [post.title],
                    'Post': [post.selftext],
                    'Subreddit': [sub],
                    'keywords_included': [keyword],
                    'sentiment_score': [sentiment_score],
                    'url': [post.url]
                })], ignore_index=True)
# Calculate aggregate sentiment score for each keyword
aggregate_sentiment_scores = df.groupby(['keywords_included'])['sentiment_score'].mean()

# Display the dataframe with posts, sentiments, and URLs
display(df)

# Display the aggregate sentiment score for each keyword
display(aggregate_sentiment_scores)



Enter the name of the country: US


Unnamed: 0,Title,Post,Subreddit,keywords_included,sentiment_score,url
0,Fed Interest Rate Decision Could Hurt Housing,,economics,interest rate,-0.1027,https://cepr.net/fed-interest-rate-decision-co...
1,Fed Tightening Reduces Horrendous Wealth Dispa...,,economics,interest rate,-0.0516,https://wolfstreet.com/2022/12/21/fed-tighteni...
2,Latest US inflation data raises questions abou...,,economics,interest rate,0.4588,https://www.theguardian.com/business/2022/oct/...
3,Documents from 1 May to 30 August 2022 about t...,,economics,interest rate,0.8176,https://www.rba.gov.au/information/foi/disclos...
4,The Federal Reserve must choose between inflat...,,economics,inflation,-0.5719,https://finance.yahoo.com/news/federal-must-ch...
...,...,...,...,...,...,...
162,1.1 million people are dead from covid-19. Wha...,1.1 million people are dead from covid-19. Wha...,wallstreetbets,gold prices,-0.9957,https://www.reddit.com/r/wallstreetbets/commen...
163,Gold prices,Ive been wondering about how Gold used to be u...,wallstreetbets,gold prices,-0.7845,https://www.reddit.com/r/wallstreetbets/commen...
164,Gold Prices Hit by Renewed Bets on Higher Yiel...,I was telling people not to buy Gold or Silver...,wallstreetbets,gold prices,0.9538,https://www.reddit.com/r/wallstreetbets/commen...
165,Is the gold price being manipulated to obfusca...,,wallstreetbets,gold prices,-0.3818,https://www.reddit.com/gallery/y4mumr


keywords_included
FII & FDI       -0.011625
GDP              0.079013
exchange rate    0.236321
gold prices      0.184271
inflation        0.486215
interest rate    0.598964
money supply     0.226635
oil prices      -0.025983
Name: sentiment_score, dtype: float64

## Old Implementations
Honestly not sure why I haven't deleted them

Old Scrape Implementation

In [None]:

for sub in subreddits:
    subreddit = reddit.subreddit(sub)
    for post in subreddit.hot(limit=100):
        post_text = post.title + " " + post.selftext
        post_text = clean_text(post_text)
        for keyword in keywords:
            if keyword.lower() in post_text and any(alias.lower() in post_text for alias in country_aliases):
                sentiment_score = sia.polarity_scores(post_text)['compound']
                df = pd.concat([df, pd.DataFrame({'Title': [post.title], 'Post': [post.selftext], 'Subreddit': [sub], 'keywords_included': [keyword], 'sentiment_score': [sentiment_score], 'url': [post.url]})], ignore_index=True)


## ESG Sentiment Analysis
This is still a work in progress. We have two options:
1. I directly use the data provided by Anaya, but there is the issue of deciding on the keywords we will use for every theme. This will require someone else to give me a keyword list for each parameter to implement. I think this will take a lot of time and effort, and may not be worth it.
2. I found these two models from the same academic that I got the sentiment analysis model from.
- https://huggingface.co/yiyanghkust/finbert-esg-9-categories?text=For+2002%2C+our+total+net+emissions+were+approximately+60+million+metric+tons+of+CO2+equivalents+for+all+businesses+and+operations+we+have+%EF%AC%81nancial+interests+in%2C+based+on+its+equity+share+in+those+businesses+and+operations.+This+is+more+than+our+competition
- https://huggingface.co/yiyanghkust/finbert-esg

You give them text, and one roughly gives you whether it belongs in E or S or G or None, and the other is divided into 9 Categories.


I am not actually sure how we will use those models for the final ESG Sentiment Analysis.

 But, my first thoughts , if we want to do ESG Sentiment Analysis on all platforms, starting from Reddit, is that we could try cycling through Reddit Posts with the company name, and get a certain number for each category of the 9 or perhaps just a certain number of E or S or G. Then perform Sentiment Analysis on them and get an overall sentiment score for each category (the 3 or the 9) and compare it with the overall sentiment score for each category in the sector. To do this, I would need keywords to identify the sector to do the same process that I listed just previously.

 I think a better use of our time would be to see if we do this on the Annual Report of the company in question, and compare it to the overall sentiment score in each category of its comparables (which would need to be inputted by tehh user).