# Sentiment Analysis for Long Text

Sentiment analysis, also known as opinion mining, is a computational technique that involves the use of natural language processing, machine learning, and statistical methods to analyze and determine the emotional tone, attitudes, and opinions expressed in textual data. In an era dominated by vast amounts of user-generated content on social media, reviews, and other online platforms, sentiment analysis has emerged as a crucial tool for understanding and extracting valuable insights from the immense volume of textual information.

The primary objective of sentiment analysis is to classify the sentiment conveyed in a piece of text as positive, negative, or neutral. This enables businesses, researchers, and organizations to gain a deeper understanding of public opinion, customer feedback, and overall sentiment towards products, services, brands, or any other subject of interest. By automating the process of sentiment analysis, it becomes possible to efficiently process and make sense of large datasets, enabling timely and informed decision-making.

Sentiment analysis finds application in various domains, including marketing, customer service, product development, political analysis, and social research. Its versatility makes it a valuable tool for businesses aiming to enhance customer satisfaction, monitor brand perception, and stay attuned to market trends. As technology continues to advance, sentiment analysis methods evolve to handle the complexities of language, cultural nuances, and the dynamic nature of online communication.

Most methods of sentiment analysis involves using supervised learning. Sentiment analysis using supervised learning is an approach that involves training a machine learning model on a labeled dataset to predict the sentiment of text. In this context, "supervised learning" refers to the training process where the model is provided with a dataset containing examples of text along with their corresponding sentiment labels (e.g., positive, negative, or neutral). The model learns patterns and relationships within the labeled data, enabling it to make predictions on new, unseen text.

However, datasets with sentiment labels for longs texts are not readily available on the internet. Therefore, our project focuses on fixing this issue by creating a system where sentiment analysis can be performed on long text.

## Sentiment analysis using unsupervised learning

Therefore, we created a system where we can use unsupervised learning to do sentiment analysis.

### First, we import the necessary libraries

In [2]:
# data processing and Data manipulation
import numpy as np # linear algebra
import pandas as pd # data processing

import sklearn
from sklearn.model_selection import train_test_split
    
# Libraries and packages for NLP
import nltk
import gensim
from gensim.models import Word2Vec

import os
import sys
import warnings
if not sys.warnoptions:
    warnings.simplefilter("ignore")
    
print('*** --> Modules are imported: ')    
print("Python version:", sys.version)
print("numpy version:", np.__version__)
print("pandas version:", pd.__version__)

print("sklearn version:", sklearn.__version__)
print("nltk version:", nltk.__version__)
print("gensim version:", gensim.__version__)

*** --> Modules are imported: 
Python version: 3.7.16 (default, Jan 17 2023, 22:20:44) 
[GCC 11.2.0]
numpy version: 1.17.4
pandas version: 1.3.5
sklearn version: 1.0.2
nltk version: 3.8.1
gensim version: 4.2.0


### Then, we read the data that we want to perform sentiment analysis on

In [3]:
# Importing IMDB Data from data directory which is two directory uper than the current directory
data_path = os.path.abspath(os.path.join(os.pardir,
                                         os.pardir, 
                                         'data/clean_news_dataset.csv'))
df = pd.read_csv(data_path, dtype={'news_body': str})
df.head(3)

Unnamed: 0,news_body
0,"(Bloomberg) -- With just three weeks to go, 20..."
1,Investing.com – Colombia stocks were higher af...
2,Investing.com – Canada stocks were lower after...


### We now perform preprocessing on the code.

Sentence preprocessing is a crucial step in preparing textual data for machine learning tasks, including natural language processing (NLP) and sentiment analysis. The goal is to transform raw text into a format that machine learning models can effectively understand and process.

In [5]:
# Adding `src` directory to the directories for interpreter to search
sys.path.append(os.path.abspath(os.path.join('../..','Model/src')))

# Importing functions and classes from utility module
from w2v_utils import (Tokenizer,
                       evaluate_model,
                       bow_vectorizer,
                       train_logistic_regressor,
                       w2v_trainer,
                       calculate_overall_similarity_score,
                       overall_semantic_sentiment_analysis,
                       list_similarity,
                       calculate_topn_similarity_score,
                       topn_semantic_sentiment_analysis,
                       define_complexity_subjectivity_reviews,
                       explore_high_complexity_reviews,
                       explore_low_subjectivity_reviews,
                       text_SSA)

In [6]:
# Instancing the Tokenizer class
tokenizer = Tokenizer(clean= True,
                      lower= True, 
                      de_noise= True, 
                      remove_stop_words= True,
                      keep_negation=True)

# Example statement
statement = "I didn't like this movie. It wasn't amusing nor visually interesting . I do not recommend it."
print(tokenizer.tokenize(statement))

['NOTlike', 'movie', 'NOTamusing', 'visually', 'interesting', 'NOTrecommend']


In [7]:
# Tokenize reviews
df['tokenized_text'] = df['news_body'].astype(str).apply(tokenizer.tokenize)

df['tokenized_text_len'] = df['tokenized_text'].apply(len)
df['tokenized_text_len'].apply(np.log).describe()

count    5.080100e+04
mean             -inf
std               NaN
min              -inf
25%      4.418841e+00
50%      5.293305e+00
75%      5.655992e+00
max      8.008033e+00
Name: tokenized_text_len, dtype: float64

In [8]:
df.at[0,"tokenized_text"]

['bloomberg',
 'three',
 'weeks',
 'go',
 '2018',
 'market',
 'contrarians',
 'proving',
 'prescient',
 'outlook',
 'decidedly',
 'bullish',
 'u',
 'stocks',
 'developing',
 'nation',
 'assets',
 '12',
 'months',
 'ago',
 'forecast',
 'build',
 'upon',
 'stellar',
 '2017',
 'beaten',
 'greenback',
 'expected',
 'fare',
 'better',
 '2018',
 'rosy',
 'international',
 'growth',
 'outlook',
 'threatened',
 'lure',
 'investors',
 'away',
 'american',
 'markets',
 'despite',
 'tough',
 'talk',
 'u',
 'china',
 'risks',
 'trade',
 'war',
 'afterthought',
 'NOTmuch',
 'gone',
 'according',
 'plan',
 'dws',
 'cantor',
 'fitzgerald',
 'morgan',
 'stanley',
 'nyse',
 'ms',
 'among',
 'bet',
 'trend',
 'got',
 'right',
 'federal',
 'reserve',
 'rate',
 'hikes',
 'backdrop',
 'sharply',
 'escalating',
 'trade',
 'tensions',
 'roiled',
 'markets',
 '2018',
 'punishing',
 'u',
 'stocks',
 'causing',
 'risk',
 'averse',
 'investors',
 'flee',
 'developing',
 'nations',
 'meanwhile',
 'dollar',
 'gain

### For sentiment analysis using unsupervised learning, we first train the word embedding model

In [10]:
# Training a Word2Vec model
keyed_vectors, keyed_vocab = w2v_trainer(df['tokenized_text'])

### Then, we create positive and negative sets

In [11]:
# Find the most similar words to "good" 
keyed_vectors.most_similar('good',topn=15)

[('decent', 0.5621327757835388),
 ('bad', 0.5583915114402771),
 ('great', 0.5564372539520264),
 ('NOTgood', 0.5504531264305115),
 ('really', 0.5375811457633972),
 ('interesting', 0.5373245477676392),
 ('terrible', 0.5370712280273438),
 ('lot', 0.5247266888618469),
 ('actually', 0.5201089978218079),
 ('kind', 0.5200352668762207),
 ('excellent', 0.5190629959106445),
 ('sort', 0.5139580368995667),
 ('excited', 0.5115196704864502),
 ('perfect', 0.5087762475013733),
 ('definitely', 0.5068932175636292)]

In [57]:
# To make sure that all `positive_concepts` are in the keyed word2vec vocabulary
positive_concepts = ['excellent', 'awesome', 'cool','decent','amazing', 'strong', 'good', 'great', 'funny', 'entertaining',
                     'rose','new', 'higher', 'best',
'high',
'rising',
'largest',
'highs',
'well',
'advancing',
'big',
'strong',
'highest',
'rise',
'goods',
'large',
'good',
'boost',
'giant',
'right',
'largely',
'grow',
'win',
'larger',
'bigger',
'massive',
'wide',
'risen',
'increases',
'effective',
'appeal',
'success',
'improved',
'improving',
'successful',
'advance',
'optimistic',
'boosting',
'effectively',
'renewed',
'powerful',
'achieve',
'appropriate',
'winning',
'efficiency',
'outstanding',
'healthy',
'valuable',
'efficient',
'outperform',
'strongest',
'true',
'experienced',
'preferred',
'outperformed',
'highlighted',
'promote',
'promising',
'awarded',
'elevated',
'climb',
'bright',
'happy',
'incentive',
'award',
'improvements',
'succeed',
'ultra',
'rises',
'successfully',
'enjoyed',
'correct',
'wins',
'highlights',
'jumping',
'widened',
'attracting',
'properly',
'newer',
'smooth',
'highlighting',
'constructive',
'love',
'overcome',
'reliable',
'enhanced',
                     'growth', 'profit', 'upward', 'surge', 'increase', 'bullish', 'prosperity', 'success',
    'rise', 'advantage', 'uptrend', 'positive', 'strong', 'improve', 'optimistic', 'robust',
    'healthy', 'vibrant', 'wealth', 'win', 'encouraging', 'optimism', 'peak', 'improvement',
    'upswing', 'victory', 'thriving', 'solid', 'bright', 'reward', 'succeed', 'achievement',
    'progress', 'outperform', 'good', 'favorable', 'benefit', 'high', 'record', 'stellar',
    'sustainable', 'encouragement', 'profitable', 'successful', 'triumph', 'outstanding', 'uplifting',
    'excellent', 'bright', 'triumph', 'encouragement', 'prospering', 'positive', 'optimistic', 'bullish',
    'innovate', 'expansion', 'innovation', 'abundance', 'promising', 'exceed', 'thrive', 'advancement',
    'upbeat', 'favorable', 'solidarity', 'gaining', 'advantageous', 'beneficial', 'forward', 'fruitful',
    'innovative', 'constructive', 'optimization', 'support', 'positive impact', 'success story', 'premium',
    'prominence', 'uplifting', 'booming', 'affluent', 'exemplary', 'upgraded', 'enhanced', 'leading',
    'pinnacle', 'stronghold', 'opulent', 'fertile', 'exuberant', 'felicity', 'fulfillment', 'glorious',
    'jubilant', 'momentum', 'thrive', 'success', 'prosper', 'triumphant', 'upside',
                    ] 
pos_concepts = [concept for concept in positive_concepts if concept in keyed_vocab]

In [58]:
print(len(pos_concepts))

193


In [49]:
# Find the most similar words to "bad" 
keyed_vectors.most_similar('bad',topn=200)

[('good', 0.5583917498588562),
 ('terrible', 0.4661613404750824),
 ('NOTgood', 0.449031800031662),
 ('delinquent', 0.44534507393836975),
 ('inevitable', 0.4369828999042511),
 ('actually', 0.4325169324874878),
 ('dire', 0.4264444410800934),
 ('impaired', 0.42457154393196106),
 ('NOThelp', 0.4216480851173401),
 ('fake', 0.4207325279712677),
 ('risky', 0.4177751839160919),
 ('worrisome', 0.4131818413734436),
 ('kind', 0.4113539755344391),
 ('certainly', 0.41021767258644104),
 ('badly', 0.4082973301410675),
 ('true', 0.40798938274383545),
 ('NOTwelcome', 0.40792298316955566),
 ('obviously', 0.40641120076179504),
 ('defaults', 0.40503445267677307),
 ('NOTmuch', 0.40414679050445557),
 ('maybe', 0.40231677889823914),
 ('shocking', 0.4011390209197998),
 ('NOTnecessarily', 0.39946448802948),
 ('problem', 0.39700478315353394),
 ('lot', 0.3950379490852356),
 ('worrying', 0.39359360933303833),
 ('clearly', 0.39343690872192383),
 ('problematic', 0.39336955547332764),
 ('worse', 0.39316898584365845)

In [59]:
# To make sure that all `negative_concepts` are in the keyed word2vec vocabulary 
negative_concepts = ['terrible','awful','horrible','boring','bad', 'disappointing', 'weak', 'poor', 'senseless','confusing', 
                     'criminal', 'wrongdoing', 'fail', 'depressed', 'stress', 'frustrated', 'pessimistic', 'hopeless', 'worthless',
                     'cheating', 'concern', 'exit', 'exhausted', 'fear', 'fears', 'lost', 'worst',  'decline', 'downturn', 'loss', 'slump', 'recession', 'downfall', 'plunge', 'slump',
    'depreciation', 'bearish', 'contraction', 'deficit', 'underperformance', 'volatility',
    'risky', 'underwhelming', 'debacle', 'woes', 'setback', 'crisis', 'bankruptcy',
    'default', 'weakness', 'instability', 'uncertainty', 'downgrade', 'collapse', 'deterioration',
    'selloff', 'risk', 'disappointing', 'distress', 'negative', 'adverse', 'implode', 'fail',
    'downgrade', 'liquidation', 'anxiety', 'turmoil', 'sluggish', 'unfavorable', 'bear market',
    'warning', 'suffer', 'plummet', 'concerns', 'drag', 'stagnation', 'shrinkage', 'inflation',
    'deflation', 'penalty', 'debt', 'suspension', 'losses', 'sell', 'shortfall', 'crash',
    'downsize', 'unstable', 'disarray', 'recessionary', 'stagnant', 'shrink', 'mismanagement',
    'deficiency', 'unprofitable', 'downward', 'trouble', 'neglect', 'warning', 'anemic', 'exposure',
    'abysmal', 'danger', 'problematic', 'downsize', 'withdrawal', 'depletion', 'dismal', 'weak',
    'stumble', 'red ink', 'slashing', 'fail', 'contagion', 'dump', 'risk-off', 'tightening',
    'stress', 'deficit', 'burden', 'crush', 'worry', 'poor', 'loss-making', 'spur', 'suffering',
    'shock', 'misfortune', 'squeeze', 'hardship', 'crunch', 'retrenchment', 'dumping', 'downward',
    'slashing', 'unfavorable', 'precarious', 'bleeding', 'damage', 'struggle', 'unstable', 'insecure',
    'downside', 'default', 'deteriorate', 'low', 'collapse', 'dumping', 'disappoint', 'rejection',
    'vulnerable', 'costly', 'tumble', 'unrest', 'downhill', 'unfavorable', 'malaise', 'blow', 'hit',
    'defeat', 'backlash', 'devaluation', 'falter', 'liquidity crunch', 'liquidate', 'negative outlook',
    'penalize', 'contraction', 'reduce', 'withhold', 'downbeat', 'strain', 'austerity', 'scarcity',
    'painful', 'disturbing', 'shrink', 'mistrust', 'shortage', 'decay', 'slashing', 'stalemate',
    'fear', 'declining', 'regression', 'shaky', 'shrinkage', 'slash', 'sluggish', 'slash',
    'dismay', 'hard hit', 'bailout', 'deplete', 'cramp', 'withdraw', 'withdrawal', 'rejection',
    'discourage', 'diminish', 'wane', 'displeasure', 'ineffective', 'discomfort', 'implode', 'crisis',
    'devaluate', 'downhearted', 'harm', 'crackdown', 'dissatisfaction', 'implode', 'downcast', 'weak',
    'ailing', 'depletion', 'challenging', 'defensive', 'failure', 'negative impact', 'undercut',
    'negative trend', 'dampen', 'downer', 'grim', 'downslide', 'drop', 'hard times', 'troubled',
    'nosedive', 'lackluster', 'bankrupt', 'liquidity crisis', 'downward spiral', 'displeasing',
    'unprofitable', 'disadvantage', 'depressed', 'downcast', 'catastrophe', 'lousy', 'despondent',
    'discouraging', 'desperation', 'forfeit', 'insecure', 'oppressive', 'panicky', 'stressed',
    'unfavorable', 'wounded', 'overwhelmed', 'unimpressive', 'inadequate', 'unfavorable', 'bitter',
    'severe', 'bleak', 'grim', 'disastrous', 'gloomy', 'woeful', 'ominous', 'harsh', 'pessimistic',
    'tragic', 'meltdown', 'debilitate', 'downhearted', 'deplorable', 'grieve', 'dismal', 'cataclysmic',
    'melancholy', 'lugubrious', 'disquieting', 'melancholic', 'grieving', 'heartbreaking', 'ruinous',
    'regrettable', 'foreboding', 'foresee', 'unsettling', 'unfavorable', 'ominous', 'worrying', 'strife',
    'adversity', 'woeful', 'melancholy', 'morose', 'unfavorable', 'disheartening', 'precarious', 'turbulent',
    'bankruptcy', 'melancholy', 'melancholy', 'unfavorable', 'undercut', 'fail', 'unsteady', 'fall', 'disarray',
    'fail', 'downhearted', 'crash', 'subside', 'hazard', 'fail', 'lamentable', 'subside', 'grievous',
    'mismanagement', 'disconcerting', 'unstable', 'hinder', 'unsound', 'hardship', 'abysmal', 'setback',
    'hardship', 'gloom', 'dismal', 'unfavorable', 'low', 'unfortunate', 'downer', 'downcast', 'lamentable',
    'unfavorable',
                    ] 
neg_concepts = [concept for concept in negative_concepts if concept in keyed_vocab]

In [60]:
print(len(neg_concepts))

293


In [15]:
df['tokenized_text']

0        [bloomberg, three, weeks, go, 2018, market, co...
1        [investing, com, colombia, stocks, higher, clo...
2        [investing, com, canada, stocks, lower, close,...
3        [washington, reuters, three, u, senators, thur...
4        [bloomberg, u, investors, looking, get, volati...
                               ...                        
50796    [investing, com, china, stocks, higher, close,...
50797    [bloomberg, canada, drive, legalize, marijuana...
50798    [sao, paulo, reuters, former, executives, braz...
50799    [bloomberg, dozen, chinese, provinces, announc...
50800    [nate, raymond, reuters, four, major, u, citie...
Name: tokenized_text, Length: 50801, dtype: object

In [16]:
df = df[df["tokenized_text"].notna()]

In [17]:
for ind, row in df.iterrows():
    if len(row['tokenized_text']) == 0:
        print(ind)

10599
32703
37590


In [18]:
df.drop(index=[10599, 32703, 37590], inplace=True)

In [None]:
# Calculating Semantic Sentiment Scores by OSSA model
overall_df_scores = overall_semantic_sentiment_analysis (keyed_vectors = keyed_vectors,
                                                   positive_target_tokens = pos_concepts, 
                                                   negative_target_tokens = neg_concepts,
                                                   doc_tokens = df['tokenized_text'])

# Calculating Semantic Sentiment Scores by TopSSA model
topn_df_scores = topn_semantic_sentiment_analysis (keyed_vectors = keyed_vectors,
                                                   positive_target_tokens = pos_concepts, 
                                                   negative_target_tokens = neg_concepts,
                                                   doc_tokens = df['tokenized_text'],
                                                     topn=30)


# To store semantic sentiment store computed by OSSA model in df
df['overall_PSS'] = overall_df_scores[0] 
df['overall_NSS'] = overall_df_scores[1] 
df['overall_semantic_sentiment_score'] = overall_df_scores[2] 
df['overall_semantic_sentiment_polarity'] = overall_df_scores[3]



# To store semantic sentiment store computed by TopSSA model in df
df['topn_PSS'] = topn_df_scores[0] 
df['topn_NSS'] = topn_df_scores[1] 
df['topn_semantic_sentiment_score'] = topn_df_scores[2] 
df['topn_semantic_sentiment_polarity'] = topn_df_scores[3]


### Test on long news data

In [92]:
test_df = pd.read_csv("../../data/test_news_labeled_baseline.csv")
test_df

Unnamed: 0,news_body,sentiment,raw_text,distilbert_base_uncased_label,vader_label,sentiwordnet_label
0,"(Bloomberg) -- With just three weeks to go, 20...",1,three go market proving outlook decidedly bull...,0,1,1
1,Investing.com ? Colombia stocks were higher af...,1,stocks higher close gains led close rose best ...,0,0,0
2,Investing.com ? Canada stocks were lower after...,0,canada stocks lower close consumer discretiona...,0,0,0
3,WASHINGTON (Reuters) - Three U.S. senators on ...,0,three said stop working delivery violate labor...,0,1,1
4,(Bloomberg) -- U.S. investors looking to get i...,1,looking get volatile stock market without hass...,0,1,1
...,...,...,...,...,...,...
391,(Bloomberg) -- Goldman Sachs Group Inc (NYSE...,1,group. unit billion said economic slowdown big...,0,1,1
392,Investing.com - Facebook (NASDAQ:FB) Stock fel...,0,stock fell trade volume since start session. r...,0,1,0
393,(Reuters) - United Parcel Service Inc (N:UPS...,1,united parcel service n rise quarterly revenue...,1,1,0
394,By Vibhuti Sharma(Reuters) - Netflix Inc (O:NF...,1,raising monthly percent percent video streamin...,0,1,1


In [93]:
test_df['tokenized_news'] = test_df['raw_text'].apply(lambda text: tokenizer.tokenize(text))
test_df

Unnamed: 0,news_body,sentiment,raw_text,distilbert_base_uncased_label,vader_label,sentiwordnet_label,tokenized_news
0,"(Bloomberg) -- With just three weeks to go, 20...",1,three go market proving outlook decidedly bull...,0,1,1,"[three, go, market, proving, outlook, decidedl..."
1,Investing.com ? Colombia stocks were higher af...,1,stocks higher close gains led close rose best ...,0,0,0,"[stocks, higher, close, gains, led, close, ros..."
2,Investing.com ? Canada stocks were lower after...,0,canada stocks lower close consumer discretiona...,0,0,0,"[canada, stocks, lower, close, consumer, discr..."
3,WASHINGTON (Reuters) - Three U.S. senators on ...,0,three said stop working delivery violate labor...,0,1,1,"[three, said, stop, working, delivery, violate..."
4,(Bloomberg) -- U.S. investors looking to get i...,1,looking get volatile stock market without hass...,0,1,1,"[looking, get, volatile, stock, market, withou..."
...,...,...,...,...,...,...,...
391,(Bloomberg) -- Goldman Sachs Group Inc (NYSE...,1,group. unit billion said economic slowdown big...,0,1,1,"[group, unit, billion, said, economic, slowdow..."
392,Investing.com - Facebook (NASDAQ:FB) Stock fel...,0,stock fell trade volume since start session. r...,0,1,0,"[stock, fell, trade, volume, since, start, ses..."
393,(Reuters) - United Parcel Service Inc (N:UPS...,1,united parcel service n rise quarterly revenue...,1,1,0,"[united, parcel, service, n, rise, quarterly, ..."
394,By Vibhuti Sharma(Reuters) - Netflix Inc (O:NF...,1,raising monthly percent percent video streamin...,0,1,1,"[raising, monthly, percent, percent, video, st..."


In [94]:
overall_df_scores = overall_semantic_sentiment_analysis (keyed_vectors = keyed_vectors,
                                                   positive_target_tokens = pos_concepts, 
                                                   negative_target_tokens = neg_concepts,
                                                   doc_tokens = test_df['tokenized_news'])

In [95]:
for i in overall_df_scores[3]:
    print(i)

1
1
0
1
1
0
0
1
1
0
1
1
0
1
0
1
1
0
1
0
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
0
1
0
1
1
0
1
1
1
1
1
1
0
1
1
1
1
1
1
0
1
1
0
1
0
0
1
1
0
1
1
0
0
1
0
1
1
0
1
1
1
1
0
1
0
1
0
1
1
0
1
0
1
1
0
1
1
1
1
1
1
0
0
0
1
1
1
1
0
1
0
1
1
1
1
1
1
1
0
1
0
1
0
1
1
1
0
0
1
1
1
1
1
1
1
1
0
1
0
0
0
1
1
1
0
1
0
1
0
1
1
0
0
1
1
1
1
1
0
1
1
1
0
1
1
1
1
0
1
1
1
1
1
0
1
0
1
1
0
1
1
0
1
0
1
0
1
1
0
1
1
0
1
1
1
0
1
0
0
1
1
1
1
1
1
0
1
1
0
1
1
1
1
1
1
1
1
0
1
0
1
1
0
1
0
0
0
1
1
1
0
0
1
1
0
0
0
1
0
1
1
1
0
1
1
1
0
1
1
1
1
0
1
0
1
0
1
1
0
1
1
1
1
1
1
1
0
1
1
1
0
1
1
1
0
1
0
1
0
1
1
1
1
1
0
0
0
1
0
1
0
1
0
0
1
1
1
1
0
1
0
1
1
0
1
0
0
0
1
1
0
1
1
1
0
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
0
0
1
1
0
1
1
0
1
1
1
1
1
1
1
1
0
0
1
1
1
1
0
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
0
1
1
1
1
1
0
0
1
1
1
1
1
1
1
1
1
1
1
0
0
1
1
1
1
1
1
1
1
1
1


In [96]:
test_df['salt_label'] = overall_df_scores[3]

In [97]:
test_df

Unnamed: 0,news_body,sentiment,raw_text,distilbert_base_uncased_label,vader_label,sentiwordnet_label,tokenized_news,salt_label
0,"(Bloomberg) -- With just three weeks to go, 20...",1,three go market proving outlook decidedly bull...,0,1,1,"[three, go, market, proving, outlook, decidedl...",1
1,Investing.com ? Colombia stocks were higher af...,1,stocks higher close gains led close rose best ...,0,0,0,"[stocks, higher, close, gains, led, close, ros...",1
2,Investing.com ? Canada stocks were lower after...,0,canada stocks lower close consumer discretiona...,0,0,0,"[canada, stocks, lower, close, consumer, discr...",0
3,WASHINGTON (Reuters) - Three U.S. senators on ...,0,three said stop working delivery violate labor...,0,1,1,"[three, said, stop, working, delivery, violate...",1
4,(Bloomberg) -- U.S. investors looking to get i...,1,looking get volatile stock market without hass...,0,1,1,"[looking, get, volatile, stock, market, withou...",1
...,...,...,...,...,...,...,...,...
391,(Bloomberg) -- Goldman Sachs Group Inc (NYSE...,1,group. unit billion said economic slowdown big...,0,1,1,"[group, unit, billion, said, economic, slowdow...",1
392,Investing.com - Facebook (NASDAQ:FB) Stock fel...,0,stock fell trade volume since start session. r...,0,1,0,"[stock, fell, trade, volume, since, start, ses...",1
393,(Reuters) - United Parcel Service Inc (N:UPS...,1,united parcel service n rise quarterly revenue...,1,1,0,"[united, parcel, service, n, rise, quarterly, ...",1
394,By Vibhuti Sharma(Reuters) - Netflix Inc (O:NF...,1,raising monthly percent percent video streamin...,0,1,1,"[raising, monthly, percent, percent, video, st...",1


In [98]:
import numpy as np
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
y_true = test_df["sentiment"]
y_pred = test_df["salt_label"]
print(recall_score(y_true, y_pred, average='binary'))
print(precision_score(y_true, y_pred, average='binary'))
print(f1_score(y_true, y_pred, average='binary'))

0.7716535433070866
0.697508896797153
0.7327102803738317


In [67]:
test_df["sentiment"]

0      1
1      1
2      0
3      0
4      1
      ..
391    1
392    0
393    1
394    1
395    0
Name: sentiment, Length: 396, dtype: int64

In [71]:
test_df["sentiment"].value_counts()

1    254
0    142
Name: sentiment, dtype: int64

In [73]:
test_df["distilbert_base_uncased_label"].value_counts()

0    355
1     41
Name: distilbert_base_uncased_label, dtype: int64

In [77]:
test_df["sentiwordnet_label"].value_counts()

1    218
0    178
Name: sentiwordnet_label, dtype: int64

In [79]:
test_df["salt_label"].value_counts()

1    281
0    115
Name: salt_label, dtype: int64

In [100]:
test_df.to_csv("../../data/all_baslines_result.csv", index=False)