# Sentiment Analysis for Long Text

Sentiment analysis, also known as opinion mining, is a computational technique that involves the use of natural language processing, machine learning, and statistical methods to analyze and determine the emotional tone, attitudes, and opinions expressed in textual data. In an era dominated by vast amounts of user-generated content on social media, reviews, and other online platforms, sentiment analysis has emerged as a crucial tool for understanding and extracting valuable insights from the immense volume of textual information.

The primary objective of sentiment analysis is to classify the sentiment conveyed in a piece of text as positive, negative, or neutral. This enables businesses, researchers, and organizations to gain a deeper understanding of public opinion, customer feedback, and overall sentiment towards products, services, brands, or any other subject of interest. By automating the process of sentiment analysis, it becomes possible to efficiently process and make sense of large datasets, enabling timely and informed decision-making.

Sentiment analysis finds application in various domains, including marketing, customer service, product development, political analysis, and social research. Its versatility makes it a valuable tool for businesses aiming to enhance customer satisfaction, monitor brand perception, and stay attuned to market trends. As technology continues to advance, sentiment analysis methods evolve to handle the complexities of language, cultural nuances, and the dynamic nature of online communication.

Most methods of sentiment analysis involves using supervised learning. Sentiment analysis using supervised learning is an approach that involves training a machine learning model on a labeled dataset to predict the sentiment of text. In this context, "supervised learning" refers to the training process where the model is provided with a dataset containing examples of text along with their corresponding sentiment labels (e.g., positive, negative, or neutral). The model learns patterns and relationships within the labeled data, enabling it to make predictions on new, unseen text.

However, datasets with sentiment labels for longs texts are not readily available on the internet. Therefore, our project focuses on fixing this issue by creating a system where sentiment analysis can be performed on long text.

## Sentiment analysis using unsupervised learning

Therefore, we created a system where we can use unsupervised learning to do sentiment analysis.

### First, we import the necessary libraries

In [1]:
# data processing and Data manipulation
import numpy as np # linear algebra
import pandas as pd # data processing

import sklearn
from sklearn.model_selection import train_test_split
    
# Libraries and packages for NLP
import nltk
import gensim
from gensim.models import Word2Vec

import os
import sys
import warnings
if not sys.warnoptions:
    warnings.simplefilter("ignore")
    
print('*** --> Modules are imported: ')    
print("Python version:", sys.version)
print("numpy version:", np.__version__)
print("pandas version:", pd.__version__)

print("sklearn version:", sklearn.__version__)
print("nltk version:", nltk.__version__)
print("gensim version:", gensim.__version__)

*** --> Modules are imported: 
Python version: 3.7.16 (default, Jan 17 2023, 22:20:44) 
[GCC 11.2.0]
numpy version: 1.17.4
pandas version: 1.3.5
sklearn version: 1.0.2
nltk version: 3.8.1
gensim version: 4.2.0


### Then, we read the data that we want to perform sentiment analysis on

In [2]:
# Importing IMDB Data from data directory which is two directory uper than the current directory
data_path = os.path.abspath(os.path.join(os.pardir,
                                         os.pardir, 
                                         'data/clean_news_1.csv'))
df = pd.read_csv(data_path, dtype={'news_body': str})
# df.head(3)
df

Unnamed: 0,news_body
0,"(Bloomberg) -- With just three weeks to go, 20..."
1,Investing.com – Colombia stocks were higher af...
2,Investing.com – Canada stocks were lower after...
3,WASHINGTON (Reuters) - Three U.S. senators on ...
4,(Bloomberg) -- U.S. investors looking to get i...
...,...
95,Investing.com – Sweden stocks were lower after...
96,TEL AVIV (Reuters) - Salesforce.com (N:CRM) is...
97,Investing.com – Saudi Arabia stocks were lower...
98,By Stephanie Nebehay and Ryan WooGENEVA/BEIJIN...


### We now perform preprocessing on the code.

Sentence preprocessing is a crucial step in preparing textual data for machine learning tasks, including natural language processing (NLP) and sentiment analysis. The goal is to transform raw text into a format that machine learning models can effectively understand and process.

In [3]:
# Adding `src` directory to the directories for interpreter to search
sys.path.append(os.path.abspath(os.path.join('../..','Model/src')))

# Importing functions and classes from utility module
from w2v_utils import (Tokenizer,
                       evaluate_model,
                       bow_vectorizer,
                       train_logistic_regressor,
                       w2v_trainer,
                       calculate_overall_similarity_score,
                       overall_semantic_sentiment_analysis,
                       list_similarity,
                       calculate_topn_similarity_score,
                       topn_semantic_sentiment_analysis,
                       define_complexity_subjectivity_reviews,
                       explore_high_complexity_reviews,
                       explore_low_subjectivity_reviews,
                       text_SSA)

In [4]:
# Instancing the Tokenizer class
tokenizer = Tokenizer(clean= True,
                      lower= True, 
                      de_noise= True, 
                      remove_stop_words= True,
                      keep_negation=True)

# Example statement
statement = "I didn't like this movie. It wasn't amusing nor visually interesting . I do not recommend it."
print(tokenizer.tokenize(statement))

['NOTlike', 'movie', 'NOTamusing', 'visually', 'interesting', 'NOTrecommend']


In [5]:
# Tokenize reviews
df['tokenized_text'] = df['news_body'].astype(str).apply(tokenizer.tokenize)

df['tokenized_text_len'] = df['tokenized_text'].apply(len)
df['tokenized_text_len'].apply(np.log).describe()

count    100.000000
mean       5.372366
std        0.625784
min        3.091042
25%        5.020437
50%        5.427148
75%        5.733326
max        6.658011
Name: tokenized_text_len, dtype: float64

In [6]:
df.at[0,"tokenized_text"]

['bloomberg',
 'three',
 'weeks',
 'go',
 '2018',
 'market',
 'contrarians',
 'proving',
 'prescient',
 'outlook',
 'decidedly',
 'bullish',
 'u',
 'stocks',
 'developing',
 'nation',
 'assets',
 '12',
 'months',
 'ago',
 'forecast',
 'build',
 'upon',
 'stellar',
 '2017',
 'beaten',
 'greenback',
 'expected',
 'fare',
 'better',
 '2018',
 'rosy',
 'international',
 'growth',
 'outlook',
 'threatened',
 'lure',
 'investors',
 'away',
 'american',
 'markets',
 'despite',
 'tough',
 'talk',
 'u',
 'china',
 'risks',
 'trade',
 'war',
 'afterthought',
 'NOTmuch',
 'gone',
 'according',
 'plan',
 'dws',
 'cantor',
 'fitzgerald',
 'morgan',
 'stanley',
 'nyse',
 'ms',
 'among',
 'bet',
 'trend',
 'got',
 'right',
 'federal',
 'reserve',
 'rate',
 'hikes',
 'backdrop',
 'sharply',
 'escalating',
 'trade',
 'tensions',
 'roiled',
 'markets',
 '2018',
 'punishing',
 'u',
 'stocks',
 'causing',
 'risk',
 'averse',
 'investors',
 'flee',
 'developing',
 'nations',
 'meanwhile',
 'dollar',
 'gain

In [7]:
# Separating the target
# y = df['sentiment'] 
X = df

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    random_state=42, 
                                                    test_size=0.3,
                                                    stratify=y)

print("X_train shape: ", X_train.shape)
print("X_test shape: ", X_test.shape)

NameError: name 'y' is not defined

### For sentiment analysis using unsupervised learning, we first train the word embedding model

In [8]:
# Training a Word2Vec model
keyed_vectors, keyed_vocab = w2v_trainer(df['tokenized_text'])

### Then, we create positive and negative sets

In [9]:
# Find the most similar words to "good" 
keyed_vectors.most_similar('good',topn=15)

[('market', 0.9997321963310242),
 ('would', 0.9997232556343079),
 ('also', 0.9997193813323975),
 ('u', 0.9997157454490662),
 ('said', 0.9997138381004333),
 ('markets', 0.9997131824493408),
 ('reuters', 0.9997076988220215),
 ('company', 0.999704122543335),
 ('investors', 0.9997040033340454),
 ('companies', 0.9997028708457947),
 ('could', 0.9997027516365051),
 ('global', 0.9997016191482544),
 ('since', 0.9997003674507141),
 ('one', 0.9996955394744873),
 ('deal', 0.9996939897537231)]

In [24]:
# To make sure that all `positive_concepts` are in the keyed word2vec vocabulary
positive_concepts = ['excellent', 'awesome', 'cool','decent','amazing', 'strong', 'good', 'great', 'funny', 'entertaining'] 
pos_concepts = [concept for concept in positive_concepts if concept in keyed_vocab]

In [25]:
# Find the most similar words to "bad" 
keyed_vectors.most_similar('bad',topn=15)

[('could', 0.9992130994796753),
 ('said', 0.9991994500160217),
 ('reuters', 0.9991929531097412),
 ('months', 0.9991893172264099),
 ('markets', 0.9991881847381592),
 ('billion', 0.9991803169250488),
 ('would', 0.9991776347160339),
 ('investors', 0.9991772770881653),
 ('deal', 0.9991771578788757),
 ('two', 0.999177098274231),
 ('000', 0.9991746544837952),
 ('tuesday', 0.9991728067398071),
 ('china', 0.9991723299026489),
 ('bank', 0.9991716146469116),
 ('public', 0.9991708993911743)]

In [25]:
str(list(keyed_vocab.keys())[2000:3000])

"['setting', 'andrew', 'talk', 'lewis', 'richard', 'gone', 'dws', 'virus', 'jeff', 'buzzfeed', 'fitzgerald', 'withstand', 'exploitative', 'driver', 'titles', 'ms', 'bankers', 'lure', 'commodity', 'giant', 'figure', 'senator', 'bull', 'nine', 'releases', 'hikes', 'underpin', 'causing', 'packages', 'spark', 'imposing', 'virtually', 'bezos', 'correctly', 'upon', 'furukawa', 'bear', 'dependent', 'kicked', 'cic', 'suit', 'lisa', 'samarco', '280', 'vale', 'argentina', 'cemex', 'manuel', 'andres', 'latam', 'populist', 'damages', 'clh', '6870', 'treasuries', 'brazil', 'conconcret', 'nordvig', 'exante', 'etb', 'ipo', 'colcap', 'decide', 'founding', 'stuff', 'NOTsee', 'broad', 'tentative', 'daniel', 'adr', 'inched', 'alibaba', 'breaks', 'weapons', 'nuclear', 'baba', 'venture', 'wagering', 'shortages', 'mike', 'iamgold', 'img', 'material', '190', '920', 'troughs', '750', 'pessimism', 'gathered', 'adrs', 'towards', 'sogn', 'gauge', 'tighter', 'deceleration', 'via', 'climb', 'helping', 'brazilian',

In [16]:
str(sorted(keyed_vocab.items(), key=lambda item: item[0]))



In [29]:
# To make sure that all `negative_concepts` are in the keyed word2vec vocabulary 
negative_concepts = ['terrible','awful','horrible','boring','bad', 'disappointing', 'weak', 'poor', 'senseless','confusing', 
                     'criminal', 'wrongdoing', 'fail', 'depressed', 'stress', 'frustrated', 'pessimistic', 'hopeless', 'worthless',
                     'cheating', 'concern', 'exit', 'exhausted', 'fear', 'fears', 'lost', 'worst', 'decline', 'fraud', 'warning', 
                     'pandemic', 'illegal', 'corruption', 'crisis', 'shutdown', 'slow', 'ban', 'attack', 'unfortunately',
                     'hurt', 'negative', 'panic', 'pullback', 'cancer', 'limit', 'uncertain', 'postponed', 'dirty', 'disease', 
                     'death', 'killed', 'dark', 'jitters', 'accused', 'dispute', 'losses', 'nervous', 'restrictions', 'fell', 
                    ] 
neg_concepts = [concept for concept in negative_concepts if concept in keyed_vocab]
len(negative_concepts)

57

In [30]:
df['tokenized_text']

0     [bloomberg, three, weeks, go, 2018, market, co...
1     [investing, com, colombia, stocks, higher, clo...
2     [investing, com, canada, stocks, lower, close,...
3     [washington, reuters, three, u, senators, thur...
4     [bloomberg, u, investors, looking, get, volati...
                            ...                        
95    [investing, com, sweden, stocks, lower, close,...
96    [tel, aviv, reuters, salesforce, com, n, crm, ...
97    [investing, com, saudi, arabia, stocks, lower,...
98    [stephanie, nebehay, ryan, woogeneva, beijing,...
99    [investing, com, canada, stocks, lower, close,...
Name: tokenized_text, Length: 100, dtype: object

In [31]:
df = df[df["tokenized_text"].notna()]

In [32]:
for ind, row in df.iterrows():
    if len(row['tokenized_text']) == 0:
        print(ind)

In [33]:
df.drop(index=[10599, 32703, 37590], inplace=True)

KeyError: '[10599 32703 37590] not found in axis'

In [34]:
# Calculating Semantic Sentiment Scores by OSSA model
overall_df_scores = overall_semantic_sentiment_analysis (keyed_vectors = keyed_vectors,
                                                   positive_target_tokens = pos_concepts, 
                                                   negative_target_tokens = neg_concepts,
                                                   doc_tokens = df['tokenized_text'])

# Calculating Semantic Sentiment Scores by TopSSA model
topn_df_scores = topn_semantic_sentiment_analysis (keyed_vectors = keyed_vectors,
                                                   positive_target_tokens = pos_concepts, 
                                                   negative_target_tokens = neg_concepts,
                                                   doc_tokens = df['tokenized_text'],
                                                     topn=30)


# To store semantic sentiment store computed by OSSA model in df
df['overall_PSS'] = overall_df_scores[0] 
df['overall_NSS'] = overall_df_scores[1] 
df['overall_semantic_sentiment_score'] = overall_df_scores[2] 
df['overall_semantic_sentiment_polarity'] = overall_df_scores[3]



# To store semantic sentiment store computed by TopSSA model in df
df['topn_PSS'] = topn_df_scores[0] 
df['topn_NSS'] = topn_df_scores[1] 
df['topn_semantic_sentiment_score'] = topn_df_scores[2] 
df['topn_semantic_sentiment_polarity'] = topn_df_scores[3]


NameError: name 'pos_concepts' is not defined

### Test on long news data

In [38]:
news_text = "China’s stocks reversed course to rise Monday after data showing persistent deflationary pressures from weak domestic demand pushed them lower earlier in the session. Japan’s stocks jumped on growing bets that its central bank might not hike interest rates next week. November inflation numbers from China showed a faster-than-expected decline in consumer prices. The consumer price index fell 0.5% year-on-year, more than the 0.1% drop expected by economists polled by Reuters and the fastest slide since November 2020. The producer price index fell 3% year-on-year, compared with October’s 2.6% drop and expectations of a 2.8% decline. It also marked the 14th straight month of PPI decline and the quickest since August."

In [39]:
print(news_text)

China’s stocks reversed course to rise Monday after data showing persistent deflationary pressures from weak domestic demand pushed them lower earlier in the session. Japan’s stocks jumped on growing bets that its central bank might not hike interest rates next week. November inflation numbers from China showed a faster-than-expected decline in consumer prices. The consumer price index fell 0.5% year-on-year, more than the 0.1% drop expected by economists polled by Reuters and the fastest slide since November 2020. The producer price index fell 3% year-on-year, compared with October’s 2.6% drop and expectations of a 2.8% decline. It also marked the 14th straight month of PPI decline and the quickest since August.


In [40]:
tokenized_news = tokenizer.tokenize(news_text)

In [41]:
test = pd.DataFrame({"test": [tokenized_news]})

In [42]:
test

Unnamed: 0,test
0,"[china, stocks, reversed, course, rise, monday..."


In [43]:
overall_df_scores = overall_semantic_sentiment_analysis (keyed_vectors = keyed_vectors,
                                                   positive_target_tokens = pos_concepts, 
                                                   negative_target_tokens = neg_concepts,
                                                   doc_tokens = test['test'])

In [44]:
overall_df_scores

(0    0.18017
 Name: test, dtype: float32,
 0    0.376213
 Name: test, dtype: float32,
 0   -0.196044
 Name: test, dtype: float32,
 0    0
 Name: test, dtype: int64)