# Sentiment Analysis - Labelled Financial News Data

This notebook focuses on conducting sentiment analysis using labeled financial news data. This involves analysing the sentiment expressed in financial news articles and categorising them as positive, negative, or neutral. The task includes understanding the characteristics of the dataset, exploring its content, preprocessing the text data, selecting appropriate feature representation techniques, training sentiment classifiers, and evaluating their performance. The objective is to develop accurate sentiment analysis models tailored to the financial domain.

In [60]:
import pandas as pd
import matplotlib.pyplot as plt
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
import string

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\rnrib\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rnrib\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [61]:
news = pd.read_csv('data/fin_cleaned.csv')

news

Unnamed: 0,Date_published,Headline,Synopsis,Full_text,Final Status
0,2022-06-21,"Banks holding on to subsidy share, say payment...",The companies have written to the National Pay...,ReutersPayments companies and banks are at log...,Negative
1,2022-04-19,Digitally ready Bank of Baroda aims to click o...,"At present, 50% of the bank's retail loans are...",AgenciesThe bank presently has 20 million acti...,Positive
2,2022-05-27,Karnataka attracted investment commitment of R...,Karnataka is at the forefront in attracting in...,PTIKarnataka Chief Minister Basavaraj Bommai.K...,Positive
3,2022-04-06,Splitting of provident fund accounts may be de...,The EPFO is likely to split accounts only at t...,Getty ImagesThe budget for FY22 had imposed in...,Negative
4,2022-06-14,Irdai weighs proposal to privatise Insurance I...,"Set up in 2009 as an advisory body, IIB collec...",AgenciesThere is a view in the insurance indus...,Positive
...,...,...,...,...,...
395,2022-06-10,"Banks take a cue from RBI, hike lending rates",These banks raised their respective external b...,"PTIICICI Bank, Bank of Baroda, Punjab National...",Negative
396,2022-06-29,Sebi issues Rs 27 lakh recovery notice to indi...,"In the event of non-payment, it will recover t...",ReutersThe logo of the Securities and Exchange...,Negative
397,2022-06-06,Apollo Hospital shares drop 0.68% as Sensex ...,"A total of 10,105 shares changed hands on the ...",Getty ImagesShrikant Chouhan of Kotak Securiti...,Negative
398,2022-05-16,SBI at Rs 710? What makes analysts see up to 5...,Calling the stock 'attractively valued' analys...,AgenciesThe PSU bank reported a 41.27 per cent...,Positive


In [62]:
# rename columns
news = news.rename(columns={'Date_published': 'date', 'Headline': 'headline', 'Synopsis': 'synopsis', 'Full_text': 'text', 'Final Status': 'label'})
news.head()

Unnamed: 0,date,headline,synopsis,text,label
0,2022-06-21,"Banks holding on to subsidy share, say payment...",The companies have written to the National Pay...,ReutersPayments companies and banks are at log...,Negative
1,2022-04-19,Digitally ready Bank of Baroda aims to click o...,"At present, 50% of the bank's retail loans are...",AgenciesThe bank presently has 20 million acti...,Positive
2,2022-05-27,Karnataka attracted investment commitment of R...,Karnataka is at the forefront in attracting in...,PTIKarnataka Chief Minister Basavaraj Bommai.K...,Positive
3,2022-04-06,Splitting of provident fund accounts may be de...,The EPFO is likely to split accounts only at t...,Getty ImagesThe budget for FY22 had imposed in...,Negative
4,2022-06-14,Irdai weighs proposal to privatise Insurance I...,"Set up in 2009 as an advisory body, IIB collec...",AgenciesThere is a view in the insurance indus...,Positive


In [63]:
# describe the dataset
news.describe()

Unnamed: 0,date,headline,synopsis,text,label
count,400,400,399,400,400
unique,75,368,398,400,3
top,2022-05-02,Stock market update: Stocks that hit 52-week h...,The 30-share BSE Sensex closed up 708.18 poi...,ReutersPayments companies and banks are at log...,Positive
freq,12,5,2,1,215


In [64]:
# check for missing values
news.isnull().sum()

date        0
headline    0
synopsis    1
text        0
label       0
dtype: int64

In [65]:
# convert date to datetime
news['date'] = pd.to_datetime(news['date'])

In [66]:
text_columns = ['headline', 'synopsis', 'text']
for col in text_columns:
    news[col] = news[col].astype(str)


In [67]:
# create a sentence token column for each text column
for col in text_columns:
    # create a new column with the sentence tokens
    news[col + '_sent_tokens'] = news[col].apply(sent_tokenize)
    # count the number of sentence tokens
    news[col + '_sent_tokens_count'] = news[col + '_sent_tokens'].apply(len)


# create a word token column for each text column
for col in text_columns:
    # create a new column with the word tokens
    news[col + '_word_tokens'] = news[col].apply(word_tokenize)
    # count the number of word tokens
    news[col + '_word_tokens_count'] = news[col + '_word_tokens'].apply(len)

news.head()

Unnamed: 0,date,headline,synopsis,text,label,headline_sent_tokens,headline_sent_tokens_count,synopsis_sent_tokens,synopsis_sent_tokens_count,text_sent_tokens,text_sent_tokens_count,headline_word_tokens,headline_word_tokens_count,synopsis_word_tokens,synopsis_word_tokens_count,text_word_tokens,text_word_tokens_count
0,2022-06-21,"Banks holding on to subsidy share, say payment...",The companies have written to the National Pay...,ReutersPayments companies and banks are at log...,Negative,"[Banks holding on to subsidy share, say paymen...",1,[The companies have written to the National Pa...,1,[ReutersPayments companies and banks are at lo...,18,"[Banks, holding, on, to, subsidy, share, ,, sa...",10,"[The, companies, have, written, to, the, Natio...",33,"[ReutersPayments, companies, and, banks, are, ...",547
1,2022-04-19,Digitally ready Bank of Baroda aims to click o...,"At present, 50% of the bank's retail loans are...",AgenciesThe bank presently has 20 million acti...,Positive,[Digitally ready Bank of Baroda aims to click ...,1,"[At present, 50% of the bank's retail loans ar...",2,[AgenciesThe bank presently has 20 million act...,18,"[Digitally, ready, Bank, of, Baroda, aims, to,...",11,"[At, present, ,, 50, %, of, the, bank, 's, ret...",43,"[AgenciesThe, bank, presently, has, 20, millio...",490
2,2022-05-27,Karnataka attracted investment commitment of R...,Karnataka is at the forefront in attracting in...,PTIKarnataka Chief Minister Basavaraj Bommai.K...,Positive,[Karnataka attracted investment commitment of ...,1,[Karnataka is at the forefront in attracting i...,1,[PTIKarnataka Chief Minister Basavaraj Bommai....,31,"[Karnataka, attracted, investment, commitment,...",14,"[Karnataka, is, at, the, forefront, in, attrac...",55,"[PTIKarnataka, Chief, Minister, Basavaraj, Bom...",1062
3,2022-04-06,Splitting of provident fund accounts may be de...,The EPFO is likely to split accounts only at t...,Getty ImagesThe budget for FY22 had imposed in...,Negative,[Splitting of provident fund accounts may be d...,1,[The EPFO is likely to split accounts only at ...,2,[Getty ImagesThe budget for FY22 had imposed i...,14,"[Splitting, of, provident, fund, accounts, may...",8,"[The, EPFO, is, likely, to, split, accounts, o...",56,"[Getty, ImagesThe, budget, for, FY22, had, imp...",424
4,2022-06-14,Irdai weighs proposal to privatise Insurance I...,"Set up in 2009 as an advisory body, IIB collec...",AgenciesThere is a view in the insurance indus...,Positive,[Irdai weighs proposal to privatise Insurance ...,1,"[Set up in 2009 as an advisory body, IIB colle...",2,[AgenciesThere is a view in the insurance indu...,7,"[Irdai, weighs, proposal, to, privatise, Insur...",8,"[Set, up, in, 2009, as, an, advisory, body, ,,...",45,"[AgenciesThere, is, a, view, in, the, insuranc...",262


In [68]:

# # remove stopwords
# stop_words = set(stopwords.words('english'))
# for col in text_columns:
#     news[col + '_word_tokens'] = news[col + '_word_tokens'].apply(lambda x: [word for word in x if word.lower() not in stop_words])

# # remove punctuation
# for col in text_columns:
#     news[col + '_word_tokens'] = news[col + '_word_tokens'].apply(lambda x: [word for word in x if word not in string.punctuation])