# Introduction

This notebook focuses on predicting whether stock prices will increase or decrease based on stock news sentiment analysis of news headlines. In this study I use data available on Kaggle https://www.kaggle.com/datasets/avisheksood/stock-news-sentiment-analysismassive-dataset?select=Sentiment_Stock_data.csv. 
This is a huge dataset with 108,301 unique values, so I will only use a sample of 5000 observations.

We are going to solve the classificaion problem using Natural Language Processing with the following steps:
- Text preprocessing applying tokenization, stopwords, lemmatization
- Converting text to vectors: OneHot representation and word embedding to convert each sentence into an array of vocabulary indices   
- Building and training the LSTM model  
- Model perfomance evaluation  

We will use Python and ML

# Importing the libraries

In [1]:
import pandas as pd
import numpy as np

In [2]:
import tensorflow as tf
tf.__version__

2024-06-13 19:52:53.321581: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-13 19:52:53.321744: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-13 19:52:53.527300: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


'2.15.0'

In [3]:
import nltk
import re
from nltk.corpus import stopwords

In [4]:
nltk.download('stopwords')
nltk.download('all')

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /usr/share/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /usr/share/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /usr/share/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /usr/share/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /usr/share/nltk_data...
[nltk_data]    |   Package basque_grammars 

True

In [5]:
from nltk import sent_tokenize
from gensim.utils import simple_preprocess

In [6]:
from tensorflow.keras.layers import Embedding  #embedding layer helps us with the word to word implementation
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense

# Importing and preprocessing the data

In [7]:
df = pd.read_csv('/kaggle/input/stock-news-sentiment-analysismassive-dataset/Sentiment_Stock_data.csv')
df = df.head(5000)
df.head()

Unnamed: 0.1,Unnamed: 0,Sentiment,Sentence
0,0,0,"According to Gran , the company has no plans t..."
1,1,1,"For the last quarter of 2010 , Componenta 's n..."
2,2,1,"In the third quarter of 2010 , net sales incre..."
3,3,1,Operating profit rose to EUR 13.1 mn from EUR ...
4,4,1,"Operating profit totalled EUR 21.1 mn , up fro..."


In [8]:
df = df[['Sentiment', 'Sentence']]

In [9]:
df.shape

(5000, 2)

In [10]:
df.isnull().sum()

Sentiment    0
Sentence     0
dtype: int64

In [11]:
# Dropping null values

df.dropna(inplace=True)

In [12]:
df.reset_index(inplace=True)

In [13]:
# specifying the independent and dependent features
X = df['Sentence']
y = df['Sentiment']

In [14]:
# copying X for preprocessing

sentences = X.copy()

In [15]:
nltk.download('wordnet')
!unzip /usr/share/nltk_data/corpora/wordnet.zip -d /usr/share/nltk_data/corpora/

[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Archive:  /usr/share/nltk_data/corpora/wordnet.zip
   creating: /usr/share/nltk_data/corpora/wordnet/
  inflating: /usr/share/nltk_data/corpora/wordnet/lexnames  
  inflating: /usr/share/nltk_data/corpora/wordnet/data.verb  
  inflating: /usr/share/nltk_data/corpora/wordnet/index.adv  
  inflating: /usr/share/nltk_data/corpora/wordnet/adv.exc  
  inflating: /usr/share/nltk_data/corpora/wordnet/index.verb  
  inflating: /usr/share/nltk_data/corpora/wordnet/cntlist.rev  
  inflating: /usr/share/nltk_data/corpora/wordnet/data.adj  
  inflating: /usr/share/nltk_data/corpora/wordnet/index.adj  
  inflating: /usr/share/nltk_data/corpora/wordnet/LICENSE  
  inflating: /usr/share/nltk_data/corpora/wordnet/citation.bib  
  inflating: /usr/share/nltk_data/corpora/wordnet/noun.exc  
  inflating: /usr/share/nltk_data/corpora/wordnet/verb.exc  
  inflating: /usr/shar

In [16]:
# Lemmatization to convert words in sentences to their meaningful root

from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()

corpus = []
for i in range(0, len(sentences)):
    text = re.sub('[^a-zA-Z]', ' ', sentences[i])
    text = text.lower()
    text = text.split()
    all_stopwords = stopwords.words('english')
    all_stopwords.remove('not')
    all_stopwords.remove('no')
    
    text = [lemmatizer.lemmatize(word) for word in text if not word in set(all_stopwords)]
    text = ' '.join(text)
    corpus.append(text)

In [17]:
corpus[0]

'according gran company no plan move production russia although company growing'

In [18]:
len(corpus)

5000

In [19]:
corpus

['according gran company no plan move production russia although company growing',
 'last quarter componenta net sale doubled eur eur period year earlier moved zero pre tax profit pre tax loss eur',
 'third quarter net sale increased eur mn operating profit eur mn',
 'operating profit rose eur mn eur mn corresponding period representing net sale',
 'operating profit totalled eur mn eur mn representing net sale',
 'finnish talentum report operating profit increased eur mn eur mn net sale totaled eur mn eur mn',
 'clothing retail chain sepp l sale increased eur mn operating profit rose eur mn eur mn',
 'consolidated net sale increased reach eur operating profit amounted eur compared loss eur prior year period',
 'foundry division report sale increased eur mn eur mn corresponding period sale machine shop division increased eur mn eur mn corresponding period',
 'helsinki afx share closed higher led nokia announced plan team sanyo manufacture g handset nokian tyre fourth quarter earnings re

In [20]:
X[0]

'According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing '

# Vectorization

### Onehot representation

In [21]:
### Vocabulary size
voc_size=50000

In [22]:
onehot_repr=[one_hot(words,voc_size)for words in corpus] 
onehot_repr

[[10394, 31033, 20264, 4991, 28387, 21188, 5979, 33155, 9268, 20264, 45089],
 [36781,
  32400,
  400,
  17055,
  37973,
  36616,
  42928,
  42928,
  43383,
  31260,
  40926,
  26663,
  40846,
  3544,
  42890,
  33265,
  3544,
  42890,
  7274,
  42928],
 [32405, 32400, 17055, 37973, 43796, 42928, 46444, 28989, 33265, 42928, 46444],
 [28989,
  33265,
  8574,
  42928,
  46444,
  42928,
  46444,
  17586,
  43383,
  30033,
  17055,
  37973],
 [28989, 33265, 33726, 42928, 46444, 42928, 46444, 30033, 17055, 37973],
 [41566,
  16457,
  8032,
  28989,
  33265,
  43796,
  42928,
  46444,
  42928,
  46444,
  17055,
  37973,
  6424,
  42928,
  46444,
  42928,
  46444],
 [27447,
  45886,
  42660,
  555,
  21725,
  37973,
  43796,
  42928,
  46444,
  28989,
  33265,
  8574,
  42928,
  46444,
  42928,
  46444],
 [32476,
  17055,
  37973,
  43796,
  35168,
  42928,
  28989,
  33265,
  2926,
  42928,
  47872,
  7274,
  42928,
  27455,
  31260,
  43383],
 [25874,
  48500,
  8032,
  37973,
  43796,
  429

In [23]:
corpus[1]

'last quarter componenta net sale doubled eur eur period year earlier moved zero pre tax profit pre tax loss eur'

In [24]:
onehot_repr[1]

[36781,
 32400,
 400,
 17055,
 37973,
 36616,
 42928,
 42928,
 43383,
 31260,
 40926,
 26663,
 40846,
 3544,
 42890,
 33265,
 3544,
 42890,
 7274,
 42928]

### Word Embedding

In [25]:
#making the sentences equal length/size
sent_length=30
embedded_docs=pad_sequences(onehot_repr,padding='post',maxlen=sent_length)
print(embedded_docs)

[[10394 31033 20264 ...     0     0     0]
 [36781 32400   400 ...     0     0     0]
 [32405 32400 17055 ...     0     0     0]
 ...
 [42943 23603 14386 ...     0     0     0]
 [14394 17713 11286 ...     0     0     0]
 [  449 44550  6777 ...     0     0     0]]


In [26]:
embedded_docs[1]

array([36781, 32400,   400, 17055, 37973, 36616, 42928, 42928, 43383,
       31260, 40926, 26663, 40846,  3544, 42890, 33265,  3544, 42890,
        7274, 42928,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0], dtype=int32)

# Building the LSTM model

In [27]:
## Creating model
embedding_vector_features=40 ##features representation. Every index in embedded_docs will be represented by 40 dimensions
model=Sequential()
model.add(Embedding(voc_size,embedding_vector_features,input_length=sent_length))
model.add(LSTM(500))  #try different values
model.add(Dense(1,activation='sigmoid')) #sigmoid since the output is binary
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())



None


In [28]:
len(embedded_docs),y.shape

(5000, (5000,))

# Model training

In [29]:
import numpy as np
X_final=np.array(embedded_docs)
y_final=np.array(y)

In [30]:
X_final.shape,y_final.shape

((5000, 30), (5000,))

In [31]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.3, random_state=42)

In [32]:
y_train

array([0, 0, 0, ..., 0, 0, 0])

In [33]:
# Model Training
model.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=10,batch_size=75)

Epoch 1/10
[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 303ms/step - accuracy: 0.6800 - loss: 0.6217 - val_accuracy: 0.6920 - val_loss: 0.6299
Epoch 2/10
[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 280ms/step - accuracy: 0.7169 - loss: 0.5858 - val_accuracy: 0.8020 - val_loss: 0.4508
Epoch 3/10
[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 287ms/step - accuracy: 0.8904 - loss: 0.2868 - val_accuracy: 0.8493 - val_loss: 0.4024
Epoch 4/10
[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 295ms/step - accuracy: 0.9548 - loss: 0.1272 - val_accuracy: 0.8427 - val_loss: 0.4975
Epoch 5/10
[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 299ms/step - accuracy: 0.9664 - loss: 0.1000 - val_accuracy: 0.8447 - val_loss: 0.4283
Epoch 6/10
[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 293ms/step - accuracy: 0.9760 - loss: 0.0792 - val_accuracy: 0.8527 - val_loss: 0.4082
Epoch 7/10
[1m47/47[

<keras.src.callbacks.history.History at 0x7c7e09a30820>

# Predictions and model performance

In [34]:
# Predicting the test data
y_pred = model.predict(X_test)
y_pred

[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 51ms/step


array([[0.17259632],
       [0.99880093],
       [0.00133554],
       ...,
       [0.57317424],
       [0.9987025 ],
       [0.00145132]], dtype=float32)

In [35]:
# making the predictions binary

y_pred=np.where(y_pred > 0.6, 1,0) ##AUC ROC Curve
y_pred

array([[0],
       [1],
       [0],
       ...,
       [0],
       [1],
       [0]])

In [36]:
# Confusion matrix

from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,y_pred)

array([[928, 110],
       [117, 345]])

In [37]:
# Getting the accuracy score

from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.8486666666666667

In [38]:
# Getting the classification report

from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.89      0.89      0.89      1038
           1       0.76      0.75      0.75       462

    accuracy                           0.85      1500
   macro avg       0.82      0.82      0.82      1500
weighted avg       0.85      0.85      0.85      1500

