### 5.5: Stock Sentiment Analysis using RNN

##### Objective

This workshop aims to explore Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) architectures for text-based tasks. We will learn how to preprocess text data (through tokenisation and word embedding), apply RNNs to a news classification problem, and use LSTMs for text generation tasks such as poetry creation.

##### Learning Outcomes

1. **Text Tokenization and Embedding:** Learn how to transform raw text into a suitable input format (tokens) and represent words or tokens as dense vector embeddings for neural network processing.
2. **RNN-Based Classification:** The AG News dataset exemplifies applying a simple RNN model to a text classification task.
3. **LSTM for Text Generation:** Gain hands-on experience building an LSTM model to generate text using the Adele.txt poetry dataset as a case study.

##### Introduction to Text Modelling for Neural Networks

When dealing with textual data, we must convert language into numerical representations that neural networks can process. This involves two key steps:

1. **Tokenisation:** Splitting the text into smaller units, usually words (or sometimes sub-words or characters).
2. **Word Embedding:** Mapping each token to a continuous vector space (e.g., via word2vec or Glove-like embeddings) so semantically similar tokens have similar vector representations.

These representations capture syntactic and semantic relationships, enabling models to learn patterns across sequences of words.


##### Recurrent Neural Networks (RNN) and LSTM

`Recurrent Neural Networks (RNNs)` are designed to handle sequential data by maintaining a hidden state that propagates information from one time step to the next. Traditional RNNs, however, often struggle with long-term dependencies due to vanishing or exploding gradients.

`Long-short-term memory (LSTM)` networks address these issues by introducing a memory cell and gating mechanisms (input, output, and forget gates) that regulate how information flows through the network. This allows the model to maintain longer-range dependencies, making it especially effective for language modelling and text generation tasks.


##### Example Applications

1. **AG News Classification:**
    - Use tokenized AG News articles and feed them into an RNN architecture for topic classification.
    - The model learns to identify news categories (e.g., World, Business, Sports) by capturing patterns in word usage.
2. **LSTM for Poetry Generation:**
    - Train an LSTM on the Adele.txt dataset (or any poetry corpus) to learn linguistic styles and generate new verses.
    - By sampling from the trained model, you can produce novel lines of text that mimic the style of the training data.

##### Conclusion

In this workshop, you explored how tokenisation, word embedding, and recurrent architectures (RNN, LSTM) form the backbone of many Natural Language Processing (NLP) applications. By implementing classification and text-generation tasks, you gain practical insights into how neural networks handle sequential data. These skills will be a solid foundation for more advanced NLP techniques, such as Transformer-based models and attention mechanisms.     


##### **Sources**
- https://newsapi.org/
- https://newsapi.org/docs/endpoints/everything
- https://www.researchgate.net/publication/363860201_Sentimental_Classification_of_News_Headlines_using_Recurrent_Neural_Network
- https://www.ijert.org/text-classification-using-rnn
- https://www.geeksforgeeks.org/nlp/rnn-for-text-classifications-in-nlp/
- https://www.tensorflow.org/text/tutorials/text_classification_rnn
- https://dev.to/aionlinecourse/learn-how-to-build-multi-class-text-classification-models-with-rnn-and-lstm-ned

In [2]:
import requests
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split        
from sklearn.metrics import confusion_matrix, classification_report

In [3]:
API_KEY = 'c2a7637911fd4e9ba350e58aa16f299e'
query = 'MSTR'

# Define the endpoint and parameters
url = f'https://newsapi.org/v2/everything'
params = {
    'q': query,
    'sortBy': 'popularity',
    'language': 'en',
    'pageSize': 15,  # Limits the results to 15 articles
    'apiKey': API_KEY
}

try:
    # Execute the request
    response = requests.get(url, params=params)
    response.raise_for_status()
    data = response.json()

    # Print the list of 15 headlines
    articles = data.get('articles', [])
    
    print(f"Top 15 Important Headlines:\n" + "="*35)
    for i, article in enumerate(articles, 1):
        print(f"{i}. {article['title']}")
        print(f"   Source: {article['source']['name']}")
        print(f"   Link: {article['url']}\n")

except requests.exceptions.RequestException as e:
    print(f"Error fetching news: {e}")

Top 15 Important Headlines:
1. The Shocking Reason This Analyst Says Michael Saylor and MicroStrategy Stock Will Take Bitcoin Prices to $0
   Source: Barchart.com
   Link: https://www.barchart.com/story/news/113796/the-shocking-reason-this-analyst-says-michael-saylor-and-microstrategy-stock-will-take-bitcoin-prices-to-0

2. 'If People in the Rest of the World Knew What I Know': MicroStrategy's Michael Saylor's Viral Message About MSTR Stock and Bitcoin to $10 Million
   Source: Yahoo Entertainment
   Link: https://consent.yahoo.com/v2/collectConsent?sessionId=1_cc-session_8062cab1-ad18-4f12-9a69-47d8ac81e40d

3. Why MicroStrategy’s Latest Bitcoin Purchase Is Deeply Concerning
   Source: BeInCrypto
   Link: https://beincrypto.com/microstrategy-latest-bitcoin-buy-4-reasons-concerning/

4. Stock market today: Dow soars 1,000 points, leading S&P 500, Nasdaq higher as Wall Street rebounds from rout
   Source: Yahoo Entertainment
   Link: https://finance.yahoo.com/news/live/stock-market-toda

In [4]:
try:
    response = requests.get(url, params=params)
    response.raise_for_status()
    data = response.json()
    headlines = [article['title'] for article in data.get('articles', [])]
    for i, headline in enumerate(headlines, 1):
        print(f"{i}. {headline}")
    
except Exception as e:
    print(f"Error: {e}")

1. The Shocking Reason This Analyst Says Michael Saylor and MicroStrategy Stock Will Take Bitcoin Prices to $0
2. 'If People in the Rest of the World Knew What I Know': MicroStrategy's Michael Saylor's Viral Message About MSTR Stock and Bitcoin to $10 Million
3. Why MicroStrategy’s Latest Bitcoin Purchase Is Deeply Concerning
4. Stock market today: Dow soars 1,000 points, leading S&P 500, Nasdaq higher as Wall Street rebounds from rout
5. Bitcoin/Crypto Crash Captures Headlines as Potentially More Serious Tech-Driven Debt Retreat Progresses
6. ETF that feasts on carnage in bitcoin-holder Strategy hits record high
7. Michael Saylor's Strategy purchased $168 million in bitcoin last week
8. Strategy purchased $264 million in bitcoin last week, a slowdown from recent acquisition pace
9. Michael Saylor's Strategy made modest bitcoin purchase at start of last week's crypto crash
10. Strategy's STRC returns to $100, poised to unlock more bitcoin accumulation
11. Strategy to initiate a bitcoin

In [5]:
# Manual sentiment labels (0=positive, 1=negative, 2=neutral)
labels = [1, 1, 2, 0, 2, 1, 1, 0, 0, 2, 0, 0, 1, 0, 2]

# Convert labels to categorical
labels_cat = tf.keras.utils.to_categorical(labels, num_classes=3)

# Preprocessing: Tokenization and Padding
tokenizer = Tokenizer(oov_token='<OOV>')
tokenizer.fit_on_texts(headlines)

sequences = tokenizer.texts_to_sequences(headlines)
max_len = max(len(seq) for seq in sequences)
padded_sequences = pad_sequences(sequences, maxlen=max_len, padding='post')

vocab_size = len(tokenizer.word_index) + 1
print(f"Vocabulary size: {vocab_size}")
print(f"Max sequence length: {max_len}")
print(f"Padded shape: {padded_sequences.shape}")

Vocabulary size: 145
Max sequence length: 25
Padded shape: (15, 25)


In [None]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    padded_sequences,
    labels_cat,
    test_size=0.2,
    random_state=42,
    stratify=labels                
)

print(f"Train shape: {X_train.shape}, Test shape: {X_test.shape}")

Train shape: (12, 25), Test shape: (3, 25)


In [7]:
# Build and compile RNN model

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=vocab_size,
                              output_dim=50,
                              input_length=max_len),
    tf.keras.layers.SimpleRNN(128, return_sequences=False),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(3, activation='softmax')
])

model.compile(
    optimizer='adam',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 50)            7250      
                                                                 
 simple_rnn (SimpleRNN)      (None, 128)               22912     
                                                                 
 dense (Dense)               (None, 32)                4128      
                                                                 
 dropout (Dropout)           (None, 32)                0         
                                                                 
 dense_1 (Dense)             (None, 3)                 99        
                                                                 
Total params: 34389 (134.33 KB)
Trainable params: 34389 (134.33 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


2026-02-19 23:12:42.789364: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2026-02-19 23:12:42.833290: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2026-02-19 23:12:42.833571: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-

In [8]:
history = model.fit(
    X_train, y_train,
    epochs=25,
    batch_size=4,
    validation_split=0.25,
    verbose=1
)

Epoch 1/25


2026-02-19 23:13:10.202376: I external/local_xla/xla/service/service.cc:168] XLA service 0x72804046a470 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2026-02-19 23:13:10.202404: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce GTX 1050 Ti, Compute Capability 6.1
2026-02-19 23:13:10.212062: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2026-02-19 23:13:10.230284: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8904
I0000 00:00:1771560790.311476 1164945 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


In [9]:
# Evaluate on test set + detailed metrics

test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
print(f"\nTest Loss:     {test_loss:.4f}")
print(f"Test Accuracy: {test_acc:.4f}")

# Predictions
y_pred_prob = model.predict(X_test, verbose=0)
y_pred = np.argmax(y_pred_prob, axis=1)
y_true = np.argmax(y_test, axis=1)

# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
print("\nConfusion Matrix:")
print(cm)

# Full classification report
target_names = ['Positive', 'Negative', 'Neutral']
cr = classification_report(y_true, y_pred, target_names=target_names, digits=3, zero_division=0)
print("\nClassification Report:")
print(cr)


Test Loss:     0.9013
Test Accuracy: 0.6667

Confusion Matrix:
[[1 0 0]
 [0 1 0]
 [0 1 0]]

Classification Report:
              precision    recall  f1-score   support

    Positive      1.000     1.000     1.000         1
    Negative      0.500     1.000     0.667         1
     Neutral      0.000     0.000     0.000         1

    accuracy                          0.667         3
   macro avg      0.500     0.667     0.556         3
weighted avg      0.500     0.667     0.556         3



In [11]:
new_headline = "MSTR Has Lost 62 percent in a Year and Bitcoin Is Still Below Its Buy Price"

# Preprocess
seq = tokenizer.texts_to_sequences([new_headline])
padded = pad_sequences(seq, maxlen=max_len, padding='post')

# Predict
probs = model.predict(padded, verbose=0)[0]
pred_class = np.argmax(probs)

labels_map = {0: "Positive", 1: "Negative", 2: "Neutral"}

print(f"Headline: {new_headline}")
print(f"Probabilities: Positive {probs[0]:.3f} | Negative {probs[1]:.3f} | Neutral {probs[2]:.3f}")
print(f"Predicted sentiment: {labels_map[pred_class]} (class {pred_class})")
print(f"Positive score: {probs[0]:.3f}")

Headline: MSTR Has Lost 62 percent in a Year and Bitcoin Is Still Below Its Buy Price
Probabilities: Positive 0.170 | Negative 0.769 | Neutral 0.061
Predicted sentiment: Negative (class 1)
Positive score: 0.170
