### 5.7: News Classifications

##### Objective

This workshop aims to explore Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) architectures for text-based tasks. We will learn how to preprocess text data (through tokenisation and word embedding), apply RNNs to a news classification problem, and use LSTMs for text generation tasks such as poetry creation.

##### Learning Outcomes

1. **Text Tokenization and Embedding:** Learn how to transform raw text into a suitable input format (tokens) and represent words or tokens as dense vector embeddings for neural network processing.
2. **RNN-Based Classification:** The AG News dataset exemplifies applying a simple RNN model to a text classification task.
3. **LSTM for Text Generation:** Gain hands-on experience building an LSTM model to generate text using the Adele.txt poetry dataset as a case study.

##### Introduction to Text Modelling for Neural Networks

When dealing with textual data, we must convert language into numerical representations that neural networks can process. This involves two key steps:

1. **Tokenisation:** Splitting the text into smaller units, usually words (or sometimes sub-words or characters).
2. **Word Embedding:** Mapping each token to a continuous vector space (e.g., via word2vec or Glove-like embeddings) so semantically similar tokens have similar vector representations.

These representations capture syntactic and semantic relationships, enabling models to learn patterns across sequences of words.


##### Recurrent Neural Networks (RNN) and LSTM

`Recurrent Neural Networks (RNNs)` are designed to handle sequential data by maintaining a hidden state that propagates information from one time step to the next. Traditional RNNs, however, often struggle with long-term dependencies due to vanishing or exploding gradients.

`Long-short-term memory (LSTM)` networks address these issues by introducing a memory cell and gating mechanisms (input, output, and forget gates) that regulate how information flows through the network. This allows the model to maintain longer-range dependencies, making it especially effective for language modelling and text generation tasks.


##### Example Applications

1. **AG News Classification:**
    - Use tokenized AG News articles and feed them into an RNN architecture for topic classification.
    - The model learns to identify news categories (e.g., World, Business, Sports) by capturing patterns in word usage.
2. **LSTM for Poetry Generation:**
    - Train an LSTM on the Adele.txt dataset (or any poetry corpus) to learn linguistic styles and generate new verses.
    - By sampling from the trained model, you can produce novel lines of text that mimic the style of the training data.

##### Student Task

- Select a book dataset of your choice and preprocess the text for training.
- Implement a bidirectional LSTM or GRU model using TensorFlow for text generation.
- Train the model to learn the text style and structure, then generate new text samples.
- Share your code, model configuration, training strategies, and generated samples in the collaborative coding discussion forum.

##### Conclusion

In this workshop, you explored how tokenisation, word embedding, and recurrent architectures (RNN, LSTM) form the backbone of many Natural Language Processing (NLP) applications. By implementing classification and text-generation tasks, you gain practical insights into how neural networks handle sequential data. These skills will be a solid foundation for more advanced NLP techniques, such as Transformer-based models and attention mechanisms. 

In [10]:
import requests

API_KEY = 'c2a7637911fd4e9ba350e58aa16f299e'
query = 'MSTR'

# Define the endpoint and parameters
url = f'https://newsapi.org/v2/everything'
params = {
    'q': query,
    'sortBy': 'popularity',
    'language': 'en',
    'pageSize': 15,  # Limits the results to 15 articles
    'apiKey': API_KEY
}

try:
    # Execute the request
    response = requests.get(url, params=params)
    response.raise_for_status()
    data = response.json()

    # Print the list of 15 headlines
    articles = data.get('articles', [])
    
    print(f"Top 15 Important Headlines:\n" + "="*35)
    for i, article in enumerate(articles, 1):
        print(f"{i}. {article['title']}")
        print(f"   Source: {article['source']['name']}")
        print(f"   Link: {article['url']}\n")

except requests.exceptions.RequestException as e:
    print(f"Error fetching news: {e}")

Top 15 Important Headlines:
1. The Shocking Reason This Analyst Says Michael Saylor and MicroStrategy Stock Will Take Bitcoin Prices to $0
   Source: Barchart.com
   Link: https://www.barchart.com/story/news/113796/the-shocking-reason-this-analyst-says-michael-saylor-and-microstrategy-stock-will-take-bitcoin-prices-to-0

2. 'If People in the Rest of the World Knew What I Know': MicroStrategy's Michael Saylor's Viral Message About MSTR Stock and Bitcoin to $10 Million
   Source: Yahoo Entertainment
   Link: https://consent.yahoo.com/v2/collectConsent?sessionId=1_cc-session_8062cab1-ad18-4f12-9a69-47d8ac81e40d

3. Why MicroStrategy’s Latest Bitcoin Purchase Is Deeply Concerning
   Source: BeInCrypto
   Link: https://beincrypto.com/microstrategy-latest-bitcoin-buy-4-reasons-concerning/

4. Stock market today: Dow soars 1,000 points, leading S&P 500, Nasdaq higher as Wall Street rebounds from rout
   Source: Yahoo Entertainment
   Link: https://finance.yahoo.com/news/live/stock-market-toda

In [17]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

try:
    response = requests.get(url, params=params)
    response.raise_for_status()
    data = response.json()
    headlines = [article['title'] for article in data.get('articles', [])]
    for i, headline in enumerate(headlines, 1):
        print(f"{i}. {headline}")
    
except Exception as e:
    print(f"Error: {e}")

# Sample labels (0: positive, 1: negative, 2: neutral)
labels = [1, 1, 1, 0, 2, 1, 1, 0, 0, 2, 0, 0, 1, 0, 2]  # Corresponding to headlines

# Preprocessing: Tokenization and Padding
tokenizer = Tokenizer()
tokenizer.fit_on_texts(headlines)
sequences = tokenizer.texts_to_sequences(headlines)
max_len = max(len(seq) for seq in sequences)
padded_sequences = pad_sequences(sequences, maxlen=max_len, padding='post')
vocab_size = len(tokenizer.word_index) + 1

# Convert labels to categorical
labels_cat = tf.keras.utils.to_categorical(labels, num_classes=3)

1. The Shocking Reason This Analyst Says Michael Saylor and MicroStrategy Stock Will Take Bitcoin Prices to $0
2. 'If People in the Rest of the World Knew What I Know': MicroStrategy's Michael Saylor's Viral Message About MSTR Stock and Bitcoin to $10 Million
3. Why MicroStrategy’s Latest Bitcoin Purchase Is Deeply Concerning
4. Stock market today: Dow soars 1,000 points, leading S&P 500, Nasdaq higher as Wall Street rebounds from rout
5. Bitcoin/Crypto Crash Captures Headlines as Potentially More Serious Tech-Driven Debt Retreat Progresses
6. ETF that feasts on carnage in bitcoin-holder Strategy hits record high
7. Michael Saylor's Strategy purchased $168 million in bitcoin last week
8. Strategy purchased $264 million in bitcoin last week, a slowdown from recent acquisition pace
9. Strategy's STRC returns to $100, poised to unlock more bitcoin accumulation
10. Michael Saylor's Strategy made modest bitcoin purchase at start of last week's crypto crash
11. Strategy to initiate a bitcoin

In [18]:
# Build Simple RNN Model (can replace RNN with LSTM for better performance)
model_class = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, 50, input_length=max_len),
    tf.keras.layers.SimpleRNN(128),
    tf.keras.layers.Dense(3, activation='softmax')
])

model_class.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model_class.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 25, 50)            7200      
                                                                 
 simple_rnn_1 (SimpleRNN)    (None, 128)               22912     
                                                                 
 dense_1 (Dense)             (None, 3)                 387       
                                                                 
Total params: 30499 (119.14 KB)
Trainable params: 30499 (119.14 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [19]:
# Training
history_class = model_class.fit(padded_sequences, 
                                labels_cat, 
                                epochs=10, 
                                batch_size=4, 
                                validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
