<a href="https://colab.research.google.com/github/SzymonNowakowski/Machine-Learning-2024/blob/master/Lab12_nlp-introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 12 - Natural Language Processing - Introduction

### Author: Szymon Nowakowski


# Introduction
---------------


# Google Crowdsource Sentiment Dataset  
--------------

It features **43k sentences** with **human-annotated emotion labels** from real-world user-generated content such as feedback, reviews, and comments.

Each sentence can be labeled with **one or more of 27 fine-grained emotion categories**, such as:
- *joy*, *amusement*, *approval*, *sadness*, *anger*, *fear*, *realization*, *pride*, etc.,  
plus an optional **neutral** label.

Because the dataset is **multi-label**, a single sentence may express a combination of emotions (e.g., *pride* and *fear*).

The dataset was released by Google as part of its **Crowdsource project** and is related to the **GoEmotions** initiative. It's particularly valuable for building models that understand nuanced emotional language in realistic user comments.




## Mapping emotions to sentiment

To use the dataset for **sentiment classification**, a rule-based mapping is applied to reduce multi-label emotion annotations into a single sentiment label:

- If **any emotion** in the label set is **positive** → classify as **Positive**
- Else if **any emotion** in the label set is **negative** → classify as **Negative**
- Else → classify as **Neutral**

This decision rule captures the dominant emotional tone of each sentence, simplifying the dataset for use in binary or ternary sentiment classification tasks.

In [2]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.w

In [11]:
from datasets import load_dataset
import pandas as pd

# Load the GoEmotions dataset (by Google)
dataset = load_dataset("go_emotions")

# Convert to binary sentiment (positive/neutral/negative)
def map_to_sentiment(example):
    positive = {
        1,   # admiration
        2,   # amusement
        3,   # approval
        7,   # caring
        10,  # desire
        12,  # excitement
        14,  # gratitude
        19,  # love
        20,  # optimism
        22,  # pride
        23,  # relief
        26   # joy
    }

    negative = {
        4,   # anger
        5,   # annoyance
        6,   # disapproval
        8,   # confusion
        9,   # disappointment
        11,  # embarrassment
        13,  # fear
        15,  # grief
        16,  # nervousness
        17,  # remorse
        18  # sadness
    }

    neutral = {
        0,   # neutral
        21,  # curiosity (ambiguous, context-specific)
        24,  # realization (realization label is often used for sentences that show understanding, acknowledgment, or reflection, without emotional intensity, like in “Oh, now I get what she meant” )
        25,  # surprise (can be negative or positive, but often context-specific)
        27   # none (used when no emotion is detected)
    }
    labels = set(example["labels"])
    if labels & positive:
        return "positive"
    elif labels & negative:
        return "negative"
    else:
        return "neutral"

# Apply sentiment mapping
dataset = dataset["train"].map(lambda x: {"sentiment": map_to_sentiment(x)})

# Convert to DataFrame for easier use
df = dataset.to_pandas()[["text", "sentiment"]]

print(df.sample(5))  # Show sample entries
print("Rows:", len(df))


                                                    text sentiment
11929  Dayyuuummm!!! If you ever get another one, you...  negative
27426  > neoprene kayaking gloves Thanks will check t...  negative
35543  Not cheating and not using exploits is honorab...  negative
17869  Someone that can check your dash and aa your j...   neutral
31134  watching [NAME] attempt hook shots gives me a ...  positive
Rows: 43410


In [12]:
from sklearn.model_selection import train_test_split

# Split into 80% train and 20% validation
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df['sentiment'])

# Check the sizes
print(f"Training size: {len(train_df)}, Validation size: {len(val_df)}")
print(train_df['sentiment'].value_counts(normalize=True))
print(val_df['sentiment'].value_counts(normalize=True))

Training size: 34728, Validation size: 8682
sentiment
neutral     0.391384
positive    0.329158
negative    0.279457
Name: proportion, dtype: float64
sentiment
neutral     0.391384
positive    0.329187
negative    0.279429
Name: proportion, dtype: float64


# Tokenizer
-------------------

To feed text into a neural network, we need to represent words in a "neural-network-ish" way — that is, as numbers. The standard approach is to use a tokenizer, often from a pretrained model. However, since we plan to experiment with our own attention modules later on, **we’ll avoid using any pretrained tokenizer**.

Instead, we’ll go with a simple, word-based tokenization. As part of this, we’ll clean the text by removing any non-standard HTML tags, digits, extra whitespace, and punctuation. We’ll also convert all words to lowercase to ensure consistency.

## Special Tokens: `<PAD>` and `<UNK>`

In our text preprocessing pipeline, we convert each word to a number using a vocabulary. Two special tokens help us handle padding and unknown words.




### `<PAD>` — Padding Token

- Represents empty slots when we need all input sequences to be the same length.
- Assigned index `0`.
- Used so that batches of sentences can be processed together by the model.

For example:  
Original: `[17, 5, 23]`  
Padded:   `[17, 5, 23, 0, 0]` (for a fixed length of 5)




### `<UNK>` — Unknown Token

- Represents any word that is **not in the vocabulary**.
- Assigned index `1`.
- Occurs when:
  1. A word was **too rare in the training data** (appeared only once and was excluded from the vocabulary).
  2. A word appears **only in validation or test data**.

> In our setup, we **excluded all words that appear only once** in the training set.  
> So even in the training data, some tokens are replaced with `<UNK>`.  
> These are called **rare unknowns** — they help the model learn how to handle unusual or unfamiliar words.


By including `<UNK>` during training, we teach the model how to deal with unseen or rare words at test time — which is **crucial for generalization**.


In [15]:
import re
from collections import Counter

PAD_LEN = 32

# Tokenize with cleaning
def tokenize(text):
    text = text.lower()
    text = re.sub(r'<[^>]+>', ' ', text)       # remove HTML tags
    text = re.sub(r'[^a-z\s]', ' ', text)      # remove digits and punctuation
    text = re.sub(r'\s+', ' ', text).strip()   # normalize whitespace
    return text.split()

# Tokenize text
train_tokens = train_df['text'].apply(tokenize)
val_tokens = val_df['text'].apply(tokenize)

# Build vocabulary from training set — exclude rare words (freq = 1)
token_counter = Counter(token for sentence in train_tokens for token in sentence)
vocab = {
    token: idx + 2
    for idx, (token, count) in enumerate(token_counter.items())
    if count > 1
}
vocab['<PAD>'] = 0
vocab['<UNK>'] = 1

# Convert tokens to indices
def tokens_to_indices(tokens, vocab):
    return [vocab.get(token, vocab['<UNK>']) for token in tokens]

# Pad or truncate sequences to a fixed length
def pad_sequence(seq, max_len=32, pad_value=0):
    if len(seq) < max_len:
        return seq + [pad_value] * (max_len - len(seq))
    else:
        return seq[:max_len]

# Apply both steps to train and val sets

train_df['input_ids'] = train_tokens.apply(lambda tokens: pad_sequence(tokens_to_indices(tokens, vocab), max_len=PAD_LEN))
val_df['input_ids'] = val_tokens.apply(lambda tokens: pad_sequence(tokens_to_indices(tokens, vocab), max_len=PAD_LEN))

# Example check
print(train_df[['text', 'input_ids']].head(3))




                                                    text  \
21866  The guy that waived a future hall of famer. Ha...   
5542                                          Stop what?   
38431                                          Oh wow!!!   

                                               input_ids  
21866  [2, 3, 4, 1, 6, 7, 8, 9, 1, 11, 12, 13, 14, 15...  
5542   [17, 18, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...  
38431  [19, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...  
