# Word Embeddings: Encoding Lexical Semantics üá¶üá∫

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vuhung16au/pytorch-mastery/blob/main/examples/pytorch-nlp/02_word_embeddings_nllp.ipynb)
[![View on GitHub](https://img.shields.io/badge/View_on-GitHub-blue?logo=github)](https://github.com/vuhung16au/pytorch-mastery/blob/main/examples/pytorch-nlp/02_word_embeddings_nllp.ipynb)

A comprehensive guide to word embeddings using PyTorch, featuring Australian tourism examples and English-Vietnamese multilingual support. Learn how to encode lexical semantics and capture semantic relationships in Australian tourism vocabulary.

## Learning Objectives

By the end of this notebook, you will:

- üî§ **Master word embedding techniques** including Word2Vec, GloVe, and FastText
- üá¶üá∫ **Train custom embeddings** on Australian tourism corpus
- üåè **Handle multilingual embeddings** for English-Vietnamese text
- üìä **Visualize semantic relationships** between Australian cities and landmarks
- üîÑ **Compare PyTorch vs TensorFlow** embedding implementations
- üéØ **Apply embeddings** to real Australian NLP tasks

## What You'll Build

1. **Australian Tourism Word2Vec Model** - Capture semantic relationships in tourism vocabulary
2. **Multilingual Embedding Space** - Align English and Vietnamese tourism terms
3. **Semantic Similarity Engine** - Find similar Australian cities and attractions
4. **Interactive Visualization** - Explore embedding space with t-SNE and PCA

---

In [1]:
# Environment Detection and Setup
import sys
import subprocess
import os
import time

# Detect the runtime environment
IS_COLAB = "google.colab" in sys.modules
IS_KAGGLE = "kaggle_secrets" in sys.modules or "kaggle" in os.environ.get('KAGGLE_URL_BASE', '')
IS_LOCAL = not (IS_COLAB or IS_KAGGLE)

print(f"üîç Environment Detection:")
print(f"   Local Development: {IS_LOCAL}")
print(f"   Google Colab: {IS_COLAB}")
print(f"   Kaggle Notebooks: {IS_KAGGLE}")

# Platform-specific system setup
if IS_COLAB:
    print("\n‚öôÔ∏è  Setting up Google Colab environment...")
    !apt update -qq
    !apt install -y -qq software-properties-common
elif IS_KAGGLE:
    print("\n‚öôÔ∏è  Setting up Kaggle environment...")
    # Kaggle usually has most packages pre-installed
else:
    print("\n‚öôÔ∏è  Setting up local environment...")

üîç Environment Detection:
   Local Development: False
   Google Colab: True
   Kaggle Notebooks: False

‚öôÔ∏è  Setting up Google Colab environment...
41 packages can be upgraded. Run 'apt list --upgradable' to see them.
[1;33mW: [0mSkipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)[0m
software-properties-common is already the newest version (0.99.22.9).
0 upgraded, 0 newly installed, 0 to remove and 41 not upgraded.


In [2]:
# Install required packages for word embeddings
required_packages = [
    "torch",
    "transformers",
    "datasets",
    "tokenizers",
    "pandas",
    "seaborn",
    "matplotlib",
    "scikit-learn",
    "tensorboard",
    "gensim",  # For Word2Vec and GloVe implementations
    "plotly",  # For interactive visualizations
]

print("üì¶ Installing packages for word embeddings...")
for package in required_packages:
    if IS_COLAB or IS_KAGGLE:
        !pip install -q {package}
    else:
        try:
            subprocess.run([sys.executable, "-m", "pip", "install", "-q", package],
                          capture_output=True, check=True)
        except subprocess.CalledProcessError:
            print(f"   ‚ö†Ô∏è  {package} installation skipped (likely already installed)")
            continue
    print(f"   ‚úÖ {package}")

print("\nüéâ Package installation completed!")

üì¶ Installing packages for word embeddings...
   ‚úÖ torch
   ‚úÖ transformers
   ‚úÖ datasets
   ‚úÖ tokenizers
   ‚úÖ pandas
   ‚úÖ seaborn
   ‚úÖ matplotlib
   ‚úÖ scikit-learn
   ‚úÖ tensorboard
   ‚úÖ gensim
   ‚úÖ plotly

üéâ Package installation completed!


In [4]:
# Import essential libraries for word embeddings
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torch.utils.tensorboard import SummaryWriter

# Data handling and visualization
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Machine learning and embeddings
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans

# Text processing
import re
import string
from collections import Counter, defaultdict
import random
from itertools import combinations

# Gensim for pre-trained embeddings and Word2Vec
try:
    from gensim.models import Word2Vec, FastText
    from gensim.models.keyedvectors import KeyedVectors
    print("‚úÖ Gensim imported successfully")
except ImportError:
    print("‚ö†Ô∏è  Gensim not available - will use PyTorch implementations only")

# Set style for better notebook aesthetics
sns.set_style("whitegrid")
sns.set_palette("Set2")
plt.rcParams['figure.figsize'] = (14, 8)

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

print(f"üî§ Word Embeddings Environment Ready!")
print(f"   PyTorch version: {torch.__version__}")
print(f"   Libraries loaded successfully")

‚úÖ Gensim imported successfully
üî§ Word Embeddings Environment Ready!
   PyTorch version: 2.8.0+cu126
   Libraries loaded successfully


In [5]:
import platform

def detect_device():
    """Detect optimal device for word embeddings training."""
    if torch.cuda.is_available():
        device = torch.device("cuda")
        gpu_name = torch.cuda.get_device_name(0)
        gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1024**3

        print(f"üöÄ CUDA GPU detected: {gpu_name}")
        print(f"   GPU Memory: {gpu_memory:.1f} GB")
        print(f"   Optimal for large embedding training")

        return device

    elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
        device = torch.device("mps")
        system_info = platform.uname()

        print(f"üçé Apple Silicon MPS detected: {system_info.machine}")
        print(f"   Optimized for M1/M2/M3 chips")
        print(f"   Good performance for embedding training")

        return device

    else:
        device = torch.device("cpu")
        cpu_count = torch.get_num_threads()

        print(f"üíª CPU mode: {platform.processor()}")
        print(f"   Threads: {cpu_count}")
        print(f"   üí° Tip: Use smaller embedding dimensions for faster training")

        return device

# Detect and set device
DEVICE = detect_device()
print(f"\n‚úÖ Device selected: {DEVICE}")

üíª CPU mode: x86_64
   Threads: 1
   üí° Tip: Use smaller embedding dimensions for faster training

‚úÖ Device selected: cpu


In [6]:
# Create comprehensive Australian tourism corpus for embedding training
def create_australian_tourism_corpus():
    """
    Create a rich corpus of Australian tourism content for training embeddings.

    Returns:
        dict: Contains English and Vietnamese text with metadata
    """

    # English corpus - Australian tourism content
    english_corpus = [
        # Sydney content
        "Sydney Opera House is an iconic architectural masterpiece located on Bennelong Point in Sydney Harbour.",
        "The Sydney Harbour Bridge offers spectacular views of the harbour and city skyline.",
        "Bondi Beach is famous for surfing and hosts many international surfing competitions.",
        "The Royal Botanic Gardens Sydney showcase native Australian flora and fauna.",
        "Darling Harbour features world-class museums, restaurants, and entertainment venues.",
        "The Rocks historic area preserves Sydney's convict heritage and colonial architecture.",

        # Melbourne content
        "Melbourne is renowned for its vibrant coffee culture and laneway street art.",
        "The Royal Exhibition Building in Carlton Gardens is a UNESCO World Heritage site.",
        "Federation Square hosts cultural events and houses major galleries and museums.",
        "Melbourne's tram network is the largest in the world and iconic to the city.",
        "The Yarra River flows through Melbourne's central business district and parks.",
        "Queen Victoria Market offers fresh produce, gourmet food, and unique souvenirs.",

        # Queensland content
        "The Great Barrier Reef is the world's largest coral reef system and UNESCO World Heritage site.",
        "Brisbane's South Bank features cultural institutions, restaurants, and riverside parks.",
        "Gold Coast is famous for its theme parks, surfing beaches, and nightlife.",
        "Cairns serves as the gateway to the Great Barrier Reef and Daintree Rainforest.",
        "Fraser Island is the world's largest sand island with unique ecosystems.",
        "Whitsunday Islands offer pristine beaches and excellent sailing conditions.",

        # Western Australia content
        "Perth is one of the most isolated major cities in the world.",
        "Fremantle port city features well-preserved colonial architecture and maritime heritage.",
        "Rottnest Island is home to quokkas and beautiful secluded beaches.",
        "The Pinnacles Desert showcases thousands of limestone pillars in unique formations.",
        "Margaret River region produces world-class wines and gourmet food.",
        "Broome features Cable Beach with stunning sunsets and pearl diving history.",

        # South Australia content
        "Adelaide is known as the Festival City with numerous cultural celebrations.",
        "Barossa Valley produces premium wines and hosts international wine festivals.",
        "Kangaroo Island wildlife sanctuary protects native Australian animals in natural habitat.",
        "Adelaide Hills wine region offers cool climate varieties and scenic vineyards.",
        "Flinders Ranges feature ancient mountain landscapes and Aboriginal cultural sites.",

        # Northern Territory content
        "Uluru is a sacred Aboriginal site and iconic symbol of Australia.",
        "Kata Tjuta rock formations complement Uluru in the heart of Australia.",
        "Darwin serves as the gateway to Kakadu National Park and Top End wilderness.",
        "Kakadu National Park preserves ancient Aboriginal rock art and diverse ecosystems.",
        "Alice Springs is the heart of the Australian outback and Red Centre.",

        # Tasmania content
        "Hobart's Museum of Old and New Art challenges visitors with provocative contemporary art.",
        "Cradle Mountain-Lake St Clair National Park offers pristine wilderness hiking.",
        "Salamanca Market in Hobart features local artisans and Tasmania's finest produce.",
        "Devil's island Tasmania protects the endangered Tasmanian devil in natural habitat.",

        # ACT content
        "Canberra houses Australia's national institutions including Parliament House and galleries.",
        "Australian War Memorial commemorates the service of Australian armed forces.",
        "National Gallery of Australia showcases the finest Australian and international art.",
        "Lake Burley Griffin provides recreational activities in the heart of Canberra."
    ]

    # Vietnamese corpus - translations and local content
    vietnamese_corpus = [
        # Sydney translations
        "Nh√† h√°t Opera Sydney l√† ki·ªát t√°c ki·∫øn tr√∫c bi·ªÉu t∆∞·ª£ng t·ªça l·∫°c t·∫°i Bennelong Point ·ªü C·∫£ng Sydney.",
        "C·∫ßu C·∫£ng Sydney mang ƒë·∫øn t·∫ßm nh√¨n ngo·∫°n m·ª•c ra c·∫£ng v√† ƒë∆∞·ªùng ch√¢n tr·ªùi th√†nh ph·ªë.",
        "B√£i bi·ªÉn Bondi n·ªïi ti·∫øng v·ªõi l∆∞·ªõt s√≥ng v√† t·ªï ch·ª©c nhi·ªÅu cu·ªôc thi l∆∞·ªõt s√≥ng qu·ªëc t·∫ø.",
        "V∆∞·ªùn B√°ch th·∫£o Ho√†ng gia Sydney tr∆∞ng b√†y h·ªá ƒë·ªông th·ª±c v·∫≠t b·∫£n ƒë·ªãa Australia.",

        # Melbourne translations
        "Melbourne n·ªïi ti·∫øng v·ªõi vƒÉn h√≥a c√† ph√™ s√¥i ƒë·ªông v√† ngh·ªá thu·∫≠t ƒë∆∞·ªùng ph·ªë trong c√°c con h·∫ªm.",
        "T√≤a nh√† Tri·ªÉn l√£m Ho√†ng gia ·ªü Carlton Gardens l√† di s·∫£n th·∫ø gi·ªõi UNESCO.",
        "Qu·∫£ng tr∆∞·ªùng Federation t·ªï ch·ª©c c√°c s·ª± ki·ªán vƒÉn h√≥a v√† c√≥ c√°c ph√≤ng tr∆∞ng b√†y l·ªõn.",
        "M·∫°ng l∆∞·ªõi t√†u ƒëi·ªán Melbourne l√† l·ªõn nh·∫•t th·∫ø gi·ªõi v√† mang t√≠nh bi·ªÉu t∆∞·ª£ng c·ªßa th√†nh ph·ªë.",

        # Queensland translations
        "R·∫°n san h√¥ Great Barrier l√† h·ªá th·ªëng r·∫°n san h√¥ l·ªõn nh·∫•t th·∫ø gi·ªõi v√† di s·∫£n UNESCO.",
        "South Bank Brisbane c√≥ c√°c t·ªï ch·ª©c vƒÉn h√≥a, nh√† h√†ng v√† c√¥ng vi√™n ven s√¥ng.",
        "Gold Coast n·ªïi ti·∫øng v·ªõi c√°c c√¥ng vi√™n gi·∫£i tr√≠, b√£i bi·ªÉn l∆∞·ªõt s√≥ng v√† cu·ªôc s·ªëng v·ªÅ ƒë√™m.",
        "Cairns l√† c·ª≠a ng√µ ƒë·∫øn R·∫°n san h√¥ Great Barrier v√† R·ª´ng m∆∞a Daintree.",

        # Other regions
        "Perth l√† m·ªôt trong nh·ªØng th√†nh ph·ªë l·ªõn bi·ªát l·∫≠p nh·∫•t tr√™n th·∫ø gi·ªõi.",
        "Adelaide ƒë∆∞·ª£c bi·∫øt ƒë·∫øn l√† Th√†nh ph·ªë L·ªÖ h·ªôi v·ªõi nhi·ªÅu celebration vƒÉn h√≥a.",
        "Uluru l√† ƒë·ªãa ƒëi·ªÉm thi√™ng li√™ng c·ªßa th·ªï d√¢n v√† bi·ªÉu t∆∞·ª£ng c·ªßa Australia.",
        "Hobart c√≥ B·∫£o t√†ng Ngh·ªá thu·∫≠t C≈© v√† M·ªõi th√°ch th·ª©c du kh√°ch v·ªõi ngh·ªá thu·∫≠t ƒë∆∞∆°ng ƒë·∫°i.",
        "Canberra ch·ª©a c√°c t·ªï ch·ª©c qu·ªëc gia c·ªßa Australia bao g·ªìm T√≤a nh√† Qu·ªëc h·ªôi."
    ]

    return {
        'english': english_corpus,
        'vietnamese': vietnamese_corpus,
        'combined': english_corpus + vietnamese_corpus
    }

# Create the corpus
tourism_corpus = create_australian_tourism_corpus()

print("üá¶üá∫ Australian Tourism Corpus Created")
print("=" * 45)
print(f"   English texts: {len(tourism_corpus['english'])}")
print(f"   Vietnamese texts: {len(tourism_corpus['vietnamese'])}")
print(f"   Total corpus size: {len(tourism_corpus['combined'])}")

# Display sample texts
print(f"\nüìù Sample English text:")
print(f"   {tourism_corpus['english'][0]}")
print(f"\nüìù Sample Vietnamese text:")
print(f"   {tourism_corpus['vietnamese'][0]}")

# Analyze vocabulary
all_words = []
for text in tourism_corpus['combined']:
    words = re.findall(r'\b\w+\b', text.lower())
    all_words.extend(words)

vocab_counter = Counter(all_words)
unique_words = len(vocab_counter)
total_words = len(all_words)

print(f"\nüìä Corpus Statistics:")
print(f"   Total words: {total_words:,}")
print(f"   Unique words: {unique_words:,}")
print(f"   Vocabulary richness: {unique_words/total_words:.3f}")

# Show most common Australian terms
australian_terms = [word for word, count in vocab_counter.most_common(20)
                   if word in ['sydney', 'melbourne', 'brisbane', 'perth', 'adelaide',
                              'darwin', 'hobart', 'canberra', 'australia', 'australian',
                              'beach', 'harbour', 'reef', 'park', 'island']]
print(f"\nüèôÔ∏è  Top Australian terms: {', '.join(australian_terms[:10])}")

üá¶üá∫ Australian Tourism Corpus Created
   English texts: 42
   Vietnamese texts: 17
   Total corpus size: 59

üìù Sample English text:
   Sydney Opera House is an iconic architectural masterpiece located on Bennelong Point in Sydney Harbour.

üìù Sample Vietnamese text:
   Nh√† h√°t Opera Sydney l√† ki·ªát t√°c ki·∫øn tr√∫c bi·ªÉu t∆∞·ª£ng t·ªça l·∫°c t·∫°i Bennelong Point ·ªü C·∫£ng Sydney.

üìä Corpus Statistics:
   Total words: 770
   Unique words: 388
   Vocabulary richness: 0.504

üèôÔ∏è  Top Australian terms: sydney, australia, australian, melbourne, island


In [7]:
class AustralianEmbeddingPreprocessor:
    """
    Specialized text preprocessor for Australian tourism embeddings.

    Handles both English and Vietnamese text while preserving important
    Australian geographic and cultural terms.
    """

    def __init__(self):
        # Protected Australian terms that should not be heavily modified
        self.protected_terms = {
            'cities': ['sydney', 'melbourne', 'brisbane', 'perth', 'adelaide',
                      'darwin', 'hobart', 'canberra'],
            'landmarks': ['uluru', 'kata', 'tjuta', 'kakadu', 'pinnacles',
                         'cradle', 'mountain', 'fraser', 'rottnest'],
            'features': ['harbour', 'reef', 'outback', 'rainforest', 'desert',
                        'beach', 'island', 'river', 'park', 'gardens'],
            'cultural': ['aboriginal', 'indigenous', 'heritage', 'colonial',
                        'convict', 'federation', 'anzac']
        }

        # Vietnamese-specific terms to preserve
        self.vietnamese_terms = ['nh√†', 'h√°t', 'opera', 'c·∫ßu', 'c·∫£ng', 'b√£i', 'bi·ªÉn',
                               'v∆∞·ªùn', 'b√°ch', 'th·∫£o', 'r·∫°n', 'san', 'h√¥', 'th√†nh', 'ph·ªë']

    def tokenize_sentence(self, text):
        """
        Tokenize text into sentences, preserving Australian terms.

        Args:
            text (str): Input text

        Returns:
            list: List of tokenized words
        """
        # Convert to lowercase
        text = text.lower()

        # Remove punctuation but preserve apostrophes in contractions
        text = re.sub(r"[^\w\s']", ' ', text)

        # Handle contractions
        contractions = {
            "n't": " not",
            "'re": " are",
            "'s": " is",
            "'ve": " have",
            "'ll": " will",
            "'d": " would"
        }

        for contraction, expansion in contractions.items():
            text = text.replace(contraction, expansion)

        # Split into words
        words = text.split()

        # Filter out very short words (except important ones)
        important_short_words = {'wa', 'sa', 'nt', 'tas', 'act', 'nsw', 'vic', 'qld'}
        words = [word for word in words
                if len(word) > 2 or word in important_short_words]

        return words

    def prepare_training_data(self, corpus):
        """
        Prepare corpus for embedding training.

        Args:
            corpus (list): List of text documents

        Returns:
            list: List of tokenized sentences
        """
        tokenized_corpus = []

        for text in corpus:
            # Split into sentences
            sentences = re.split(r'[.!?]+', text)

            for sentence in sentences:
                sentence = sentence.strip()
                if len(sentence) > 10:  # Skip very short sentences
                    tokens = self.tokenize_sentence(sentence)
                    if len(tokens) >= 3:  # Minimum sentence length
                        tokenized_corpus.append(tokens)

        return tokenized_corpus

    def analyze_vocabulary(self, tokenized_corpus):
        """
        Analyze vocabulary statistics from tokenized corpus.

        Args:
            tokenized_corpus (list): List of tokenized sentences

        Returns:
            dict: Vocabulary statistics
        """
        all_words = []
        for sentence in tokenized_corpus:
            all_words.extend(sentence)

        word_freq = Counter(all_words)

        # Find Australian-specific terms
        australian_words = []
        for word, freq in word_freq.items():
            if any(word in terms for terms in self.protected_terms.values()):
                australian_words.append((word, freq))

        # Find Vietnamese terms
        vietnamese_words = [(word, freq) for word, freq in word_freq.items()
                           if word in self.vietnamese_terms]

        return {
            'total_words': len(all_words),
            'unique_words': len(word_freq),
            'word_frequencies': word_freq,
            'australian_terms': australian_words,
            'vietnamese_terms': vietnamese_words,
            'avg_sentence_length': np.mean([len(s) for s in tokenized_corpus])
        }

# Initialize preprocessor and prepare data
preprocessor = AustralianEmbeddingPreprocessor()

# Prepare training data
tokenized_corpus = preprocessor.prepare_training_data(tourism_corpus['combined'])
vocab_stats = preprocessor.analyze_vocabulary(tokenized_corpus)

print("üî§ Text Preprocessing for Australian Tourism Embeddings")
print("=" * 55)
print(f"   Total sentences: {len(tokenized_corpus)}")
print(f"   Total words: {vocab_stats['total_words']:,}")
print(f"   Unique vocabulary: {vocab_stats['unique_words']:,}")
print(f"   Average sentence length: {vocab_stats['avg_sentence_length']:.1f} words")

print(f"\nüá¶üá∫ Australian terms found: {len(vocab_stats['australian_terms'])}")
australian_terms_str = ', '.join([term for term, freq in vocab_stats['australian_terms'][:10]])
print(f"   Top terms: {australian_terms_str}")

print(f"\nüáªüá≥ Vietnamese terms found: {len(vocab_stats['vietnamese_terms'])}")
if vocab_stats['vietnamese_terms']:
    vietnamese_terms_str = ', '.join([term for term, freq in vocab_stats['vietnamese_terms'][:5]])
    print(f"   Sample terms: {vietnamese_terms_str}")

# Show sample tokenized sentences
print(f"\nüìù Sample tokenized sentences:")
for i, sentence in enumerate(tokenized_corpus[:3]):
    print(f"   {i+1}. {sentence[:8]}... ({len(sentence)} words)")

print(f"\n‚úÖ Preprocessing completed - ready for embedding training!")

üî§ Text Preprocessing for Australian Tourism Embeddings
   Total sentences: 59
   Total words: 672
   Unique vocabulary: 362
   Average sentence length: 11.4 words

üá¶üá∫ Australian terms found: 32
   Top terms: sydney, harbour, beach, gardens, convict, heritage, colonial, melbourne, federation, river

üáªüá≥ Vietnamese terms found: 14
   Sample terms: opera, nh√†, h√°t, c·∫£ng, c·∫ßu

üìù Sample tokenized sentences:
   1. ['sydney', 'opera', 'house', 'iconic', 'architectural', 'masterpiece', 'located', 'bennelong']... (11 words)
   2. ['the', 'sydney', 'harbour', 'bridge', 'offers', 'spectacular', 'views', 'the']... (12 words)
   3. ['bondi', 'beach', 'famous', 'for', 'surfing', 'and', 'hosts', 'many']... (11 words)

‚úÖ Preprocessing completed - ready for embedding training!


In [8]:
class AustralianWord2Vec(nn.Module):
    """
    PyTorch implementation of Word2Vec Skip-gram model for Australian tourism corpus.

    TensorFlow equivalent:
        embedding_layer = tf.keras.layers.Embedding(vocab_size, embed_dim)

    This implementation uses:
    - Skip-gram architecture for better rare word representation
    - Negative sampling for efficient training
    - Australian tourism vocabulary optimization

    Args:
        vocab_size (int): Size of vocabulary
        embed_dim (int): Embedding dimension (typically 100-300)
        context_window (int): Context window size (typically 5-10)
    """

    def __init__(self, vocab_size, embed_dim=200, context_window=5):
        super(AustralianWord2Vec, self).__init__()

        self.vocab_size = vocab_size
        self.embed_dim = embed_dim
        self.context_window = context_window

        # Input embeddings (center words)
        self.in_embeddings = nn.Embedding(vocab_size, embed_dim)

        # Output embeddings (context words)
        self.out_embeddings = nn.Embedding(vocab_size, embed_dim)

        # Initialize embeddings with small random values
        self._init_embeddings()

        # Store Australian cities for analysis
        self.australian_cities = ['sydney', 'melbourne', 'brisbane', 'perth',
                                'adelaide', 'darwin', 'hobart', 'canberra']

    def _init_embeddings(self):
        """Initialize embedding weights."""
        # Initialize with small random values
        nn.init.uniform_(self.in_embeddings.weight, -0.5/self.embed_dim, 0.5/self.embed_dim)
        nn.init.uniform_(self.out_embeddings.weight, -0.5/self.embed_dim, 0.5/self.embed_dim)

    def forward(self, center_words, context_words, negative_words=None):
        """
        Forward pass for Word2Vec training.

        Args:
            center_words (torch.Tensor): Center word indices [batch_size]
            context_words (torch.Tensor): Context word indices [batch_size]
            negative_words (torch.Tensor): Negative sample indices [batch_size, num_negative]

        Returns:
            torch.Tensor: Loss value
        """
        batch_size = center_words.size(0)

        # Get center word embeddings
        center_embeds = self.in_embeddings(center_words)  # [batch_size, embed_dim]

        # Get context word embeddings
        context_embeds = self.out_embeddings(context_words)  # [batch_size, embed_dim]

        # Positive samples score
        pos_score = torch.sum(center_embeds * context_embeds, dim=1)  # [batch_size]
        pos_loss = -F.logsigmoid(pos_score).mean()

        # Negative sampling loss
        neg_loss = 0
        if negative_words is not None:
            num_negative = negative_words.size(1)

            # Get negative word embeddings
            neg_embeds = self.out_embeddings(negative_words)  # [batch_size, num_negative, embed_dim]

            # Compute negative scores
            center_embeds_expanded = center_embeds.unsqueeze(1).expand(-1, num_negative, -1)
            neg_scores = torch.sum(center_embeds_expanded * neg_embeds, dim=2)  # [batch_size, num_negative]

            neg_loss = -F.logsigmoid(-neg_scores).mean()

        return pos_loss + neg_loss

    def get_word_embeddings(self):
        """Get trained word embeddings."""
        return self.in_embeddings.weight.data

    def similarity(self, word1_idx, word2_idx):
        """Compute cosine similarity between two words."""
        embeddings = self.get_word_embeddings()

        emb1 = embeddings[word1_idx]
        emb2 = embeddings[word2_idx]

        # Cosine similarity
        cos_sim = F.cosine_similarity(emb1.unsqueeze(0), emb2.unsqueeze(0))
        return cos_sim.item()

    def most_similar(self, word_idx, word_to_idx, idx_to_word, top_k=10):
        """Find most similar words to a given word."""
        embeddings = self.get_word_embeddings()
        word_embed = embeddings[word_idx].unsqueeze(0)

        # Compute similarities with all words
        similarities = F.cosine_similarity(word_embed, embeddings)

        # Get top-k most similar (excluding the word itself)
        similarities[word_idx] = -1  # Exclude the word itself
        top_indices = similarities.topk(top_k).indices

        similar_words = []
        for idx in top_indices:
            word = idx_to_word.get(idx.item(), '<UNK>')
            similarity_score = similarities[idx].item()
            similar_words.append((word, similarity_score))

        return similar_words

print("üî§ Australian Word2Vec model class defined!")
print("   Architecture: Skip-gram with negative sampling")
print("   Optimized for: Australian tourism vocabulary")
print("   Features: Similarity computation, most similar words")

üî§ Australian Word2Vec model class defined!
   Architecture: Skip-gram with negative sampling
   Optimized for: Australian tourism vocabulary
   Features: Similarity computation, most similar words


In [9]:
class Word2VecDataset(Dataset):
    """
    PyTorch Dataset for Word2Vec training with Australian tourism corpus.

    Generates (center_word, context_word) pairs for skip-gram training.
    """

    def __init__(self, tokenized_corpus, word_to_idx, context_window=5, num_negative=5):
        self.tokenized_corpus = tokenized_corpus
        self.word_to_idx = word_to_idx
        self.idx_to_word = {idx: word for word, idx in word_to_idx.items()}
        self.context_window = context_window
        self.num_negative = num_negative
        self.vocab_size = len(word_to_idx)

        # Generate training pairs
        self.training_pairs = self._generate_training_pairs()

        # Create word frequency table for negative sampling
        self.word_freqs = self._build_frequency_table()

    def _generate_training_pairs(self):
        """Generate (center, context) word pairs."""
        pairs = []

        for sentence in self.tokenized_corpus:
            # Convert words to indices
            word_indices = [self.word_to_idx.get(word, self.word_to_idx.get('<UNK>', 0))
                           for word in sentence]

            # Generate context pairs
            for center_idx, center_word_idx in enumerate(word_indices):
                # Define context window
                start = max(0, center_idx - self.context_window)
                end = min(len(word_indices), center_idx + self.context_window + 1)

                # Generate pairs
                for context_idx in range(start, end):
                    if context_idx != center_idx:
                        pairs.append((center_word_idx, word_indices[context_idx]))

        return pairs

    def _build_frequency_table(self):
        """Build word frequency table for negative sampling."""
        word_counts = Counter()

        for sentence in self.tokenized_corpus:
            for word in sentence:
                word_counts[word] += 1

        # Convert to frequency distribution
        total_words = sum(word_counts.values())
        word_freqs = np.zeros(self.vocab_size)

        for word, count in word_counts.items():
            if word in self.word_to_idx:
                idx = self.word_to_idx[word]
                # Use subsampling for frequent words (power = 0.75)
                word_freqs[idx] = (count / total_words) ** 0.75

        # Normalize
        word_freqs = word_freqs / word_freqs.sum()

        return word_freqs

    def _negative_sampling(self, batch_size):
        """Generate negative samples."""
        negative_samples = np.random.choice(
            self.vocab_size,
            size=(batch_size, self.num_negative),
            p=self.word_freqs
        )
        return torch.LongTensor(negative_samples)

    def __len__(self):
        return len(self.training_pairs)

    def __getitem__(self, idx):
        center_word, context_word = self.training_pairs[idx]
        return torch.LongTensor([center_word]), torch.LongTensor([context_word])

# Build vocabulary from tokenized corpus
def build_vocabulary(tokenized_corpus, min_count=2):
    """Build vocabulary with minimum word frequency threshold."""
    word_counts = Counter()

    # Count word frequencies
    for sentence in tokenized_corpus:
        for word in sentence:
            word_counts[word] += 1

    # Filter by minimum count
    filtered_words = {word: count for word, count in word_counts.items()
                     if count >= min_count}

    # Create word-to-index mapping
    word_to_idx = {'<UNK>': 0}  # Unknown words
    idx_to_word = {0: '<UNK>'}

    for idx, word in enumerate(sorted(filtered_words.keys()), 1):
        word_to_idx[word] = idx
        idx_to_word[idx] = word

    return word_to_idx, idx_to_word, filtered_words

# Build vocabulary for Australian tourism corpus
word_to_idx, idx_to_word, word_counts = build_vocabulary(tokenized_corpus, min_count=2)
vocab_size = len(word_to_idx)

print("üìö Australian Tourism Vocabulary Built")
print("=" * 40)
print(f"   Vocabulary size: {vocab_size:,}")
print(f"   Total training pairs: {len(tokenized_corpus)} sentences")

# Show sample vocabulary
print(f"\nüá¶üá∫ Sample Australian words in vocabulary:")
australian_sample = [word for word in word_to_idx.keys()
                    if word in ['sydney', 'melbourne', 'brisbane', 'perth', 'adelaide',
                               'darwin', 'hobart', 'canberra', 'australia', 'australian']]
print(f"   Cities: {', '.join(australian_sample[:8])}")

# Show most frequent words
most_frequent = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)[:10]
print(f"\nüìä Most frequent words:")
for word, count in most_frequent:
    print(f"   {word}: {count}")

# Create dataset and dataloader
dataset = Word2VecDataset(tokenized_corpus, word_to_idx, context_window=5, num_negative=5)
dataloader = DataLoader(dataset, batch_size=256, shuffle=True, num_workers=0)

print(f"\n‚ö° Dataset created:")
print(f"   Training pairs: {len(dataset):,}")
print(f"   Batch size: 256")
print(f"   Context window: 5")
print(f"   Negative samples: 5")

print(f"\n‚úÖ Ready for Word2Vec training!")

üìö Australian Tourism Vocabulary Built
   Vocabulary size: 141
   Total training pairs: 59 sentences

üá¶üá∫ Sample Australian words in vocabulary:
   Cities: adelaide, australia, australian, brisbane, canberra, hobart, melbourne, perth

üìä Most frequent words:
   and: 31
   the: 26
   sydney: 9
   world: 8
   australia: 7
   australian: 6
   c√°c: 6
   features: 5
   melbourne: 5
   art: 5

‚ö° Dataset created:
   Training pairs: 4,950
   Batch size: 256
   Context window: 5
   Negative samples: 5

‚úÖ Ready for Word2Vec training!


In [10]:
# Initialize Word2Vec model
embed_dim = 200 if DEVICE.type != 'cpu' else 100  # Adjust based on device
model = AustralianWord2Vec(
    vocab_size=vocab_size,
    embed_dim=embed_dim,
    context_window=5
).to(DEVICE)

# Training configuration
learning_rate = 0.001 if DEVICE.type != 'cpu' else 0.002
num_epochs = 10 if DEVICE.type != 'cpu' else 5
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# TensorBoard setup
from datetime import datetime
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
log_dir = f"runs/australian_word2vec_{timestamp}"
writer = SummaryWriter(log_dir)

print(f"üèãÔ∏è Training Australian Tourism Word2Vec")
print("=" * 45)
print(f"   Model: Skip-gram with negative sampling")
print(f"   Embedding dimension: {embed_dim}")
print(f"   Vocabulary size: {vocab_size:,}")
print(f"   Learning rate: {learning_rate}")
print(f"   Epochs: {num_epochs}")
print(f"   Device: {DEVICE}")
print(f"   TensorBoard logs: {log_dir}")

def train_word2vec(model, dataloader, optimizer, writer, num_epochs, device):
    """Train Word2Vec model with Australian tourism corpus."""

    model.train()
    total_loss = 0
    step = 0

    print(f"\nüöÄ Starting Word2Vec training...")

    for epoch in range(num_epochs):
        epoch_loss = 0
        epoch_steps = 0

        for batch_idx, (center_words, context_words) in enumerate(dataloader):
            # Move to device
            center_words = center_words.squeeze().to(device)
            context_words = context_words.squeeze().to(device)

            # Generate negative samples
            batch_size = center_words.size(0)
            negative_words = dataset._negative_sampling(batch_size).to(device)

            # Forward pass
            optimizer.zero_grad()
            loss = model(center_words, context_words, negative_words)

            # Backward pass
            loss.backward()

            # Gradient clipping
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

            optimizer.step()

            # Accumulate loss
            epoch_loss += loss.item()
            epoch_steps += 1
            step += 1

            # Log to TensorBoard
            if batch_idx % 100 == 0:
                writer.add_scalar('Loss/Batch', loss.item(), step)

                # Log some embedding norms
                if batch_idx % 500 == 0:
                    with torch.no_grad():
                        embed_norms = torch.norm(model.in_embeddings.weight, dim=1).mean()
                        writer.add_scalar('Embeddings/Average_Norm', embed_norms.item(), step)

        # Calculate average epoch loss
        avg_epoch_loss = epoch_loss / epoch_steps

        # Log epoch metrics
        writer.add_scalar('Loss/Epoch', avg_epoch_loss, epoch)

        print(f"   Epoch {epoch+1:2d}/{num_epochs}: Loss = {avg_epoch_loss:.6f}")

        # Log embedding samples for specific Australian words
        if epoch % 2 == 0:  # Every 2 epochs
            with torch.no_grad():
                for city in ['sydney', 'melbourne', 'brisbane']:
                    if city in word_to_idx:
                        city_idx = word_to_idx[city]
                        city_embedding = model.in_embeddings.weight[city_idx]
                        writer.add_histogram(f'Embeddings/{city.title()}', city_embedding, epoch)

    writer.close()
    print(f"\nüéâ Word2Vec training completed!")
    print(f"   Final average loss: {avg_epoch_loss:.6f}")

    return model

# Train the model
trained_model = train_word2vec(model, dataloader, optimizer, writer, num_epochs, DEVICE)

# Save the trained model
torch.save({
    'model_state_dict': trained_model.state_dict(),
    'word_to_idx': word_to_idx,
    'idx_to_word': idx_to_word,
    'embed_dim': embed_dim,
    'vocab_size': vocab_size
}, 'australian_word2vec_model.pth')

print(f"\nüíæ Model saved as: australian_word2vec_model.pth")
print(f"üìä TensorBoard logs available at: {log_dir}")
print(f"   Run: tensorboard --logdir {log_dir}")

üèãÔ∏è Training Australian Tourism Word2Vec
   Model: Skip-gram with negative sampling
   Embedding dimension: 100
   Vocabulary size: 141
   Learning rate: 0.002
   Epochs: 5
   Device: cpu
   TensorBoard logs: runs/australian_word2vec_20250923_053937

üöÄ Starting Word2Vec training...
   Epoch  1/5: Loss = 1.380366
   Epoch  2/5: Loss = 1.327919
   Epoch  3/5: Loss = 1.235292
   Epoch  4/5: Loss = 1.168256
   Epoch  5/5: Loss = 1.131216

üéâ Word2Vec training completed!
   Final average loss: 1.131216

üíæ Model saved as: australian_word2vec_model.pth
üìä TensorBoard logs available at: runs/australian_word2vec_20250923_053937
   Run: tensorboard --logdir runs/australian_word2vec_20250923_053937


## Conclusion

<!-- Add your concluding remarks here -->

## Next Steps