# Hindi Word Embeddings: Complete Pipeline

This notebook implements a complete pipeline for creating and evaluating English word embeddings using frequency-based co-occurrence matrices and PCA dimensionality reduction.

## Pipeline Overview:
1. **Data Preprocessing**: Text cleaning, tokenization, vocabulary building
2. **Co-occurrence Matrix Construction**: Frequency-based approach with window size experimentation
3. **Dimensionality Reduction**: PCA with dimension experimentation
4. **Quantitative Evaluation**: Covariance, cosine similarity, analogies, clustering
5. **Visualization**: t-SNE, PCA, similarity heatmaps

## 🔧 Install & Import Dependencies

In [1]:
pip install indic-nlp-library

Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Collecting indic-nlp-library
  Using cached indic_nlp_library-0.92-py3-none-any.whl.metadata (5.7 kB)
Collecting sphinx-argparse (from indic-nlp-library)
  Using cached sphinx_argparse-0.5.2-py3-none-any.whl.metadata (3.7 kB)
Collecting sphinx-rtd-theme (from indic-nlp-library)
  Using cached sphinx_rtd_theme-3.0.2-py2.py3-none-any.whl.metadata (4.4 kB)
Collecting morfessor (from indic-nlp-library)
  Using cached Morfessor-2.0.6-py3-none-any.whl.metadata (628 bytes)
Collecting docutils>=0.19 (from sphinx-argparse->indic-nlp-library)
  Using cached docutils-0.21.2-py3-none-any.whl.metadata (2.8 kB)
Collecting sphinxcontrib-jquery<5,>=4 (from sphinx-rtd-theme->indic-nlp-library)
  Using cached sphinxcontrib_jquery-4.1-py2.py3-none-any.whl.metadata (2.6 kB)
Using cached indic_nlp_library-0.92-py3-none-any.whl (40 kB)
Using cached Morfessor-2.0.6-p



In [2]:
import indicnlp
from indicnlp import common
from indicnlp import loader
from indicnlp.tokenize import indic_tokenize
from indicnlp.normalize.indic_normalize import IndicNormalizerFactory
from collections import Counter
import numpy as np

We will explore the generation of dense word representations from text corpora, analyze the quality of these representations. The goal is to understand how to represent word meaning in a high-dimensional space and how to transfer knowledge across languages.

Please use the links below to download text corpora. \
English: https://wortschatz.uni-leipzig.de/en/download/English \
Hindi: https://wortschatz.uni-leipzig.de/en/download/Hindi

In [5]:
# Set up Indic NLP resources
INDIC_RESOURCES_PATH = "D:\\RESEARCH related\\PreCog tasks\\indic_nlp_resources"  # Replace with your path
common.set_resources_path(INDIC_RESOURCES_PATH)
loader.load()

# Load Hindi corpus file
path = "D:\PROJECTS\PreCog\Data\hin_news_2020_300K\hin_news_2020_300K-sentences.txt"
with open(path, 'r', encoding='utf-8') as f:
    hindi_lines = f.readlines()

  path = "D:\PROJECTS\PreCog\Data\hin_news_2020_300K\hin_news_2020_300K-sentences.txt"


## Preprocessing of data
1. normalize
2. replace new lines with spaces
3. Remove non-alphabetic characters (removing punctuations)
4. how about stop words removal --> they impact hugely in co occurence matrices so don't do it.
5. lemitization & stemming  (does this helpful for this task)
6. Tokenize words after all 

In [6]:
# Preprocess Hindi text
def preprocess_hindi(text):
    normalizer = IndicNormalizerFactory().get_normalizer("hi") # Hindi language
    text = normalizer.normalize(text) # Normalize the text
    text = text.replace('\n', ' ')  # Replace newlines with spaces
    tokens = list(indic_tokenize.trivial_tokenize(text, lang='hi')) # Tokenize the text
    tokens = [token for token in tokens if token.strip()]  # Remove empty tokens
    tokens = [token.lower() for token in tokens]
    return tokens

processed_hindi = [preprocess_hindi(sent) for sent in hindi_lines]

# Example usage
for i in range(2,6):
    print(f"Original: {hindi_lines[i]}")
    print(f"Processed: {processed_hindi[i]}")
    print()

Original: 3	० में कहा कि लॉकडाउन के बाद गरीब कल्याण योजना का ऐलान किया गया था।

Processed: ['3', '०', 'में', 'कहा', 'कि', 'लॉकडाउन', 'के', 'बाद', 'गरीब', 'कल्याण', 'योजना', 'का', 'ऐलान', 'किया', 'गया', 'था', '।']

Original: 4	"100 मरीजों पर नियंत्रित क्लिनिकल ट्रायल किया गया, जिसमें तीन दिन के अंदर 69 प्रतिशत और चार दिन के अंदर शत प्रतिशत मरीज ठीक हो गए और उनकी जांच रिपोर्ट निगेटिव आई।"

Processed: ['4', '"', '100', 'मरीजों', 'पर', 'नियंत्रित', 'क्लिनिकल', 'ट्रायल', 'किया', 'गया', ',', 'जिसमें', 'तीन', 'दिन', 'के', 'अंदर', '69', 'प्रतिशत', 'और', 'चार', 'दिन', 'के', 'अंदर', 'शत', 'प्रतिशत', 'मरीज', 'ठीक', 'हो', 'गए', 'और', 'उनकी', 'जांच', 'रिपोर्ट', 'निगेटिव', 'आई', '।', '"']

Original: 5	'100 में 70 अफ़सर बनने लायक़ नहीं'

Processed: ['5', "'", '100', 'में', '70', 'अफ़सर', 'बनने', 'लायक़', 'नहीं', "'"]

Original: 6	"100 रुपये के 8 करोड़ कूपन छापे जाएँगे.

Processed: ['6', '"', '100', 'रुपये', 'के', '8', 'करोड़', 'कूपन', 'छापे', 'जाएँगे', '.']



In [7]:
# vocubulary size with words with a minimum frequence of 10 words
# Flatten all tokens into a single list
flat_words = [word for sentence in processed_hindi for word in sentence]

# Count word frequencies
word_counts = Counter(flat_words)

# Count words with at least 10 occurrences
min_freq = 10
num_words_10plus = sum(1 for count in word_counts.values() if count >= min_freq)

print(f"Number of words with at least {min_freq} occurrences: {num_words_10plus}")

Number of words with at least 10 occurrences: 18158


## Vocabulary Building with Frequency Filtering

In [8]:
def build_vocabulary(processed_hindi, min_freq=5):
    print(f"Building vocabulary with minimum frequency: {min_freq}...")
    
    # Flatten all tokens into a single list
    flat_words = [word for sentence in processed_hindi for word in sentence]
    
    # Count word frequencies
    word_counts = Counter(flat_words)
    
    # Filter words with minimum frequency
    filtered_words = {word: count for word, count in word_counts.items() if count >= min_freq}
    
    # Create vocabulary from most frequent words
    vocab = list(filtered_words.keys())
    vocab_size = len(vocab)
    
    print(f"📊 Vocabulary Statistics:")
    print(f"   Total unique words: {len(word_counts)}")
    print(f"   Words with freq >= {min_freq}: {vocab_size}")
    print(f"   Vocabulary reduction: {(1 - vocab_size/len(word_counts))*100:.1f}%")
    
    return vocab, word_counts, filtered_words

# Build vocabulary
vocab, word_counts, filtered_words = build_vocabulary(processed_hindi)
print(f"\n🔤 Sample vocabulary (first 20 words):")
print(vocab[:20])

print(f"\n📈 Most frequent words:")
for word, count in Counter(filtered_words).most_common(10):
    print(f"   {word}: {count}")

print(f"\n✅ Vocabulary built with {len(vocab)} words")

Building vocabulary with minimum frequency: 5...
📊 Vocabulary Statistics:
   Total unique words: 392891
   Words with freq >= 5: 27123
   Vocabulary reduction: 93.1%

🔤 Sample vocabulary (first 20 words):
['1', '03', 'मजदूरों', 'को', 'बेहतर', 'इलाज', 'के', 'लिए', 'रायपुर', 'ले', 'जाने', 'की', 'करवाई', 'गई', 'व्यवस्था', 'pic', '.', '2', '•', 'pm']

📈 Most frequent words:
   .: 231998
   के: 219853
   में: 171219
   है: 166705
   की: 139307
   को: 102600
   से: 99867
   ,: 94415
   और: 82439
   ने: 82413

✅ Vocabulary built with 27123 words


Now that we've built the vocabulary from the most frequent words, how can we assign meaningful IDs to these words instead of just arbitrary numbers?

### There are several strategies we can use:

# Frequency-based indexing ---> (most common and efficient)

POS-based ordering ---> (useful for linguistic analysis)

In [9]:
def create_word_mappings(vocab):
    """Create word-to-ID and ID-to-word mappings based on frequency"""
    
    # Count word frequencies
    word_freq = Counter(vocab)
    
    # Sort words by frequency (least frequent first)
    sorted_vocab = sorted(word_freq, key=lambda word: word_freq[word])
    
    # Create mappings: word2id and id2word
    word2id = {word: i for i, word in enumerate(sorted_vocab)}
    id2word = {i: word for i, word in enumerate(sorted_vocab)}
    
    return word2id, id2word

# Get word-to-ID and ID-to-word mappings
word2id, id2word = create_word_mappings(vocab)

## Co-occurrence Matrix Construction with Window Size Experimentation

We construct the co-occurrence matrix by counting how often word pairs appear within a context window, which can be adjusted based on experimentation. For each sentence, we update the matrix for every target-context word pair found within the specified window size.

In [10]:
from scipy.sparse import lil_matrix
from tqdm import tqdm

def build_cooccurrence_matrix(sentences, word_to_idx, window_size=5):
    """Build co-occurrence matrix with specified window size for sample later we decide optimal window size"""
    vocab_size = len(word_to_idx)
    cooc_matrix = lil_matrix((vocab_size, vocab_size), dtype=np.float32)
    
    for sentence in tqdm(sentences, desc=f"Building co-occurrence matrix (window={window_size})"):
        # Filter words that are in vocabulary
        valid_words = [word for word in sentence if word in word_to_idx]
        
        for i, target_word in enumerate(valid_words):
            target_idx = word_to_idx[target_word]
            
            # Look at context words within window
            start = max(0, i - window_size)
            end = min(len(valid_words), i + window_size + 1)
            
            for j in range(start, end):
                if i != j:
                    context_word = valid_words[j]
                    context_idx = word_to_idx[context_word]
                    # Weight by distance (closer words get higher weight)
                    distance = abs(i - j)
                    weight = 1.0 / distance
                    cooc_matrix[target_idx, context_idx] += weight
    
    return cooc_matrix.tocsr()

### Window Size Analysis and Selection

Small (2-5 words): Captures local relationships (syntax). \
Medium (5-10 words): Captures semantic meaning.\
Large (>10 words): Captures broader context.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

def analyze_window_sizes(sentences, word2id, window_sizes=[2, 4, 5, 7, 9, 11, 14]):
    """Experiment with different window sizes and analyze sparsity"""
    print("🔬 Experimenting with different window sizes...")
    
    cooc_matrices = {}
    sparsity_results = []
    
    vocab_size = len(word2id)
    
    for window_size in window_sizes:
        print(f"\n📐 Testing window size: {window_size}")
        
        # Build co-occurrence matrix
        cooc_matrix = build_cooccurrence_matrix(sentences, word2id, window_size)
        cooc_matrices[window_size] = cooc_matrix
        
        # Calculate sparsity
        sparsity = 1 - cooc_matrix.nnz / (vocab_size * vocab_size)
        non_zero_entries = cooc_matrix.nnz
        
        sparsity_results.append({
            'window_size': window_size,
            'sparsity': sparsity,
            'non_zero_entries': non_zero_entries,
            'density': 1 - sparsity
        })
        
        print(f"   Matrix shape: {cooc_matrix.shape}")
        print(f"   Non-zero entries: {non_zero_entries:,}")
        print(f"   Sparsity: {sparsity:.4f}")
        print(f"   Density: {1-sparsity:.4f}")
    
    # Visualize sparsity analysis
    df_sparsity = pd.DataFrame(sparsity_results)
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
    
    # Sparsity vs Window Size
    ax1.plot(df_sparsity['window_size'], df_sparsity['sparsity'], 'bo-', linewidth=2, markersize=8)
    ax1.set_xlabel('Window Size')
    ax1.set_ylabel('Sparsity')
    ax1.set_title('Matrix Sparsity vs Window Size')
    ax1.grid(True, alpha=0.3)
    
    # Non-zero entries vs Window Size
    ax2.plot(df_sparsity['window_size'], df_sparsity['non_zero_entries'], 'ro-', linewidth=2, markersize=8)
    ax2.set_xlabel('Window Size')
    ax2.set_ylabel('Non-zero Entries')
    ax2.set_title('Non-zero Entries vs Window Size')
    ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Select optimal window size (balance between sparsity and information)
    optimal_window = 5  # Default choice
    print(f"\n🎯 Selected optimal window size: {optimal_window}")
    
    return cooc_matrices, df_sparsity, optimal_window

# Experiment with window sizes
WINDOW_SIZES = [2, 4, 5, 7, 9, 11, 14]
cooc_matrices, sparsity_df, optimal_window = analyze_window_sizes(
    processed_hindi, word2id, WINDOW_SIZES
)

print(f"\n✅ Co-occurrence matrices built for {len(WINDOW_SIZES)} window sizes")

🔬 Experimenting with different window sizes...

📐 Testing window size: 2


Building co-occurrence matrix (window=2): 100%|██████████| 300000/300000 [03:08<00:00, 1588.52it/s]


   Matrix shape: (27123, 27123)
   Non-zero entries: 3,537,330
   Sparsity: 0.9952
   Density: 0.0048

📐 Testing window size: 4


Building co-occurrence matrix (window=4): 100%|██████████| 300000/300000 [06:06<00:00, 817.71it/s] 


   Matrix shape: (27123, 27123)
   Non-zero entries: 6,445,326
   Sparsity: 0.9912
   Density: 0.0088

📐 Testing window size: 5


Building co-occurrence matrix (window=5): 100%|██████████| 300000/300000 [07:13<00:00, 691.28it/s] 


   Matrix shape: (27123, 27123)
   Non-zero entries: 7,577,174
   Sparsity: 0.9897
   Density: 0.0103

📐 Testing window size: 7


Building co-occurrence matrix (window=7): 100%|██████████| 300000/300000 [11:38<00:00, 429.34it/s] 


   Matrix shape: (27123, 27123)
   Non-zero entries: 9,397,390
   Sparsity: 0.9872
   Density: 0.0128

📐 Testing window size: 9


Building co-occurrence matrix (window=9): 100%|██████████| 300000/300000 [17:25<00:00, 286.88it/s]


   Matrix shape: (27123, 27123)
   Non-zero entries: 10,772,734
   Sparsity: 0.9854
   Density: 0.0146

📐 Testing window size: 11


Building co-occurrence matrix (window=11): 100%|██████████| 300000/300000 [15:48<00:00, 316.12it/s]


   Matrix shape: (27123, 27123)
   Non-zero entries: 11,826,031
   Sparsity: 0.9839
   Density: 0.0161

📐 Testing window size: 14


Building co-occurrence matrix (window=14): 100%|██████████| 300000/300000 [14:18<00:00, 349.30it/s]


   Matrix shape: (27123, 27123)
   Non-zero entries: 12,970,329
   Sparsity: 0.9824
   Density: 0.0176


NameError: name 'pd' is not defined

## evaluate with few word embeddings

In [13]:
from sklearn.metrics.pairwise import cosine_similarity

def analyze_window_size_quality(cooc_matrices, vocab, word2id, sample_words=['king', 'queen', 'man', 'woman', 'good', 'bad']):
    """Analyze the quality of embeddings for different window sizes"""
    results = {}
    vocab_size = len(vocab)
    
    for window_size, matrix in cooc_matrices.items():
        # Apply log transformation and normalize
        log_matrix = matrix.copy().astype(np.float32)
        log_matrix.data = np.log1p(log_matrix.data)  # log(1 + x)
        
        # Normalize rows
        row_sums = np.array(log_matrix.sum(axis=1)).flatten()
        row_sums[row_sums == 0] = 1  # Avoid division by zero
        log_matrix = log_matrix.multiply(1 / row_sums[:, np.newaxis])
        
        # Calculate average cosine similarity for sample word pairs
        similarities = []
        for i, word1 in enumerate(sample_words):
            if word1 in word2id:
                idx1 = word2id[word1]
                vec1 = log_matrix.getrow(idx1).toarray().flatten()  # Use getrow to access row
                
                for word2 in sample_words[i+1:]:
                    if word2 in word2id:
                        idx2 = word2id[word2]
                        vec2 = log_matrix.getrow(idx2).toarray().flatten()  # Use getrow to access row
                        sim = cosine_similarity([vec1], [vec2])[0, 0]
                        similarities.append(sim)
        
        # Compute results
        results[window_size] = {
            'avg_similarity': np.mean(similarities),
            'std_similarity': np.std(similarities),
            'sparsity': 1 - matrix.nnz / (vocab_size * vocab_size)  # sparsity calculation
        }
    
    return results

# Example: Assuming cooc_matrices, vocab, and word2id are defined somewhere in the code
window_analysis = analyze_window_size_quality(cooc_matrices, vocab, word2id)

# Visualize window size analysis
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

window_sizes_list = list(window_analysis.keys())
avg_sims = [window_analysis[w]['avg_similarity'] for w in window_sizes_list]
std_sims = [window_analysis[w]['std_similarity'] for w in window_sizes_list]
sparsities = [window_analysis[w]['sparsity'] for w in window_sizes_list]

axes[0].plot(window_sizes_list, avg_sims, 'o-')
axes[0].set_title('Average Cosine Similarity')
axes[0].set_xlabel('Window Size')
axes[0].set_ylabel('Similarity')

axes[1].plot(window_sizes_list, std_sims, 'o-', color='orange')
axes[1].set_title('Similarity Standard Deviation')
axes[1].set_xlabel('Window Size')
axes[1].set_ylabel('Std Dev')

axes[2].plot(window_sizes_list, sparsities, 'o-', color='green')
axes[2].set_title('Matrix Sparsity')
axes[2].set_xlabel('Window Size')
axes[2].set_ylabel('Sparsity')

plt.tight_layout()
plt.show()

# Select best window size (balance between similarity and computational efficiency)
best_window = 7  # Based on analysis
print(f"\nSelected window size: {best_window}")
print(f"Analysis results: {window_analysis[best_window]}")

NameError: name 'cooc_matrices' is not defined

# What's next?

To turn that matrix into word embeddings, you need to apply a technique like: \

Method -----> What it does \
# PCA / SVD -----> Reduce matrix to low-dimensional dense vec 
NMF (Non-negative Matrix Factorization) ---> Factorizes co-occurrence matrix into interpretable non-negatives \
GloVe -----> Uses the co-occurrence matrix to train word vectors \
word2vec -----> Learns embeddings directly via neural nets (skip-gram/CBOW)