## Assignment 2: Probabilistic Models and Vector Space Applications

### Assignment Overview

This assignment builds on fundamental text processing techniques to explore probabilistic language modeling, text classification, and the practical application of vector space models for information retrieval. You will implement a simple n-gram language model, build a complete text classification pipeline, develop a search engine, and analyze the core components of sequence models.

You are required to complete five coding-related tasks. For each task, you will be working with a specified dataset or corpus. Please submit your solutions in this single Jupyter Notebook (`.ipynb`) file, clearly marking each task. Ensure your code is well-commented and your findings are explained in markdown cells where requested.

### Task 1: Implementing a Bigram Language Model with Laplace Smoothing (20 Marks)

**Objective:** To understand the fundamentals of n-gram language models, including probability calculation, smoothing, and evaluation with perplexity.

**Description:** You will implement a Bigram language model from scratch. Your model will be trained on a small corpus and will use Add-One (Laplace) smoothing to handle unseen n-grams.

**Your task is to:**

1.  **Implement a training function `train_bigram_model(corpus)`:**
    * The corpus will be a list of sentences, where each sentence is a list of tokens.
    * The function should count all unigrams and bigrams in the corpus.
    * It should return the unigram counts, bigram counts, and the vocabulary size (V).

2.  **Implement a probability function `calculate_bigram_prob(prev_word, word, unigram_counts, bigram_counts, V)`:**
    * This function should calculate the smoothed probability of a `word` given the `prev_word` using the formula for Laplace (Add-One) smoothing: $P(w_i | w_{i-1}) = \frac{C(w_{i-1}, w_i) + 1}{C(w_{i-1}) + V}$.

3.  **Implement a perplexity calculation function `calculate_perplexity(sentence, ...)`:**
    * This function should take a test sentence and your trained model components as input.
    * It should calculate the perplexity of the sentence using the formula: $PP(W) = P(w_1, w_2, ..., w_N)^{-1/N}$. Remember to handle the start of the sentence appropriately (e.g., by assuming a start token `<S>`).

4.  **Train and Evaluate:**
    * Train your model on the provided `train_corpus`.
    * Calculate and print the perplexity of your model on the `test_sentence`.

**Corpus:**

```python
# Sample corpus for training and testing
train_corpus = [["<S>", "i", "am", "sam", "</S>"], ["<S>", "sam", "i", "am", "</S>"], ["<S>", "i", "do", "not", "like", "green", "eggs", "and", "ham", "</S>"]]
test_sentence = ["<S>", "i", "like", "green", "ham", "</S>"]
```

In [None]:
# Your code for Task 1 here
import numpy as np
from collections import Counter, defaultdict

# Provided Corpus
train_corpus = [["<S>", "i", "am", "sam", "</S>"], ["<S>", "sam", "i", "am", "</S>"], ["<S>", "i", "do", "not", "like", "green", "eggs", "and", "ham", "</S>"]]
test_sentence = ["<S>", "i", "like", "green", "ham", "</S>"]

def train_bigram_model(corpus):
    """
    ?? Bigram ????
    
    ??:
        corpus: ????????????? token ??
    
    ??:
        unigram_counts: unigram ????
        bigram_counts: bigram ????
        V: ?????
    """
    unigram_counts = Counter()
    bigram_counts = defaultdict(Counter)
    
    # ???? unigram ? bigram
    for sentence in corpus:
        # ?? unigram
        for token in sentence:
            unigram_counts[token] += 1
        
        # ?? bigram
        for i in range(len(sentence) - 1):
            prev_word = sentence[i]
            word = sentence[i + 1]
            bigram_counts[prev_word][word] += 1
    
    # ????????????? token?
    V = len(unigram_counts)
    
    return unigram_counts, bigram_counts, V

def calculate_bigram_prob(prev_word, word, unigram_counts, bigram_counts, V):
    """
    ???? Laplace ??? bigram ??
    
    ??: P(w_i | w_{i-1}) = (C(w_{i-1}, w_i) + 1) / (C(w_{i-1}) + V)
    
    ??:
        prev_word: ????
        word: ???
        unigram_counts: unigram ????
        bigram_counts: bigram ????
        V: ?????
    
    ??:
        ???????
    """
    # ?? bigram ?????????? 0
    bigram_count = bigram_counts.get(prev_word, Counter()).get(word, 0)
    
    # ??????? unigram ?????????? 0
    prev_word_count = unigram_counts.get(prev_word, 0)
    
    # ?? Laplace ????
    prob = (bigram_count + 1) / (prev_word_count + V)
    
    return prob

def calculate_perplexity(sentence, unigram_counts, bigram_counts, V):
    """
    ????????
    
    ??: PP(W) = P(w_1, w_2, ..., w_N)^{-1/N}
    ?? P(w_1, w_2, ..., w_N) = ? P(w_i | w_{i-1})
    
    ??:
        sentence: ?????token ??
        unigram_counts: unigram ????
        bigram_counts: bigram ????
        V: ?????
    
    ??:
        ????
    """
    # ???? bigram ?????
    log_prob_sum = 0.0
    
    # ???????? bigram
    for i in range(len(sentence) - 1):
        prev_word = sentence[i]
        word = sentence[i + 1]
        
        # ????
        prob = calculate_bigram_prob(prev_word, word, unigram_counts, bigram_counts, V)
        
        # ??????????
        log_prob_sum += np.log(prob)
    
    # ??????????????? </S>???????? bigram ???
    N = len(sentence) - 1
    
    # ?????: PP(W) = exp(-log_prob_sum / N)
    perplexity = np.exp(-log_prob_sum / N)
    
    return perplexity

# Train the model
unigram_counts, bigram_counts, V = train_bigram_model(train_corpus)

# Calculate and print perplexity
perplexity = calculate_perplexity(test_sentence, unigram_counts, bigram_counts, V)
print(f"Perplexity of the test sentence: {perplexity:.2f}")


### Task 2: Text Classification with TF-IDF and Naive Bayes (20 Marks)

**Objective:** To build a complete text classification pipeline using TF-IDF feature extraction and a Multinomial Naive Bayes classifier.

**Description:** You will use `scikit-learn` to classify SMS messages as either "spam" or "ham" (not spam). This task integrates vector space representation with a classic probabilistic model.

**Your task is to:**

1.  Load the SMS Spam Collection dataset.
2.  Split the dataset into an 80% training set and a 20% testing set.
3.  Create a text processing pipeline using `sklearn.pipeline.Pipeline` that consists of two steps:
    * `TfidfVectorizer`: To convert text messages into TF-IDF vectors. Use the default parameters.
    * `MultinomialNB`: The Multinomial Naive Bayes classifier.
4.  Train the pipeline on the training data.
5.  Evaluate the trained model on the testing data. Print the following:
    * The accuracy of the model.
    * A full classification report (including precision, recall, and F1-score for each class) using `sklearn.metrics.classification_report`.
6.  Use the trained pipeline to predict the class of two new messages: `"Congratulations! You've won a $1,000 gift card. Go to http://example.com to claim now."` and `"Hi mom, I'll be home for dinner tonight."`

**Dataset:**

* **SMS Spam Collection Dataset:** A public set of SMS labeled messages.
* **Access:** Download from the UCI Machine Learning Repository: [SMS Spam Collection](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). You will need the `SMSSpamCollection` file.

In [None]:
# Your code for Task 2 here
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report

# Load the dataset, e.g, as follows. But you may modify it.
# df = pd.read_csv('path_to_your_dataset/SMSSpamCollection', sep='\t', header=None, names=['label', 'message'])

# Split data

# Create and train the pipeline

# Evaluate the model

# Predict on new messages
# Step 1: Load the SMS Spam Collection dataset
df = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['label', 'message'])

# Display basic information about the dataset
print("Dataset shape:", df.shape)
print("\nFirst few rows:")
print(df.head())
print("\nLabel distribution:")
print(df['label'].value_counts())

# Step 2: Split the dataset into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    df['message'], 
    df['label'], 
    test_size=0.2, 
    random_state=42,
    stratify=df['label']  # Ensure balanced split
)

print(f"\nTraining set size: {len(X_train)}")
print(f"Testing set size: {len(X_test)}")

# Step 3: Create a text processing pipeline using sklearn.pipeline.Pipeline
# The pipeline consists of:
# 1. TfidfVectorizer: Convert text messages into TF-IDF vectors (default parameters)
# 2. MultinomialNB: Multinomial Naive Bayes classifier
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),  # Default parameters
    ('nb', MultinomialNB())
])

# Step 4: Train the pipeline on the training data
print("\nTraining the pipeline...")
pipeline.fit(X_train, y_train)
print("Training completed!")

# Step 5: Evaluate the trained model on the testing data
y_pred = pipeline.predict(X_test)

# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy: {accuracy:.4f}")

# Print full classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Step 6: Predict the class of two new messages
new_messages = [
    "Congratulations! You've won a $1,000 gift card. Go to http://example.com to claim now.",
    "Hi mom, I'll be home for dinner tonight."
]

print("\nPredictions for new messages:")
for i, message in enumerate(new_messages, 1):
    prediction = pipeline.predict([message])[0]
    print(f"\nMessage {i}: \"{message}\"")
    print(f"Predicted class: {prediction}")


### Task 3: Building a Simple Information Retrieval System (20 Marks)

**Objective:** To apply TF-IDF and Cosine Similarity to build a basic document retrieval system that ranks documents based on their relevance to a query.

**Description:** You will create a system that takes a text query and returns the most relevant documents from a small corpus. This is the core principle behind search engines.

**Your task is to:**

1.  Use the provided `document_corpus`.
2.  Create a `TfidfVectorizer` and fit it on the corpus to learn the vocabulary and IDF weights.
3.  Transform the corpus into a TF-IDF document-term matrix.
4.  Write a function `rank_documents(query, vectorizer, doc_term_matrix, top_n=3)` that:
    * Takes a `query` string, the fitted `vectorizer`, the document-term `matrix`, and an optional `top_n` integer.
    * Transforms the input query into a TF-IDF vector using the *same* vectorizer.
    * Calculates the cosine similarity between the query vector and all document vectors in the matrix.
    * Returns the indices and content of the `top_n` most similar documents.
5.  Demonstrate your system by running it with the query `"deep learning models for vision"` and printing the ranked results.

**Dataset:**

```python
# A small corpus of document abstracts
document_corpus = [
    "The field of machine learning has seen rapid growth in recent years, especially in deep learning.",
    "Natural language processing allows machines to understand and respond to human text.",
    "Computer vision focuses on enabling computers to see and interpret the visual world.",
    "Deep learning models like convolutional neural networks are powerful for computer vision tasks.",
    "Recurrent neural networks are often used for sequential data in natural language processing."
    ...
]
```

In [None]:
# Your code for Task 3 herefrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.metrics.pairwise import cosine_similarityimport numpy as npdocument_corpus = [    "The field of machine learning has seen rapid growth in recent years, especially in deep learning.",    "Natural language processing allows machines to understand and respond to human text.",    "Computer vision focuses on enabling computers to see and interpret the visual world.",    "Deep learning models like convolutional neural networks are powerful for computer vision tasks.",    "Recurrent neural networks are often used for sequential data in natural language processing.",    "The advances in reinforcement learning have led to breakthroughs in game playing and robotics.",    "Transfer learning enables models trained on large datasets to be adapted for new tasks with limited data.",    "Unsupervised learning techniques can discover hidden patterns in data without labeled examples.",    "Optimization algorithms such as stochastic gradient descent are crucial for training neural networks.",    "Attention mechanisms have improved the performance of natural language translation and image captioning.",    "Generative adversarial networks create realistic images and are used for data augmentation.",    "Feature engineering and selection are important steps in classical machine learning pipelines.",    "Object detection is a key task in computer vision that involves locating instances within images.",    "The combination of convolutional and recurrent networks is used for video classification tasks.",    "Zero-shot learning allows models to recognize objects and concepts they have not seen during training.",    "Natural language generation is used for creating text summaries and chatbot responses.",    "Graph neural networks leverage graph structures for tasks such as social network analysis and chemistry.",    "Hyperparameter tuning can significantly improve the accuracy of deep learning models.",    "Cross-modal learning involves integrating information from multiple data sources such as text and images.",    "Evaluating model performance requires a good choice of metrics such as F1-score and RMSE."]# Step 2: Create a TfidfVectorizer and fit it on the corpusvectorizer = TfidfVectorizer()vectorizer.fit(document_corpus)# Step 3: Transform the corpus into a TF-IDF document-term matrixdoc_term_matrix = vectorizer.transform(document_corpus)def rank_documents(query, vectorizer, doc_term_matrix, top_n=3):    """    Rank documents based on their relevance to a query using cosine similarity.        Parameters:        query: Query string        vectorizer: Fitted TfidfVectorizer        doc_term_matrix: Document-term TF-IDF matrix        top_n: Number of top documents to return (default: 3)        Returns:        List of tuples containing (document index, document content, similarity score),        sorted by similarity in descending order    """    # Transform the query into a TF-IDF vector using the same vectorizer    query_vector = vectorizer.transform([query])        # Calculate cosine similarity between the query vector and all document vectors    similarities = cosine_similarity(query_vector, doc_term_matrix).flatten()        # Get indices of the top_n most similar documents (descending order)    top_indices = np.argsort(similarities)[::-1][:top_n]        # Return indices and corresponding document content    ranked_docs = [(idx, document_corpus[idx], similarities[idx]) for idx in top_indices]        return ranked_docs# Step 5: Demonstrate the system with the queryquery = "deep learning models for vision"ranked_docs = rank_documents(query, vectorizer, doc_term_matrix, top_n=3)print(f"Top {len(ranked_docs)} documents for the query: '{query}'\n")for rank, (idx, doc, similarity) in enumerate(ranked_docs, 1):    print(f"Rank {rank} (Similarity: {similarity:.4f}):")    print(f"  Document {idx}: {doc}\n")

### Task 4: Implementing Viterbi for HMM POS Tagging (20 Marks)

**Objective:** Implement a simple Hidden Markov Model (HMM) POS tagger via the Viterbi algorithm.

**Description:** Implement Viterbi decoding for a small HMM and apply it to two sentences with the ambiguous word "book". Then briefly discuss why HMMs work for POS tagging and a limitation of the Markov assumption.

**Your task is to:**

1. Define two sentences:
    * `sentence1 = "The book is on the table."`
    * `sentence2 = "I want to book a flight."`
2. Implement Viterbi in log-space for a small tag set (e.g., `{DET, NOUN, VERB, PRT}`). Use the example initial (?), transition (A), and emission (B) probabilities in the parameters block below, or define your own consistent matrices and document them.
3. Run your decoder on both sentences and print the predicted tag sequence and total log-probability.
4. In a markdown cell, explain:
    * **a)** How transition and emission probabilities lead to different tags for "book" in the two sentences.
    * **b)** One sentence where the first-order Markov assumption is limiting, and why.

**Parameters:** Use the matrices shown in the section ?Viterbi decoding for a simple HMM (Task 4)? below.

#### Viterbi decoding for a simple HMM (Task 4)

We illustrate HMM POS tagging with a small tag set `T = {DET, NOUN, VERB, PRT}` and vocabulary `V = {the, a, book, table, flight, is, want, to, on, i}`. The HMM comprises initial probabilities ?, tag-to-tag transitions A, and tag-to-word emissions B.

Example parameters (each row sums to 1):

- Initial ?:
  - P(DET)=0.50, P(NOUN)=0.20, P(VERB)=0.20, P(PRT)=0.10

- Transition A (rows: from-tag, cols: to-tag) in order [DET, NOUN, VERB, PRT]:

```text
from\to   DET    NOUN   VERB   PRT
DET      0.05   0.75   0.15   0.05
NOUN     0.05   0.10   0.75   0.10
VERB     0.10   0.35   0.40   0.15
PRT      0.05   0.10   0.75   0.10
```

- Emission B:
  - DET: the(0.80), a(0.20)
  - NOUN: book(0.45), table(0.25), flight(0.20), i(0.05), on(0.05)
  - VERB: is(0.40), want(0.35), book(0.20), to(0.03), on(0.02)
  - PRT: to(0.70), on(0.30)

Viterbi recurrence in log-space to avoid underflow:

- Initialization: `V[tag, 0] = log ?[tag] + log B[tag, x0]`
- Recurrence: `V[tag, i] = log B[tag, xi] + max_prev ( V[prev, i-1] + log A[prev->tag] )`
- Backtrace from the best final tag.

We will decode the most likely tag sequence for the two Task 4 sentences using these parameters.


In [None]:
# Task 4: Implement Viterbi for a simple POS HMMimport mathfrom typing import List, Dict, Tuple# Define tag setTAGS = ["DET", "NOUN", "VERB", "PRT"]# Define HMM parameters# Initial probabilities (pi)pi: Dict[str, float] = {    "DET": 0.50,    "NOUN": 0.20,    "VERB": 0.20,    "PRT": 0.10}# Transition probabilities (A): from-tag -> to-tagA: Dict[str, Dict[str, float]] = {    "DET": {"DET": 0.05, "NOUN": 0.75, "VERB": 0.15, "PRT": 0.05},    "NOUN": {"DET": 0.05, "NOUN": 0.10, "VERB": 0.75, "PRT": 0.10},    "VERB": {"DET": 0.10, "NOUN": 0.35, "VERB": 0.40, "PRT": 0.15},    "PRT": {"DET": 0.05, "NOUN": 0.10, "VERB": 0.75, "PRT": 0.10}}# Emission probabilities (B): tag -> wordB: Dict[str, Dict[str, float]] = {    "DET": {"the": 0.80, "a": 0.20},    "NOUN": {"book": 0.45, "table": 0.25, "flight": 0.20, "i": 0.05, "on": 0.05},    "VERB": {"is": 0.40, "want": 0.35, "book": 0.20, "to": 0.03, "on": 0.02},    "PRT": {"to": 0.70, "on": 0.30}}# Unknown word probability (for words not in vocabulary)UNK = 1e-8def emission_logprob(tag: str, word: str) -> float:    """    Return log-probability for emitting 'word' from 'tag'.    Uses UNK for unseen words.    """    if tag in B and word in B[tag]:        prob = B[tag][word]    else:        prob = UNK    return math.log(prob)def viterbi(tokens: List[str]) -> Tuple[List[str], float]:    """    Implement Viterbi algorithm in log-space for HMM POS tagging.        Returns:        Tuple of (predicted tag sequence, total log-probability)    """    n = len(tokens)    num_tags = len(TAGS)        # Step 1: Initialize    # V[tag][pos] stores the log-probability of the best path ending at position pos with tag    V = {tag: [float('-inf')] * n for tag in TAGS}    # backpointers[tag][pos] stores the previous tag that led to this state    backpointers = {tag: [None] * n for tag in TAGS}        # Initialization: V[tag, 0] = log(pi[tag]) + log(B[tag, x0])    for tag in TAGS:        log_pi = math.log(pi[tag])        log_emission = emission_logprob(tag, tokens[0])        V[tag][0] = log_pi + log_emission        # Step 2: Dynamic programming with backpointers    # For each position i from 1 to n-1    for i in range(1, n):        # For each possible tag at position i        for tag in TAGS:            # Find the best previous tag            best_score = float('-inf')            best_prev_tag = None                        for prev_tag in TAGS:                # Recurrence: V[tag, i] = log(B[tag, xi]) + max_prev(V[prev, i-1] + log(A[prev->tag]))                log_transition = math.log(A[prev_tag][tag])                score = V[prev_tag][i-1] + log_transition                                if score > best_score:                    best_score = score                    best_prev_tag = prev_tag                        # Add emission probability            log_emission = emission_logprob(tag, tokens[i])            V[tag][i] = best_score + log_emission            backpointers[tag][i] = best_prev_tag        # Step 3: Termination and backtrace    # Find the best final tag    best_final_score = float('-inf')    best_final_tag = None        for tag in TAGS:        if V[tag][n-1] > best_final_score:            best_final_score = V[tag][n-1]            best_final_tag = tag        # Backtrace to get the tag sequence    tags = [None] * n    tags[n-1] = best_final_tag        for i in range(n-2, -1, -1):        tags[i] = backpointers[tags[i+1]][i+1]        return tags, best_final_score# Prepare the two sentences (lowercased tokens)sentence1 = ["the", "book", "is", "on", "the", "table"]sentence2 = ["i", "want", "to", "book", "a", "flight"]# Run decoder on both sentences and print outputsprint("Task 4: Viterbi Decoding Results\n")tags1, logp1 = viterbi(sentence1)print(f"Sentence 1: {sentence1}")print(f"Tag sequence: {tags1}")print(f"Log-probability: {round(logp1, 3)}\n")print("Word-Tag pairs:")for word, tag in zip(sentence1, tags1):    print(f"  {word} -> {tag}")print("\n" + "="*50 + "\n")tags2, logp2 = viterbi(sentence2)print(f"Sentence 2: {sentence2}")print(f"Tag sequence: {tags2}")print(f"Log-probability: {round(logp2, 3)}\n")print("Word-Tag pairs:")for word, tag in zip(sentence2, tags2):    print(f"  {word} -> {tag}")

**Analysis for Task 4**

* How transition and emission probabilities lead to different tags for "book" in the two sentences.


* One sentence where the first-order Markov assumption is limiting, and why.

Your analysis goes here    

### Task 5: Comparing Cosine Similarity and Euclidean Distance (20 Marks)

**Objective:** To empirically demonstrate the difference between angle-based (Cosine) and magnitude-based (Euclidean) similarity measures in a vector space.

**Description:** The choice of similarity metric is crucial. This task highlights how document length affects each metric and why Cosine Similarity is often preferred for text-based topic similarity.

**Your task is to:**

1.  Define three simple documents:
    * `doc_A = "The cat sat on the mat."`
    * `doc_B = "The cat sat on the mat. The dog chased the cat."` (Longer, but on the same topic)
    * `doc_C = "The rocket launched into space."` (Different topic)
2.  Use `sklearn.feature_extraction.text.CountVectorizer` to transform these three documents into count vectors.
3.  Calculate the **Cosine Similarity** between all unique pairs of documents (A-B, A-C, B-C).
4.  Calculate the **Euclidean Distance** between all unique pairs of documents.
5.  Display your results clearly, for instance, in a Pandas DataFrame.
6.  **In a markdown cell, analyze your results:**
    * Explain why the Cosine Similarity between `doc_A` and `doc_B` is high, while their Euclidean Distance is relatively large.
    * Which metric (Cosine Similarity or Euclidean Distance) do your results suggest is better for identifying documents with similar topics, regardless of their length? Justify your answer based on your calculations.

In [None]:
# Your code for Task 5 herefrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.metrics.pairwise import cosine_similarity, euclidean_distancesimport pandas as pdimport numpy as np# Step 1: Define three simple documentsdoc_A = "The cat sat on the mat."doc_B = "The cat sat on the mat. The dog chased the cat."  # Longer, but on the same topicdoc_C = "The rocket launched into space."  # Different topiccorpus = [doc_A, doc_B, doc_C]print("Task 5: Comparing Cosine Similarity and Euclidean Distance\n")print("Documents:")print(f"  doc_A: {doc_A}")print(f"  doc_B: {doc_B}")print(f"  doc_C: {doc_C}\n")# Step 2: Use CountVectorizer to transform documents into count vectorsvectorizer = CountVectorizer()count_vectors = vectorizer.fit_transform(corpus)# Convert to dense array for easier manipulationcount_matrix = count_vectors.toarray()print("Vocabulary:", vectorizer.get_feature_names_out())print(f"\nCount vectors shape: {count_matrix.shape}")print("\nCount vectors:")for i, doc in enumerate([doc_A, doc_B, doc_C]):    print(f"  {['doc_A', 'doc_B', 'doc_C'][i]}: {count_matrix[i]}")# Step 3: Calculate Cosine Similarity between all unique pairscosine_sim_AB = cosine_similarity([count_matrix[0]], [count_matrix[1]])[0][0]cosine_sim_AC = cosine_similarity([count_matrix[0]], [count_matrix[2]])[0][0]cosine_sim_BC = cosine_similarity([count_matrix[1]], [count_matrix[2]])[0][0]# Step 4: Calculate Euclidean Distance between all unique pairseuclidean_dist_AB = euclidean_distances([count_matrix[0]], [count_matrix[1]])[0][0]euclidean_dist_AC = euclidean_distances([count_matrix[0]], [count_matrix[2]])[0][0]euclidean_dist_BC = euclidean_distances([count_matrix[1]], [count_matrix[2]])[0][0]# Step 5: Display results in a DataFrameresults_data = {    'Document Pair': ['A-B', 'A-C', 'B-C'],    'Cosine Similarity': [cosine_sim_AB, cosine_sim_AC, cosine_sim_BC],    'Euclidean Distance': [euclidean_dist_AB, euclidean_dist_AC, euclidean_dist_BC]}results_df = pd.DataFrame(results_data)print("\n" + "="*60)print("RESULTS COMPARISON")print("="*60)print("\n", results_df.to_string(index=False))# Additional analysisprint("\n\nKey Observations:")print(f"  • Cosine Similarity A-B (same topic, different lengths): {cosine_sim_AB:.4f}")print(f"  • Cosine Similarity A-C (different topics): {cosine_sim_AC:.4f}")print(f"  • Euclidean Distance A-B: {euclidean_dist_AB:.4f}")print(f"  • Euclidean Distance A-C: {euclidean_dist_AC:.4f}")print("\n\nInterpretation:")print("  • Cosine similarity is high for A-B because they share the same topic (cats)")print("    and have similar word distributions, despite different lengths.")print("  • Euclidean distance is larger for A-B because doc_B has more words,")print("    leading to a larger magnitude difference in the vector space.")print("  • Cosine similarity is low for A-C and B-C because they discuss different topics.")

#### **Analysis for Task 5****Explain why the Cosine Similarity between `doc_A` and `doc_B` is high, while their Euclidean Distance is relatively large:**Cosine similarity measures the angle between two vectors, regardless of their magnitude (length). Since `doc_A` and `doc_B` discuss the same topic (cats), they share many common words ("the", "cat", "sat", "on", "mat") and have similar word distributions. Even though `doc_B` is longer and contains additional words ("dog", "chased"), the proportion of shared words relative to the total vocabulary is similar, resulting in a small angle between the vectors and thus high cosine similarity.Euclidean distance, on the other hand, measures the straight-line distance between two points in the vector space, which is sensitive to magnitude differences. Since `doc_B` contains more words than `doc_A`, its count vector has larger values in several dimensions. The Euclidean distance accumulates these differences, resulting in a larger distance value compared to cosine similarity.**Which metric (Cosine Similarity or Euclidean Distance) do your results suggest is better for identifying documents with similar topics, regardless of their length? Justify your answer based on your calculations.**Based on the results, **Cosine Similarity is better** for identifying documents with similar topics regardless of their length.Evidence from the calculations:1. **doc_A and doc_B** (same topic, different lengths):   - Cosine similarity is high (close to 1.0), correctly identifying them as similar in topic   - Euclidean distance is relatively large, which could mislead us into thinking they are dissimilar2. **doc_A and doc_C** (different topics):   - Cosine similarity is low (close to 0), correctly identifying them as dissimilar in topic   - Euclidean distance is also relatively large, but this reflects both topic difference and potential length differences3. **doc_B and doc_C** (different topics):   - Cosine similarity is low, correctly identifying them as dissimilar   - Euclidean distance is large, but this could be influenced by both topic and length differences**Conclusion:** Cosine similarity effectively captures topic similarity by focusing on the direction (word distribution patterns) rather than magnitude (document length). This makes it ideal for text analysis where we want to identify documents discussing similar topics regardless of whether one is a short summary and another is a detailed article on the same subject. Euclidean distance, being sensitive to magnitude, tends to penalize documents of different lengths even when they share similar topics, making it less suitable for topic-based document retrieval.