# üî§ Homework 6: Sentence Embeddings
**MIS 769 - Big Data Analytics for Business | Spring 2026**

**Points:** 20 | **Due:** See WebCampus for deadline

**Author:** Richard Young, Ph.D. | UNLV Lee Business School

**Compute:** CPU (free tier) ‚Äî GPU recommended for faster encoding

---

## What You'll Learn

1. Difference between word and sentence embeddings
2. Use Sentence Transformers to encode text semantically
3. Compute semantic similarity between documents
4. Build a simple semantic search system

---

## Understanding Cosine Similarity

Sentence embeddings work by converting text into high-dimensional vectors. **Similar meanings** produce vectors pointing in **similar directions**. The diagram below shows how cosine similarity measures this‚Äîit's the foundation of semantic search.

cosine_similarity.svg

---

## Part 1: Setup and Model Loading (3 points)

In [None]:
!pip install sentence-transformers datasets pandas numpy scikit-learn -q

from sentence_transformers import SentenceTransformer
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')

print("üìä MODEL LOADED")
print("=" * 60)
print(f"Model: all-MiniLM-L6-v2")
print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")
print(f"Max sequence length: {model.max_seq_length}")

In [None]:
from datasets import load_dataset

# Load dataset
dataset = load_dataset("stanfordnlp/imdb", split="train[:2000]")
df = dataset.to_pandas()

print(f"‚úÖ Loaded {len(df):,} reviews")
print(f"\nSample review (first 200 chars):")
print(df['text'].iloc[0][:200] + "...")

## Part 2: Text Encoding (4 points)

In [None]:
# Encode single sentences
sentences = [
    "This movie was absolutely fantastic!",
    "I loved this film, it was great!",
    "The movie was terrible and boring.",
    "The bank is by the river.",
    "I need to visit the bank for a loan."
]

embeddings = model.encode(sentences)

print("üìä SENTENCE EMBEDDINGS")
print("=" * 60)
print(f"Number of sentences: {len(sentences)}")
print(f"Embedding shape: {embeddings.shape}")
print(f"\nFirst embedding (first 10 values):")
print(embeddings[0][:10])

In [None]:
# Encode corpus (batch processing)
print("Encoding reviews (this may take 1-2 minutes)...")

# Truncate reviews to avoid memory issues
texts = [text[:512] for text in df['text'].tolist()]
corpus_embeddings = model.encode(texts, show_progress_bar=True)

print(f"\n‚úÖ Encoded {len(corpus_embeddings):,} reviews")
print(f"Corpus embeddings shape: {corpus_embeddings.shape}")

## Part 3: Similarity Computation (5 points)

In [None]:
# Compute similarity between sentence pairs
print("üîç SEMANTIC SIMILARITY")
print("=" * 60)

pairs = [
    (0, 1),  # Similar sentiment, different words
    (0, 2),  # Opposite sentiment
    (3, 4),  # Different meanings of "bank"
]

for i, j in pairs:
    sim = cosine_similarity([embeddings[i]], [embeddings[j]])[0][0]
    print(f"\nSentence A: \"{sentences[i]}\"")
    print(f"Sentence B: \"{sentences[j]}\"")
    print(f"Similarity: {sim:.2f}")

In [None]:
# Create a similarity matrix
similarity_matrix = cosine_similarity(embeddings)

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 8))
sns.heatmap(similarity_matrix, 
            xticklabels=[s[:30]+"..." for s in sentences],
            yticklabels=[s[:30]+"..." for s in sentences],
            annot=True, fmt=".2f", cmap="YlOrRd")
plt.title("Sentence Similarity Matrix")
plt.tight_layout()
plt.savefig('similarity_matrix.png', dpi=150)
plt.show()

## Part 4: Semantic Search (5 points)

In [None]:
def semantic_search(query, corpus_embeddings, corpus_texts, top_k=5):
    """Find most similar documents to a query."""
    query_embedding = model.encode([query])
    similarities = cosine_similarity(query_embedding, corpus_embeddings)[0]
    top_indices = np.argsort(similarities)[::-1][:top_k]
    
    results = []
    for idx in top_indices:
        results.append({
            'score': similarities[idx],
            'text': corpus_texts[idx][:200] + "..."
        })
    return results

# Test semantic search
query = "movies with surprising plot twists"

print("üîé SEMANTIC SEARCH RESULTS")
print("=" * 60)
print(f"Query: \"{query}\"\n")

results = semantic_search(query, corpus_embeddings, texts)

print("Top 5 Results:")
for i, r in enumerate(results, 1):
    print(f"\n{i}. [Score: {r['score']:.2f}]")
    print(f"   {r['text']}")

In [None]:
# Try more queries
queries = [
    "heartwarming family movies",
    "scary horror films",
    "romantic comedy with happy ending"
]

for query in queries:
    print(f"\n{'='*60}")
    print(f"Query: \"{query}\"")
    print("-" * 60)
    results = semantic_search(query, corpus_embeddings, texts, top_k=3)
    for i, r in enumerate(results, 1):
        print(f"{i}. [{r['score']:.2f}] {r['text'][:100]}...")

## Part 5: Analysis (3 points)

In [None]:
# Compare semantic search vs keyword search
query = "films I didn't enjoy"

print("üìä SEMANTIC vs KEYWORD SEARCH")
print("=" * 60)
print(f"Query: \"{query}\"\n")

# Semantic search
print("SEMANTIC SEARCH (embedding-based):")
semantic_results = semantic_search(query, corpus_embeddings, texts, top_k=3)
for i, r in enumerate(semantic_results, 1):
    print(f"  {i}. [{r['score']:.2f}] {r['text'][:80]}...")

# Simple keyword search
print("\nKEYWORD SEARCH (exact match):")
keyword_results = [t[:80] for t in texts if 'enjoy' in t.lower()][:3]
if keyword_results:
    for i, r in enumerate(keyword_results, 1):
        print(f"  {i}. {r}...")
else:
    print("  No exact matches found!")

print("\nüí° Notice: Semantic search understands meaning, not just keywords!")

---

## Questions to Answer

**Q1:** What makes two sentences similar according to the model?

*Your answer:*

**Q2:** Find an example where semantic search works better than keyword search.

*Your answer:*

**Q3:** Find an example where the model fails. Why might this happen?

*Your answer:*

**Q4:** How would you use this in a business application?

*Your answer:*

---

## Submission Checklist

| Item | Points | Done? |
|------|--------|-------|
| Part 1: Model loading | 3 | ‚òê |
| Part 2: Text encoding | 4 | ‚òê |
| Part 3: Similarity computation | 5 | ‚òê |
| Part 4: Semantic search | 5 | ‚òê |
| Part 5: Analysis | 3 | ‚òê |
| **Total** | **20** | |