- Text Classification: Assigning predefined categories to text based on its content.
- Clustering: Grouping similar texts together based on their content.
- Semantic Textual Similarity: Measuring how similar two pieces of text are in meaning.
- Bitext Mining: Finding parallel sentences in different languages.
- Reranking: Ordering a list of texts based on relevance to a query.
- Pair Classification: Determining if two texts are related or not.
- Multilabel Classification: Assigning multiple categories to a single piece of text.
- Instruction Reranking: Ranking texts based on how well they follow given instructions.

In [1]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
from sklearn.linear_model import LogisticRegression
import numpy as np
from datasets import load_dataset

# Load the embedding model
model = SentenceTransformer('Qwen/Qwen3-Embedding-0.6B')

    Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (8.0) - (12.0)
    


In [2]:
# 1. TEXT CLASSIFICATION
# Classify movie reviews as positive or negative

print("=" * 50)
print("1. TEXT CLASSIFICATION")
print("=" * 50)

# Sample data
train_texts = [
    "This movie was absolutely fantastic! I loved every minute.",
    "Terrible film, waste of time and money.",
    "An amazing masterpiece with brilliant acting.",
    "Boring and predictable. Don't recommend.",
    "Best movie I've seen this year!",
    "Complete disaster, awful in every way."
]
train_labels = [1, 0, 1, 0, 1, 0]  # 1=positive, 0=negative

test_texts = [
    "Great film with excellent performances.",
    "Not worth watching, very disappointing."
]

# Generate embeddings
train_embeddings = model.encode(train_texts)
test_embeddings = model.encode(test_texts)

# Train classifier
classifier = LogisticRegression()
classifier.fit(train_embeddings, train_labels)

# Predict
predictions = classifier.predict(test_embeddings)
for text, pred in zip(test_texts, predictions):
    sentiment = "Positive" if pred == 1 else "Negative"
    print(f"Text: {text}")
    print(f"Predicted: {sentiment}\n")


1. TEXT CLASSIFICATION
Text: Great film with excellent performances.
Predicted: Positive

Text: Not worth watching, very disappointing.
Predicted: Negative



In [None]:
# 2. CLUSTERING
# Group similar news articles

print("=" * 50)
print("2. CLUSTERING")
print("=" * 50)

texts = [
    "Apple announces new iPhone with advanced AI features",
    "Stock market reaches all-time high today",
    "New study shows benefits of Mediterranean diet",
    "Tech giant unveils revolutionary smartphone technology",
    "Wall Street celebrates record-breaking trading day",
    "Researchers discover health benefits of olive oil"
]

embeddings = model.encode(texts)

# Cluster into 3 groups
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(embeddings)

for i, (text, cluster) in enumerate(zip(texts, clusters)):
    print(f"Cluster {cluster}: {text}")


2. CLUSTERING
Cluster 2: Apple announces new iPhone with advanced AI features
Cluster 1: Stock market reaches all-time high today
Cluster 0: New study shows benefits of Mediterranean diet
Cluster 1: Tech giant unveils revolutionary smartphone technology
Cluster 1: Wall Street celebrates record-breaking trading day
Cluster 0: Researchers discover health benefits of olive oil


In [6]:
# 3. SEMANTIC TEXTUAL SIMILARITY
# Measure similarity between sentence pairs

print("\n" + "=" * 50)
print("3. SEMANTIC TEXTUAL SIMILARITY")
print("=" * 50)

sentence_pairs = [
    ("A man is playing guitar", "A person is playing a musical instrument"),
    ("A dog is running in the park", "The cat is sleeping on the couch"),
    ("The weather is sunny today", "It's a bright and clear day outside")
]

for sent1, sent2 in sentence_pairs:
    emb1 = model.encode([sent1])
    emb2 = model.encode([sent2])
    similarity = cosine_similarity(emb1, emb2)[0][0]
    print(f"Sentence 1: {sent1}")
    print(f"Sentence 2: {sent2}")
    print(f"Similarity: {similarity:.4f}\n")


3. SEMANTIC TEXTUAL SIMILARITY
Sentence 1: A man is playing guitar
Sentence 2: A person is playing a musical instrument
Similarity: 0.7289

Sentence 1: A dog is running in the park
Sentence 2: The cat is sleeping on the couch
Similarity: 0.2953

Sentence 1: The weather is sunny today
Sentence 2: It's a bright and clear day outside
Similarity: 0.8234



In [8]:
# 4. BITEXT MINING
# Find parallel sentences in different languages

print("=" * 50)
print("4. BITEXT MINING")
print("=" * 50)

english_sentences = [
    "Hello, how are you?",
    "The weather is nice today",
    "I love learning new languages"
]

spanish_candidates = [
    "El clima está agradable hoy",
    "Me encanta aprender nuevos idiomas",
    "Hola, ¿cómo estás?",
    "La comida es deliciosa",
    "Estoy muy feliz hoy",
    "¿Dónde está la biblioteca?",
    "Buenos días, señor",
    "El perro corre en el parque"
]

en_embeddings = model.encode(english_sentences)
es_embeddings = model.encode(spanish_candidates)

# Find best match for each English sentence
for i, en_sent in enumerate(english_sentences):
    similarities = cosine_similarity([en_embeddings[i]], es_embeddings)[0]
    best_match_idx = np.argmax(similarities)
    print(f"EN: {en_sent}")
    print(f"ES: {spanish_candidates[best_match_idx]}")
    print(f"Score: {similarities[best_match_idx]:.4f}\n")


4. BITEXT MINING
EN: Hello, how are you?
ES: Hola, ¿cómo estás?
Score: 0.8297

EN: The weather is nice today
ES: El clima está agradable hoy
Score: 0.8756

EN: I love learning new languages
ES: Me encanta aprender nuevos idiomas
Score: 0.9278



In [None]:
# 5. RERANKING
# Order documents by relevance to a query

print("=" * 50)
print("5. RERANKING")
print("=" * 50)

query = "machine learning algorithms"
documents = [
    "Deep learning is a subset of machine learning using neural networks",
    "The best pizza recipe includes mozzarella and basil",
    "Random forests and decision trees are popular ML algorithms",
    "Gardening tips for growing tomatoes in summer",
    "Support vector machines are supervised learning models",
    "How to bake a chocolate cake from scratch",
    "Use distilled water for better coffee taste"
]

query_embedding = model.encode([query])
doc_embeddings = model.encode(documents)

# Calculate similarities
similarities = cosine_similarity(query_embedding, doc_embeddings)[0]

# Rerank by similarity
ranked_indices = np.argsort(similarities)[::-1]

print(f"Query: {query}\n")
print("Ranked documents:")
for rank, idx in enumerate(ranked_indices, 1):
    print(f"{rank}. [{similarities[idx]:.4f}] {documents[idx]}")


5. RERANKING
Query: machine learning algorithms

Ranked documents:
1. [0.8366] Support vector machines are supervised learning models
2. [0.8101] Random forests and decision trees are popular ML algorithms
3. [0.7485] Deep learning is a subset of machine learning using neural networks
4. [0.3850] Use distilled water for better coffee taste
5. [0.3378] How to bake a chocolate cake from scratch
6. [0.3198] Gardening tips for growing tomatoes in summer
7. [0.3118] The best pizza recipe includes mozzarella and basil


In [None]:
# 6. PAIR CLASSIFICATION
# Determine if two texts are related (e.g., question-answer pairs)

print("\n" + "=" * 50)
print("6. PAIR CLASSIFICATION")
print("=" * 50)

train_pairs = [
    # Related pairs (label = 1)
    ("What is Python?", "Python is a programming language", 1),
    ("How to bake a cake?", "Mix flour, eggs, and sugar then bake at 350°F", 1),
    ("What's the weather?", "It's sunny and 75 degrees", 1),
    ("How to code?", "Start with tutorials and practice regularly", 1),
    ("What is machine learning?", "ML is a branch of AI that learns from data", 1),
    ("How to cook pasta?", "Boil water, add pasta, cook for 10 minutes", 1),
    ("What is JavaScript?", "JavaScript is a programming language for web development", 1),
    ("How to lose weight?", "Eat healthy and exercise regularly", 1),
    ("What is photosynthesis?", "Plants convert sunlight into energy", 1),
    ("How to learn guitar?", "Practice chords and scales daily", 1),    
    # Unrelated pairs (label = 0)
    ("What is Python?", "The sky is blue", 0),
    ("Best restaurants nearby?", "Python is a programming language", 0),
    ("How to bake a cake?", "The capital of France is Paris", 0),
    ("What's the weather?", "Dogs are loyal animals", 0),
    ("How to code?", "Mount Everest is the tallest mountain", 0),
    ("What is machine learning?", "Pizza has cheese and tomatoes", 0),
    ("How to cook pasta?", "The ocean is very deep", 0),
    ("What is JavaScript?", "Birds can fly south for winter", 0),
    ("How to lose weight?", "Mars is a red planet", 0),
    ("What is photosynthesis?", "Cars need gasoline to run", 0),
]

test_pairs = [
    ("What is machine learning?", "ML is a branch of artificial intelligence"),
    ("How to cook pasta?", "The capital of France is Paris"),
    ("What is Python?", "It's a high-level programming language"),
    ("How to play piano?", "Bitcoin is a cryptocurrency"),
    ("What causes rain?", "Water evaporates and condenses in clouds"),
]

print(f"Training on {len(train_pairs)} examples")
print(f"  - Related pairs: {sum(1 for _, _, label in train_pairs if label == 1)}")
print(f"  - Unrelated pairs: {sum(1 for _, _, label in train_pairs if label == 0)}\n")

# APPROACH 1: Concatenate embeddings
print("APPROACH 1: Concatenated Embeddings")
print("-" * 50)
X_train = []
y_train = []
for q, a, label in train_pairs:
    combined = model.encode([q, a])
    X_train.append(np.concatenate(combined))
    y_train.append(label)

X_train = np.array(X_train)

pair_classifier = LogisticRegression(max_iter=1000)
pair_classifier.fit(X_train, y_train)

# Test
for q, a in test_pairs:
    combined = model.encode([q, a])
    X_test = np.concatenate(combined).reshape(1, -1)
    prediction = pair_classifier.predict(X_test)[0]
    probability = pair_classifier.predict_proba(X_test)[0]
    result = "Related" if prediction == 1 else "Not Related"
    print(f"Q: {q}")
    print(f"A: {a}")
    print(f"Classification: {result} (confidence: {max(probability):.2%})\n")

# APPROACH 2: Use similarity as feature (often better!)
print("\n" + "=" * 50)
print("APPROACH 2: Similarity-Based Classification")
print("-" * 50)

X_train_sim = []
for q, a, label in train_pairs:
    q_emb = model.encode([q])
    a_emb = model.encode([a])
    similarity = cosine_similarity(q_emb, a_emb)[0][0]
    # Use similarity as feature
    X_train_sim.append([similarity])

X_train_sim = np.array(X_train_sim)

sim_classifier = LogisticRegression(max_iter=1000)
sim_classifier.fit(X_train_sim, y_train)

# Test
for q, a in test_pairs:
    q_emb = model.encode([q])
    a_emb = model.encode([a])
    similarity = cosine_similarity(q_emb, a_emb)[0][0]
    X_test_sim = np.array([[similarity]])
    
    prediction = sim_classifier.predict(X_test_sim)[0]
    probability = sim_classifier.predict_proba(X_test_sim)[0]
    result = "Related" if prediction == 1 else "Not Related"
    
    print(f"Q: {q}")
    print(f"A: {a}")
    print(f"Similarity: {similarity:.4f}")
    print(f"Classification: {result} (confidence: {max(probability):.2%})\n")

# APPROACH 3: Simple threshold (no training needed!)
print("\n" + "=" * 50)
print("APPROACH 3: Simple Similarity Threshold")
print("-" * 50)
print("(No training needed - just set a threshold)\n")

threshold = 0.5  # Adjust based on your needs

for q, a in test_pairs:
    q_emb = model.encode([q])
    a_emb = model.encode([a])
    similarity = cosine_similarity(q_emb, a_emb)[0][0]
    
    result = "Related" if similarity > threshold else "Not Related"
    
    print(f"Q: {q}")
    print(f"A: {a}")
    print(f"Similarity: {similarity:.4f}")
    print(f"Classification: {result}\n")



6. PAIR CLASSIFICATION
Training on 20 examples
  - Related pairs: 10
  - Unrelated pairs: 10

APPROACH 1: Concatenated Embeddings
--------------------------------------------------
Q: What is machine learning?
A: ML is a branch of artificial intelligence
Classification: Related (confidence: 54.94%)

Q: How to cook pasta?
A: The capital of France is Paris
Classification: Not Related (confidence: 62.21%)

Q: What is Python?
A: It's a high-level programming language
Classification: Related (confidence: 53.82%)

Q: How to play piano?
A: Bitcoin is a cryptocurrency
Classification: Related (confidence: 54.32%)

Q: What causes rain?
A: Water evaporates and condenses in clouds
Classification: Not Related (confidence: 52.98%)


APPROACH 2: Similarity-Based Classification
--------------------------------------------------
Q: What is machine learning?
A: ML is a branch of artificial intelligence
Similarity: 0.7525
Classification: Related (confidence: 57.31%)

Q: How to cook pasta?
A: The capital

In [21]:
# 7. MULTILABEL CLASSIFICATION
# Assign multiple categories to articles

print("=" * 50)
print("7. MULTILABEL CLASSIFICATION")
print("=" * 50)

from sklearn.multioutput import MultiOutputClassifier

# Articles with multiple topics
train_articles = [
    "Apple releases new AI-powered iPhone with health tracking features",
    "Stock market analysis: Tech companies lead growth this quarter",
    "Study shows exercise and diet improve mental health",
    "Google announces cloud computing services for healthcare",
    "Economic forecast: Technology sector shows strong performance"
]

# Labels: [Technology, Finance, Health]
train_labels = [
    [1, 0, 1],  # Tech + Health
    [1, 1, 0],  # Tech + Finance
    [0, 0, 1],  # Health
    [1, 0, 1],  # Tech + Health
    [1, 1, 0]   # Tech + Finance
]

test_articles = [
    "Microsoft launches healthcare AI platform for financial analysis",
]

# Train
train_emb = model.encode(train_articles)
test_emb = model.encode(test_articles)

multilabel_clf = MultiOutputClassifier(LogisticRegression(max_iter=1000))
multilabel_clf.fit(train_emb, train_labels)

# Predict
predictions = multilabel_clf.predict(test_emb)
categories = ["Technology", "Finance", "Health"]

for i, article in enumerate(test_articles):
    print(f"Article: {article}")
    assigned_cats = [cat for cat, pred in zip(categories, predictions[i]) if pred == 1]
    print("Categories:", assigned_cats)
    print()

7. MULTILABEL CLASSIFICATION
Article: Microsoft launches healthcare AI platform for financial analysis
Categories: ['Technology', 'Health']



In [22]:
# 8. INSTRUCTION RERANKING
# Rank texts based on how well instructions are followed

print("\n" + "=" * 50)
print("8. INSTRUCTION RERANKING")
print("=" * 50)

instruction = "Write a formal business email requesting a meeting"

candidate_texts = [
    "Dear Mr. Smith, I would like to request a meeting to discuss the project timeline. Please let me know your availability. Best regards, John",
    "hey wanna meet up sometime? lmk",
    "To Whom It May Concern: I am writing to formally request a meeting at your earliest convenience. Sincerely, Jane Doe",
    "The weather is nice today",
    "I am reaching out to schedule a meeting to review our quarterly goals. Would next Tuesday work for you?"
]

# Encode instruction and candidates
instruction_emb = model.encode([instruction])
candidate_embs = model.encode(candidate_texts)

# Calculate instruction-following scores
scores = cosine_similarity(instruction_emb, candidate_embs)[0]

# Rank by instruction-following quality
ranked = np.argsort(scores)[::-1]

print(f"Instruction: {instruction}\n")
print("Ranked responses:")
for rank, idx in enumerate(ranked, 1):
    print(f"{rank}. [Score: {scores[idx]:.4f}]")
    print(f"   {candidate_texts[idx]}\n")



8. INSTRUCTION RERANKING
Instruction: Write a formal business email requesting a meeting

Ranked responses:
1. [Score: 0.7923]
   To Whom It May Concern: I am writing to formally request a meeting at your earliest convenience. Sincerely, Jane Doe

2. [Score: 0.7576]
   Dear Mr. Smith, I would like to request a meeting to discuss the project timeline. Please let me know your availability. Best regards, John

3. [Score: 0.6535]
   I am reaching out to schedule a meeting to review our quarterly goals. Would next Tuesday work for you?

4. [Score: 0.5610]
   hey wanna meet up sometime? lmk

5. [Score: 0.3588]
   The weather is nice today

