## How to Download and Run Qwen3 
### What we’ll demo (locally, CPU)

- Text embeddings with Qwen/Qwen3-Embedding-0.6B (best choice for CPU).
- Similarity search (cosine) over a small doc set.
- Instruction-aware embeddings (query/task prefix).
- Multilingual example (English/Hindi).
- Reranking with a CPU-friendly Qwen3 reranker variant (sequence-classification head).

Why this model: Qwen3-Embedding (0.6B/4B/8B) is purpose-built for retrieval & reranking, trained with a multi-stage pipeline and supports instruction-aware inputs; it’s a new (June 5 2025) series that’s SoTA on many MTEB-style retrieval tasks. 


### Set up Environment (CPU Only)
#### Fresh venv (Windows PowerShell shown; macOS/Linux: replace python with python3)
- python -m venv .venv
- . .venv/Scripts/Activate.ps1    # Windows
- pip install --upgrade pip
- pip install torch --index-url https://download.pytorch.org/whl/cpu
- pip install transformers>=4.44.0 accelerate sentencepiece scipy numpy

### Features to be presented 

- Instruction-aware embeddings: show how adding Instruct: ... improves retrieval. (Paper/blog describe instruction-aware inputs.) 
- Multilingual retrieval: run a Hindi (or your choice) query—still retrieves the English “New Delhi” doc. (Multilingual capability is a core selling point.) 
- Lightweight CPU footprint: 0.6B runs on CPU for quick demos (your scripts above).
- Reranking stage: show base retrieval (cosine top-k) vs reranked order—the relevant doc pops to #1. The paper defines the yes/no likelihood method; our seq-cls variant emulates it. 

### Quick embedding demo (script) using "AutoModel"  

In [1]:
import torch, torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

MODEL_ID = "Qwen/Qwen3-Embedding-0.6B"   # CPU-friendly

tok = AutoTokenizer.from_pretrained(MODEL_ID, padding_side="left")
model = AutoModel.from_pretrained(MODEL_ID)  # defaults to CPU

def last_token_pool(last_hidden_states, attention_mask):
    # Qwen3-Embedding uses last-token pooling (see README/blog)
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        seq_lengths = attention_mask.sum(dim=1) - 1
        bsz = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(bsz), seq_lengths]

def embed_texts(texts, instruction=None):
    # Instruction-aware format recommended by authors
    prefix = f"Instruct: {instruction}\nQuery: " if instruction else ""
    # Qwen3-Embedding expects the final token to be the end token
    texts = [f"{prefix}{t}{tok.eos_token}" for t in texts]
    enc = tok(texts, padding=True, truncation=True, return_tensors="pt")
    with torch.no_grad():
        out = model(**enc)
        pooled = last_token_pool(out.last_hidden_state, enc["attention_mask"])
        # L2 normalize (recommended)
        emb = F.normalize(pooled, p=2, dim=1)
    return emb

def cos_sim(a, b):
    return (a @ b.T).cpu().numpy()

# --- Demo data ---
instruction = "Given a web search query, retrieve relevant passages that answer the query"
queries = [
    "What is the capital of India?",
    "भारत की राजधानी क्या है?",  # Hindi
]
docs = [
    "New Delhi is the capital of India.",
    "Washington, D.C. is the capital of the United States.",
    "Mumbai is India’s financial center.",
]

# Compute embeddings
q_emb = embed_texts(queries, instruction=instruction)
d_emb = embed_texts(docs, instruction=instruction)

# Similarity search
S = cos_sim(q_emb, d_emb)
for i,q in enumerate(queries):
    order = S[i].argsort()[::-1]
    print(f"\nQuery: {q}")
    for j,idx in enumerate(order):
        print(f"  {j+1}. score={S[i,idx]:.3f} -> {docs[idx]}")

  from .autonotebook import tqdm as notebook_tqdm



Query: What is the capital of India?
  1. score=0.807 -> New Delhi is the capital of India.
  2. score=0.557 -> Mumbai is India’s financial center.
  3. score=0.479 -> Washington, D.C. is the capital of the United States.

Query: भारत की राजधानी क्या है?
  1. score=0.725 -> New Delhi is the capital of India.
  2. score=0.552 -> Mumbai is India’s financial center.
  3. score=0.428 -> Washington, D.C. is the capital of the United States.


In [2]:
import torch, math
from transformers import AutoTokenizer, AutoModel

mid = "Qwen/Qwen3-Embedding-0.6B"
tok = AutoTokenizer.from_pretrained(mid)
model = AutoModel.from_pretrained(mid)

def embed_text(texts):
    enc = tok(texts, padding=True, truncation=True, return_tensors="pt")
    with torch.no_grad():
        out = model(**enc)
    # Per model card: pool on last token (EOS is auto-added in latest tokenizer)
    last_hidden = out.last_hidden_state
    # compute positions of last non-pad token in each sequence
    seq_lens = enc["attention_mask"].sum(dim=1) - 1
    pooled = last_hidden[torch.arange(last_hidden.size(0)), seq_lens]
    # L2-normalize for cosine similarity
    return torch.nn.functional.normalize(pooled, p=2, dim=1)
"""
queries = ["What is the capital of India?"]
docs    = ["New Delhi is the capital of India.",
           "The capital of United States of America is Washington D C.",
           "Bombay is the Financial district in India."]
"""
queries = ["Name the administrative capital city of the Republic of India."]
docs    = ["Washington D C is known for historical Museums.",
           "Delhi is a historic city with many monuments.",
           "The Indian rupee is the currency of India."]


E_q = embed_text(queries)
E_d = embed_text(docs)

# cosine similarity q vs each doc
sims = (E_q @ E_d.T).squeeze(0)
order = torch.argsort(sims, descending=True).tolist()

print("Query:", queries[0])
for i in order:
    print(f"{sims[i].item():.3f}  ->  {docs[i]}")

  from .autonotebook import tqdm as notebook_tqdm


Query: Name the administrative capital city of the Republic of India.
0.637  ->  Delhi is a historic city with many monuments.
0.506  ->  The Indian rupee is the currency of India.
0.453  ->  Washington D C is known for historical Museums.


### Qwen3 Reranking demo (script) Seq2Seq using "AutoModelForSequenceClassification"  

The original Qwen3-Reranker computes a yes/no log-odds score on (instruction, query, document). For a CPU-simple path, we’ll use a sequence-classification conversion of the 0.6B reranker—works nicely with standard AutoModelForSequenceClassification and returns a single relevance logit. (This conversion approach is documented in community repos & vLLM notes.)

In [6]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

MODEL_ID = "tomaarsen/Qwen3-Reranker-0.6B-seq-cls"  # CPU-friendly variant

tok = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)  # CPU by default
model.eval()

instruction = "Rerank documents for relevance to the query."
query = "Name the administrative capital city of the Republic of India."
docs = [
    "Delhi is a historic city with many monuments.",
    "India's seat of government is located in New Delhi.",
    "The Indian rupee is the currency of India.",
    "Washington D. C. is known for historical museums.",
]

pairs = [(
    f"Instruct: {instruction}\nQuery: {query}",
    d
) for d in docs]

enc = tok([q for q,d in pairs],
          [d for q,d in pairs],
          padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    logits = model(**enc).logits.squeeze(-1)  # higher = more relevant

order = torch.argsort(logits, descending=True).tolist()
print(instruction)
print(docs)
print(query)
print("Reranked:")
for rank, idx in enumerate(order, 1):
    print(f"  {rank}. score={float(logits[idx]):.3f} -> {docs[idx]}")

Rerank documents for relevance to the query.
['Delhi is a historic city with many monuments.', "India's seat of government is located in New Delhi.", 'The Indian rupee is the currency of India.', 'Washington D. C. is known for historical museums.']
Name the administrative capital city of the Republic of India.
Reranked:
  1. score=1.398 -> Delhi is a historic city with many monuments.
  2. score=1.354 -> India's seat of government is located in New Delhi.
  3. score=-0.437 -> Washington D. C. is known for historical museums.
  4. score=-0.572 -> The Indian rupee is the currency of India.


In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

mid = "tomaarsen/Qwen3-Reranker-0.6B-seq-cls"  # converted to seq-cls

tok = AutoTokenizer.from_pretrained(mid)
if tok.pad_token is None:
    tok.pad_token = tok.eos_token or tok.sep_token or tok.cls_token
tok.padding_side = "left"

model = AutoModelForSequenceClassification.from_pretrained(mid)
model.config.pad_token_id = tok.pad_token_id
"""
query = "Best city to visit in India?"
docs  = [
    "New Delhi is the capital with historical sites.",
    "Washington D C is known for historical Museums.",
    "Bombay is a modern financial hub."
]
"""
query = "Name the administrative capital city of the Republic of India."
docs  = [
    "Indias seat of government is located in New Delhi.",
    "Washington D C is known for historical Museums.",
    "The Indian rupee is the currency of India."
]



# batch encode pairs
enc = tok([query]*len(docs), docs, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    logits = model(**enc).logits.squeeze(-1)  # shape [batch]

order = torch.argsort(logits, descending=True).tolist()
print("Query:", query)
for i in order:
    print(f"{logits[i].item():.3f}  ->  {docs[i]}")

  from .autonotebook import tqdm as notebook_tqdm


Query: Name the administrative capital city of the Republic of India.
2.872  ->  Indias seat of government is located in New Delhi.
1.265  ->  The Indian rupee is the currency of India.
0.965  ->  Washington D C is known for historical Museums.


### Qwen3 Reranking with original yes/no format using "AutoModelForCausalLM"  

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

mid = "Qwen/Qwen3-Reranker-0.6B"  # official
tok = AutoTokenizer.from_pretrained(mid)
model = AutoModelForCausalLM.from_pretrained(mid)

# Ensure padding works for batching
if tok.pad_token is None:
    tok.pad_token = tok.eos_token or tok.sep_token or tok.cls_token
tok.padding_side = "left"
model.config.pad_token_id = tok.pad_token_id

# Minimal prompt template (matches common reranker usage)
def build_prompt(query, doc):
    # You can refine template per model card; core idea: ask yes/no relevance.
    return f"Query: {query}\nDocument: {doc}\nIs the document relevant to the query? Answer yes or no:"

def yes_no_scores(prompts):
    enc = tok(prompts, padding=True, truncation=True, return_tensors="pt")
    with torch.no_grad():
        out = model(**enc)
    # Take next-token logits at the last non-pad position
    last_pos = enc["attention_mask"].sum(dim=1) - 1
    logits_next = out.logits[torch.arange(out.logits.size(0)), last_pos]  # [batch, vocab]
    id_yes = tok.convert_tokens_to_ids(" yes")
    id_no  = tok.convert_tokens_to_ids(" no")
    # Fallback if leading space tokens aren't in vocab:
    if id_yes is None: id_yes = tok.convert_tokens_to_ids("yes")
    if id_no  is None: id_no  = tok.convert_tokens_to_ids("no")
    # Score = logit_yes - logit_no (or softmax prob_yes)
    yes = logits_next[:, id_yes]
    no  = logits_next[:, id_no]
    return (yes - no)
"""
query = "Best city to visit in India?"
docs  = [
    "New Delhi is the capital with historical sites.",
    "Washington D C is known for historical Museums.",
    "Bombay is a modern financial hub."
]
"""
query = "Name the administrative capital city of the Republic of India."
docs  = [
    "Indias seat of government is located in New Delhi.",
    "Delhi is a historic city with many monuments.",
    "Washington D C is known for historical Museums."
]

prompts = [build_prompt(query, d) for d in docs]
scores = yes_no_scores(prompts)

order = torch.argsort(scores, descending=True).tolist()
print("Query:", query)
for i in order:
    print(f"{scores[i].item():.3f}  ->  {docs[i]}")

Query: Name the administrative capital city of the Republic of India.
2.214  ->  Washington D C is known for historical Museums.
0.535  ->  Indias seat of government is located in New Delhi.
-9.568  ->  Delhi is a historic city with many monuments.


### Original yes/no scoring — Hugging Face (CPU)
#### What it does.
For each (query, document) pair, it builds a yes/no prompt and computes the conditional log-probability of the continuations " yes" and " no" (tokenized) given the prompt. It returns a score = logp(yes) − logp(no). Higher = more relevant.

Pick the smallest Qwen3 Instruct model you can run on CPU (it will still be slow). Example placeholders below; replace MODEL_ID with the smallest Qwen3 Instruct you have locally.

In [8]:
import math
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# --- Choose a (small) Qwen3 Instruct model you can run locally on CPU ---
# Example placeholder (replace with your local model):
MODEL_ID = "Qwen/Qwen2-1.5B-Instruct"  # <-- replace with a Qwen3 Instruct if available locally
DEVICE = "cpu"
DTYPE = torch.float32

# Prompt template (concise & deterministic)
TEMPLATE = (
    "You are a reranker. Answer strictly with 'yes' or 'no'.\n"
    "Query: {query}\n"
    "Document: {doc}\n"
    "Answer:"
)

def load_model(model_id=MODEL_ID):
    tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
    if tok.eos_token is None:
        tok.eos_token = tok.pad_token or "<|endoftext|>"
    model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=DTYPE)
    model.to(DEVICE)
    model.eval()
    return tok, model

def candidate_logprob(tok, model, prompt:str, candidate:str, max_new_tokens=None):
    """
    Computes log p(candidate | prompt) by teacher-forcing the candidate tokens.
    """
    # Encode prompt and prompt+candidate
    enc_prompt = tok(prompt, return_tensors="pt")
    enc_full   = tok(prompt + candidate, return_tensors="pt")

    input_ids = enc_full.input_ids.to(DEVICE)
    attn_mask = enc_full.attention_mask.to(DEVICE)

    with torch.no_grad():
        out = model(input_ids=input_ids, attention_mask=attn_mask)
        logits = out.logits  # [B, T, V]
        # We need log-probs of candidate tokens only (positions after the prompt)
        # Shift so that logits[t-1] -> token[t]
        logprobs = torch.log_softmax(logits[:, :-1, :], dim=-1)  # [B, T-1, V]

    # Indices that correspond to candidate tokens
    prompt_len = tok(prompt, return_tensors="pt").input_ids.shape[1]
    full_len   = input_ids.shape[1]
    cand_token_ids = input_ids[:, prompt_len:full_len]  # [B, Lc]

    # Gather logprobs at candidate token positions
    # Align: use the last T-1 vs tokens[1:]
    relevant_logprobs = logprobs[:, prompt_len-1:full_len-1, :].gather(
        dim=-1, index=cand_token_ids.unsqueeze(-1)
    ).squeeze(-1)  # [B, Lc]

    total_logprob = relevant_logprobs.sum(dim=1).item()
    return float(total_logprob)

def yes_no_score(tok, model, query:str, doc:str):
    prompt = TEMPLATE.format(query=query, doc=doc).strip() + " "
    # Try candidate variants with/without leading spaces to match tokenizer behavior
    candidates = ["yes", " yes", "Yes", " Yes"]
    cand_no    = ["no", " no", "No", " No"]

    def best_logp(alts):
        best = -1e30
        for a in alts:
            try:
                lp = candidate_logprob(tok, model, prompt, a)
                if lp > best:
                    best = lp
            except Exception:
                pass
        return best

    lp_yes = best_logp(candidates)
    lp_no  = best_logp(cand_no)
    return lp_yes - lp_no, lp_yes, lp_no, prompt

def rerank(tok, model, query, docs):
    scored = []
    for d in docs:
        s, lp_y, lp_n, prompt = yes_no_score(tok, model, query, d)
        scored.append((s, lp_y, lp_n, d))
    scored.sort(key=lambda x: x[0], reverse=True)
    return scored

if __name__ == "__main__":
    tok, model = load_model()

    query = "Name the administrative capital city of the Republic of India."
    docs = [
        "Delhi is a historic city with many monuments.",
        "India's seat of government is located in New Delhi.",
        "The Indian rupee is the currency of India.",
        "Washington D. C. is known for historical museums.",
    ]

    results = rerank(tok, model, query, docs)
    print(f"Query: {query}\n")
    for i,(score, lp_y, lp_n, d) in enumerate(results, 1):
        print(f"{i:2d}. score={score:.3f}  (logp_yes={lp_y:.3f}, logp_no={lp_n:.3f})  -> {d}")

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


Query: Name the administrative capital city of the Republic of India.

 1. score=0.000  (logp_yes=0.000, logp_no=0.000)  -> Delhi is a historic city with many monuments.
 2. score=0.000  (logp_yes=0.000, logp_no=0.000)  -> India's seat of government is located in New Delhi.
 3. score=0.000  (logp_yes=0.000, logp_no=0.000)  -> The Indian rupee is the currency of India.
 4. score=0.000  (logp_yes=0.000, logp_no=0.000)  -> Washington D. C. is known for historical museums.


#### FAQ & tweaks
Why check variants like " yes" vs "Yes"?

Tokenizers differ in whether the leading space is part of the token. Trying a small set of variants makes the scorer robust.

Can I batch with vLLM?

Yes—send multiple prompts (or use an array of messages) and process responses; just keep max_tokens=1 and temperature=0.

Why not generate the full “yes”/“no” string?

One token is faster and sufficient. If your model tends to emit punctuation, keep variants like " yes.".

How to integrate with your earlier embedding step?

Use embeddings to get top-k candidates, then apply yes/no scoring to those k docs.