# Zero-Shot Labeling cu FAISS + DeBERTa

Scop:
- Folosim embeddings Sentence-BERT + FAISS pentru a selecta Top-K = 22 de etichete candidate.
- Validam fiecare eticheta cu un model DeBERTa zero-shot.
- Rezultatul: un set final de etichete multi-label, fara antrenare supravegheata.


In [None]:
import pandas as pd
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from transformers import pipeline
from tqdm.notebook import tqdm
import torch

## 1. Inccrcam label-urile si construim index FAISS
Presupunem ca:
- `insurance_taxonomy.csv` sau `.xlsx` conține coloana `label` cu lista completa de etichete.
- Folosim un model Sentence-BERT pentru embeddings.


In [None]:
label_file = "insurance_taxonomy.xlsx"
df_labels = pd.read_excel(label_file)
label_list = df_labels["label"].tolist()
print(f"Avem {len(label_list)} etichete disponibile.")

emb_model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

label_embeddings = emb_model.encode(label_list, show_progress_bar=True)
label_embeddings = np.array(label_embeddings, dtype="float32")

embed_dim = label_embeddings.shape[1]
index = faiss.IndexFlatIP(embed_dim)
index.add(label_embeddings)

## 2. Încărcăm modelul DeBERTa zero-shot

Recomandarea: **MoritzLaurer/deberta-v3-base-mnli** sau alt model specializat pe zero-shot.
La nevoie poți folosi BART MNLI (ex. `facebook/bart-large-mnli`) – adaptabil.

In [None]:
zero_shot_model_name = "MoritzLaurer/deberta-v3-base-mnli" #unul dintre modelele nli testate
zero_shot_classifier = pipeline("zero-shot-classification",
                                model=zero_shot_model_name,
                                device = 0 if torch.cuda.is_available() else -1,
                                framework="pt")

## 3. Incarcam companiile si aplicam flow-ul: FAISS -> DeBERTa Zero-Shot

- Coloană relevantă: `description` (și, opțional, `business_tags`, `sector`, `category`, `niche`).
- Noi folosim doar `description` pentru a determina embedding-ul. 
- *K = 22* etichete candidate. 
- DeBERTa zero-shot (multi_label=True) pentru validare finală.


In [None]:
companies_file = "ml_insurance_challenge.csv"
df_comp = pd.read_csv(companies_file)

K = 22
entail_threshold = 0.3

results = []
for i in tqdm(range(len(df_comp)), desc="Procesare intrari"):
    row = df_comp.iloc[i]

    description = str(row["description"]).strip()
    business_tags = eval(row["business_tags"]) if isinstance(row["business_tags"], str) and row["business_tags"].startswith("[") else []
    sector = str(row["sector"]).strip() if pd.notna(row["sector"]) else ""
    category = str(row["category"]).strip() if pd.notna(row["category"]) else ""
    niche = str(row["niche"]).strip() if pd.notna(row["niche"]) else ""

    # combined_text = f"{description}. Business Tags: {', '.join(business_tags)} Niche: {niche}.".strip()
    # combined_text = f"{description} . Business Tags: {', '.join(business_tags)}. Sector: {sector}. Category: {category}. Niche: {niche}.".strip()
    combined_text = f"Description: {description}. This business operates in the {niche} niche. Tags: {', '.join(business_tags)}."

    if not description:
        results.append({
            "description": description,
            "final_labels": []
        })
        continue

    desc_emb = emb_model.encode(combined_text, convert_to_tensor=False)
    desc_emb = desc_emb.astype(np.float32).reshape(1, -1)

    distances, indices = index.search(desc_emb, K)
    candidate_labels = [label_list[idx] for idx in indices[0]]
    candidate_scores = distances[0] 


    all_candidate_labels = list(set(candidate_labels) | extra_labels)
    classification = zero_shot_classifier(
        combined_text,
        candidate_labels,
        multi_label=True
    )

    label2score = {lab: score for lab, score in zip(classification["labels"], classification["scores"])}

    validated = [lab for lab in candidate_labels if label2score.get(lab, 0) >= entail_threshold]

    results.append({
        "description": description,
        "business_tags": business_tags,
        "sector": sector,
        "category": category,
        "niche": niche,
        "candidate_labels": candidate_labels,
        "candidate_sim_scores": list(candidate_scores),
        "zero_shot_scores": {lab: float(label2score.get(lab, 0)) for lab in candidate_labels},
        "final_labels": validated
    })

In [None]:
df_out = pd.DataFrame(results)
output_file = "final_cox_v15.csv"
df_out.to_csv(output_file, index=False)