## **Этот ноутбук**: проводит анализ слоёв модели бге м3 прогоняя их на имеющихся данных (запросы пользователей) и ищет "слабые слои" - те что дублируют предыдущие. На результатах данного кода построена логика урезания модели: самые ключевые (по метрикам) оставляем, дубликаты - убираем.

In [None]:
# pip install -q transformers torch scikit-learn pandas numpy tqdm

import os, json, math, numpy as np, pandas as pd, torch
from tqdm import tqdm
from sklearn.metrics import silhouette_score
from transformers import AutoTokenizer, AutoModel

# ====== CONFIG ======
MODEL_NAME = "BAAI/bge-m3"
TEXT_COL, LANG_COL = "text", "lang"
SAMPLES_PER_LANG = 1000
MAX_LEN, BATCH = 36, 32         # MAX_LEN=256 обычно хватает; BATCH подгони под VRAM
OUT = "./bge_m3_stats_final"
os.makedirs(OUT, exist_ok=True)
device = "cuda" if torch.cuda.is_available() else "cpu"
torch.manual_seed(42); np.random.seed(42)

# ====== LOAD YOUR DF HERE ======
# df = pd.read_parquet("/path/your_dataset.parquet")  # должен содержать колонки text, lang
# assert {TEXT_COL, LANG_COL}.issubset(df.columns)

def sample_equal(df, n):
    return (df.groupby(LANG_COL, group_keys=False)
              .apply(lambda g: g.sample(min(n, len(g)), random_state=42))
              .reset_index(drop=True))

def pick_layers(total, k=None):
    return list(range(1, total + 1))


# ====== MODEL (FP16) ======
tok = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModel.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,
    device_map='auto'
).eval()
NUM_LAYERS = model.config.num_hidden_layers
SEL = pick_layers(NUM_LAYERS, k=7)  # 6 равномерно + последний

def mean_pool(h, mask):
    m = mask.unsqueeze(-1).float()
    return (h*m).sum(1) / (m.sum(1).clamp(min=1e-6))

def iter_batches(df, bs):
    for i in range(0, len(df), bs):
        yield df.iloc[i:i+bs]
def run_and_collect(df):
    embeds = {L: [] for L in SEL}
    labels = []
    with torch.inference_mode():
        for lang in df.columns:         # каждая колонка = язык
            texts = df[lang].dropna().astype(str).tolist()
            if len(texts) > SAMPLES_PER_LANG:
                texts = np.random.choice(texts, SAMPLES_PER_LANG, replace=False).tolist()
            for i in range(0, len(texts), BATCH):
                batch = texts[i:i+BATCH]
                enc = tok(batch, padding=True, truncation=True, max_length=MAX_LEN, return_tensors="pt").to(device)
                out = model(**enc, output_hidden_states=True)
                hs = out.hidden_states
                am = enc["attention_mask"]
                for L in SEL:
                    pooled = mean_pool(hs[L], am).float().cpu().numpy()
                    embeds[L].append(pooled)
                labels.extend([lang]*len(batch))
    for L in SEL:
        embeds[L] = np.concatenate(embeds[L], 0)
    return embeds, np.array(labels)


def compute_layer_metrics(embeds, labels):
    rows_global = []
    for L in SEL:
        X = embeds[L]
        norms = np.linalg.norm(X, axis=1)
        # cos with prev (если prev сохранён)
        cos_prev = np.nan
        if (L-1) in embeds:
            Xp = embeds[L-1]
            x = X / (np.linalg.norm(X, axis=1, keepdims=True)+1e-9)
            y = Xp / (np.linalg.norm(Xp, axis=1, keepdims=True)+1e-9)
            cos_prev = float((x*y).sum(1).mean())
        # silhouette
        try:
            sil = float(silhouette_score(X, labels, metric="cosine"))
        except Exception:
            sil = float("nan")
        # anisotropy ≈ mean cos over random pairs
        N = len(X)
        pairs = min(20000, N*(N-1)//2) if N>1 else 0
        if pairs>0:
            i = np.random.randint(0, N, size=pairs)
            j = np.random.randint(0, N, size=pairs)
            x = X[i]; y = X[j]
            x /= (np.linalg.norm(x, axis=1, keepdims=True)+1e-9)
            y /= (np.linalg.norm(y, axis=1, keepdims=True)+1e-9)
            anis = float((x*y).sum(1).mean())
        else:
            anis = float("nan")

        rows_global.append({
            "scope":"global",
            "layer":L,
            "lang":"",
            "count":int(N),
            "mean_norm":float(norms.mean()),
            "std_norm":float(norms.std()),
            "cos_with_prev":cos_prev,
            "silhouette_by_lang_cosine":sil,
            "anisotropy_mean_cosine":anis
        })
    return pd.DataFrame(rows_global)

def compute_by_lang_metrics(embeds, labels):
    rows = []
    uniq = np.unique(labels)
    for L in SEL:
        X = embeds[L]
        norms = np.linalg.norm(X, axis=1)
        cos_prev_vec = None
        if (L-1) in embeds:
            Xp = embeds[L-1]
            x = X / (np.linalg.norm(X, axis=1, keepdims=True)+1e-9)
            y = Xp / (np.linalg.norm(Xp, axis=1, keepdims=True)+1e-9)
            cos_prev_vec = (x*y).sum(1)  # [N]
        for lg in uniq:
            m = (labels == lg)
            if not m.any(): continue
            rows.append({
                "scope":"lang",
                "layer":L,
                "lang":str(lg),
                "count":int(m.sum()),
                "mean_norm":float(norms[m].mean()),
                "std_norm":float(norms[m].std()),
                "cos_with_prev":(float(cos_prev_vec[m].mean()) if cos_prev_vec is not None else np.nan),
                "silhouette_by_lang_cosine":np.nan,
                "anisotropy_mean_cosine":np.nan
            })
    return pd.DataFrame(rows)

def add_heuristics(df_all):
    # заполним NaN для косинуса нулями для нормировки
    t = df_all.copy()
    t["cos_with_prev_f"] = t["cos_with_prev"].fillna(0.0)
    def norm01(s):
        s = s.astype(float)
        lo, hi = s.min(), s.max()
        return (s - lo) / (hi - lo) if hi - lo > 1e-9 else s*0
    cos_n  = norm01(t.loc[t["scope"]=="global","cos_with_prev_f"])
    anis_n = norm01(t.loc[t["scope"]=="global","anisotropy_mean_cosine"].fillna(t["anisotropy_mean_cosine"].min()))
    sil_n  = norm01(t.loc[t["scope"]=="global","silhouette_by_lang_cosine"].fillna(t["silhouette_by_lang_cosine"].min()))
    # вставим обратно по индексам
    t.loc[t["scope"]=="global","prune_score"] = 0.5*cos_n + 0.3*anis_n + 0.2*sil_n
    # флаги (только для global)
    med_anis = t.loc[t["scope"]=="global","anisotropy_mean_cosine"].median()
    t["flag_prune_candidate"]  = (t["scope"]=="global") & (t["cos_with_prev_f"]>=0.98) & (t["anisotropy_mean_cosine"]>=med_anis)
    t["flag_freeze_candidate"] = (t["scope"]=="global") & (t["cos_with_prev_f"]>=0.95) & (~t["flag_prune_candidate"])
    return t.drop(columns=["cos_with_prev_f"])

def save_side_artifacts(embeds, labels):
    # embeddings.npz + manifest.json (по желанию)
    layers = np.array(sorted(embeds.keys()), dtype=np.int32)
    mats = [embeds[L] for L in layers]
    E = np.stack(mats, axis=0)  # [L, N, D]
    np.savez_compressed(f"{OUT}/embeddings.npz", embeddings=E, labels=labels, layers=layers)
    with open(f"{OUT}/manifest.json","w") as f:
        json.dump({"model":MODEL_NAME,"layers":layers.tolist(),"N":int(E.shape[1]),"D":int(E.shape[2]),
                   "max_len":MAX_LEN,"batch":BATCH}, f, indent=2)

def main(df):
    embeds, labels = run_and_collect(df)
    g = compute_layer_metrics(embeds, labels)         # scope=global
    bl = compute_by_lang_metrics(embeds, labels)      # scope=lang
    final = add_heuristics(pd.concat([g, bl], ignore_index=True))
    # порядок колонок и сохранение одного CSV
    cols = ["scope","layer","lang","count","mean_norm","std_norm","cos_with_prev",
            "silhouette_by_lang_cosine","anisotropy_mean_cosine","prune_score",
            "flag_prune_candidate","flag_freeze_candidate"]
    final = final.reindex(columns=cols)
    final.sort_values(["scope","layer","lang"], inplace=True, kind="mergesort")
    final.to_csv(f"{OUT}/final_analysis.csv", index=False)
    # вспомогательные файлы (по желанию — удобно для оффлайн-проверок)
    save_side_artifacts(embeds, labels)
    print("✅ Saved:")
    print(f"- {OUT}/final_analysis.csv  (ЕДИНЫЙ CSV-отчёт)")
    print(f"- {OUT}/embeddings.npz, {OUT}/manifest.json")

# ==== USAGE ====
df = pd.read_parquet("/content/transef_lerning_merged_10_langueges.parquet")
main(df)


✅ Saved:
- ./bge_m3_stats_final/final_analysis.csv  (ЕДИНЫЙ CSV-отчёт)
- ./bge_m3_stats_final/embeddings.npz, ./bge_m3_stats_final/manifest.json


In [None]:
analys = pd.read_csv("/content/bge_m3_stats_final/final_analysis.csv")

In [None]:
analys[analys['scope']=='global'][analys['flag_prune_candidate']==False]['layer']

  analys[analys['scope']=='global'][analys['flag_prune_candidate']==False]['layer']


Unnamed: 0,layer
0,1
1,2
12,13
14,15
15,16
16,17
17,18
18,19
19,20
20,21


# Оставляем слои: 1, 2, 5, 9, 13 (прередили "дублирующиеся"), 15, 17, 21, 24. Итого: 14 слоёв (урезание на ~41%). Получился классический "бутерброд": крайние слои важны, в середине дубликаты, что лишний раз говорит "выбор слоёв удачный".

# Выходит слоёв что берём для layer-wise distillation: 13, 17, 24. Слой: 9 (теперь 4) будет подрожать слою 13, слой 17 (теперь 6) - 17, слой 24 (теперь 24) - 24.
