## **Этот ноутбук**: использует mistral-instruct-7b, генерирует исходя из описаний фильмов планы-описания по 11 аспектам: от жанра до настроения финала.

Заметка: модель справилась отлично, но не все описания содержат, например, тон финала и так далее. Однако из 11 аспектов более, чем у 75% ответов модели 8 и более аспектов были заполнены (8 заполненых аспектах считаем релевантными, почему: см. ноутбук generating-queries-with-qwen).

In [None]:
prompt = f"""
This is a short plot summary of a movie:

<<synopsis>>

Task: analyze the summary and select CLASSES from the lists below.

Rules (must follow all):
- For each aspect write EXACTLY three values, comma-separated.
- Use ONLY values from the provided lists. New words are forbidden.
- For each aspect, at least one value must be a real class (not "not_specified").
- If only one or two values fit, pad the remaining slots with "not_specified".
- Output EXACTLY 15 lines in the ORDER shown below (one line per aspect).
- No explanations, no extra lines, no trailing text.

Answer strictly in the format:
aspect_name: class_1, class_2, class_3

Lists of aspects and classes:

genre: drama, comedy, thriller, horror, action, adventure, crime, romance, sci_fi, fantasy, war, historical, biography, mystery, western, sports, musical, family, animation, documentary, satire, superhero, not_specified

subgenre: noir, neo_noir, heist, courtroom, spy, detective, slasher, survival, zombie, creature_feature, folk_horror, body_horror, psychological, paranoia, cyberpunk, steampunk, post_apocalypse, dystopia, space_opera, time_travel, alternate_history, road_movie, coming_of_age, buddy, prison, disaster, political, social_drama, satire_subgenre, mockumentary, anthology, found_footage, martial_arts, revenge, gangster, whodunit, not_specified

tone: dark, bleak, gritty, tense, suspenseful, tragic, serious, bittersweet, hopeful, uplifting, light, whimsical, romantic, ironic, satirical, epic, intimate, melancholic, not_specified

setting_place: city, suburb, village, countryside, island, desert, mountains, forest, jungle, arctic, ocean, school, university, hospital, prison, court, warzone, battlefield, military_base, space, spaceship, space_station, future_city, fantasy_world, parallel_world, post_apocalypse_wasteland, underworld_criminal, corporate_office, small_town, road, house, apartment, monastery, castle, mine, not_specified

setting_time: antiquity, medieval, renaissance, xviii_century, xix_century, early_xx, mid_xx, late_xx, present, near_future, far_future, timeless, alternate_history, not_specified

protagonist_archetype: everyman, teenager, child, family, detective, cop, private_eye, criminal, hitman, soldier, veteran, resistance_member, scientist, engineer, doctor, teacher, artist, athlete, journalist, lawyer, leader, monarch, politician, spy, hacker, survivor, antihero, superhero, monster_protagonist, outsider, immigrant, pilgrim, prisoner, not_specified

antagonist_type: person, rival, serial_killer, criminal_org, corrupt_cop, corporation, system, society, government, cult, nature, disease, poverty, addiction, inner_demon, fate, supernatural, ghost, demon, monster, alien, ai, virus, war, environment, not_specified

relationship_dynamics: love, friendship, rivalry, family, mentorship, team, found_family, betrayal, enemies_to_lovers, love_triangle, loner, mentor_student, siblings, parent_child, not_specified

core_goal: survive, escape, protect, rescue, find, uncover_secret, solve_crime, prove_innocence, revenge, win, save_world, stop_villain, build_relationship, self_discovery, rise_to_power, restore_order, heist_score, expose_conspiracy, not_specified

inciting_incident: murder, kidnapping, robbery, disaster, accident, betrayal_hook, diary, prophecy, inheritance, new_job, arrival_of_stranger, war_outbreak, outbreak_virus, aliens_arrival, time_anomaly, parallel_worlds, conspiracy_revealed, magic_awakens, ai_malfunction, forbidden_love, not_specified

themes: love, friendship, family, betrayal, loyalty, freedom, justice, power, corruption, greed, guilt, forgiveness, redemption, survival, identity, coming_of_age, class, race, immigration, war, revolution, oppression, resistance, faith, doubt, science, technology, ecology, capitalism, art, fame, myth, destiny, madness, memory, trauma, grief, hope, not_specified

scale: personal, local, national, global, cosmic, not_specified

narrative_structure: linear, non_linear, multi_timeline, frame_story, real_time, unreliable_narrator, anthology_structure, puzzle_plot, chaptered, not_specified

pacing: slow_burn, measured, brisk, breakneck, not_specified

ending_tone: happy, tragic, bittersweet, ambiguous, twist, not_specified

Now, your answer:
"""


In [None]:
hf_token = ''

from huggingface_hub import login
import os


login(token=hf_token)

In [None]:
import torch
import torch.nn.functional as F
import pandas as pd
from transformers import AutoTokenizer, AutoModel
import string

# === 1. словарь допустимых классов ===
ASPECT_ORDER = [
    "genre","subgenre","tone","setting_place","setting_time",
    "protagonist_archetype","antagonist_type","relationship_dynamics",
    "core_goal","inciting_incident","themes","scale",
    "narrative_structure","pacing","ending_tone"
]

ASPECT_CLASSES = {
    "genre": ["drama","comedy","thriller","horror","action","adventure","crime","romance","sci_fi","fantasy","war","historical","biography","mystery","western","sports","musical","family","animation","documentary","satire","superhero","not_specified"],
    "subgenre": ["noir","neo_noir","heist","courtroom","spy","detective","slasher","survival","zombie","creature_feature","folk_horror","body_horror","psychological","paranoia","cyberpunk","steampunk","post_apocalypse","dystopia","space_opera","time_travel","alternate_history","road_movie","coming_of_age","buddy","prison","disaster","political","social_drama","satire_subgenre","mockumentary","anthology","found_footage","martial_arts","revenge","gangster","whodunit","not_specified"],
    "tone": ["dark","bleak","gritty","tense","suspenseful","tragic","serious","bittersweet","hopeful","uplifting","light","whimsical","romantic","ironic","satirical","epic","intimate","melancholic","not_specified"],
    "setting_place": ["city","suburb","village","countryside","island","desert","mountains","forest","jungle","arctic","ocean","school","university","hospital","prison","court","warzone","battlefield","military_base","space","spaceship","space_station","future_city","fantasy_world","parallel_world","post_apocalypse_wasteland","underworld_criminal","corporate_office","small_town","road","house","apartment","monastery","castle","mine","not_specified"],
    "setting_time": ["antiquity","medieval","renaissance","xviii_century","xix_century","early_xx","mid_xx","late_xx","present","near_future","far_future","timeless","alternate_history","not_specified"],
    "protagonist_archetype": ["everyman","teenager","child","family","detective","cop","private_eye","criminal","hitman","soldier","veteran","resistance_member","scientist","engineer","doctor","teacher","artist","athlete","journalist","lawyer","leader","monarch","politician","spy","hacker","survivor","antihero","superhero","monster_protagonist","outsider","immigrant","pilgrim","prisoner","not_specified"],
    "antagonist_type": ["person","rival","serial_killer","criminal_org","corrupt_cop","corporation","system","society","government","cult","nature","disease","poverty","addiction","inner_demon","fate","supernatural","ghost","demon","monster","alien","ai","virus","war","environment","not_specified"],
    "relationship_dynamics": ["love","friendship","rivalry","family","mentorship","team","found_family","betrayal","enemies_to_lovers","love_triangle","loner","mentor_student","siblings","parent_child","not_specified"],
    "core_goal": ["survive","escape","protect","rescue","find","uncover_secret","solve_crime","prove_innocence","revenge","win","save_world","stop_villain","build_relationship","self_discovery","rise_to_power","restore_order","heist_score","expose_conspiracy","not_specified"],
    "inciting_incident": ["murder","kidnapping","robbery","disaster","accident","betrayal_hook","diary","prophecy","inheritance","new_job","arrival_of_stranger","war_outbreak","outbreak_virus","aliens_arrival","time_anomaly","parallel_worlds","conspiracy_revealed","magic_awakens","ai_malfunction","forbidden_love","not_specified"],
    "themes": ["love","friendship","family","betrayal","loyalty","freedom","justice","power","corruption","greed","guilt","forgiveness","redemption","survival","identity","coming_of_age","class","race","immigration","war","revolution","oppression","resistance","faith","doubt","science","technology","ecology","capitalism","art","fame","myth","destiny","madness","memory","trauma","grief","hope","not_specified"],
    "scale": ["personal","local","national","global","cosmic","not_specified"],
    "narrative_structure": ["linear","non_linear","multi_timeline","frame_story","real_time","unreliable_narrator","anthology_structure","puzzle_plot","chaptered","not_specified"],
    "pacing": ["slow_burn","measured","brisk","breakneck","not_specified"],
    "ending_tone": ["happy","tragic","bittersweet","ambiguous","twist","not_specified"],
}




def parse_one_output(text: str) -> dict[str, list[str]]:
    # чистка пунктуации (кроме "_")
    punct = string.punctuation.replace('_', '')

    def drop_na(lst):
        return [i for i in lst if i]

    def clean_(txt: str) -> str:
        return txt.translate(str.maketrans('', '', punct)).lower().strip()

    lines = [clean_(ln) for ln in text.split("\n") if ln.strip()]
    dict_aspect = {asp: ["not_specified"]*3 for asp in ASPECT_ORDER}

    for line in lines:
        parts = drop_na(line.split())
        if not parts:
            continue
        asp = parts[0]
        if asp in dict_aspect:
            vals = parts[1:4]
            # паддинг до 3
            while len(vals) < 3:
                vals.append("not_specified")
            dict_aspect[asp] = vals[:3]
    return dict_aspect


# === 2. flatten словарей ===
def flatten_batch(dicts):
    tokens, aspects, spans = [], [], []
    cursor = 0
    for d in dicts:
        for asp in ASPECT_ORDER:
            vals = d[asp]  # всегда длина 3
            tokens.extend(vals)
            aspects.extend([asp]*len(vals))
        spans.append((cursor, cursor + len(ASPECT_ORDER)*3))
        cursor += len(ASPECT_ORDER)*3
    return tokens, aspects, spans

# === 3. GPU encode ===
@torch.inference_mode()
def mean_pool(last_hidden_state, attention_mask):
    mask = attention_mask.unsqueeze(-1).type_as(last_hidden_state)
    summed = (last_hidden_state * mask).sum(dim=1)
    counts = mask.sum(dim=1).clamp(min=1e-9)
    return summed / counts

@torch.inference_mode()
def encode_texts(texts, tokenizer, model, device="cuda", bs=64):
    outs = []
    for i in range(0, len(texts), bs):
        chunk = texts[i:i+bs]
        enc = tokenizer([t.replace("_"," ") for t in chunk],
                        padding=True, truncation=True,
                        max_length=32, return_tensors="pt").to(device)
        out = model(**enc)
        if hasattr(out, "pooler_output") and out.pooler_output is not None:
            emb = out.pooler_output
        else:
            emb = mean_pool(out.last_hidden_state, enc["attention_mask"])
        emb = F.normalize(emb, dim=-1)
        outs.append(emb)
    return torch.cat(outs, dim=0)

# === 4. строим банк эмбеддингов для каждого аспекта ===
def build_bank(tokenizer, model, device="cuda"):
    bank = {}
    for asp in ASPECT_ORDER:
        labels = ASPECT_CLASSES[asp]
        embs = encode_texts(labels, tokenizer, model, device=device)
        bank[asp] = {lbl: embs[i] for i,lbl in enumerate(labels)}
    return bank

# === 5. сопоставление токенов с классами ===
def map_tokens(tokens, aspects, bank, tokenizer, model, device="cuda"):
    embs = encode_texts(tokens, tokenizer, model, device=device)
    mapped = []
    for tok, asp, vec in zip(tokens, aspects, embs):
        # если токен уже валидный — оставляем
        if tok in bank[asp]:
            mapped.append(tok)
            continue
        # считаем косинусы
        cand_labels = list(bank[asp].keys())
        cand_embs = torch.stack([bank[asp][c] for c in cand_labels], dim=0)
        sims = cand_embs @ vec
        best = int(torch.argmax(sims).item())
        best_lbl = cand_labels[best]
        mapped.append(best_lbl)
    return mapped


# === 6. собираем обратно словари ===
def reconstruct(mapped_tokens, spans):
    dicts = []
    for start,end in spans:
        local = mapped_tokens[start:end]
        d = {}
        ptr = 0
        for asp in ASPECT_ORDER:
            d[asp] = local[ptr:ptr+3]
            ptr += 3
        dicts.append(d)
    return dicts

# === 7. добавляем в датасет (DataFrame) ===
def to_dataframe(dicts, prompts=None):
    rows = []
    for i,d in enumerate(dicts):
        row = {}
        if prompts is not None:
            row["prompt"] = prompts[i]
        for asp in ASPECT_ORDER:
            row[f"{asp}_1"], row[f"{asp}_2"], row[f"{asp}_3"] = d[asp]
        rows.append(row)
    cols = (["prompt"] if prompts is not None else []) + [f"{a}_{k}" for a in ASPECT_ORDER for k in (1,2,3)]
    return pd.DataFrame(rows, columns=cols)

# === 8. пайплайн ===
def process_batch(parsed_batch, tokenizer, model, device="cuda", prompts=None):
    tokens, aspects, spans = flatten_batch(parsed_batch)
    bank = build_bank(tokenizer, model, device=device)
    mapped = map_tokens(tokens, aspects, bank, tokenizer, model, device=device)
    dicts = reconstruct(mapped, spans)
    # df = to_dataframe(dicts, prompts=prompts)
    return dicts


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import os, torch


os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128,expandable_segments:True,garbage_collection_threshold:0.6"
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cuda.enable_flash_sdp(True)
torch.backends.cuda.enable_mem_efficient_sdp(True)
torch.backends.cuda.enable_math_sdp(False)

name = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer_mistral = AutoTokenizer.from_pretrained(name)
if tokenizer_mistral.pad_token_id is None:
    tokenizer_mistral.pad_token_id = tokenizer_mistral.eos_token_id

mistral_model = AutoModelForCausalLM.from_pretrained(
    name,
    torch_dtype=torch.float16,
    device_map="auto",                       # model-parallel по всем GPU
    low_cpu_mem_usage=True,

    # max_memory={"cuda:0":"15GiB","cuda:1":"15GiB"},  # опционально
)
mistral_model.generation_config.pad_token_id = tokenizer_mistral.pad_token_id
mistral_model.eval()

tokenizer_config.json:   0%|          | 0.00/2.10k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

2025-09-05 14:28:17.347012: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1757082497.543984      20 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1757082497.602636      20 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): MistralMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): MistralRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): MistralRMSNorm((4096,), eps=1e-0

In [None]:
device = "cuda:1"
MODEL = "sentence-transformers/all-MiniLM-L6-v2"  # можно заменить на bge, e5 и т.д.
tok_vect = AutoTokenizer.from_pretrained(MODEL)
model_vect = AutoModel.from_pretrained(MODEL).to(device).eval()

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

In [None]:
import pandas as pd
df = pd.read_parquet("/kaggle/input/promts-for-sematinc/prompts.parquet")

def create_prompt(syn, base_prompt=prompt):
    mask = '<<synopsis>>'
    return base_prompt.replace(mask, syn)

df = df[['plot']].drop_duplicates()
df['prompt'] = df['plot'].apply(create_prompt)
df['desc'] = ''

df = df.sort_values(by="prompt", key=lambda col: col.str.len())

In [None]:
df = df.iloc[6144:]

In [None]:
def dict_to_text(d: dict) -> str:
    lines = []
    for k, v in d.items():
        # v гарантированно список из 3х элементов
        values = ", ".join(v)
        lines.append(f"{k}: {values}")
    return "\n".join(lines)

In [None]:

saved_per_batch = 16
batch_size = 8
n = 0
for i in range(0, len(df), batch_size):
    batch = df.iloc[i:i+batch_size]
    prompts = batch['prompt'].tolist()

    inputs = tokenizer_mistral(prompts, return_tensors="pt", padding=True, truncation=True).to('cuda')

    with torch.inference_mode():
        outputs = mistral_model.generate(
            **inputs,
            max_new_tokens=180,
            temperature=0.0,
            do_sample=False
        )

    decoded = []
    input_len = inputs["input_ids"].shape[1]
    for output in outputs:
        text = tokenizer_mistral.decode(output[input_len:], skip_special_tokens=True)
        decoded.append(text)

    # парсим "грязные" выходы
    parsed_batch = [parse_one_output(txt) for txt in decoded]
    # обрабатываем через твой пайплайн
    descriptions = process_batch(parsed_batch, tok_vect, model_vect, device=device)
    descriptions = [dict_to_text(d) for d in descriptions]
    # сохраняем обратно
    df.loc[batch.index, 'desc'] = descriptions

    # периодическое сохранение
    if (i // batch_size + 1) % saved_per_batch == 0:
        df.to_parquet("plot_prompt_desc.parquet", index=False)

df.to_parquet("plot_prompt_desc_final.parquet", index=False)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be

In [None]:
# def count_not_specified(original_dicts, mapped_dicts):
#     """
#     original_dicts: список словарей (от твоего парсера, ещё до замены)
#     mapped_dicts:   список словарей (после process_batch)
#     возвращает список кортежей (orig_count, mapped_count) по каждому тексту
#     """
#     results = []
#     for orig, mapped in zip(original_dicts, mapped_dicts):
#         orig_count = sum(v == "not_specified" for vals in orig.values() for v in vals)
#         mapped_count = sum(v == "not_specified" for vals in mapped.values() for v in vals)
#         results.append((orig_count, mapped_count))
#     return results
# # допустим, у тебя есть:
# # parsed_batch = [parse_one_output_my(txt) for txt in raw_outputs]
# # dicts, df = process_batch(parsed_batch, tok, mdl, device="cuda")

# stats = count_not_specified(parsed_batch, dicts)

# for i, (o, m) in enumerate(stats, 1):
#     print(f"Текст {i}: not_specified было {o}, стало {m}")
