# Токенизация и классика в NLP


## Шаг 1: Загрузка датасета

Начнем с загрузки датасета AG News. AG News — это датасет для классификации новостных статей с 4 классами:
- World
- Sports
- Business
- Sci/Tech

Датасет содержит приблизительно 120,000 примеров для обучения и 7,600 примеров для тестирования.


In [1]:
import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np
from datasets import load_dataset
import random

torch.manual_seed(42)
np.random.seed(42)
random.seed(42)


In [2]:
print("Loading AG News dataset...")
dataset = load_dataset("ag_news")

print(f"\nDataset structure:")
print(dataset)

print(f"\nTrain set size: {len(dataset['train'])}")
print(f"Test set size: {len(dataset['test'])}")

print("\nSample examples:")
for i in range(3):
    example = dataset['train'][i]
    print(f"\nExample {i+1}:")
    print(f"  Label: {example['label']} ({dataset['train'].features['label'].names[example['label']]})")
    print(f"  Text: {example['text'][:200]}...")


Loading AG News dataset...

Dataset structure:
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})

Train set size: 120000
Test set size: 7600

Sample examples:

Example 1:
  Label: 2 (Business)
  Text: Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again....

Example 2:
  Label: 2 (Business)
  Text: Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense in...

Example 3:
  Label: 2 (Business)
  Text: Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\hang over the stock market next week during t...


In [3]:
print("Loading data into memory...")

train_texts = dataset['train']['text']
train_labels = dataset['train']['label']
test_texts = dataset['test']['text']
test_labels = dataset['test']['label']

print(f"\nTrain texts loaded: {len(train_texts)}")
print(f"Train labels loaded: {len(train_labels)}")
print(f"Test texts loaded: {len(test_texts)}")
print(f"Test labels loaded: {len(test_labels)}")

print("\nLabel distribution in training set:")
unique_labels, counts = np.unique(train_labels, return_counts=True)
for label, count in zip(unique_labels, counts):
    label_name = dataset['train'].features['label'].names[label]
    print(f"  {label_name}: {count} ({count/len(train_labels)*100:.1f}%)")


Loading data into memory...

Train texts loaded: 120000
Train labels loaded: 120000
Test texts loaded: 7600
Test labels loaded: 7600

Label distribution in training set:
  World: 30000 (25.0%)
  Sports: 30000 (25.0%)
  Business: 30000 (25.0%)
  Sci/Tech: 30000 (25.0%)


In [4]:
print("Dataset statistics:")
print(f"  Average text length (train): {np.mean([len(text) for text in train_texts[:1000]]):.1f} characters")
print(f"  Average text length (test): {np.mean([len(text) for text in test_texts[:1000]]):.1f} characters")
print(f"  Number of classes: {len(dataset['train'].features['label'].names)}")
print(f"  Class names: {dataset['train'].features['label'].names}")

print("\n✓ Dataset successfully loaded into memory!")


Dataset statistics:
  Average text length (train): 250.2 characters
  Average text length (test): 242.2 characters
  Number of classes: 4
  Class names: ['World', 'Sports', 'Business', 'Sci/Tech']

✓ Dataset successfully loaded into memory!


## Шаг 2: Токенизация текста

Токенизация — это процесс разбиения текста на отдельные единицы (токены). Рассмотрим различные подходы к токенизации, каждый из которых имеет свои преимущества и области применения.


### Word-level токенизация

Самый простой подход — разделение текста по пробелам. Каждое слово становится отдельным токеном. Это базовый метод для классических подходов типа Bag-of-Words и TF-IDF.

**Принцип работы:** Текст разбивается по пробелам и знакам препинания, каждое слово приводится к нижнему регистру и нормализуется.


In [5]:
import re
from collections import Counter

def word_level_tokenize(text):
    text = text.lower()
    tokens = re.findall(r'\b\w+\b', text)
    return tokens

sample_text = train_texts[0]
word_tokens = word_level_tokenize(sample_text)

print(f"Original text: {sample_text[:150]}...")
print(f"\nWord-level tokens ({len(word_tokens)} tokens):")
print(word_tokens[:20])
print(f"\nVocabulary size (first 1000 texts): {len(set([token for text in train_texts[:1000] for token in word_level_tokenize(text)]))}")


Original text: Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again....

Word-level tokens (24 tokens):
['wall', 'st', 'bears', 'claw', 'back', 'into', 'the', 'black', 'reuters', 'reuters', 'short', 'sellers', 'wall', 'street', 's', 'dwindling', 'band', 'of', 'ultra', 'cynics']

Vocabulary size (first 1000 texts): 7552


### Character-level токенизация

Токенизация на уровне символов разбивает текст на отдельные символы. Можно также использовать char n-grams (последовательности из n символов). Такой подход устойчив к опечаткам и смешанным алфавитам, но создает очень длинные последовательности.

**Принцип работы:** Каждый символ становится отдельным токеном. Можно также создавать n-граммы символов для захвата локальных паттернов.


In [6]:
def character_level_tokenize(text):
    return list(text)

def char_ngram_tokenize(text, n=3):
    tokens = []
    for i in range(len(text) - n + 1):
        tokens.append(text[i:i+n])
    return tokens

sample_text = train_texts[0]
char_tokens = character_level_tokenize(sample_text)
char_3gram_tokens = char_ngram_tokenize(sample_text, n=3)

print(f"Original text: {sample_text[:100]}...")
print(f"\nCharacter-level tokens (first 50): {char_tokens[:50]}")
print(f"\nChar 3-gram tokens (first 20): {char_3gram_tokens[:20]}")
print(f"\nTotal characters: {len(char_tokens)}")
print(f"Total char 3-grams: {len(char_3gram_tokens)}")


Original text: Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\b...

Character-level tokens (first 50): ['W', 'a', 'l', 'l', ' ', 'S', 't', '.', ' ', 'B', 'e', 'a', 'r', 's', ' ', 'C', 'l', 'a', 'w', ' ', 'B', 'a', 'c', 'k', ' ', 'I', 'n', 't', 'o', ' ', 't', 'h', 'e', ' ', 'B', 'l', 'a', 'c', 'k', ' ', '(', 'R', 'e', 'u', 't', 'e', 'r', 's', ')', ' ']

Char 3-gram tokens (first 20): ['Wal', 'all', 'll ', 'l S', ' St', 'St.', 't. ', '. B', ' Be', 'Bea', 'ear', 'ars', 'rs ', 's C', ' Cl', 'Cla', 'law', 'aw ', 'w B', ' Ba']

Total characters: 144
Total char 3-grams: 142


### WordPiece токенизация (BERT)

WordPiece — это субворд токенизатор, используемый в BERT и других моделях. Он разбивает слова на более мелкие части (subwords), что позволяет обрабатывать редкие слова и уменьшить размер словаря.

**Принцип работы:** Использует жадный алгоритм для разбиения слов на субворды. Слова, которых нет в словаре, разбиваются на известные части, что минимизирует количество [UNK] токенов.


In [7]:
from transformers import AutoTokenizer

bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

sample_text = train_texts[0]
wordpiece_tokens = bert_tokenizer.tokenize(sample_text)
wordpiece_ids = bert_tokenizer.encode(sample_text, add_special_tokens=False)

print(f"Original text: {sample_text[:150]}...")
print(f"\nWordPiece tokens ({len(wordpiece_tokens)} tokens):")
print(wordpiece_tokens[:30])
print(f"\nToken IDs (first 30): {wordpiece_ids[:30]}")
print(f"\nVocabulary size: {bert_tokenizer.vocab_size}")
print(f"\nSpecial tokens: {bert_tokenizer.special_tokens_map}")


Original text: Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again....

WordPiece tokens (39 tokens):
['wall', 'st', '.', 'bears', 'claw', 'back', 'into', 'the', 'black', '(', 'reuters', ')', 'reuters', '-', 'short', '-', 'sellers', ',', 'wall', 'street', "'", 's', 'd', '##wind', '##ling', '\\', 'band', 'of', 'ultra', '-']

Token IDs (first 30): [2813, 2358, 1012, 6468, 15020, 2067, 2046, 1996, 2304, 1006, 26665, 1007, 26665, 1011, 2460, 1011, 19041, 1010, 2813, 2395, 1005, 1055, 1040, 11101, 2989, 1032, 2316, 1997, 11087, 1011]

Vocabulary size: 30522

Special tokens: {'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}


### Unigram Language Model токенизация (SentencePiece)

Unigram LM — это вероятностная субворд модель, реализованная в SentencePiece. Она использует языковую модель для определения оптимального разбиения текста и поддерживает subword dropout для регуляризации.

**Принцип работы:** Обучается на корпусе текста, вычисляя вероятности различных разбиений. Может работать с любым Unicode текстом без предобработки и поддерживает вероятностное разбиение.


In [8]:
import sentencepiece as spm
import os

model_prefix = "unigram_model"
vocab_size = 8000

if not os.path.exists(f"{model_prefix}.model"):
    with open("train_texts_sample.txt", "w", encoding="utf-8") as f:
        for text in train_texts[:10000]:
            f.write(text + "\n")
    
    spm.SentencePieceTrainer.train(
        input="train_texts_sample.txt",
        model_prefix=model_prefix,
        vocab_size=vocab_size,
        model_type="unigram",
        character_coverage=1.0
    )
    
    os.remove("train_texts_sample.txt")

sp = spm.SentencePieceProcessor(model_file=f"{model_prefix}.model")

sample_text = train_texts[0]
unigram_tokens = sp.encode(sample_text, out_type=str)
unigram_ids = sp.encode(sample_text, out_type=int)

print(f"Original text: {sample_text[:150]}...")
print(f"\nUnigram tokens ({len(unigram_tokens)} tokens):")
print(unigram_tokens[:30])
print(f"\nToken IDs (first 30): {unigram_ids[:30]}")
print(f"\nVocabulary size: {sp.get_piece_size()}")


Original text: Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again....

Unigram tokens (43 tokens):
['▁Wall', '▁St', '.', '▁Bears', '▁C', 'law', '▁Back', '▁In', 'to', '▁the', '▁Black', '▁(', 'Reuters', ')', '▁Reuters', '▁-', '▁Short', '-', 's', 'ell', 'ers', ',', '▁Wall', '▁Street', "'", 's', '▁dwindling', '\\', 'band', '▁of']

Token IDs (first 30): [634, 402, 4, 3932, 99, 4339, 1114, 118, 249, 5, 2977, 17, 40, 18, 146, 20, 5571, 13, 3, 690, 124, 6, 634, 678, 21, 3, 7836, 43, 2953, 10]

Vocabulary size: 8000


### Byte-level BPE токенизация (GPT-2/tiktoken)

Byte-level BPE — это токенизатор, используемый в GPT-2 и других современных языковых моделях. Он работает на уровне байтов, что делает его универсальным для любого Unicode текста и практически исключает появление неизвестных токенов [UNK].

**Принцип работы:** Сначала текст кодируется в UTF-8 байты, затем применяется BPE (Byte Pair Encoding) для создания субворд токенов. Это позволяет обрабатывать любой текст без предобработки и минимизировать OOV (out-of-vocabulary) токены.


In [9]:
import tiktoken

gpt2_tokenizer = tiktoken.get_encoding("gpt2")

sample_text = train_texts[0]
bpe_tokens = gpt2_tokenizer.encode(sample_text)
bpe_token_strings = [gpt2_tokenizer.decode([token]) for token in bpe_tokens[:30]]

print(f"Original text: {sample_text[:150]}...")
print(f"\nByte-level BPE tokens ({len(bpe_tokens)} tokens):")
print(f"Token strings (first 30): {bpe_token_strings}")
print(f"\nToken IDs (first 30): {bpe_tokens[:30]}")
print(f"\nVocabulary size: {gpt2_tokenizer.n_vocab}")

decoded_text = gpt2_tokenizer.decode(bpe_tokens)
print(f"\nDecoded text matches original: {decoded_text == sample_text}")


Original text: Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again....

Byte-level BPE tokens (37 tokens):
Token strings (first 30): ['Wall', ' St', '.', ' Bears', ' Claw', ' Back', ' Into', ' the', ' Black', ' (', 'Reuters', ')', ' Reuters', ' -', ' Short', '-', 'sell', 'ers', ',', ' Wall', ' Street', "'s", ' dwindling', '\\', 'band', ' of', ' ultra', '-', 'cy', 'n']

Token IDs (first 30): [22401, 520, 13, 15682, 30358, 5157, 20008, 262, 2619, 357, 12637, 8, 8428, 532, 10073, 12, 7255, 364, 11, 5007, 3530, 338, 45215, 59, 3903, 286, 14764, 12, 948, 77]

Vocabulary size: 50257

Decoded text matches original: True


### Сравнение токенизаторов

Сравним различные токенизаторы на одном примере текста:


In [10]:
sample_text = train_texts[0]

print(f"Original text:\n{sample_text}\n")
print("=" * 80)

word_tokens = word_level_tokenize(sample_text)
print(f"\nWord-level: {len(word_tokens)} tokens")
print(f"Tokens: {word_tokens[:15]}")

char_tokens = character_level_tokenize(sample_text)
print(f"\nCharacter-level: {len(char_tokens)} tokens")
print(f"Tokens (first 50): {char_tokens[:50]}")

wordpiece_tokens = bert_tokenizer.tokenize(sample_text)
print(f"\nWordPiece (BERT): {len(wordpiece_tokens)} tokens")
print(f"Tokens: {wordpiece_tokens[:15]}")

unigram_tokens = sp.encode(sample_text, out_type=str)
print(f"\nUnigram (SentencePiece): {len(unigram_tokens)} tokens")
print(f"Tokens: {unigram_tokens[:15]}")

bpe_tokens = gpt2_tokenizer.encode(sample_text)
bpe_token_strings = [gpt2_tokenizer.decode([token]) for token in bpe_tokens[:15]]
print(f"\nByte-level BPE (GPT-2): {len(bpe_tokens)} tokens")
print(f"Tokens: {bpe_token_strings}")


Original text:
Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.


Word-level: 24 tokens
Tokens: ['wall', 'st', 'bears', 'claw', 'back', 'into', 'the', 'black', 'reuters', 'reuters', 'short', 'sellers', 'wall', 'street', 's']

Character-level: 144 tokens
Tokens (first 50): ['W', 'a', 'l', 'l', ' ', 'S', 't', '.', ' ', 'B', 'e', 'a', 'r', 's', ' ', 'C', 'l', 'a', 'w', ' ', 'B', 'a', 'c', 'k', ' ', 'I', 'n', 't', 'o', ' ', 't', 'h', 'e', ' ', 'B', 'l', 'a', 'c', 'k', ' ', '(', 'R', 'e', 'u', 't', 'e', 'r', 's', ')', ' ']

WordPiece (BERT): 39 tokens
Tokens: ['wall', 'st', '.', 'bears', 'claw', 'back', 'into', 'the', 'black', '(', 'reuters', ')', 'reuters', '-', 'short']

Unigram (SentencePiece): 43 tokens
Tokens: ['▁Wall', '▁St', '.', '▁Bears', '▁C', 'law', '▁Back', '▁In', 'to', '▁the', '▁Black', '▁(', 'Reuters', ')', '▁Reuters']

Byte-level BPE (GPT-2): 37 tokens
Tokens: ['Wall', ' St', '.', '

## Шаг 3: Векторизация текста

После токенизации необходимо преобразовать тексты в числовые векторы. Рассмотрим различные методы векторизации для каждого типа токенизации: CountVectorizer, TF-IDF, HashingVectorizer и char n-граммы.


In [11]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import time


### Word-level векторизация

Для word-level токенизации используем CountVectorizer и TfidfVectorizer с n-граммами (1-2). Для больших словарей можно использовать HashingVectorizer.


In [12]:
train_subset = train_texts[:10000]
train_labels_subset = train_labels[:10000]
test_subset = test_texts[:1000]
test_labels_subset = test_labels[:1000]

print("Word-level векторизация:")
print("=" * 60)

start_time = time.time()
word_count_vectorizer = CountVectorizer(tokenizer=word_level_tokenize, ngram_range=(1, 2), max_features=50000)
word_count_train = word_count_vectorizer.fit_transform(train_subset)
word_count_test = word_count_vectorizer.transform(test_subset)
word_count_time = time.time() - start_time

print(f"CountVectorizer (1-2 grams):")
print(f"  Train shape: {word_count_train.shape}")
print(f"  Test shape: {word_count_test.shape}")
print(f"  Vocabulary size: {len(word_count_vectorizer.vocabulary_)}")
print(f"  Time: {word_count_time:.2f}s")

start_time = time.time()
word_tfidf_vectorizer = TfidfVectorizer(tokenizer=word_level_tokenize, ngram_range=(1, 2), max_features=50000)
word_tfidf_train = word_tfidf_vectorizer.fit_transform(train_subset)
word_tfidf_test = word_tfidf_vectorizer.transform(test_subset)
word_tfidf_time = time.time() - start_time

print(f"\nTfidfVectorizer (1-2 grams):")
print(f"  Train shape: {word_tfidf_train.shape}")
print(f"  Test shape: {word_tfidf_test.shape}")
print(f"  Vocabulary size: {len(word_tfidf_vectorizer.vocabulary_)}")
print(f"  Time: {word_tfidf_time:.2f}s")

start_time = time.time()
word_hash_vectorizer = HashingVectorizer(tokenizer=word_level_tokenize, ngram_range=(1, 2), n_features=2**18)
word_hash_train = word_hash_vectorizer.transform(train_subset)
word_hash_test = word_hash_vectorizer.transform(test_subset)
word_hash_time = time.time() - start_time

print(f"\nHashingVectorizer (1-2 grams, 2^18 features):")
print(f"  Train shape: {word_hash_train.shape}")
print(f"  Test shape: {word_hash_test.shape}")
print(f"  Time: {word_hash_time:.2f}s")


Word-level векторизация:




CountVectorizer (1-2 grams):
  Train shape: (10000, 50000)
  Test shape: (1000, 50000)
  Vocabulary size: 50000
  Time: 0.90s

TfidfVectorizer (1-2 grams):
  Train shape: (10000, 50000)
  Test shape: (1000, 50000)
  Vocabulary size: 50000
  Time: 0.83s

HashingVectorizer (1-2 grams, 2^18 features):
  Train shape: (10000, 262144)
  Test shape: (1000, 262144)
  Time: 0.41s


In [13]:
print("\nОценка качества Word-level векторов:")
print("=" * 60)

start_time = time.time()
lr_count = LogisticRegression(max_iter=500, random_state=42)
lr_count.fit(word_count_train, train_labels_subset)
count_pred = lr_count.predict(word_count_test)
count_acc = accuracy_score(test_labels_subset, count_pred)
count_train_time = time.time() - start_time

print(f"CountVectorizer + LogisticRegression:")
print(f"  Accuracy: {count_acc:.4f}")
print(f"  Training time: {count_train_time:.2f}s")

start_time = time.time()
lr_tfidf = LogisticRegression(max_iter=500, random_state=42)
lr_tfidf.fit(word_tfidf_train, train_labels_subset)
tfidf_pred = lr_tfidf.predict(word_tfidf_test)
tfidf_acc = accuracy_score(test_labels_subset, tfidf_pred)
tfidf_train_time = time.time() - start_time

print(f"\nTfidfVectorizer + LogisticRegression:")
print(f"  Accuracy: {tfidf_acc:.4f}")
print(f"  Training time: {tfidf_train_time:.2f}s")

start_time = time.time()
lr_hash = LogisticRegression(max_iter=500, random_state=42)
lr_hash.fit(word_hash_train, train_labels_subset)
hash_pred = lr_hash.predict(word_hash_test)
hash_acc = accuracy_score(test_labels_subset, hash_pred)
hash_train_time = time.time() - start_time

print(f"\nHashingVectorizer + LogisticRegression:")
print(f"  Accuracy: {hash_acc:.4f}")
print(f"  Training time: {hash_train_time:.2f}s")



Оценка качества Word-level векторов:
CountVectorizer + LogisticRegression:
  Accuracy: 0.8780
  Training time: 15.73s

TfidfVectorizer + LogisticRegression:
  Accuracy: 0.8930
  Training time: 10.95s

HashingVectorizer + LogisticRegression:
  Accuracy: 0.8570
  Training time: 21.10s


### Character-level векторизация

Для character-level токенизации используем char n-граммы (3-5), что особенно эффективно для коротких и шумных текстов.


In [14]:
print("Character-level векторизация:")
print("=" * 60)

start_time = time.time()
char_count_vectorizer = CountVectorizer(analyzer='char', ngram_range=(3, 5), max_features=50000)
char_count_train = char_count_vectorizer.fit_transform(train_subset)
char_count_test = char_count_vectorizer.transform(test_subset)
char_count_time = time.time() - start_time

print(f"CountVectorizer (char 3-5 grams):")
print(f"  Train shape: {char_count_train.shape}")
print(f"  Test shape: {char_count_test.shape}")
print(f"  Vocabulary size: {len(char_count_vectorizer.vocabulary_)}")
print(f"  Time: {char_count_time:.2f}s")

start_time = time.time()
char_tfidf_vectorizer = TfidfVectorizer(analyzer='char', ngram_range=(3, 5), max_features=50000)
char_tfidf_train = char_tfidf_vectorizer.fit_transform(train_subset)
char_tfidf_test = char_tfidf_vectorizer.transform(test_subset)
char_tfidf_time = time.time() - start_time

print(f"\nTfidfVectorizer (char 3-5 grams):")
print(f"  Train shape: {char_tfidf_train.shape}")
print(f"  Test shape: {char_tfidf_test.shape}")
print(f"  Vocabulary size: {len(char_tfidf_vectorizer.vocabulary_)}")
print(f"  Time: {char_tfidf_time:.2f}s")

start_time = time.time()
char_hash_vectorizer = HashingVectorizer(analyzer='char', ngram_range=(3, 5), n_features=2**18)
char_hash_train = char_hash_vectorizer.transform(train_subset)
char_hash_test = char_hash_vectorizer.transform(test_subset)
char_hash_time = time.time() - start_time

print(f"\nHashingVectorizer (char 3-5 grams, 2^18 features):")
print(f"  Train shape: {char_hash_train.shape}")
print(f"  Test shape: {char_hash_test.shape}")
print(f"  Time: {char_hash_time:.2f}s")


Character-level векторизация:
CountVectorizer (char 3-5 grams):
  Train shape: (10000, 50000)
  Test shape: (1000, 50000)
  Vocabulary size: 50000
  Time: 3.81s

TfidfVectorizer (char 3-5 grams):
  Train shape: (10000, 50000)
  Test shape: (1000, 50000)
  Vocabulary size: 50000
  Time: 4.09s

HashingVectorizer (char 3-5 grams, 2^18 features):
  Train shape: (10000, 262144)
  Test shape: (1000, 262144)
  Time: 2.54s


In [15]:
print("\nОценка качества Character-level векторов:")
print("=" * 60)

start_time = time.time()
lr_char_count = LogisticRegression(max_iter=500, random_state=42)
lr_char_count.fit(char_count_train, train_labels_subset)
char_count_pred = lr_char_count.predict(char_count_test)
char_count_acc = accuracy_score(test_labels_subset, char_count_pred)
char_count_train_time = time.time() - start_time

print(f"CountVectorizer (char) + LogisticRegression:")
print(f"  Accuracy: {char_count_acc:.4f}")
print(f"  Training time: {char_count_train_time:.2f}s")

start_time = time.time()
lr_char_tfidf = LogisticRegression(max_iter=500, random_state=42)
lr_char_tfidf.fit(char_tfidf_train, train_labels_subset)
char_tfidf_pred = lr_char_tfidf.predict(char_tfidf_test)
char_tfidf_acc = accuracy_score(test_labels_subset, char_tfidf_pred)
char_tfidf_train_time = time.time() - start_time

print(f"\nTfidfVectorizer (char) + LogisticRegression:")
print(f"  Accuracy: {char_tfidf_acc:.4f}")
print(f"  Training time: {char_tfidf_train_time:.2f}s")

start_time = time.time()
lr_char_hash = LogisticRegression(max_iter=500, random_state=42)
lr_char_hash.fit(char_hash_train, train_labels_subset)
char_hash_pred = lr_char_hash.predict(char_hash_test)
char_hash_acc = accuracy_score(test_labels_subset, char_hash_pred)
char_hash_train_time = time.time() - start_time

print(f"\nHashingVectorizer (char) + LogisticRegression:")
print(f"  Accuracy: {char_hash_acc:.4f}")
print(f"  Training time: {char_hash_train_time:.2f}s")



Оценка качества Character-level векторов:
CountVectorizer (char) + LogisticRegression:
  Accuracy: 0.8680
  Training time: 16.59s

TfidfVectorizer (char) + LogisticRegression:
  Accuracy: 0.8890
  Training time: 12.11s

HashingVectorizer (char) + LogisticRegression:
  Accuracy: 0.8720
  Training time: 23.94s


### WordPiece векторизация

Для WordPiece токенов используем CountVectorizer и TfidfVectorizer. Токены уже получены через BERT токенизатор, поэтому используем их напрямую.


In [16]:
def wordpiece_tokenize_text(text):
    return bert_tokenizer.tokenize(text)

print("WordPiece векторизация:")
print("=" * 60)

start_time = time.time()
wordpiece_count_vectorizer = CountVectorizer(tokenizer=wordpiece_tokenize_text, ngram_range=(1, 2), max_features=50000)
wordpiece_count_train = wordpiece_count_vectorizer.fit_transform(train_subset)
wordpiece_count_test = wordpiece_count_vectorizer.transform(test_subset)
wordpiece_count_time = time.time() - start_time

print(f"CountVectorizer (1-2 grams):")
print(f"  Train shape: {wordpiece_count_train.shape}")
print(f"  Test shape: {wordpiece_count_test.shape}")
print(f"  Vocabulary size: {len(wordpiece_count_vectorizer.vocabulary_)}")
print(f"  Time: {wordpiece_count_time:.2f}s")

start_time = time.time()
wordpiece_tfidf_vectorizer = TfidfVectorizer(tokenizer=wordpiece_tokenize_text, ngram_range=(1, 2), max_features=50000)
wordpiece_tfidf_train = wordpiece_tfidf_vectorizer.fit_transform(train_subset)
wordpiece_tfidf_test = wordpiece_tfidf_vectorizer.transform(test_subset)
wordpiece_tfidf_time = time.time() - start_time

print(f"\nTfidfVectorizer (1-2 grams):")
print(f"  Train shape: {wordpiece_tfidf_train.shape}")
print(f"  Test shape: {wordpiece_tfidf_test.shape}")
print(f"  Vocabulary size: {len(wordpiece_tfidf_vectorizer.vocabulary_)}")
print(f"  Time: {wordpiece_tfidf_time:.2f}s")


WordPiece векторизация:




CountVectorizer (1-2 grams):
  Train shape: (10000, 50000)
  Test shape: (1000, 50000)
  Vocabulary size: 50000
  Time: 2.82s

TfidfVectorizer (1-2 grams):
  Train shape: (10000, 50000)
  Test shape: (1000, 50000)
  Vocabulary size: 50000
  Time: 2.86s


In [17]:
print("\nОценка качества WordPiece векторов:")
print("=" * 60)

start_time = time.time()
lr_wordpiece_count = LogisticRegression(max_iter=500, random_state=42)
lr_wordpiece_count.fit(wordpiece_count_train, train_labels_subset)
wordpiece_count_pred = lr_wordpiece_count.predict(wordpiece_count_test)
wordpiece_count_acc = accuracy_score(test_labels_subset, wordpiece_count_pred)
wordpiece_count_train_time = time.time() - start_time

print(f"CountVectorizer + LogisticRegression:")
print(f"  Accuracy: {wordpiece_count_acc:.4f}")
print(f"  Training time: {wordpiece_count_train_time:.2f}s")

start_time = time.time()
lr_wordpiece_tfidf = LogisticRegression(max_iter=500, random_state=42)
lr_wordpiece_tfidf.fit(wordpiece_tfidf_train, train_labels_subset)
wordpiece_tfidf_pred = lr_wordpiece_tfidf.predict(wordpiece_tfidf_test)
wordpiece_tfidf_acc = accuracy_score(test_labels_subset, wordpiece_tfidf_pred)
wordpiece_tfidf_train_time = time.time() - start_time

print(f"\nTfidfVectorizer + LogisticRegression:")
print(f"  Accuracy: {wordpiece_tfidf_acc:.4f}")
print(f"  Training time: {wordpiece_tfidf_train_time:.2f}s")



Оценка качества WordPiece векторов:
CountVectorizer + LogisticRegression:
  Accuracy: 0.8840
  Training time: 17.19s

TfidfVectorizer + LogisticRegression:
  Accuracy: 0.8920
  Training time: 13.50s


### Unigram (SentencePiece) векторизация

Для Unigram токенов из SentencePiece используем стандартные векторизаторы с токенами, полученными через обученную модель.


In [18]:
def unigram_tokenize_text(text):
    return sp.encode(text, out_type=str)

print("Unigram (SentencePiece) векторизация:")
print("=" * 60)

start_time = time.time()
unigram_count_vectorizer = CountVectorizer(tokenizer=unigram_tokenize_text, ngram_range=(1, 2), max_features=50000)
unigram_count_train = unigram_count_vectorizer.fit_transform(train_subset)
unigram_count_test = unigram_count_vectorizer.transform(test_subset)
unigram_count_time = time.time() - start_time

print(f"CountVectorizer (1-2 grams):")
print(f"  Train shape: {unigram_count_train.shape}")
print(f"  Test shape: {unigram_count_test.shape}")
print(f"  Vocabulary size: {len(unigram_count_vectorizer.vocabulary_)}")
print(f"  Time: {unigram_count_time:.2f}s")

start_time = time.time()
unigram_tfidf_vectorizer = TfidfVectorizer(tokenizer=unigram_tokenize_text, ngram_range=(1, 2), max_features=50000)
unigram_tfidf_train = unigram_tfidf_vectorizer.fit_transform(train_subset)
unigram_tfidf_test = unigram_tfidf_vectorizer.transform(test_subset)
unigram_tfidf_time = time.time() - start_time

print(f"\nTfidfVectorizer (1-2 grams):")
print(f"  Train shape: {unigram_tfidf_train.shape}")
print(f"  Test shape: {unigram_tfidf_test.shape}")
print(f"  Vocabulary size: {len(unigram_tfidf_vectorizer.vocabulary_)}")
print(f"  Time: {unigram_tfidf_time:.2f}s")


Unigram (SentencePiece) векторизация:




CountVectorizer (1-2 grams):
  Train shape: (10000, 50000)
  Test shape: (1000, 50000)
  Vocabulary size: 50000
  Time: 1.54s

TfidfVectorizer (1-2 grams):
  Train shape: (10000, 50000)
  Test shape: (1000, 50000)
  Vocabulary size: 50000
  Time: 1.59s


In [19]:
print("\nОценка качества Unigram векторов:")
print("=" * 60)

start_time = time.time()
lr_unigram_count = LogisticRegression(max_iter=500, random_state=42)
lr_unigram_count.fit(unigram_count_train, train_labels_subset)
unigram_count_pred = lr_unigram_count.predict(unigram_count_test)
unigram_count_acc = accuracy_score(test_labels_subset, unigram_count_pred)
unigram_count_train_time = time.time() - start_time

print(f"CountVectorizer + LogisticRegression:")
print(f"  Accuracy: {unigram_count_acc:.4f}")
print(f"  Training time: {unigram_count_train_time:.2f}s")

start_time = time.time()
lr_unigram_tfidf = LogisticRegression(max_iter=500, random_state=42)
lr_unigram_tfidf.fit(unigram_tfidf_train, train_labels_subset)
unigram_tfidf_pred = lr_unigram_tfidf.predict(unigram_tfidf_test)
unigram_tfidf_acc = accuracy_score(test_labels_subset, unigram_tfidf_pred)
unigram_tfidf_train_time = time.time() - start_time

print(f"\nTfidfVectorizer + LogisticRegression:")
print(f"  Accuracy: {unigram_tfidf_acc:.4f}")
print(f"  Training time: {unigram_tfidf_train_time:.2f}s")



Оценка качества Unigram векторов:
CountVectorizer + LogisticRegression:
  Accuracy: 0.8770
  Training time: 19.65s

TfidfVectorizer + LogisticRegression:
  Accuracy: 0.8870
  Training time: 14.48s


### Byte-level BPE векторизация

Для Byte-level BPE токенов используем векторизаторы с токенами, полученными через tiktoken.


In [20]:
def bpe_tokenize_text(text):
    tokens = gpt2_tokenizer.encode(text)
    return [gpt2_tokenizer.decode([token]) for token in tokens]

print("Byte-level BPE векторизация:")
print("=" * 60)

start_time = time.time()
bpe_count_vectorizer = CountVectorizer(tokenizer=bpe_tokenize_text, ngram_range=(1, 2), max_features=50000)
bpe_count_train = bpe_count_vectorizer.fit_transform(train_subset)
bpe_count_test = bpe_count_vectorizer.transform(test_subset)
bpe_count_time = time.time() - start_time

print(f"CountVectorizer (1-2 grams):")
print(f"  Train shape: {bpe_count_train.shape}")
print(f"  Test shape: {bpe_count_test.shape}")
print(f"  Vocabulary size: {len(bpe_count_vectorizer.vocabulary_)}")
print(f"  Time: {bpe_count_time:.2f}s")

start_time = time.time()
bpe_tfidf_vectorizer = TfidfVectorizer(tokenizer=bpe_tokenize_text, ngram_range=(1, 2), max_features=50000)
bpe_tfidf_train = bpe_tfidf_vectorizer.fit_transform(train_subset)
bpe_tfidf_test = bpe_tfidf_vectorizer.transform(test_subset)
bpe_tfidf_time = time.time() - start_time

print(f"\nTfidfVectorizer (1-2 grams):")
print(f"  Train shape: {bpe_tfidf_train.shape}")
print(f"  Test shape: {bpe_tfidf_test.shape}")
print(f"  Vocabulary size: {len(bpe_tfidf_vectorizer.vocabulary_)}")
print(f"  Time: {bpe_tfidf_time:.2f}s")


Byte-level BPE векторизация:




CountVectorizer (1-2 grams):
  Train shape: (10000, 50000)
  Test shape: (1000, 50000)
  Vocabulary size: 50000
  Time: 1.79s

TfidfVectorizer (1-2 grams):
  Train shape: (10000, 50000)
  Test shape: (1000, 50000)
  Vocabulary size: 50000
  Time: 1.70s


In [21]:
print("\nОценка качества Byte-level BPE векторов:")
print("=" * 60)

start_time = time.time()
lr_bpe_count = LogisticRegression(max_iter=500, random_state=42)
lr_bpe_count.fit(bpe_count_train, train_labels_subset)
bpe_count_pred = lr_bpe_count.predict(bpe_count_test)
bpe_count_acc = accuracy_score(test_labels_subset, bpe_count_pred)
bpe_count_train_time = time.time() - start_time

print(f"CountVectorizer + LogisticRegression:")
print(f"  Accuracy: {bpe_count_acc:.4f}")
print(f"  Training time: {bpe_count_train_time:.2f}s")

start_time = time.time()
lr_bpe_tfidf = LogisticRegression(max_iter=500, random_state=42)
lr_bpe_tfidf.fit(bpe_tfidf_train, train_labels_subset)
bpe_tfidf_pred = lr_bpe_tfidf.predict(bpe_tfidf_test)
bpe_tfidf_acc = accuracy_score(test_labels_subset, bpe_tfidf_pred)
bpe_tfidf_train_time = time.time() - start_time

print(f"\nTfidfVectorizer + LogisticRegression:")
print(f"  Accuracy: {bpe_tfidf_acc:.4f}")
print(f"  Training time: {bpe_tfidf_train_time:.2f}s")



Оценка качества Byte-level BPE векторов:
CountVectorizer + LogisticRegression:
  Accuracy: 0.8880
  Training time: 14.74s

TfidfVectorizer + LogisticRegression:
  Accuracy: 0.9020
  Training time: 9.28s


### Сравнение всех методов векторизации

Сводная таблица результатов по времени векторизации и качеству классификации:


In [22]:
import pandas as pd

results = [
    {'Токенизатор': 'Word-level', 'Векторизатор': 'CountVectorizer', 
     'Время векторизации': word_count_time, 'Accuracy': count_acc, 'Время обучения': count_train_time},
    {'Токенизатор': 'Word-level', 'Векторизатор': 'TfidfVectorizer', 
     'Время векторизации': word_tfidf_time, 'Accuracy': tfidf_acc, 'Время обучения': tfidf_train_time},
    {'Токенизатор': 'Word-level', 'Векторизатор': 'HashingVectorizer', 
     'Время векторизации': word_hash_time, 'Accuracy': hash_acc, 'Время обучения': hash_train_time},
    {'Токенизатор': 'Character-level', 'Векторизатор': 'CountVectorizer (char)', 
     'Время векторизации': char_count_time, 'Accuracy': char_count_acc, 'Время обучения': char_count_train_time},
    {'Токенизатор': 'Character-level', 'Векторизатор': 'TfidfVectorizer (char)', 
     'Время векторизации': char_tfidf_time, 'Accuracy': char_tfidf_acc, 'Время обучения': char_tfidf_train_time},
    {'Токенизатор': 'Character-level', 'Векторизатор': 'HashingVectorizer (char)', 
     'Время векторизации': char_hash_time, 'Accuracy': char_hash_acc, 'Время обучения': char_hash_train_time},
    {'Токенизатор': 'WordPiece', 'Векторизатор': 'CountVectorizer', 
     'Время векторизации': wordpiece_count_time, 'Accuracy': wordpiece_count_acc, 'Время обучения': wordpiece_count_train_time},
    {'Токенизатор': 'WordPiece', 'Векторизатор': 'TfidfVectorizer', 
     'Время векторизации': wordpiece_tfidf_time, 'Accuracy': wordpiece_tfidf_acc, 'Время обучения': wordpiece_tfidf_train_time},
    {'Токенизатор': 'Unigram', 'Векторизатор': 'CountVectorizer', 
     'Время векторизации': unigram_count_time, 'Accuracy': unigram_count_acc, 'Время обучения': unigram_count_train_time},
    {'Токенизатор': 'Unigram', 'Векторизатор': 'TfidfVectorizer', 
     'Время векторизации': unigram_tfidf_time, 'Accuracy': unigram_tfidf_acc, 'Время обучения': unigram_tfidf_train_time},
    {'Токенизатор': 'Byte-level BPE', 'Векторизатор': 'CountVectorizer', 
     'Время векторизации': bpe_count_time, 'Accuracy': bpe_count_acc, 'Время обучения': bpe_count_train_time},
    {'Токенизатор': 'Byte-level BPE', 'Векторизатор': 'TfidfVectorizer', 
     'Время векторизации': bpe_tfidf_time, 'Accuracy': bpe_tfidf_acc, 'Время обучения': bpe_tfidf_train_time},
]

df_results = pd.DataFrame(results)
df_results = df_results.sort_values('Accuracy', ascending=False)

print("Сводная таблица результатов:")
print("=" * 80)
print(df_results.to_string(index=False))

print("\n\nЛучшие результаты по Accuracy:")
print("=" * 80)
print(df_results.head(3).to_string(index=False))


Сводная таблица результатов:
    Токенизатор             Векторизатор  Время векторизации  Accuracy  Время обучения
 Byte-level BPE          TfidfVectorizer            1.702362     0.902        9.276575
     Word-level          TfidfVectorizer            0.831401     0.893       10.953832
      WordPiece          TfidfVectorizer            2.855483     0.892       13.501026
Character-level   TfidfVectorizer (char)            4.094401     0.889       12.109068
 Byte-level BPE          CountVectorizer            1.788724     0.888       14.740647
        Unigram          TfidfVectorizer            1.589704     0.887       14.481398
      WordPiece          CountVectorizer            2.819031     0.884       17.188610
     Word-level          CountVectorizer            0.904554     0.878       15.725384
        Unigram          CountVectorizer            1.543808     0.877       19.651316
Character-level HashingVectorizer (char)            2.535200     0.872       23.943213
Character-leve

## Шаг 4: Сравнение TF-IDF и BM25 на Byte-level BPE

BM25 (Best Matching 25) — это улучшенная версия TF-IDF, которая лучше учитывает длину документа и использует нелинейную нормализацию частоты терминов. Сравним TF-IDF и BM25 на byte-level BPE токенах.


### BM25 векторизация

BM25 использует формулу с параметрами k1 и b для нормализации частоты терминов и длины документа. Это делает его более устойчивым к различиям в длине документов по сравнению с TF-IDF.


In [24]:
from rank_bm25 import BM25Okapi
import scipy.sparse as sp
from collections import Counter

class BM25Vectorizer:
    def __init__(self, tokenizer, k1=1.5, b=0.75):
        self.tokenizer = tokenizer
        self.k1 = k1
        self.b = b
        self.bm25 = None
        self.vocab = None
        self.idf = None
        self.avgdl = None
        self.tokenized_corpus = None
        
    def fit(self, texts):
        self.tokenized_corpus = [self.tokenizer(text) for text in texts]
        self.bm25 = BM25Okapi(self.tokenized_corpus, k1=self.k1, b=self.b)
        
        all_tokens = set()
        for tokens in self.tokenized_corpus:
            all_tokens.update(tokens)
        self.vocab = {token: idx for idx, token in enumerate(sorted(all_tokens))}
        
        doc_freqs = Counter()
        for tokens in self.tokenized_corpus:
            doc_freqs.update(set(tokens))
        
        n_docs = len(self.tokenized_corpus)
        self.idf = {token: np.log((n_docs - doc_freqs[token] + 0.5) / (doc_freqs[token] + 0.5) + 1.0) 
                   for token in self.vocab}
        
        self.avgdl = np.mean([len(tokens) for tokens in self.tokenized_corpus])
        
        return self
    
    def _bm25_score(self, token, tf, doc_len):
        if token not in self.idf:
            return 0.0
        idf = self.idf[token]
        norm_tf = (tf * (self.k1 + 1)) / (tf + self.k1 * (1 - self.b + self.b * doc_len / self.avgdl))
        return idf * norm_tf
    
    def transform(self, texts):
        tokenized_texts = [self.tokenizer(text) for text in texts]
        n_docs = len(texts)
        n_features = len(self.vocab)
        
        rows, cols, data = [], [], []
        
        for doc_idx, tokens in enumerate(tokenized_texts):
            doc_len = len(tokens)
            token_counts = Counter(tokens)
            
            for token, tf in token_counts.items():
                if token in self.vocab:
                    feature_idx = self.vocab[token]
                    score = self._bm25_score(token, tf, doc_len)
                    rows.append(doc_idx)
                    cols.append(feature_idx)
                    data.append(score)
        
        return sp.csr_matrix((data, (rows, cols)), shape=(n_docs, n_features))
    
    def fit_transform(self, texts):
        return self.fit(texts).transform(texts)


In [25]:
print("Сравнение TF-IDF и BM25 на Byte-level BPE:")
print("=" * 60)

print("\nTF-IDF векторизация:")
start_time = time.time()
bpe_tfidf_vectorizer_full = TfidfVectorizer(tokenizer=bpe_tokenize_text, ngram_range=(1, 2), max_features=50000)
bpe_tfidf_train_full = bpe_tfidf_vectorizer_full.fit_transform(train_subset)
bpe_tfidf_test_full = bpe_tfidf_vectorizer_full.transform(test_subset)
tfidf_vectorization_time = time.time() - start_time

print(f"  Train shape: {bpe_tfidf_train_full.shape}")
print(f"  Test shape: {bpe_tfidf_test_full.shape}")
print(f"  Vocabulary size: {len(bpe_tfidf_vectorizer_full.vocabulary_)}")
print(f"  Time: {tfidf_vectorization_time:.2f}s")

print("\nBM25 векторизация:")
start_time = time.time()
bpe_bm25_vectorizer = BM25Vectorizer(tokenizer=bpe_tokenize_text, k1=1.5, b=0.75)
bpe_bm25_train = bpe_bm25_vectorizer.fit_transform(train_subset)
bpe_bm25_test = bpe_bm25_vectorizer.transform(test_subset)
bm25_vectorization_time = time.time() - start_time

print(f"  Train shape: {bpe_bm25_train.shape}")
print(f"  Test shape: {bpe_bm25_test.shape}")
print(f"  Vocabulary size: {len(bpe_bm25_vectorizer.vocab)}")
print(f"  Time: {bm25_vectorization_time:.2f}s")


Сравнение TF-IDF и BM25 на Byte-level BPE:

TF-IDF векторизация:




  Train shape: (10000, 50000)
  Test shape: (1000, 50000)
  Vocabulary size: 50000
  Time: 1.75s

BM25 векторизация:
  Train shape: (10000, 26128)
  Test shape: (1000, 26128)
  Vocabulary size: 26128
  Time: 2.28s


In [26]:
print("\nОценка качества TF-IDF и BM25:")
print("=" * 60)

start_time = time.time()
lr_tfidf_bpe = LogisticRegression(max_iter=500, random_state=42)
lr_tfidf_bpe.fit(bpe_tfidf_train_full, train_labels_subset)
tfidf_bpe_pred = lr_tfidf_bpe.predict(bpe_tfidf_test_full)
tfidf_bpe_acc = accuracy_score(test_labels_subset, tfidf_bpe_pred)
tfidf_bpe_train_time = time.time() - start_time

print(f"TF-IDF + LogisticRegression:")
print(f"  Accuracy: {tfidf_bpe_acc:.4f}")
print(f"  Training time: {tfidf_bpe_train_time:.2f}s")

start_time = time.time()
lr_bm25_bpe = LogisticRegression(max_iter=500, random_state=42)
lr_bm25_bpe.fit(bpe_bm25_train, train_labels_subset)
bm25_bpe_pred = lr_bm25_bpe.predict(bpe_bm25_test)
bm25_bpe_acc = accuracy_score(test_labels_subset, bm25_bpe_pred)
bm25_bpe_train_time = time.time() - start_time

print(f"\nBM25 + LogisticRegression:")
print(f"  Accuracy: {bm25_bpe_acc:.4f}")
print(f"  Training time: {bm25_bpe_train_time:.2f}s")

print(f"\nРазница в Accuracy: {abs(tfidf_bpe_acc - bm25_bpe_acc):.4f}")
if bm25_bpe_acc > tfidf_bpe_acc:
    print(f"BM25 лучше на {(bm25_bpe_acc - tfidf_bpe_acc) * 100:.2f}%")
else:
    print(f"TF-IDF лучше на {(tfidf_bpe_acc - bm25_bpe_acc) * 100:.2f}%")



Оценка качества TF-IDF и BM25:
TF-IDF + LogisticRegression:
  Accuracy: 0.9020
  Training time: 9.20s

BM25 + LogisticRegression:
  Accuracy: 0.8940
  Training time: 4.64s

Разница в Accuracy: 0.0080
TF-IDF лучше на 0.80%


In [27]:
comparison_results = pd.DataFrame([
    {
        'Метод': 'TF-IDF',
        'Время векторизации': tfidf_vectorization_time,
        'Accuracy': tfidf_bpe_acc,
        'Время обучения': tfidf_bpe_train_time,
        'Общее время': tfidf_vectorization_time + tfidf_bpe_train_time
    },
    {
        'Метод': 'BM25',
        'Время векторизации': bm25_vectorization_time,
        'Accuracy': bm25_bpe_acc,
        'Время обучения': bm25_bpe_train_time,
        'Общее время': bm25_vectorization_time + bm25_bpe_train_time
    }
])

print("\nСравнительная таблица TF-IDF vs BM25:")
print("=" * 80)
print(comparison_results.to_string(index=False))



Сравнительная таблица TF-IDF vs BM25:
 Метод  Время векторизации  Accuracy  Время обучения  Общее время
TF-IDF            1.754680     0.902        9.201822    10.956502
  BM25            2.284903     0.894        4.639738     6.924641


## Шаг 5: Сравнение классических моделей ML на TF-IDF (Byte-level BPE)

Сравним различные классические модели машинного обучения из sklearn на TF-IDF векторах с byte-level BPE токенизацией. Это позволит оценить, какая модель лучше работает с разреженными представлениями текста.


In [28]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier


### Обучение различных моделей

Обучим различные классификаторы на TF-IDF векторах с byte-level BPE и сравним их производительность.


In [29]:
print("Сравнение моделей sklearn на TF-IDF (Byte-level BPE):")
print("=" * 80)

models = {
    'LogisticRegression': LogisticRegression(max_iter=500, random_state=42),
    'MultinomialNB': MultinomialNB(alpha=1.0),
    'LinearSVC': LinearSVC(max_iter=1000, random_state=42),
    'SGDClassifier': SGDClassifier(max_iter=1000, random_state=42, loss='hinge'),
    'RandomForest': RandomForestClassifier(n_estimators=100, max_depth=20, random_state=42, n_jobs=-1),
    'GradientBoosting': GradientBoostingClassifier(n_estimators=100, max_depth=5, random_state=42),
    'DecisionTree': DecisionTreeClassifier(max_depth=20, random_state=42),
}

results_models = []

for name, model in models.items():
    print(f"\n{name}:")
    start_time = time.time()
    model.fit(bpe_tfidf_train_full, train_labels_subset)
    train_time = time.time() - start_time
    
    start_time = time.time()
    pred = model.predict(bpe_tfidf_test_full)
    pred_time = time.time() - start_time
    
    acc = accuracy_score(test_labels_subset, pred)
    
    results_models.append({
        'Модель': name,
        'Accuracy': acc,
        'Время обучения': train_time,
        'Время предсказания': pred_time,
        'Общее время': train_time + pred_time
    })
    
    print(f"  Accuracy: {acc:.4f}")
    print(f"  Время обучения: {train_time:.2f}s")
    print(f"  Время предсказания: {pred_time:.4f}s")


Сравнение моделей sklearn на TF-IDF (Byte-level BPE):

LogisticRegression:
  Accuracy: 0.9020
  Время обучения: 9.33s
  Время предсказания: 0.0013s

MultinomialNB:
  Accuracy: 0.8870
  Время обучения: 0.01s
  Время предсказания: 0.0011s

LinearSVC:
  Accuracy: 0.9140
  Время обучения: 0.30s
  Время предсказания: 0.0009s

SGDClassifier:
  Accuracy: 0.9120
  Время обучения: 0.15s
  Время предсказания: 0.0014s

RandomForest:
  Accuracy: 0.7690
  Время обучения: 0.33s
  Время предсказания: 0.0531s

GradientBoosting:
  Accuracy: 0.8510
  Время обучения: 222.84s
  Время предсказания: 0.0083s

DecisionTree:
  Accuracy: 0.6090
  Время обучения: 2.75s
  Время предсказания: 0.0009s


In [30]:
df_models = pd.DataFrame(results_models)
df_models = df_models.sort_values('Accuracy', ascending=False)

print("\n\nСводная таблица результатов моделей:")
print("=" * 80)
print(df_models.to_string(index=False))

print("\n\nТоп-3 модели по Accuracy:")
print("=" * 80)
print(df_models.head(3).to_string(index=False))




Сводная таблица результатов моделей:
            Модель  Accuracy  Время обучения  Время предсказания  Общее время
         LinearSVC     0.914        0.301154            0.000929     0.302083
     SGDClassifier     0.912        0.145760            0.001358     0.147117
LogisticRegression     0.902        9.328824            0.001313     9.330137
     MultinomialNB     0.887        0.007668            0.001107     0.008775
  GradientBoosting     0.851      222.835377            0.008273   222.843650
      RandomForest     0.769        0.331515            0.053051     0.384566
      DecisionTree     0.609        2.753665            0.000914     2.754579


Топ-3 модели по Accuracy:
            Модель  Accuracy  Время обучения  Время предсказания  Общее время
         LinearSVC     0.914        0.301154            0.000929     0.302083
     SGDClassifier     0.912        0.145760            0.001358     0.147117
LogisticRegression     0.902        9.328824            0.001313     9.3301

### Детальный анализ лучших моделей

Посмотрим детальные метрики для лучших моделей:


In [31]:
top_3_models = df_models.head(3)['Модель'].values

print("Детальные метрики для топ-3 моделей:")
print("=" * 80)

for model_name in top_3_models:
    model = models[model_name]
    model.fit(bpe_tfidf_train_full, train_labels_subset)
    pred = model.predict(bpe_tfidf_test_full)
    
    print(f"\n{model_name}:")
    print(classification_report(test_labels_subset, pred, 
                              target_names=dataset['train'].features['label'].names))


Детальные метрики для топ-3 моделей:

LinearSVC:
              precision    recall  f1-score   support

       World       0.95      0.91      0.93       268
      Sports       0.95      0.96      0.96       274
    Business       0.86      0.86      0.86       205
    Sci/Tech       0.88      0.91      0.90       253

    accuracy                           0.91      1000
   macro avg       0.91      0.91      0.91      1000
weighted avg       0.91      0.91      0.91      1000


SGDClassifier:
              precision    recall  f1-score   support

       World       0.95      0.90      0.92       268
      Sports       0.95      0.96      0.96       274
    Business       0.86      0.85      0.86       205
    Sci/Tech       0.88      0.91      0.90       253

    accuracy                           0.91      1000
   macro avg       0.91      0.91      0.91      1000
weighted avg       0.91      0.91      0.91      1000


LogisticRegression:
              precision    recall  f1-score 