# Порівняння методів аналізу тональності

| Method | Task | Key Steps | Pros | Cons |
|--------|------|-----------|------|------|
| **Word2Vec** | Векторне представлення | • Токенізація<br>• Навчання Word2Vec<br>• Векторизація документів<br>• Логістична регресія<br>• Оцінка точності | • Швидкий<br>• Легкий у використанні | • Втрата контексту<br>• Обмежений словник |
| **DistilBERT** | Контекстний аналіз | • Завантаження моделі<br>• Токенізація<br>• Fine-tuning<br>• Прогнозування<br>• Оцінка метрик | • Контекстний<br>• Висока точність | • Повільний<br>• Ресурсомісткий |
| **Zero-shot** | Класифікація без навчання | • Визначення міток<br>• BART-large-mnli<br>• Прогнозування<br>• Порівняння результатів | • Гнучкий<br>• Без навчання | • Менш точний<br>• Залежить від формулювань |
| **Few-shot** | Навчання на малій вибірці | • Вибір прикладів<br>• Налаштування промптів<br>• Прогнозування<br>• Оцінка якості | • Адаптивний<br>• Мало прикладів | • Чутливий до прикладів<br>• Потребує якісних промптів |

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')


from gensim.models import Word2Vec
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report


[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/vladislavpleshko/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/vladislavpleshko/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/vladislavpleshko/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [4]:
fn='/Users/vladislavpleshko/Documents/VS Code/before/amazinum/data/rt-polarity.neg'

with open(fn, "r",encoding='utf-8', errors='ignore') as f: # some invalid symbols encountered
    content = f.read()
texts_neg=  content.splitlines()
print ('len of texts_neg = {:,}'.format (len(texts_neg)))
for review in texts_neg[:5]:
    print ( '\n', review)

fn='/Users/vladislavpleshko/Documents/VS Code/before/amazinum/data/rt-polarity.pos'

with open(fn, "r",encoding='utf-8', errors='ignore') as f:
    content = f.read()
texts_pos=  content.splitlines()
print ('len of texts_pos = {:,}'.format (len(texts_pos)))
for review in texts_pos[:5]:
    print ('\n', review)


len of texts_neg = 5,331

 simplistic , silly and tedious . 

 it's so laddish and juvenile , only teenage boys could possibly find it funny . 

 exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable . 

 [garbus] discards the potential for pathological study , exhuming instead , the skewed melodrama of the circumstantial situation . 

 a visually flashy but narratively opaque and emotionally vapid exercise in style and mystification . 
len of texts_pos = 5,331

 the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . 

 the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth . 

 effective but too-tepid biopic

 if you

# **Method 1: WordEmbeddings**
## Word2Vec

In [5]:
positive = pd.DataFrame({
    'Text': texts_pos,
    'Sentiment': 1
})

negative = pd.DataFrame({
    'Text': texts_neg,
    'Sentiment': 0
})

df = pd.concat([negative, positive], ignore_index=True)
df = df.sample(frac=1, random_state=35)
df.head()


Unnamed: 0,Text,Sentiment
2406,undercover brother doesn't go far enough . it'...,0
7971,directors brett morgen and nanette burstein ha...,1
7329,the film delivers what it promises : a look at...,1
7653,the fast runner' transports the viewer into an...,1
4455,"while the production details are lavish , film...",0


In [6]:
tokenizer = RegexpTokenizer(r'\w+')
stop_words = set(stopwords.words('english'))

def preprocess(text):
    tokens = tokenizer.tokenize(text.lower())
    return [w for w in tokens if w not in stop_words]

df['tokens'] = df['Text'].apply(preprocess)
print(df['tokens'])


2406    [undercover, brother, go, far, enough, silly, ...
7971    [directors, brett, morgen, nanette, burstein, ...
7329    [film, delivers, promises, look, wild, ride, e...
7653    [fast, runner, transports, viewer, unusual, sp...
4455    [production, details, lavish, film, little, in...
                              ...                        
3007    [even, britney, spears, really, cute, movie, r...
7148    [high, crimes, knows, mistakes, bad, movies, m...
9143    [film, centering, traditional, indian, wedding...
1295      [end, admire, ensemble, players, wonder, point]
5833    [believe, much, laugh, audacity, casting, shee...
Name: tokens, Length: 10662, dtype: object


In [7]:
w2v_model = Word2Vec(
    sentences=df['tokens'],     # Вхідні токенізовані речення
    vector_size=300,           # Розмірність векторів - більше значення дає більше деталей, але потребує більше пам'яті/часу
    window=5,                  # Контекстне вікно - більше значення враховує більше контексту, але може розмити значення
    min_count=3,              # Мінімальна частота слова - фільтрує рідкісні слова
    workers=4,                # Кількість потоків - більше значення прискорює навчання
    epochs=30,                # Кількість проходів по даним - більше значення покращує якість, але збільшує час
    sg=1                      # 1=Skip-gram (краще для рідкісних слів), 0=CBOW (швидше навчання)
)


In [8]:
def document_vector(tokens):
    vecs = [w2v_model.wv[t] for t in tokens if t in w2v_model.wv]
    return np.mean(vecs, axis=0) if vecs else np.zeros(w2v_model.vector_size)

X = np.vstack(df['tokens'].apply(document_vector).values)
y = df['Sentiment'].values


In [9]:
X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                    test_size=0.25,
                                                    stratify=y,
                                                    random_state=35)

clf = LogisticRegression(max_iter=200)
clf.fit(X_train, y_train)


In [10]:
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.74      0.74      0.74      1333
           1       0.74      0.74      0.74      1333

    accuracy                           0.74      2666
   macro avg       0.74      0.74      0.74      2666
weighted avg       0.74      0.74      0.74      2666



In [11]:
dataframe = []
def predict_sentiment(text: str) -> dict:
    tokens = preprocess(text)
    vec = document_vector(tokens).reshape(1, -1)
    proba = clf.predict_proba(vec)[0,1]
    label = int(proba >= 0.5)

    return {'text': text, 'probability_pos': round(proba,5), 'label': label}

sample = ["I absolutely love this movie, it's fantastic!",
          "It's a poor film, so bad quality",
          "This cinema is so amazing!",
          "The acting was mediocre, but the plot was interesting",
          "A complete waste of time and money",
          "A masterpiece that will be remembered for generations",
          "The special effects were good but the story lacked depth",
          "This film changed my perspective on life",
          "Boring and predictable from start to finish",
          "A decent attempt but falls short of expectations"]

results = [predict_sentiment(text) for text in sample]

for result in results:
    print(result)


{'text': "I absolutely love this movie, it's fantastic!", 'probability_pos': 0.72403, 'label': 1}
{'text': "It's a poor film, so bad quality", 'probability_pos': 0.02212, 'label': 0}
{'text': 'This cinema is so amazing!', 'probability_pos': 0.89474, 'label': 1}
{'text': 'The acting was mediocre, but the plot was interesting', 'probability_pos': 0.03864, 'label': 0}
{'text': 'A complete waste of time and money', 'probability_pos': 0.21635, 'label': 0}
{'text': 'A masterpiece that will be remembered for generations', 'probability_pos': 0.91906, 'label': 1}
{'text': 'The special effects were good but the story lacked depth', 'probability_pos': 0.76679, 'label': 1}
{'text': 'This film changed my perspective on life', 'probability_pos': 0.9412, 'label': 1}
{'text': 'Boring and predictable from start to finish', 'probability_pos': 0.00674, 'label': 0}
{'text': 'A decent attempt but falls short of expectations', 'probability_pos': 0.15272, 'label': 0}


In [12]:
similar_words = w2v_model.wv.most_similar('hope', topn=5)
similar_words


[('capability', 0.5830621719360352),
 ('sunshine', 0.5558199882507324),
 ('ray', 0.5453183650970459),
 ('jie', 0.5423682928085327),
 ('solace', 0.5328478217124939)]

In [13]:
# схожість
similarity = w2v_model.wv.similarity('good','phone')
similarity


0.16821066

In [14]:
# аналогії
res = w2v_model.wv.most_similar(
    positive=['woman','king'], # що шукаємо + контекст
    negative=['man'], # від чого відштовхуємось
    topn=1
)
res


[('lion', 0.5050972700119019)]

**Формула для запам'ятовування:**
- **positive** = [що_шукаємо, контекст_пошуку]
- **negative** = [від_чого_відштовхуємося]

# **Method 2: Transformers**

## DistilBert

In [15]:
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
import torch


2025-06-14 18:09:18.360423: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [16]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=2,                 # кількість класів
    output_attentions=True,      # виведення ваг уваги
    output_hidden_states=True,   # виведення прихованих станів
    return_dict=True,            # повернення результатів як словника
    dropout=0.2,                 # ймовірність дропауту
)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [17]:
text = 'Machine learning is amazing!'

enc = tokenizer(
    text,
    return_tensors = 'pt',
    truncation = True,
    padding = 'longest'
)

with torch.no_grad():
    outputs = model(**enc)




In [18]:
logits = outputs.logits
probs = torch.softmax(logits, dim=1)
label = torch.argmax(probs, dim=1).item()

print('Logits: ', logits)
print('Probabilities: ', probs)
print('Predicted label (0=neg, 1=pos): ', label)


Logits:  tensor([[ 0.2470, -0.1086]])
Probabilities:  tensor([[0.5880, 0.4120]])
Predicted label (0=neg, 1=pos):  0


# **Method 3:** 
## Zero-Shot

In [None]:
from transformers import pipeline

classifier = pipeline('zero-shot-classification', model='facebook/bart-large-mnli')


Device set to use mps:0


In [None]:
text = "Machine learning is so amazing!"

# 3. Кандидатні мітки (класи) — що ми хочемо передбачити
candidate_labels = ["negative", "positive", "neutral"]

# 4. Класифікація
result = classifier(text, candidate_labels)


In [None]:
print('TEXT: ', text)
print('LABELS: ')

for label, score in zip(result['labels'], result['scores']):
    print(f'{label}: {round(score,3)}')


TEXT:  Machine learning is so amazing!
LABELS: 
positive: 0.8
neutral: 0.134
negative: 0.066


In [None]:
test_samples = [
    # Однозначно позитивні
    "This movie is an absolute masterpiece with stunning visuals!",
    "The experience was incredible, exceeded all my expectations!",

    # Однозначно негативні
    "Terrible waste of time, one of the worst films ever made.",
    "I regret watching this, it was painfully disappointing.",

    # Нейтральні
    "The movie was neither good nor bad, just average.",
    "It has some good moments and some bad ones.",

    # Змішані почуття
    "Despite great acting, the plot was confusing and messy.",
    "Beautiful visuals but completely lacking in substance.",

    # Складні випадки
    "While not particularly entertaining, it raises important questions.",
    "It's so bad that it's actually good, in a weird way."
]


In [None]:
for t in test_samples:
    print('Analyzing text:\n',t)
    res = classifier(t, candidate_labels, multi_label=False)
    print('Predictions:')
    for label, score in zip(res['labels'], res['scores']):
        print(f'{label}: {round(score,3)}')
    print(f'{"="*20} \n')


Analyzing text:
 This movie is an absolute masterpiece with stunning visuals!
Predictions:
positive: 0.986
neutral: 0.008
negative: 0.007

Analyzing text:
 The experience was incredible, exceeded all my expectations!
Predictions:
positive: 0.991
neutral: 0.006
negative: 0.003

Analyzing text:
 Terrible waste of time, one of the worst films ever made.
Predictions:
negative: 0.997
neutral: 0.002
positive: 0.001

Analyzing text:
 I regret watching this, it was painfully disappointing.
Predictions:
negative: 0.992
neutral: 0.007
positive: 0.001

Analyzing text:
 The movie was neither good nor bad, just average.
Predictions:
neutral: 0.951
negative: 0.024
positive: 0.024

Analyzing text:
 It has some good moments and some bad ones.
Predictions:
neutral: 0.452
positive: 0.395
negative: 0.153

Analyzing text:
 Despite great acting, the plot was confusing and messy.
Predictions:
negative: 0.965
neutral: 0.021
positive: 0.014

Analyzing text:
 Beautiful visuals but completely lacking in substan