Подгрузим библиотеки, если нужно

In [1]:
!pip install transformers datasets tensorflow scikit-learn tf-keras

Defaulting to user installation because normal site-packages is not writeable


Будем использовать предобученный BERT и также BERTтокенизатор

Замечание - трансформеры еще не поддерживают версию keras-3, скачиваем keras-2.19...

In [2]:
import tensorflow as tf
from transformers import BertTokenizer, TFBertForSequenceClassification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import numpy as np
import pandas as pd

2025-04-16 00:15:47.921657: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-04-16 00:15:47.969492: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1744751747.991737   13286 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1744751747.997477   13286 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1744751748.014700   13286 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

In [3]:
print("Количество доступных GPU устройств: ", len(tf.config.list_physical_devices('GPU')))

Количество доступных GPU устройств:  1


Датасет с каггла - загружаем зависимости

In [4]:
!pip install kagglehub

Defaulting to user installation because normal site-packages is not writeable


Загружаем интересный датасет

In [5]:
import kagglehub

path = kagglehub.dataset_download("mdismielhossenabir/sentiment-analysis")

print("Path to dataset files:", path)

Path to dataset files: /home/t33nsy/.cache/kagglehub/datasets/mdismielhossenabir/sentiment-analysis/versions/1


In [6]:
!ls /home/t33nsy/.cache/kagglehub/datasets/mdismielhossenabir/sentiment-analysis/versions/1

sentiment_analysis.csv


In [7]:
df = pd.read_csv(path+"/sentiment_analysis.csv")

In [8]:
df.head(5)

Unnamed: 0,Year,Month,Day,Time of Tweet,text,sentiment,Platform
0,2018,8,18,morning,What a great day!!! Looks like dream.,positive,Twitter
1,2018,8,18,noon,"I feel sorry, I miss you here in the sea beach",positive,Facebook
2,2017,8,18,night,Don't angry me,negative,Facebook
3,2022,6,8,morning,We attend in the class just for listening teac...,negative,Facebook
4,2022,6,8,noon,"Those who want to go, let them go",negative,Instagram


Преобразовываем метки в числа

In [9]:
# удаляем пропуски
df = df.dropna(subset=['text', 'sentiment'])

# Преобразуем тексты в строки (на случай если там числа или другие типы)
df['text'] = df['text'].astype(str)

label_encoder = LabelEncoder()
df['sentiment'] = label_encoder.fit_transform(df['sentiment'])
df.head(5)

Unnamed: 0,Year,Month,Day,Time of Tweet,text,sentiment,Platform
0,2018,8,18,morning,What a great day!!! Looks like dream.,2,Twitter
1,2018,8,18,noon,"I feel sorry, I miss you here in the sea beach",2,Facebook
2,2017,8,18,night,Don't angry me,0,Facebook
3,2022,6,8,morning,We attend in the class just for listening teac...,0,Facebook
4,2022,6,8,noon,"Those who want to go, let them go",0,Instagram


Делим на обучающую и тестовую выборку

In [10]:
train_texts, val_texts, train_labels, val_labels = train_test_split(
    df['text'].tolist(), df['sentiment'].tolist(), test_size=0.2, random_state=42
)

In [11]:
len(train_texts)

399

Токенизируем токенизатором BERT

In [12]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

train_encodings = tokenizer(train_texts, truncation=True, padding=True, return_tensors="tf", max_length=128)
val_encodings = tokenizer(val_texts, truncation=True, padding=True, return_tensors="tf", max_length=128)

I0000 00:00:1744751755.649884   13286 gpu_device.cc:2019] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1767 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3050 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6


Создаем датасет

In [13]:
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    tf.convert_to_tensor(train_labels)
)).shuffle(1000).batch(4)

val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    tf.convert_to_tensor(val_labels)
)).batch(4)

Смотрим, сколько классов тональностей у нас есть

In [14]:
df['sentiment'].value_counts()

sentiment
1    199
2    166
0    134
Name: count, dtype: int64

Количество уникальных классов

In [15]:
df['sentiment'].nunique()

3

Файн-тюн предобученного BERT

In [16]:
from transformers import TFBertForSequenceClassification

Загрузка модели

In [17]:
num_classes = len(label_encoder.classes_)
num_classes

3

In [18]:
model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=num_classes)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Компиляция модели

In [19]:
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metrics = ['accuracy']
model.compile(optimizer=optimizer, loss=loss, metrics=metrics)
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  109482240 
                                                                 
 dropout_37 (Dropout)        multiple                  0 (unused)
                                                                 
 classifier (Dense)          multiple                  2307      
                                                                 
Total params: 109484547 (417.65 MB)
Trainable params: 109484547 (417.65 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Обучение модели

In [20]:
model.fit(train_dataset, validation_data=val_dataset, epochs=2, batch_size=4)

Epoch 1/2


I0000 00:00:1744751784.098615   13535 service.cc:152] XLA service 0x7effa960a870 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1744751784.098660   13535 service.cc:160]   StreamExecutor device (0): NVIDIA GeForce RTX 3050 Laptop GPU, Compute Capability 8.6
2025-04-16 00:16:24.105543: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1744751784.120170   13535 cuda_dnn.cc:529] Loaded cuDNN version 90300
I0000 00:00:1744751784.201964   13535 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


Epoch 2/2


<tf_keras.src.callbacks.History at 0x7f01783ec070>

Сохраняем модель

In [25]:
model.save_weights(filepath="./BERT2.keras")

Загружаем модель

In [20]:
model.load_weights("./BERT.keras")

In [21]:
def predict_sentiment(text):
    inputs = tokenizer(text, return_tensors="tf", truncation=True, padding=True, max_length=128)
    logits = model(inputs)[0]
    predicted_class = tf.argmax(logits, axis=1).numpy()[0]
    return label_encoder.inverse_transform([predicted_class])[0]

print(predict_sentiment("I feel hopeless and lost."))
print(predict_sentiment("I'm okay, just a bit tired."))
print(predict_sentiment("I'm feeling great today!"))
print(predict_sentiment("Enjoyable but could've been a lot better"))
print(predict_sentiment("Nah it's mid"))
print(predict_sentiment("It's mid"))


negative
negative
positive
positive
negative
neutral


Посмотрим на ошибки модели

In [22]:
y_pred = model.predict(val_encodings, batch_size=4).logits



In [23]:
predicted_labels = np.argmax(y_pred, axis=1)
true_labels = val_labels

wrong_idx = np.where(predicted_labels != true_labels)[0]
print(f"Accuracy on test {1 - len(wrong_idx)/len(y_pred)}")

Accuracy on test 0.8


In [24]:
wrong_texts = [val_texts[i] for i in wrong_idx]
wrong_true = [true_labels[i] for i in wrong_idx]
wrong_pred = [predicted_labels[i] for i in wrong_idx]

wrong_true_labels = label_encoder.inverse_transform(wrong_true)
wrong_pred_labels = label_encoder.inverse_transform(wrong_pred)

# Собираем таблицу
df_wrong = pd.DataFrame({
    "Text": wrong_texts,
    "True Label": wrong_true_labels,
    "Predicted Label": wrong_pred_labels
})

# Показываем первые 10 ошибок
pd.set_option('display.max_colwidth', 150)
df_wrong.head(10)


Unnamed: 0,Text,True Label,Predicted Label
0,"'there are people and then there are pencils' some are sharp, some are not and some can be sharpened my pencil philosophy.....",neutral,negative
1,"Screw the reviews, I thought Wolverine was awesome. But not enough Dominic Monaghan for my liking.",positive,negative
2,I saw amazing heeels. But they were too big,neutral,negative
3,How looks like our company new logo?,positive,neutral
4,is home alone.. Doing hw,neutral,negative
5,Here`s a brief preview: OMG James is creepy in that role! I`m scared of him,negative,positive
6,"If u do, please pray 4 me. Lord knows I need it.",neutral,negative
7,"doing pretty well, up and wide awake",positive,neutral
8,"i want to wake up early, and get a coffee tomorrow (today) ! it`s going to be a busyy day! but have to keep writing.. booo whoo!",neutral,positive
9,can`t school just be done already? it hurts too much... seeing him every day,positive,negative
