# Лабораторная работа №6 "Прогноз успеха фильмов по обзорам"
### Выполнила студентка группы БВТ2101 Пьянова Анна Олеговна

### Цель работы:
Прогноз успеха фильмов по обзорам (Predict Sentiment From Movie Reviews)
### Задачи:
- Ознакомиться с задачей классификации
- Изучить способы представления текста для передачи в ИНС
- Достигнуть точность прогноза не менее 95%


### Выполнение работы

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from tensorflow.keras import models, layers
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.datasets import imdb

Загрузка данных

In [2]:
(training_data, training_targets), (testing_data, testing_targets) = imdb.load_data(num_words=10000)
data = np.concatenate((training_data, testing_data), axis=0)
targets = np.concatenate((training_targets, testing_targets), axis=0)

Изучение данных

In [3]:
print("Categories:", np.unique(targets))
print("Number of unique words:", len(np.unique(np.hstack(data))))
length = [len(i) for i in data]
print("Average Review length:", np.mean(length))
print("Standard Deviation:", round(np.std(length)))

Categories: [0 1]
Number of unique words: 9998
Average Review length: 234.75892
Standard Deviation: 173


In [4]:
print("Label:", targets[0])
print(data[0])
index = imdb.get_word_index()
reverse_index = dict([(value, key) for (key, value) in index.items()])
decoded = " ".join( [reverse_index.get(i - 3, "#") for i in data[0]] )
print(decoded)


Label: 1
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
# this film was just brilliant casting location sce

Подготовка данных

In [5]:
def vectorize(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        sequence = [word for word in sequence if word < dimension]
        results[i, sequence] = 1
    return results

data1 = vectorize(data)
targets = np.array(targets).astype("float32")

test_x = data1[:10000]
test_y = targets[:10000]
train_x = data1[10000:]
train_y = targets[10000:]

Создание и обучение модели (размер вектора представления = 10000)

In [6]:
model = models.Sequential()

# Input - Layer
model.add(layers.Input(shape=(10000,)))
model.add(layers.Dense(50, activation="relu"))
# Hidden - Layers
model.add(layers.Dropout(0.3, noise_shape=None, seed=None))
model.add(layers.Dense(50, activation = "relu"))
model.add(layers.Dropout(0.2, noise_shape=None, seed=None))
model.add(layers.Dense(50, activation = "relu"))
# Output- Layer
model.add(layers.Dense(1, activation ="sigmoid"))

model.summary()

In [7]:
model.compile(optimizer = "adam", loss = "binary_crossentropy", metrics = ["accuracy"])

results = model.fit(train_x, train_y, epochs= 2, batch_size = 500, validation_data = (test_x, test_y))

print(np.mean(results.history["val_accuracy"]))

Epoch 1/2
[1m80/80[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 13ms/step - accuracy: 0.7374 - loss: 0.5288 - val_accuracy: 0.8966 - val_loss: 0.2599
Epoch 2/2
[1m80/80[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 0.9181 - loss: 0.2161 - val_accuracy: 0.8961 - val_loss: 0.2613
0.8963499963283539


Создание и обучение модели (размер вектора представления = 5000)

In [22]:
data2 = vectorize(data, 5000)
targets = np.array(targets).astype("float32")

test_x = data2[:10000]
test_y = targets[:10000]
train_x = data2[10000:]
train_y = targets[10000:]

In [23]:
model = models.Sequential()

# Input - Layer
model.add(layers.Input(shape=(5000,)))
model.add(layers.Dense(50, activation="relu"))
# Hidden - Layers
model.add(layers.Dropout(0.3, noise_shape=None, seed=None))
model.add(layers.Dense(50, activation = "relu"))
model.add(layers.Dropout(0.2, noise_shape=None, seed=None))
model.add(layers.Dense(50, activation = "relu"))
# Output- Layer
model.add(layers.Dense(1, activation ="sigmoid"))

model.summary()

In [24]:
model.compile(optimizer = "adam", loss = "binary_crossentropy", metrics = ["accuracy"])

results = model.fit(train_x, train_y, epochs= 2, batch_size = 500, validation_data = (test_x, test_y))

print(np.mean(results.history["val_accuracy"]))

Epoch 1/2
[1m80/80[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 9ms/step - accuracy: 0.7104 - loss: 0.5453 - val_accuracy: 0.8856 - val_loss: 0.2762
Epoch 2/2
[1m80/80[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.9074 - loss: 0.2410 - val_accuracy: 0.8889 - val_loss: 0.2642
0.8872499763965607


In [25]:
data3 = vectorize(data, 13000)
targets = np.array(targets).astype("float32")

test_x = data3[:10000]
test_y = targets[:10000]
train_x = data3[10000:]
train_y = targets[10000:]

In [26]:
model = models.Sequential()

# Input - Layer
model.add(layers.Input(shape=(13000,)))
model.add(layers.Dense(50, activation="relu"))
# Hidden - Layers
model.add(layers.Dropout(0.3, noise_shape=None, seed=None))
model.add(layers.Dense(50, activation = "relu"))
model.add(layers.Dropout(0.2, noise_shape=None, seed=None))
model.add(layers.Dense(50, activation = "relu"))
# Output- Layer
model.add(layers.Dense(1, activation ="sigmoid"))

model.summary()

In [27]:
model.compile(optimizer = "adam", loss = "binary_crossentropy", metrics = ["accuracy"])

results = model.fit(train_x, train_y, epochs= 2, batch_size = 500, validation_data = (test_x, test_y))

print(np.mean(results.history["val_accuracy"]))

Epoch 1/2
[1m80/80[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 21ms/step - accuracy: 0.7378 - loss: 0.5120 - val_accuracy: 0.8953 - val_loss: 0.2594
Epoch 2/2
[1m80/80[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step - accuracy: 0.9214 - loss: 0.2089 - val_accuracy: 0.8952 - val_loss: 0.2643
0.895249992609024


In [11]:
def predict_sentiment(text):
    words = text.split()
    sequence = [index.get(word, 2) for word in words]
    vector = vectorize([sequence], dimension=10000)

    prediction = model.predict(vector)

    sentiment = "positive" if prediction > 0.5 else "negative"
    confidence = prediction[0][0] if sentiment == "positive" else 1 - prediction[0][0]
    
    return sentiment, confidence, text

user_text = input("Your text: ")
sentiment, confidence, text = predict_sentiment(user_text)
print(f"{text}\n Sentiment: {sentiment} ({confidence:.2f})")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
This is movie is messed up and I'll give it just one star since maybe at least some messed up people can actually enjoy this film, it has at least SOME audience appeal. And for the record, I have no issues with BDSM or people participating in it, that's not the most screwed up thing I'm talking about! Beyond that, from a non soft and hard-core porn lens and movie-making lens, this STINKS! This main character has no appeal at all and I feel so sorry for her. She's as mentally ill as the man taking advantage of her and manipulating her, who is also a horrible character and human being overall. In fact, he COERCES her and she falls for it each step of the way! He's sick and amoral and she's just plain sad and pathetic with barely any self-esteem except maybe a little at the very end. I cannot empathize with either character which makes it non-investable and the script is god-awful too! Barely ANY redeeming qualities.
