# ARD - project
## *Determination of the universe according to the sentence of a character*

**Author**: Brenda Lesniczakova, LES0045 <br>
**Datasets**: Scripts from Star Wars, Rick and Morty, Harry Potter and Lord of the Rings

In [None]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
import re
from nltk.corpus import stopwords
from nltk.stem import wordnet
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report

from tensorflow import string as tf_string
from tensorflow import keras
from keras.models import Model
from keras.layers.experimental.preprocessing import TextVectorization
from keras.layers import Input, Embedding, Dropout, Dense, Flatten, GRU, LSTM, Bidirectional
from keras.layers import Conv2D, MaxPool2D, Reshape, Concatenate #, CuDNNLSTM, CuDNNGRU
from tensorflow.compat.v1.keras.layers import CuDNNGRU, CuDNNLSTM
from keras.callbacks import EarlyStopping
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

# All data import
The data is in the form of a CSV file. All datasets contain character names and their dialogues.

In [None]:
folder = '/kaggle/input'
for dirname, _, filenames in os.walk(folder):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Star Wars: script of original trilogy (episodes 4-6)
Source: https://www.kaggle.com/xvivancos/star-wars-movie-scripts

In [None]:
swIV = pd.read_csv(folder + '/star-wars-movie-scripts/SW_EpisodeIV.txt', 
                   delim_whitespace=True, usecols=[1,2], names=['name', 'txt'], skiprows=[0])
swV = pd.read_csv(folder + '/star-wars-movie-scripts/SW_EpisodeV.txt', 
                  delim_whitespace=True, usecols=[1,2], names=['name', 'txt'], skiprows=[0])
swVI = pd.read_csv(folder + '/star-wars-movie-scripts/SW_EpisodeVI.txt', 
                   delim_whitespace=True, usecols=[1,2], names=['name', 'txt'], skiprows=[0])
sw = pd.concat([swIV, swV, swVI], ignore_index=True)
sw['universe'] = 'Star Wars'
sw

## Harry Potter: script of first three movies
Source: https://www.kaggle.com/gulsahdemiryurek/harry-potter-dataset

In [None]:
hp1 = pd.read_csv(folder + '/harry-potter-dataset/Harry Potter 1.csv', 
                  sep=';', encoding='ISO-8859-1', names=['name', 'txt'], skiprows=[0])
hp2 = pd.read_csv(folder + '/harry-potter-dataset/Harry Potter 2.csv', 
                  sep=';', encoding='ISO-8859-1', names=['name', 'txt'], skiprows=[0])
hp3 = pd.read_csv(folder + '/harry-potter-dataset/Harry Potter 3.csv', 
                  sep=';', encoding='ISO-8859-1', names=['name', 'txt'], skiprows=[0])
hp = pd.concat([hp1, hp2, hp3], ignore_index=True)
hp['universe'] = 'Harry Potter'
hp

## Lord of the Rings: script of all three movies
Source: https://www.kaggle.com/paultimothymooney/lord-of-the-rings-data

In [None]:
lr = pd.read_csv(folder + '/lord-of-the-rings-data/lotr_scripts.csv', 
                 usecols=[1,2], names=['name', 'txt'], skiprows=[0])
lr['universe'] = 'Lord of the Rings'
lr

## Rick and Morty: script of first three seasons
Source: https://www.kaggle.com/andradaolteanu/rickmorty-scripts

In [None]:
rm = pd.read_csv('/kaggle/input/rickmorty-scripts/RickAndMortyScripts.csv', 
                 usecols=[4,5], names=['name', 'txt'], skiprows=[0])
rm['universe'] = 'Rick and Morty'
rm

# Dataset

In [None]:
data = pd.concat([sw, hp, lr, rm], ignore_index=True)
data

In [None]:
label_encoder = LabelEncoder()
data['label'] = label_encoder.fit_transform(data.universe)
universes = data.groupby(['universe', 'label']).count().reset_index()
universes.loc[universes.universe == 'Harry Potter', 'name'] = hp.name.value_counts().size
universes.loc[universes.universe == 'Lord of the Rings', 'name'] = lr.name.value_counts().size
universes.loc[universes.universe == 'Rick and Morty', 'name'] = rm.name.value_counts().size
universes.loc[universes.universe == 'Star Wars', 'name'] = sw.name.value_counts().size
universes.columns = ['universe', 'label', 'cnt_characters', 'cnt_dialogues']
universes

In [None]:
print('The dataset contains', data.txt.size, 'dialogues between', universes.cnt_characters.sum(), 
      'characters from', universes.label.size, 'universes:\n')
for index, row in universes.iterrows():
    print('\t', row.universe, '\n\t\tCount of characters:', row.cnt_characters, 
          '\n\t\tCount of dialogues:', row.cnt_dialogues, '\n')

In [None]:
data.txt.isna().value_counts()

### *There is one missing value in this dataset. The record with missing value will be removed for next processing.*

In [None]:
data.dropna(inplace=True)
data['clean_txt'] = data['txt'].apply(lambda x: re.sub(r'[^A-Za-z]+', ' ', x))
data['clean_txt'] = data['clean_txt'].apply(lambda x: x.lower())
data['clean_txt'] = data['clean_txt'].apply(lambda x: x.strip())

stop_words = stopwords.words('english')
data['clean_txt'] = data['clean_txt'].apply(lambda x: ' '.join([words for words in x.split() 
                                                                if words not in stop_words]))
lem = wordnet.WordNetLemmatizer()
data['clean_txt'] = data['clean_txt'].apply(lambda x: ' '.join([lem.lemmatize(item, pos='v') 
                                                                for item in x.split()]))
data.head()

### *Some dialogues consist only of stop words. The data will be removed after cleaning.*

In [None]:
del_lines = data.loc[data.clean_txt == '']
data = data.loc[data.clean_txt != '']
del_lines.head()

# Data summary
*Count of characters and dialogues before and after cleaning data by universe.*

In [None]:
universes['cnt_cleaned'] = data.label.value_counts()
universes

### *The script of Harry Potter is a little bit bigger than others, but tests with less records (using only one or two movie scripts) achieved worse results.*

In [None]:
fig = plt.figure(figsize = (12,5))
ax = fig.add_subplot(111)
sns.countplot(data.universe)
plt.xlabel('Universe', size = 15)
plt.ylabel('Count', size= 15)
plt.xticks(size = 12)
plt.title("Count of dialogues by universe" , size = 20)
plt.show()

In [None]:
data['length'] = data['txt'].apply(lambda x: len(x))

idx, row, col = 0, 2, 2
fig, axes = plt.subplots(row, col, figsize=(15, 8))
fig.suptitle('Length of texts', size=20)
colors = ['orange', 'green', 'brown', 'lightblue']
for r in range(row):
    for c in range(col):
        data.loc[data['label'] == idx].hist(ax=axes[r,c], column='length', by='universe', 
                                            bins=50, xrot=0, color=colors[idx])
        idx += 1

In [None]:
plt.figure(figsize=(12,5))
for col in data.universe.unique():
    ax = sns.distplot(data.loc[(data['universe'] == col) & (data['length'] < 350), 'length'], kde = False)
    
ax.legend(data.universe.unique())
ax.set_title('Distribution of length of texts', fontsize = 20)
ax.set_xlabel('length')
sns.despine(left = True)
ax.set_ylabel('count')
plt.show()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
fig.suptitle('Length of texts', size=20)

sns.boxplot(ax=axes[0], x='universe', y='length', data=data, palette='coolwarm')
axes[0].set_title('with outliers', size=15)
axes[0].tick_params('x', rotation=15)

sns.boxplot(ax=axes[1], x='universe', y='length', data=data, palette='coolwarm', showfliers=False)
axes[1].set_title('no outliers', size=15)
axes[1].tick_params('x', rotation=15)

### *Length of dialogues is usually between 25 and 70 words. Dialogues in Harry Potter universe are considerably shorter than others. Serial Rick and Morty contains the longest dialogues but mostly about a little more. On the other hand, a few texts there are longer than thousand words.*

# Splitting data to training, validation and testing part

In [None]:
X = data.clean_txt
y = data.label

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.1,
                                                      random_state=13, stratify=y_train)
print('Train:', X_train.shape, y_train.shape)
print('Test:', X_test.shape, y_test.shape)
print('Validation:', X_valid.shape, y_valid.shape)

y_train_vect = to_categorical(y_train)
y_valid_vect = to_categorical(y_valid)
print('\nEncoding labels example:')
for i in range(5):
    print('  ', list(y_train)[i], '  ', y_train_vect[i])

# Vocabulary: TextVectorization

In [None]:
word_freq = {}
for txt in data.clean_txt:
    words = pd.Series(txt.split(' ')).value_counts()
    for word in words.index:
        if word.index in word_freq:
            word_freq[word.index] += words[word]
        else: word_freq[word.index] = words[word]
print('Count of unique words:', len(word_freq))

In [None]:
embedding_dim = 128 
vocab_size = len(word_freq)
sequence_length = 64 
vect_layer = TextVectorization(max_tokens=vocab_size, output_mode='int',
                               output_sequence_length=sequence_length)
vect_layer.adapt(data.clean_txt.values)

print('Vocabulary example: ', vect_layer.get_vocabulary()[:10])
print('Vocabulary shape: ', len(vect_layer.get_vocabulary()))

# Model

In [None]:
input_layer = Input(shape=(1,), dtype=tf_string)
x_v = vect_layer(input_layer)
emb = Embedding(vocab_size, embedding_dim)(x_v)
x = LSTM(64, activation='relu', return_sequences=True)(emb)
x = GRU(64, activation='relu', return_sequences=True)(x)
x = Flatten()(x)
x = Dense(64, 'relu')(x)
x = Dense(32, 'relu')(x)
x = Dropout(0.2)(x)
output_layer = Dense(4, 'softmax')(x)

model = Model(input_layer, output_layer)
model.summary()
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

In [None]:
es = EarlyStopping(monitor='val_loss', min_delta=0, patience=70, restore_best_weights=True)
batch_size = 512
epochs = 50
history = model.fit(X_train.values, y_train_vect, validation_data=(X_valid.values, y_valid_vect), 
                    callbacks=[es], epochs=epochs, batch_size=batch_size)

# Graph of history

In [None]:
def show_history(history):
    plt.figure(figsize=(12,5))
    for key in history.history.keys():
        plt.plot(history.epoch, history.history[key], label=key)
    plt.legend()
    plt.tight_layout()

In [None]:
show_history(history)

# Classification report

In [None]:
def class_report(y_test, y_pred_vect):
    y_pred = np.argmax(y_pred_vect, axis=1)
    test_accuracy = np.sum(y_pred == y_test.values) / y_test.size
    print('Test accuracy:', test_accuracy)
    print('Accuracy score: ', accuracy_score(y_test, y_pred))
    print('F1 score: ', f1_score(y_test, y_pred, average='macro'), '\n')
    print(classification_report(y_true=y_test, y_pred=y_pred))

    conf_mtx = confusion_matrix(y_test, y_pred)
    df_conf_mtx = pd.DataFrame(conf_mtx, index=universes.universe, columns=universes.universe)
    plt.figure(figsize=(12,5))
    sns.heatmap(df_conf_mtx, fmt='d', annot=True, cmap='Blues')
    plt.xlabel('Predicted label', size = 15)
    plt.ylabel('True label', size= 15)
    plt.title('Confusion matrix', size=20)
    plt.show()

In [None]:
class_report(y_test, model.predict(X_test))

# Model - Bidirectional layer

In [None]:
input_layer = Input(shape=(1,), dtype=tf_string)
x_v = vect_layer(input_layer)
emb = Embedding(vocab_size, embedding_dim)(x_v)
x = Bidirectional(LSTM(128, return_sequences=True))(emb)
x = Dropout(0.5)(x)
x = Bidirectional(LSTM(64))(x)
x = Dropout(0.5)(x)
x = Dense(32, 'relu')(x)
x = Dropout(0.5)(x)
output_layer = Dense(4, 'softmax')(x)

model_bdirect = Model(input_layer, output_layer)
model_bdirect.summary()
model_bdirect.compile(optimizer='Adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [None]:
es = EarlyStopping(monitor='val_loss', min_delta=0, patience=70, restore_best_weights=True)
batch_size = 512
epochs = 50
history_bdirect = model_bdirect.fit(X_train.values, y_train_vect, 
                                    validation_data=(X_valid.values, y_valid_vect), 
                                    callbacks=[es], epochs=epochs, batch_size=batch_size)

In [None]:
show_history(history_bdirect)

In [None]:
class_report(y_test, model_bdirect.predict(X_test))

# Vocabulary: Tokenizer

In [None]:
tokenizer_keras = Tokenizer(oov_token = "<OOV>")
tokenizer_keras.fit_on_texts(data.clean_txt)
word_index = tokenizer_keras.word_index
vocab_token_size = len(word_index)
print('Vocabulary shape:', vocab_token_size)
list(word_index.items())[:10]

In [None]:
def prepare_data(X, tokenizer, max_len):
    sequences = tokenizer.texts_to_sequences(X)
    padded = pad_sequences(sequences, maxlen=max_len, padding='post', truncating='post')
    return padded

In [None]:
max_len = 512
X_train_vect = prepare_data(X_train, tokenizer_keras, max_len)
X_valid_vect = prepare_data(X_valid, tokenizer_keras, max_len)
X_test_vect = prepare_data(X_test, tokenizer_keras, max_len)

In [None]:
input_layer = Input(shape=(max_len,))
emb = Embedding(vocab_token_size+1, embedding_dim, trainable=False)(input_layer)
x = Bidirectional(LSTM(128, return_sequences=True))(emb)
x = Dropout(0.5)(x)
x = Bidirectional(LSTM(64))(x)
x = Dropout(0.5)(x)
x = Dense(32, 'relu')(x)
x = Dropout(0.5)(x)
output_layer = Dense(4, 'softmax')(x)

model_token = Model(input_layer, output_layer)
model_token.summary()
model_token.compile(optimizer='Adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [None]:
es = EarlyStopping(monitor='val_loss', min_delta=0, patience=70, restore_best_weights=True)
batch_size = 512
epochs = 50
history_token = model_token.fit(X_train_vect, y_train_vect, 
                                validation_data=(X_valid_vect, y_valid_vect), 
                                callbacks=[es], epochs=epochs, batch_size=batch_size)

In [None]:
show_history(history_token)

In [None]:
class_report(y_test, model_token.predict(X_test_vect))

# Embedding file: GloVe Dictionary
File **glove.840B.300d.pkl** was imported from https://www.kaggle.com/authman/pickled-glove840b300d-for-10sec-loading

In [None]:
glove_embeddings = np.load(folder + '/pickled-glove840b300d-for-10sec-loading/glove.840B.300d.pkl',
                           allow_pickle=True)
embedding_dim = len(glove_embeddings['the'])
print("There are", len(glove_embeddings), "words and", embedding_dim,  "dimensions in Glove Dictionary.")

In [None]:
embedding_mtx = np.zeros((vocab_token_size+1, embedding_dim))
for word, idx in word_index.items():
    if word in glove_embeddings:
        embedding_mtx[idx] = glove_embeddings[word]
        
tokenized = pd.DataFrame([word_index]).T.reset_index()
tokenized.columns = ['words','index']
temp_mtx = pd.DataFrame(embedding_mtx).reset_index()
temp_mtx = temp_mtx.drop(0, axis = 0)
df_embedding_mtx = pd.merge(tokenized, temp_mtx, on = 'index')
df_embedding_mtx

# Model with embedding file

In [None]:
input_layer = Input(shape=(max_len,))
emb = Embedding(vocab_token_size+1, embedding_dim, weights=[embedding_mtx], 
                trainable=False)(input_layer)
x = Bidirectional(LSTM(128, return_sequences=True))(emb)
x = Dropout(0.5)(x)
x = Bidirectional(LSTM(64))(x)
x = Dropout(0.5)(x)
x = Dense(32, 'relu')(x)
x = Dropout(0.5)(x)
output_layer = Dense(4, 'softmax')(x)

model_glove = Model(input_layer, output_layer)
model_glove.summary()
model_glove.compile(optimizer='Adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [None]:
es = EarlyStopping(monitor='val_loss', min_delta=0, patience=70, restore_best_weights=True)
batch_size = 512
epochs = 50
history_glove = model_glove.fit(X_train_vect, y_train_vect, 
                                validation_data = (X_valid_vect, y_valid_vect),
                                callbacks=[es], epochs=epochs, batch_size=batch_size)

In [None]:
show_history(history_glove)

In [None]:
class_report(y_test, model_glove.predict(X_test_vect))

# Model using CNN

In [None]:
filter_sizes = [1,2,3,5]
num_filters = 42

input_layer = Input(shape=(max_len,))
emb = Embedding(vocab_token_size+1, embedding_dim, weights=[embedding_mtx],
                trainable=False)(input_layer)
x = Reshape((max_len, embedding_dim, 1))(emb)

conv_0 = Conv2D(num_filters, kernel_size=(filter_sizes[0], embedding_dim),
                             kernel_initializer='he_normal', activation='tanh')(x)
conv_1 = Conv2D(num_filters, kernel_size=(filter_sizes[1], embedding_dim),
                             kernel_initializer='he_normal', activation='tanh')(x)
conv_2 = Conv2D(num_filters, kernel_size=(filter_sizes[2], embedding_dim), 
                             kernel_initializer='he_normal', activation='tanh')(x)
conv_3 = Conv2D(num_filters, kernel_size=(filter_sizes[3], embedding_dim),
                             kernel_initializer='he_normal', activation='tanh')(x)
maxpool_0 = MaxPool2D(pool_size=(max_len - filter_sizes[0] + 1, 1))(conv_0)
maxpool_1 = MaxPool2D(pool_size=(max_len - filter_sizes[1] + 1, 1))(conv_1)
maxpool_2 = MaxPool2D(pool_size=(max_len - filter_sizes[2] + 1, 1))(conv_2)
maxpool_3 = MaxPool2D(pool_size=(max_len - filter_sizes[3] + 1, 1))(conv_3)

z = Concatenate(axis=1)([maxpool_0, maxpool_1, maxpool_2, maxpool_3])   
z = Flatten()(z)
z = Dropout(0.1)(z)
output_layer = Dense(4, 'softmax')(z)

model_cnn = Model(input_layer, output_layer)
model_cnn.summary()
model_cnn.compile(optimizer='Adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [None]:
es = EarlyStopping(monitor='val_loss', min_delta=0, patience=70, restore_best_weights=True)
batch_size = 512
epochs = 50
history_cnn = model_cnn.fit(X_train_vect, y_train_vect, 
                            validation_data = (X_valid_vect, y_valid_vect),
                            callbacks=[es], epochs=epochs, batch_size=batch_size)

In [None]:
show_history(history_cnn)

In [None]:
class_report(y_test, model_cnn.predict(X_test_vect))

# Summary
Dataset was split to 3 parts for training (72%), validation (8%) and testing (20%). Models trained on values of attributte **clean_txt**. Labels present universe of character dialogue and for better using was encoded to integer and represent as categoraical vector. The data was trained on 5 models:
  1. base model with *LSTM* and *GRU* layers use vocabulary *TextVectorization* - accuracy: ~0.729, f1-score: ~0.711
  2. model use *Bidirectional*(*LSTM*) layers, also use vocabulary *TextVectorization* - accuracy: ~0.724, f1-score: ~0.710
  3. model use vocabulary *Tokenize* - but the accuracy results were much worse - accuracy: ~0.531, f1-score: ~0.432
  4. model includes **GloVe Dictionary** to tokenized vocabulary - accuracy: ~0.739, f1-score: ~0.725
  5. model use 2D convolution layers and operation max pooling - accuracy: ~0.737, f1-score: ~0.722

In general, models 2, 3 and 4 are built on the same foundation, main difference is used embedding layer. Models 1 and 2 there use vocabulary *TextVectorization*. Model 3 there use vocabulary *Tokenize*. Models 4 and 5 there use vocabulary *Tokenizer* with **GloVe dictionary**.

Models 1-4 work with activation function *relu* and finally is used function *softmax* due to categorical label.

Batch with size 512 prove to be the best and optimal count of epochs is 50 (more epochs make process of save and commit version too long and after 9 hours was cancelled).