In this project we will develop a transformer architecture equipped only with an encoder structure to do the task of sentiment analysis. For this model we will use a dataset of 1500 customer feedbacks; the model reaches an accuracy of ~99% on the training dataset and ~90% on the testing dataset after approximately 500 iterations. Alternatively, we start training the model with a very small dataset of 100 customer feedbacks. The word tokens are mapped to GloVe word embeddings and the accuracy of the model reaches ~90% and ~92% on the training and testing datasets after 100 iterations respectively. 

In [1]:
# Loading the required packages: 
import tensorflow as tf
import time
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from tensorflow.keras.layers import Embedding, MultiHeadAttention, Dense, Input, Dropout, LayerNormalization, Layer,SimpleRNN
from tensorflow.keras.models import Sequential
from tensorflow import  reshape, shape, transpose

from transformers import DistilBertTokenizerFast #, TFDistilBertModel
from transformers import TFDistilBertForTokenClassification

import numpy as np
import pandas as pd 
from sklearn.model_selection import train_test_split 
import matplotlib.pyplot as plt
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize, RegexpTokenizer
from nltk.stem import PorterStemmer,LancasterStemmer
import re

In [2]:
def pred(y): 
    '''
    This function maps the probabilities outputed by the model back to the rankings list 
    and outputs the ranking with the highest probability. 
    
    inputs: 
    y  (1,m)     : Probability output of the RNN model 
    
    outputs: 
    res (string) : The ranking corresponding to the most probable outcome. 
    
    '''
    y = y.tolist()
    #ranking = ['Below Average' , 'Average' , 'Above Average']
    res = ranking[y.index(max(y))]
    return(res)


In [191]:
#need to write another function that maps the correct output of the function to the rankings. 
def vec_output(y): 
    m = len(ranking)
    txt = y
    v = np.zeros(m) 
    j = ranking.index(txt)
    v[j] = 1
    return v 


In [192]:
#Loading the data: 
CustomerFeed = 'Canva_reviews.xlsx'
df = pd.read_excel(CustomerFeed)

print(df)

                                               reviewId            userName  \
0     gp:AOqpTOFxf3fttcT5DSvFIn9KPp5FErgH9yC533Fmoxv...      Donna Caritero   
1     gp:AOqpTOEq6rNIWLnPV4KFTctWvm0mpGEQljtD6mvy1H-...  Soumi Mukhopadhyay   
2     gp:AOqpTOE86hSyPRHZgYt28Uk5zGe4FZGb1hkmtFDiYJ2...   Theknown _unknown   
3     gp:AOqpTOHSuKkVTcM3QgCCKysHQlxEnk2ocOKsUMiMIJy...        Anthony Dean   
4     gp:AOqpTOEOrZt5H6jXPiplJyffCd5ZBnVXACTWgwNsF1R...   Neha Diana Wesley   
...                                                 ...                 ...   
1495  gp:AOqpTOHhnXMpylU3f-1V1KbR2hwWArOilxPlKI6K4xY...            Reen Ali   
1496  gp:AOqpTOEcz62DHS-amqTB5xGMhM4_R0UJpcv_HDNny9i...     Shaurya Chilwal   
1497  gp:AOqpTOFMqEqa_kpp29Q8wjcBmKUCAvOQGQx4KZQ8b83...           GK Gaming   
1498  gp:AOqpTOGY4z3pUxeiqGzn2ad3Noxqlbm-9DZ3ksHqD1_...    1203_Vani Sharma   
1499  gp:AOqpTOFVGZ0MXyR-Gv_d2cYf2KD709Hwple_u7OZE4y...           MeLLy EcK   

                                              userI

In [193]:
df = df[["review", "Sentiment"]]
df.head()

Unnamed: 0,review,Sentiment
0,Overall it's really an amazing app. I've been ...,Negative
1,Hey! Yes I gave a 5 star rating... coz I belie...,Positive
2,Canva used to be a good app! But recently I've...,Negative
3,"It's a brilliant app, but I have just one prob...",Negative
4,This was such a great app. I used to make BTS ...,Negative


In [194]:
def edit_txt(review):
    """
    This function receives a text and returns it edited as follows: 
    1, all words converted to lower case 
    2, integers removed
    3, tokenize the words 
    4, punctuation removed 
    5, common words that are unnecessary are removed. 
    """
    
    review_edited = []

    #Converting to lower case: 
    review_edited = review.lower() 
    
    #Removing integers: 
    pattern = r'[0-9]'
    # Match all digits in the string and replace them with an empty string
    review_edited = re.sub(pattern, '', review_edited) 

    #Tokenize the comment: 
    review_edited = word_tokenize(review_edited) 

    #Removing punctuation 
    tokenizer = RegexpTokenizer(r'\w+')
    review_edited = [''.join(tokenizer.tokenize(word)) for word in review_edited if len(tokenizer.tokenize(word))>0]

    #Removing common words: 
    #remove_list = stopwords.words('english') 
    #to_remove = [ "not",'don',"don't",'should',"should've", 'ain','aren',"aren't",'couldn',"couldn't",'didn',"didn't",'doesn',"doesn't",'hadn',"hadn't",'hasn',"hasn't",'haven',"haven't",'isn',"isn't",'mightn',"mightn't",'mustn',"mustn't",'needn',"needn't",'shan',"shan't",'shouldn',"shouldn't",'wasn',"wasn't",'weren',"weren't",'won',"won't",'wouldn', "wouldn't"]
 
    #review_edited = [word for word in review_edited if not word in remove_list]
    return(review_edited) 



In [195]:
# Defining the review dataset as x: 
x = df["review"] 
dfrank = df.iloc[:,1]

print(x[10])

y = df["Sentiment"].tolist()
ranking = np.unique(y)
ranking = ranking.tolist()
print(f"\nCorresponding ranking: {y[10]}\n")
print(f"Rankigns include {ranking}")


Really great editing app, its all around which makes it great. Has everything I need for basic editing. It makes editing easier because of premade tools and stickers, designs, etc. I gave it four stars only because of how slow it loads, especially at starting the app. It is pretty stressful, so you really gotta have patience at waiting for stuff to load.

Corresponding ranking: Positive

Rankigns include ['Negative', 'Positive']


In [196]:
#creating the dictionary: 
reviews_edited = [edit_txt(review) for review in x]
print(f"Comment before editting: {x[13]}")
print(f"Comment after editting: {reviews_edited[13]}")

Split = [] 
Dic = []
dictionary = np.unique([word for review in reviews_edited for word in review]).tolist()
print(dictionary[0:30])
len(dictionary)

Comment before editting: Unable to save my work. Nothing works :(
Comment after editting: ['unable', 'to', 'save', 'my', 'work', 'nothing', 'works']
['_', 'a', 'aa', 'aap', 'ability', 'able', 'about', 'above', 'absolutely', 'acc', 'accepted', 'access', 'accessibilities', 'accessible', 'accidentally', 'accoding', 'according', 'account', 'across', 'action', 'activity', 'actual', 'actually', 'ad', 'adaptable', 'add', 'added', 'adding', 'addition', 'address']


2317

In [197]:
# Load the word embeddigns:
embeddings_dict = {}
with open("glove.6B.50d.txt", 'r') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], "float32")
        embeddings_dict[word] = vector

words =  list(embeddings_dict.keys())
vectors = [embeddings_dict[word] for word in words]

In [198]:
#dividing the dataset into 75% training set and 25% test set: 
x = x.to_list()
X_train, X_test, y_train, y_test = train_test_split(x,y, 
                                   random_state=104,  
                                   test_size=0.25,  
                                   shuffle=True) 


In [200]:
# Edit the text in the training and texting datasets: 
X_train = [edit_txt(comment) for comment in X_train]
X_test = [edit_txt(comment) for comment in X_test]

In [201]:
type(X_train[13])

list

In [88]:
def gvec_input(x,m,e): 

    "    This function takes any input, x, and returns a glove vector based on the \n",
    "    words introduced in the vocabulary (400,000 words). This function returns k vectors where k is the number of words in the \n",
    "    sentence. Every vector corresponds to a word in the dictionary and each entry will describe a feature of the word. \n",
    "    \n",
    "    inputs: \n",
    "    \n",
    "    x (string) : a statement from customers. \n",
    "    m (int)    : size of the sequence \n",
    "    e (int)    : size of the embeddings \n",
    "    outputs: \n",
    "    v (m,n)    : where m is the number of words in the sentence and n = 50 is the number of total features describing a word. \n",
    "\n",
    n = len(x)
    gv = np.zeros((n,m, e))

    for i in range(0, n): #looping over each comment 
        txt = x[i] #select the ith comment  
        txt = (txt[:m] if len(txt) > m else txt + ['<pad>'] * (m - len(txt))) #shorten or add extra padding
        for l in range(m): #looping over each word 
        
            # add the embedding of all ones for pads
            if txt[l] == "<pad>": 
                gv[i,l,:] = np.zeros(e) 
        
            # if a word is not is the list of Glove embeddings, then assign an array which is the average of all embeddings:  
            elif txt[l] not in words: 
                gv[i,l,:] = np.mean(vectors, axis = 0)
                # add the word embeddings: 
            else: 
                gv[i,l,:] = embeddings_dict[txt[l]]
    return(gv)

In [89]:
#converting x_train and x_test to word embeddings: 
m = 30
e = 50
X_trainmod = gvec_input(X_train,m,e)
X_testmod = gvec_input(X_test,m,e) 

In [90]:
print(X_trainmod[0])
print(X_trainmod.shape)
print(X_testmod.shape)

[[ 0.11891     0.15255    -0.082073   ... -0.57511997 -0.26671001
   0.92120999]
 [ 0.79238999  0.21864     0.68711001 ... -0.066753   -0.39660001
   0.74818999]
 [ 0.60307997 -0.32023999  0.088857   ... -0.25187001 -0.26879001
   0.36657   ]
 ...
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]]
(1125, 30, 50)
(375, 30, 50)


In [91]:
# Map the y_training and y_testing datasets to Boolean 0, 1: 
y_trainmod = (np.array([vec_output(y) for y in y_train])).reshape(len(y_train), len(ranking))
y_testmod = (np.array([vec_output(y) for y in y_test])).reshape(len(y_test),len(ranking))
y_trainmod[0:5]

array([[1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.]])

In [93]:
# Calculate the angles for positional embeddings: 

def get_angles(pos, k, d):
    """
    Get the angles for the positional encoding
    
    Arguments:
        pos -- Column vector containing the positions [[0], [1], ...,[N-1]]
        k --   Row vector containing the dimension span [[0, 1, 2, ..., d-1]]
        d(integer) -- Encoding size
    
    Returns:
        angles -- (pos, d) numpy array 
    """
    
    # Get i from dimension span k
    i = k//2
    # Calculate the angles using pos, i and d
    angles = pos/ (10000)**(2*i/d)

    
    return angles
    
def pos_emb(len_seq,len_emb): 
    
    """
    This function creates the positional embeddings for all the words in the sequence based on: 
    
    Input: 
    len_seq (int) : The length of the sequences inputed into the model. 
    len_emb (int) : The length of the word embeddings for every word in the sequence. 

    Note: the size of the positional encoding and the word embeddings must match in order to add them in the next step. 

    Output: 
    res (np.array(len_seq, len_emb)) : ith row of this matrix represents the positional encodings for the ith position in the sequence. 

    """

    len_i = int(len_emb/2)

    # Initialize the matrix to save positional encodings: 
    res = np.zeros((len_seq,len_emb))
    angles = np.zeros((len_seq,len_emb))
    
    #for each position in the sequence 
    for pos in range(len_seq): 
        
        #calculate the angles: 
        for i in range(len_i): 
            angles[pos,2*i] = pos/(10000**(2*i/len_emb))
            angles[pos, 2*i +1] = pos/(10000**(2*i/len_emb)) 
        
        # Calculate the entries corresponding to each position 
        #for j in range(len_i): 
        res[pos, 0::2] = np.sin(angles[pos,0::2])
        res[pos,1::2] = np.cos(angles[pos,0::2])
            
    return(tf.cast(res.reshape(1,len_seq,len_emb), dtype=tf.float32))


In [94]:
# Create the positional embeddings: 
position_enc = pos_emb(X_trainmod.shape[1],X_trainmod.shape[2])
position_enc.shape

TensorShape([1, 30, 50])

In [95]:
# Add the positional encoding to the word embeddings: 
X_trainmod = X_trainmod + position_enc 
print(X_trainmod.shape)

X_testmod = X_testmod + position_enc 
X_testmod.shape

(1125, 30, 50)


TensorShape([375, 30, 50])

In [96]:
def self_attention(q,k,v):
    """
    let me define you later gorgeous! 
    
    """
    
    
    # Perform matrix multiplication on the last two dimensions
    dotqk = tf.matmul(q, k, transpose_b = True)

    dim_k = tf.cast(k.shape[-1],tf.float32)
    normalized_dotqk = dotqk/tf.math.sqrt(dim_k)
    
    #then add the masking if masking if given" 
    #if masking is not None: 
        #normalized_dotqk += (1 - masking)* (-1e9)
    
    attention_scores =  tf.nn.softmax(tf.cast(normalized_dotqk, dtype=tf.float32),axis = -1)
    res = tf.matmul(attention_scores,v) 
    
    return(res)
    

In [97]:
def FullFeedForward(n_1, emb_size):#the model must return vectors of the same size as the embeddings of the input so can be combined with decoder
    model = Sequential([
    Dense(n_1, activation='tanh', name="dense1"), #relu? (#samples, len_seq, n_1)
    Dense(emb_size, activation='tanh', name="dense2")# linear? (#samples, len_seq, emb_size)
])
    return(model)
    

In [98]:
# Define a reshape_tensor which will be later on used for the Multi-head attention: 

def reshape_tensor(q_matrix, heads, pre_attention): 
    """
    """
    
    #pre_attention, we'll need to reform into 4d 
    if pre_attention:

        dense_qre = reshape(q_matrix, (shape(q_matrix)[0], shape(q_matrix)[1], heads, -1))
        dense_qre = transpose(dense_qre, ([0, 2, 3, 1]))
        
        
    #post_attention, we'll need to revert back to 3d: 
    else: 
        q_matrix_transpose = transpose(q_matrix, ([0,3,1,2]))
        dense_qre = reshape(q_matrix_transpose, (shape(q_matrix_transpose)[0], shape(q_matrix_transpose)[1], -1)) 
        
        
    return(dense_qre)
        

In [99]:
class MultiHeadAttention(Layer): 

    def __init__(self, dim_kv, dim_q, len_emb, heads, **kwargs):
        
        super(MultiHeadAttention, self).__init__(**kwargs) 
        self.heads = heads
        self.denseq = Dense(units = dim_q)
        self.densek = Dense(units = dim_kv)
        self.densev = Dense(units = dim_kv) 
        self.dense = Dense(units = len_emb)
    
    def call(self,q,k,v, **kwargs): #by passing self, you passed all the attributes you've defined above. 
       
        # Define the query, key, and value matrices: 
        dense_q = self.denseq(q) # shape = (#samples, len_seq, dim_q)
        dense_k = self.densek(k) # shape = (#samples, len_seq, dim_k) 
        dense_v = self.densev(v) # shape = (#samples, len_seq, dim_v) 
        
        # Reshape: 
        dense_qre = reshape_tensor(dense_q, self.heads, pre_attention = True) #shape = (#samples, #heads, dim_q/heads, len_seq)
        dense_kre = reshape_tensor(dense_k, self.heads, pre_attention = True) #shape = (#samples, #heads, dim_k/heads, len_seq)
        dense_vre = reshape_tensor(dense_v, self.heads, pre_attention = True) #shape = (#samples, #heads, dim_v/heads, len_seq) 
        
        # Calculate the attention scores: 
        attention_scores = self_attention(dense_qre, dense_kre,dense_vre) #shape = (#samples, #heads, dim_q/heads, len_seq)
        
        # Revert the shape:
        attention_with_v = reshape_tensor(attention_scores, self.heads, pre_attention = False) #shape = (#samples, len_seq, dim_q)
        
        # Run through another dense and add to the initial x: 
        res = self.dense(attention_with_v)  # shape = (#samples, len_seq, d_model) 
        
        return(res)


Note that each Query, Key, and Value matrix will be divided into heads. More specifically, dim_q/heads and dim_kv/heads must still be integers for the model to work. 

In [155]:
# Check if it works: 
dim_kv = 40
dim_q = 40
len_emb = 50
heads = 4

masking = None

function = MultiHeadAttention(dim_kv, dim_q, len_emb, heads)
function(X_trainmod, X_trainmod,X_trainmod).shape

TensorShape([1125, 30, 50])

In [156]:
class Encoder(Layer):
    
    def __init__(self, dim_kv, dim_q, heads, fnn_neurons, len_emb, iter, len_rank, rnn_units, drop_rate):
        super(Encoder,self).__init__()
        self.len_emb = len_emb
        self.mha     = MultiHeadAttention(dim_kv, dim_q, len_emb, heads)
        self.norm    = LayerNormalization(epsilon = 1e-6)
        self.drop    = Dropout(rate = drop_rate)
        self.fnn     = FullFeedForward(fnn_neurons, len_emb)
        self.iter    = iter
        self.rnn     = SimpleRNN(units = rnn_units, return_sequences=False)
        self.dense  = Dense(units = len_rank, activation = 'softmax') 



    def call(self,x,training): 

        
        #len_seq = x.shape[1]
        
        # Add positional encodings: 
        #x += pos_emb(len_seq, self.len_emb)
        for _ in range(self.iter): 
            # Add dropout layer:
            drop_x = self.drop(x, training = training)
            # Calculate the attention scores: 
            mha_scores = self.mha(drop_x, drop_x, drop_x)

            # Add dropout and normalize: 
            dropout_1 = self.drop(mha_scores, training = training)
            norm_1  = self.norm(dropout_1 + x )
            
            #Run through a fully connected neural network: 
            fnn_output = self.fnn(norm_1) 
              
            # Add dropout:
            dropout_2 = self.drop(fnn_output, training = training)
               
            # Normalize: 
            x = self.norm(dropout_2 + norm_1)

        # Run through a dense layer to combine all the word embeddings of each word: 
        x = self.rnn(x)

        # Run through a dense layer activation function = 'softmax': 
        probs = self.dense(x)
        return probs


In [169]:
dim_kv = 40 
dim_q = 40 
len_emb = 50
heads = 4 
masking = None 
fnn_neurons = 30
drop_rate = 0.1
len_rank = len(ranking)
rnn_units = 20 
iter = 6 #based on the paper  

encoder = Encoder(dim_kv, dim_q, heads, fnn_neurons, len_emb, iter, len_rank,rnn_units, drop_rate = 0.1)
output_encoder = encoder(X_trainmod, training = True)
output_encoder.shape

TensorShape([1125, 2])

In [170]:
y_trainmod = y_trainmod.reshape(1125,2)
print(y_trainmod.shape)

(1125, 2)


In [171]:
inputs = tf.keras.Input(shape=(30, len_emb))
outputs = encoder(inputs, training=True)  # Assuming training=True for now
model = tf.keras.Model(inputs=inputs, outputs=outputs)
model.summary()

In [172]:
from tensorflow.keras.optimizers import Adam
opt = Adam(0.002,beta_1 = 0.9, beta_2 = 0.999, decay = 0.01) 
model.compile(loss = "categorical_crossentropy", optimizer = opt, metrics = ["accuracy"])

In [176]:
model.fit(X_trainmod,y_trainmod, epochs=5, batch_size=500) #usually trained for 500 iterations reaches accuracy = 98.92% 

Epoch 1/5
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 104ms/step - accuracy: 0.9885 - loss: 0.0311
Epoch 2/5
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 108ms/step - accuracy: 0.9863 - loss: 0.0445
Epoch 3/5
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 107ms/step - accuracy: 0.9935 - loss: 0.0223
Epoch 4/5
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 129ms/step - accuracy: 0.9898 - loss: 0.0291
Epoch 5/5
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 104ms/step - accuracy: 0.9953 - loss: 0.0204


<keras.src.callbacks.history.History at 0x124c2a610>

In [163]:
y_testmod = y_testmod.reshape(375,2)
print(y_testmod.shape)

(375, 2)


In [177]:
# Evaluate the model on the testing set: 
model.evaluate(X_testmod, y_testmod)

[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.8965 - loss: 0.5702


[0.5557928085327148, 0.8960000276565552]

In [178]:
predictions = model.predict(X_testmod)
predictions = np.argmax(predictions, axis = -1)
output = [ranking[int(x)] for x in predictions]
for i in range(len(output)): 
    print(f"Comment: {x[i]}\n\nRanking: {y_test[i]}, prediction: {output[i]}\n\n")

[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 26ms/step
Comment: Overall it's really an amazing app. I've been using this for the past 5 years however I only have one issue though and I wanted this to get address since I think this issue had lasted for how many years? The texts were blurred and when you zoom it out it's pixelated. I thought this issue only occurs on mobile apps however it was also present on the website. Please fix this. I still remember the time when I can export high definition texts and I love that experience. Thank you!

Ranking: Positive, prediction: Positive


Comment: Hey! Yes I gave a 5 star rating... coz I believe it deserves it! I mostly use the desktop version and I am seriously so satisfied with this app in both android and desktop version. I just came here to thank the developers for this beautiful app and its facilities. I literally find almost everything that I need for and the best part is even without the premium feature it provides u

Conclusion: The model reaches ~98% accuracy on the training dataset and ~90% accuracy on the testing set with only 600 iterations taking less than 5 minutes to compute. As compared to a simple Recurrent Neural Network, in which the accuracy on the training dataset did not surpass 70%, this is a very good progress. Also note that in some cases when the model prediction is labeled as wrong, the comment itself is quite confusing and might not be the result of the model incapability to understand the task of sentiment analysis. A better approach would have been to introduce 3 labels instead of 2. 

### What if we have a very limited dataset? 

In [202]:
#Loading the data: 
CustomerFeed = 'CustomerFeedback.xlsx'
df = pd.read_excel(CustomerFeed)
print(df)

                                            Sentence         Ranking 
0   looks beautiful I am in love with this product .   Above Average 
1                               I really like this .   Above Average 
2         I like this but the design could be better         Average 
3                            I do not like the smell   Below Average 
4           Works well but the smell is too strong .         Average 
..                                                ...             ...
95                                     not satisfied   Below Average 
96                                     does not work   Below Average 
97                               does not smell good   Below Average 
98                          does not work for my son   Below Average 
99                                     Saves me time   Above Average 

[100 rows x 2 columns]


In [203]:
dffed = df.iloc[:,0]
x = dffed.to_numpy()
dfrank = df.iloc[:,1]

y = dfrank.to_numpy()
print(x[1:5])
print(y[:3])

['I really like this . ' 'I like this but the design could be better '
 'I do not like the smell ' 'Works well but the smell is too strong . ']
['Above Average ' 'Above Average ' 'Average ']


In [204]:
ranking = np.unique(y)
ranking = ranking.tolist()
ranking

['Above Average ', 'Average ', 'Below Average ']

In [205]:
lenx = len(x)
Split = [] 
Dic = []
for i in range(0,len(x)):
    split = x[i].split()
    for i in range(0,len(split)): 
        split[i] = split[i].lower()
    Dic.extend(np.unique(split))
dictionary = np.unique(Dic)#this is our new dictionary. 
dictionary = dictionary.tolist()

# Add the extra padding token: 
dictionary =  dictionary + ["<pad>"] 

# Print the resutls: 
dictionary[1:10]

['a', 'about', 'after', 'am', 'amazing', 'and', 'average', 'bad', 'be']

In [206]:
#dividing the dataset into 75% training set and 25% test set: 
x = x.tolist()
X_train, X_test, y_train, y_test = train_test_split(x,y, 
                                   random_state=104,  
                                   test_size=0.25,  
                                   shuffle=True) 
print(X_train[1:5])
print(y_train[1:5])

['looks amazing ', 'love this but design could be better ', 'I love the color ', 'good quality ']
['Above Average ' 'Average ' 'Above Average ' 'Average ']


In [208]:
# Edit the text in the training and texting datasets: 
X_train = [edit_txt(comment) for comment in X_train]
X_test = [edit_txt(comment) for comment in X_test]

In [218]:
m = 15
e = 50 
X_trainmod = gvec_input(X_train,m,e) 
X_testmod  = gvec_input(X_test,m,e) 

In [219]:
print(X_trainmod.shape)
print(X_testmod.shape)

(75, 15, 50)
(25, 15, 50)


In [220]:
# Map the y_training and y_testing datasets to Boolean 0, 1: 
y_trainmod = (np.array([vec_output(y) for y in y_train])).reshape(len(y_train), len(ranking))
y_testmod = (np.array([vec_output(y) for y in y_test])).reshape(len(y_test),len(ranking))
y_trainmod.shape

(75, 3)

In [221]:
# Create the positional embeddings: 
position_enc = pos_emb(X_trainmod.shape[1],X_trainmod.shape[2])
position_enc.shape

TensorShape([1, 15, 50])

In [222]:
# Add the positional encoding to the word embeddings: 
X_trainmod = X_trainmod + position_enc 
print(X_trainmod.shape)

X_testmod = X_testmod + position_enc 
X_testmod.shape

(75, 15, 50)


TensorShape([25, 15, 50])

In [225]:
dim_kv = 20
dim_q = 20 
len_emb = 50
heads = 2 
fnn_neurons = 20
drop_rate = 0.1
len_rank = len(ranking)
rnn_units = 20 
iter = 6 #based on the paper  

encoder = Encoder(dim_kv, dim_q, heads, fnn_neurons, len_emb, iter, len_rank,rnn_units, drop_rate = 0.1)
output_encoder = encoder(X_trainmod, training = True)
output_encoder.shape

TensorShape([75, 3])

In [238]:
inputs = tf.keras.Input(shape=(15, len_emb))
outputs = encoder(inputs, training=True)  # Assuming training=True for now
model = tf.keras.Model(inputs=inputs, outputs=outputs)
model.summary()

In [239]:
from tensorflow.keras.optimizers import Adam
opt = Adam(0.002,beta_1 = 0.9, beta_2 = 0.999, decay = 0.01) 
model.compile(loss = "categorical_crossentropy", optimizer = opt, metrics = ["accuracy"])

In [246]:
model.fit(X_trainmod,y_trainmod, epochs=100, batch_size=500) 

Epoch 1/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 38ms/step - accuracy: 0.7600 - loss: 0.6133
Epoch 2/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step - accuracy: 0.7600 - loss: 0.5788
Epoch 3/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step - accuracy: 0.8667 - loss: 0.4301
Epoch 4/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step - accuracy: 0.8667 - loss: 0.4563
Epoch 5/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step - accuracy: 0.7867 - loss: 0.4968
Epoch 6/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step - accuracy: 0.7467 - loss: 0.7015
Epoch 7/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step - accuracy: 0.7867 - loss: 0.5424
Epoch 8/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step - accuracy: 0.7867 - loss: 0.5487
Epoch 9/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[3

<keras.src.callbacks.history.History at 0x2a902a2d0>

In [247]:
model.evaluate(X_testmod, y_testmod)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 267ms/step - accuracy: 0.9200 - loss: 0.3098


[0.30978140234947205, 0.9200000166893005]

In [249]:
predictions = model.predict(X_testmod)
predictions = np.argmax(predictions, axis = -1)
output = [ranking[int(x)] for x in predictions]
for i in range(len(output)): 
    print(f"Comment: {X_test[i]}\n\nRanking: {y_test[i]}, prediction: {output[i]}\n\n")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step
Comment: ['i', 'am', 'very', 'satisfied', 'with', 'this', 'product']

Ranking: Above Average , prediction: Above Average 


Comment: ['i', 'like', 'this', 'but', 'the', 'design', 'could', 'be', 'better']

Ranking: Average , prediction: Average 


Comment: ['it', 'does', 'not', 'work']

Ranking: Below Average , prediction: Below Average 


Comment: ['too', 'expensive']

Ranking: Below Average , prediction: Below Average 


Comment: ['looks', 'beautiful']

Ranking: Above Average , prediction: Above Average 


Comment: ['love', 'this', 'but', 'too', 'expensive']

Ranking: Average , prediction: Above Average 


Comment: ['works', 'well', 'but', 'too', 'large']

Ranking: Average , prediction: Average 


Comment: ['does', 'not', 'work', 'for', 'my', 'son']

Ranking: Below Average , prediction: Below Average 


Comment: ['great', 'quality']

Ranking: Above Average , prediction: Above Average 


Comment: ['works', 'well', 

The model is trained on a very small dataset and reaches 92% accuracy; classifying only 2 comments wrong against 25 comments in the testing set. The dot product attention model seems to have a better performance than the additive model in terms of the accuracy it reaches both when predicting the testing and training datasets. Furthermore, the dot-product attention seems to have a superior performance even when the dataset is as small as only 100 samples. 

In [118]:
def edit_txt(review):
    """
    This function receives a text and returns it edited as follows: 
    1, all words converted to lower case 
    2, integers removed
    3, tokenize the words 
    4, punctuation removed 
    5, common words that are unnecessary are removed. 
    """
    
    review_edited = []

    #Converting to lower case: 
    review_edited = review.lower() 
    
    #Removing integers: 
    pattern = r'[0-9]'
    # Match all digits in the string and replace them with an empty string
    review_edited = re.sub(pattern, '', review_edited) 

    #Tokenize the comment: 
    review_edited = word_tokenize(review_edited) 

    #Removing punctuation 
    tokenizer = RegexpTokenizer(r'\w+')
    review_edited = [''.join(tokenizer.tokenize(word)) for word in review_edited if len(tokenizer.tokenize(word))>0]

    #Removing common words: 
    remove_list = stopwords.words('english') 
    to_remove = [ "not",'don',"don't",'should',"should've", 'ain','aren',"aren't",'couldn',"couldn't",'didn',"didn't",'doesn',"doesn't",'hadn',"hadn't",'hasn',"hasn't",'haven',"haven't",'isn',"isn't",'mightn',"mightn't",'mustn',"mustn't",'needn',"needn't",'shan',"shan't",'shouldn',"shouldn't",'wasn',"wasn't",'weren',"weren't",'won',"won't",'wouldn', "wouldn't"]
 
    review_edited = [word for word in review_edited if not word in remove_list]
    return(review_edited) 



In [128]:
# Defining the review dataset as x: 
x = df["review"] 
dfrank = df.iloc[:,1]

print(x[10])

y = df["Sentiment"].tolist()
ranking = np.unique(y)
ranking = ranking.tolist()
print(f"\nCorresponding ranking: {y[10]}\n")
print(f"Rankigns include {ranking}")


Really great editing app, its all around which makes it great. Has everything I need for basic editing. It makes editing easier because of premade tools and stickers, designs, etc. I gave it four stars only because of how slow it loads, especially at starting the app. It is pretty stressful, so you really gotta have patience at waiting for stuff to load.

Corresponding ranking: Positive

Rankigns include ['Negative', 'Positive']


In [129]:
#creating the dictionary: 
reviews_edited = [edit_txt(review) for review in x]
print(f"Comment before editting: {x[13]}")
print(f"Comment after editting: {reviews_edited[13]}")

Split = [] 
Dic = []
dictionary = np.unique([word for review in reviews_edited for word in review]).tolist()
print(dictionary[1:30])
len(dictionary)

Comment before editting: Unable to save my work. Nothing works :(
Comment after editting: ['unable', 'save', 'work', 'nothing', 'works']
['aa', 'aap', 'ability', 'able', 'absolutely', 'acc', 'accepted', 'access', 'accessibilities', 'accessible', 'accidentally', 'accoding', 'according', 'account', 'across', 'action', 'activity', 'actual', 'actually', 'ad', 'adaptable', 'add', 'added', 'adding', 'addition', 'address', 'adds', 'administrative', 'adobe']


2196

In [130]:
#dividing the dataset into 75% training set and 25% test set: 
x = x.to_list()
X_train, X_test, y_train, y_test = train_test_split(x,y, 
                                   random_state=104,  
                                   test_size=0.25,  
                                   shuffle=True) 


In [131]:
# Edit the text in the training and texting datasets: 
X_train = [edit_txt(comment) for comment in X_train]
X_test = [edit_txt(comment) for comment in X_test]

In [132]:
X_train[0]

['spend',
 'much',
 'time',
 'working',
 'poster',
 'app',
 'allowing',
 'download',
 'simply',
 'wasted',
 'hard',
 'work']

In [133]:
#converting x_train and x_test to word embeddings: 
m = 30
e = 50
X_trainmod = gvec_input(X_train,m,e)
X_testmod = gvec_input(X_test,m,e) 

In [134]:
# Map the y_training and y_testing datasets to Boolean 0, 1: 
y_trainmod = (np.array([vec_output(y) for y in y_train])).reshape(len(y_train), 1, len(ranking))
y_testmod = (np.array([vec_output(y) for y in y_test])).reshape(len(y_test), 1, len(ranking))
y_trainmod[0:5]

array([[[1., 0.]],

       [[1., 0.]],

       [[1., 0.]],

       [[1., 0.]],

       [[0., 1.]]])

In [135]:
# Add the positional encoding to the word embeddings: 
X_trainmod = X_trainmod + position_enc 
print(X_trainmod.shape)

X_testmod = X_testmod + position_enc 
X_testmod.shape

(1125, 30, 50)


TensorShape([375, 30, 50])

In [136]:
# Check if it works: 
dim_kv = 30 
dim_q = 20 
len_emb = 50
heads = 2 
masking = None 
fnn_neurons = 20
drop_rate = 0.1
len_rank = len(ranking)
rnn_units = 20 

encoder = Encoder(dim_kv, dim_q, heads, fnn_neurons, len_emb, 1, len_rank,rnn_units, drop_rate = 0.1)
output_encoder = encoder(X_trainmod, training = True)
output_encoder.shape

TensorShape([1125, 2])

In [137]:
y_trainmod = y_trainmod.reshape(1125,2)
y_trainmod.shape

(1125, 2)

In [138]:
inputs = tf.keras.Input(shape=(30, len_emb))
outputs = encoder(inputs, training=True)  # Assuming training=True for now
model = tf.keras.Model(inputs=inputs, outputs=outputs)
model.summary()

In [139]:
from tensorflow.keras.optimizers import Adam
opt = Adam(0.003,beta_1 = 0.9, beta_2 = 0.999, decay = 0.01) 
model.compile(loss = "categorical_crossentropy", optimizer = opt, metrics = ["accuracy"])

In [146]:
model.fit(X_trainmod,y_trainmod, epochs=300, batch_size=100)

Epoch 1/300
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.9750 - loss: 0.0714
Epoch 2/300
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.9631 - loss: 0.0886
Epoch 3/300
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.9712 - loss: 0.0703
Epoch 4/300
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.9618 - loss: 0.0855
Epoch 5/300
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.9641 - loss: 0.1141
Epoch 6/300
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.9771 - loss: 0.0651
Epoch 7/300
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.9759 - loss: 0.0566
Epoch 8/300
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.9816 - loss: 0.0569
Epoch 9/300
[1m12/12[0m [32m━━━━━━━━━━━━━━━━━

<keras.src.callbacks.history.History at 0x2c89b78d0>

In [144]:
y_testmod = y_testmod.reshape(375,2)
y_testmod.shape

(375, 2)

In [147]:
# Evaluate the model: 
model.evaluate(X_testmod, y_testmod)

[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8948 - loss: 0.5264 


[0.5319152474403381, 0.8826666474342346]

In [148]:
predictions = model.predict(X_testmod)
predictions = np.argmax(predictions, axis = -1)
output = [ranking[int(x)] for x in predictions]
for i in range(len(output)): 
    print(f"Ranking: Comment: {x[i]}\n\nRanking: {y_test[i]}, prediction: {output[i]}\n\n")

[1m12/12[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
Ranking: Comment: Overall it's really an amazing app. I've been using this for the past 5 years however I only have one issue though and I wanted this to get address since I think this issue had lasted for how many years? The texts were blurred and when you zoom it out it's pixelated. I thought this issue only occurs on mobile apps however it was also present on the website. Please fix this. I still remember the time when I can export high definition texts and I love that experience. Thank you!

Ranking: Positive, prediction: Positive


Ranking: Comment: Hey! Yes I gave a 5 star rating... coz I believe it deserves it! I mostly use the desktop version and I am seriously so satisfied with this app in both android and desktop version. I just came here to thank the developers for this beautiful app and its facilities. I literally find almost everything that I need for and the best part is even without the premium feat

Conclusion: 