# Tian Xu
# Assignment 3

https://github.com/xutian0117/QMSS5074.git

## *1*. Discuss the dataset in general terms and describe why building a predictive model using this data might be practically useful.  Who could benefit from a model like this? Explain

From the descriptive statistics, we can observe that this IMDb dataset offers a wide range of text lengths, from 52  to  13704 length , indicating diverse review styles and content depth, with a mean 1325 length. It also maintains a balance between positive and negative sentiments(12500 vs 12500). This balanced nature of the dataset is crucial for training unbiased and accurate sentiment analysis models.

Such models can be immensely beneficial to movie studios, online review platforms, and marketing firms, as they provide key insights into public opinion and assist in refining content recommendations and marketing strategies.

## Get and prepare the data

In [10]:
# Get raw imdb dataset
! wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

--2023-12-18 03:15:11--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2023-12-18 03:15:36 (3.32 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



In [11]:
# Untar it to a new folder
! tar xf aclImdb_v1.tar.gz

In [12]:
# Build corpus of docs and labels
import os

imdb_dir = 'aclImdb'
train_dir = os.path.join(imdb_dir, 'train')

labels = []
texts = []

for label_type in ['neg', 'pos']:
    dir_name = os.path.join(train_dir, label_type)
    for fname in os.listdir(dir_name):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname))
            texts.append(f.read())
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)

In [13]:
len(texts)

25000

In [14]:
print(texts[0])
print(labels[0])

This movie tries to hard to be something that it's not....a good movie. It wants you to be fooled from begining to end,But fails.From when it starts to get interesting it falls apart and you're just hoping the ending gives you some clue of just what is going on but it didn't.<br /><br />
0


In [15]:

import pandas as pd

data = pd.DataFrame({
    'text': texts,
    'label': labels
})


print("Head of the Data:")
print(data.head())

# Descriptive statistics
text_lengths = data['text'].apply(len)
print("\nDescriptive Statistics for Text Length:")
print(text_lengths.describe())

print("\nLabel Distribution:")
print(data['label'].value_counts())


Head of the Data:
                                                text  label
0  This movie tries to hard to be something that ...      0
1  I saw virtually no redeeming qualities in this...      0
2  PROBLEM CHILD is one of the worst movies I hav...      0
3  ...was so that I could, in good conscience, te...      0
4  The story at the outset is interesting: slaver...      0

Descriptive Statistics for Text Length:
count    25000.00000
mean      1325.06964
std       1003.13367
min         52.00000
25%        702.00000
50%        979.00000
75%       1614.00000
max      13704.00000
Name: text, dtype: float64

Label Distribution:
0    12500
1    12500
Name: label, dtype: int64


### Data preprecess

In [16]:
# Tokenize the data into one hot vectors
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

maxlen = 100  # We will cut reviews after 100 words in sequence
training_samples = 10000  # We will be training on 10000 samples
validation_samples = 10000  # We will be validating on 10000 samples
max_words = 10000  # We will only consider the top 10,000 words in the dataset

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)


sequences = tokenizer.texts_to_sequences(texts) # converts words in each text to each word's numeric index in tokenizer dictionary.

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=maxlen)

labels = np.asarray(labels)

print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

# Split the data into a training set and a validation set
# But first, shuffle the data, since we started from data
# where sample are ordered (all negative first, then all positive).
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]

x_train = data[:training_samples] #100 words
y_train = labels[:training_samples]
x_val = data[training_samples: training_samples + validation_samples]
y_val = labels[training_samples: training_samples + validation_samples]

Found 88582 unique tokens.
Shape of data tensor: (25000, 100)
Shape of label tensor: (25000,)


In [17]:
#Text example
print(texts[0])

#Text transformed to sequence using tokenizer
print(sequences[0])

This movie tries to hard to be something that it's not....a good movie. It wants you to be fooled from begining to end,But fails.From when it starts to get interesting it falls apart and you're just hoping the ending gives you some clue of just what is going on but it didn't.<br /><br />
[11, 17, 494, 5, 251, 5, 27, 139, 12, 42, 21, 3, 49, 17, 9, 490, 22, 5, 27, 4449, 36, 5, 127, 18, 993, 36, 51, 9, 514, 5, 76, 218, 9, 731, 968, 2, 332, 40, 1379, 1, 274, 405, 22, 46, 2297, 4, 40, 48, 6, 167, 20, 18, 9, 158, 7, 7]


In [18]:
# sequences preprocessed with tokenizer with zeroes added whenver text isn't 100 words long.
data[0]

array([   1, 1881,  247, 7217,   48,  124,    1, 1881, 7616,   37,   22,
         25,    5,   27, 3996,   69,    1,  133,  118,    1,  229,   13,
       1307,    2,  185,   46,   49,  916,   36,    1,  308,  234,    9,
        607,   35, 3917,  703,    2,    1,  478, 3702,    4,    1,  229,
        109, 3324,  968, 1574,   69,    4,  291,  228,    3, 7685,  478,
         16,   65, 1636,   44,   11,    6,  392,   30,    3,  748,  747,
         22,  795,    9,   30,   29, 2200,   11,   17,    6,   35,   75,
         12,   10,  162,   90,    1, 2150,   41, 3594,  231,  140,   12,
         10,  885,    5, 1271,   53,   20,   58, 1662,    2,   10,  119,
        370], dtype=int32)

## 2.Run at least three prediction models to try to predict the IMDB sentiment dataset well.



### a.Use an Embedding layer and LSTM layers in at least one model

In [19]:
from tensorflow.keras.layers import Dense, Embedding, LSTM
from tensorflow.keras.models import Sequential

maxlen = 100

model1 = Sequential()
model1.add(Embedding(10000, 16, input_length=maxlen))

model1.add(LSTM(128))

model1.add(Dense(1, activation='sigmoid'))

model1.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

model1.summary()

history = model1.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 100, 16)           160000    
                                                                 
 lstm (LSTM)                 (None, 128)               74240     
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total params: 234369 (915.50 KB)
Trainable params: 234369 (915.50 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### b.Use an Embedding layer and Conv1d layers in at least one model

In [20]:
from tensorflow.keras.layers import Dense, Embedding, Conv1D, GlobalMaxPooling1D
from tensorflow.keras.models import Sequential

maxlen = 100

model2 = Sequential()
model2.add(Embedding(10000, 16, input_length=maxlen))

model2.add(Conv1D(32, 3, activation='relu'))

model2.add(Conv1D(32, 3, activation='relu'))

model2.add(GlobalMaxPooling1D())

model2.add(Dense(1, activation='sigmoid'))

model2.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

model2.summary()

history = model2.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)


Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 100, 16)           160000    
                                                                 
 conv1d (Conv1D)             (None, 98, 32)            1568      
                                                                 
 conv1d_1 (Conv1D)           (None, 96, 32)            3104      
                                                                 
 global_max_pooling1d (Glob  (None, 32)                0         
 alMaxPooling1D)                                                 
                                                                 
 dense_1 (Dense)             (None, 1)                 33        
                                                                 
Total params: 164705 (643.38 KB)
Trainable params: 164705 (643.38 KB)
Non-trainable params: 0 (0.00 Byte)
______________

### c.Use transfer learning with glove embeddings for at least one of these models

In [4]:
# What if we wanted to use a matrix of pretrained embeddings?  Same as transfer learning before, but now we are importing a pretrained Embedding matrix:
# Download Glove embedding matrix weights (Might take 10 mins or so!)
! wget http://nlp.stanford.edu/data/wordvecs/glove.6B.zip

--2023-12-18 03:08:07--  http://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/wordvecs/glove.6B.zip [following]
--2023-12-18 03:08:07--  https://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip [following]
--2023-12-18 03:08:07--  https://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182753 (822M) [app

In [5]:
! unzip glove.6B.zip

Archive:  glove.6B.zip
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       
  inflating: glove.6B.50d.txt        


In [22]:

import os

glove_dir = os.getcwd()

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))




Found 400001 word vectors.


In [23]:
# Build embedding matrix
embedding_dim = 100 # change if you use txt files using larger number of features

embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < max_words:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

In [25]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Embedding, Flatten, Dense
from tensorflow.keras.models import Sequential



# Define the model
model3 = Sequential()
model3.add(Embedding(max_words, embedding_dim, input_length=maxlen, weights=[embedding_matrix], trainable=False))
model3.add(Flatten())
model3.add(Dense(32, activation='relu'))
model3.add(Dense(1, activation='sigmoid'))

model3.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])


model3.summary()


Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_4 (Embedding)     (None, 100, 100)          1000000   
                                                                 
 flatten_1 (Flatten)         (None, 10000)             0         
                                                                 
 dense_4 (Dense)             (None, 32)                320032    
                                                                 
 dense_5 (Dense)             (None, 1)                 33        
                                                                 
Total params: 1320065 (5.04 MB)
Trainable params: 320065 (1.22 MB)
Non-trainable params: 1000000 (3.81 MB)
_________________________________________________________________


In [27]:
history = model3.fit(x_train, y_train, epochs=10, batch_size=32, validation_split=0.2)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [26]:
# Evaluate model on test set (need to preprocess test data to same structure first)

test_dir = os.path.join(imdb_dir, 'test')

labels = []
texts = []

for label_type in ['neg', 'pos']:
    dir_name = os.path.join(test_dir, label_type)
    for fname in sorted(os.listdir(dir_name)):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname))
            texts.append(f.read())
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)

#using tokenizer object we fit to test data above
sequences = tokenizer.texts_to_sequences(texts)
x_test = pad_sequences(sequences, maxlen=maxlen)
y_test = np.asarray(labels)

### d.Discuss which models performed better and point out relevant hyper-parameter values for successful models

Model 2 performed the best with an accuracy of 81.49% and a loss of 0.4919, outperforming the basic Model 1 (accuracy: 79.95%, loss: 0.8645) and Model 3 with GloVe embeddings (accuracy: 68.00%, loss: 0.9363).


For the hyperparameter, Model 2 Embedding layer of 10,000 words and dimension 16, two Conv1D layers with 32 filters and kernel size 3, followed by GlobalMaxPooling1D and a Dense layer with sigmoid activation. It was trained using RMSprop optimizer over 10 epochs, batch size 32, and a 20% validation spli

The Conv1D layers in Model 2 likely contributed to its superior ability to capture contextual features in the text, leading to its higher accuracy and lower loss.

In [30]:
model1.evaluate(x_test, y_test)




[0.8644613027572632, 0.7994800209999084]

In [31]:
model2.evaluate(x_test, y_test)




[0.4919321835041046, 0.8148800134658813]

In [29]:

model3.evaluate(x_test, y_test)





[0.9362882971763611, 0.6800400018692017]