## The Problem: Large Movie Dataset Review
### Classify movie reviews from IMDB into positive or negative sentiment.
### Download the dataset [here](https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz)

In [16]:
# imports

from gensim.models import KeyedVectors
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing import text_dataset_from_directory
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.layers import Embedding, Dense, Input, GlobalAveragePooling1D
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam

import utils

## Exploring the data

In [17]:
# Importing & preprocessing the dataset

train_ds = text_dataset_from_directory('../data/aclImdb/train')
test_ds = text_dataset_from_directory('../data/aclImdb/test')

dfTrain = pd.DataFrame(train_ds.unbatch().as_numpy_iterator(), columns=['text', 'label'])
dfTrain = dfTrain[dfTrain['label']<2];dfTrain.reset_index(inplace=True)
print("dfTrain", dfTrain.shape[0], dfTrain['label'].value_counts())
dfTest = pd.DataFrame(test_ds.unbatch().as_numpy_iterator(), columns=['text', 'label'])
print("dfTest", dfTest.shape[0], dfTest['label'].value_counts())
_, xts = train_test_split(dfTest, stratify=dfTest['label'], test_size=0.25)

dfTrain['text'] = dfTrain['text'].map(lambda x: x.decode())
xts['text'] = xts['text'].map(lambda x: x.decode())

Found 75000 files belonging to 3 classes.
Found 25000 files belonging to 2 classes.
dfTrain 25000 0    12500
1    12500
Name: label, dtype: int64
dfTest 25000 0    12500
1    12500
Name: label, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [18]:
pd.options.display.max_colwidth = 200
dfTrain.sample(n=5)

Unnamed: 0,index,text,label
3322,10132,"There are movies that are awful, and there are movies that are so awful they are deemed long-forgotten and unwatchable. Also, lots of violence and bad stuff (not just cheesy stuff; you know what I...",0
23743,71411,"I found a DVD of ""I Dream Of Jeanie"" in the $1.00 bin at Wal-Mart. When I saw that it was the ""story of Stephen Foster"", being a musician and music educator, I had to see it. I had no idea what ye...",0
2631,7995,"Anna (Charlotte Burke), who is just on the verge of puberty, begins to have strange dreams which start affecting her in real life--especially involving a boy named Mark (Elliott Spiers) who she me...",1
4110,12496,"This review contains spoilers for those who are not aware of the details of the true story on which this movie is based.<br /><br />The right to be presumed ""Innocent until proven guilty"" is a bas...",1
5395,16376,"A study in bad. Bad acting, bad music, bad screenplay, bad editing, bad direction and a bad idea. Pieces of schlock don't come any cheesier or unintentionally funnier than this... thing. By the en...",0


In [19]:
dfTrain['label'].value_counts()

0    12500
1    12500
Name: label, dtype: int64

In [20]:
print(dfTrain.loc[0, 'text'])

The subject notwithstanding, this is an amateur, exhibitionist movie--or an effort at one--which is about as interesting and daring as a moody high school student's composition book full of death "poetry". To be sure, it will disturb viewers who are hell-bent on being disturbed, but the success will be attributable to themselves, not to the director. To genuinely get under somebody's skin requires sensibility, discipline, technique, and talent, as well as an eye and an ear. The film does contain one evocative image, shown as a still (and also used on the video case), but with no development leading up to or away from it. If the director had had an eye, he would have seen it as a possible starting point for an interesting movie--that is, a movie.


## Tokenize the text

In [21]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(dfTrain['text'].tolist())
train_sequences = tokenizer.texts_to_sequences(dfTrain['text'].tolist())
test_sequences = tokenizer.texts_to_sequences(xts['text'].tolist())

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 88582 unique tokens.


In [22]:
print(train_sequences[0])

[1, 872, 8775, 11, 6, 32, 2365, 30568, 17, 39, 32, 778, 30, 28, 60, 6, 41, 14, 218, 2, 3899, 14, 3, 4348, 309, 392, 15492, 7262, 271, 365, 4, 338, 4588, 5, 27, 249, 9, 77, 11037, 794, 34, 23, 605, 5558, 20, 109, 4010, 18, 1, 1018, 77, 27, 34701, 5, 529, 21, 5, 1, 164, 5, 2064, 76, 463, 16816, 2387, 3432, 7476, 7693, 3108, 2, 673, 14, 70, 14, 32, 741, 2, 32, 4844, 1, 19, 124, 3022, 28, 11038, 1456, 614, 14, 3, 128, 2, 79, 340, 20, 1, 371, 417, 18, 16, 54, 939, 968, 53, 5, 39, 242, 36, 9, 44, 1, 164, 66, 66, 32, 741, 26, 59, 25, 107, 9, 14, 3, 611, 1853, 210, 15, 32, 218, 17, 12, 6, 3, 17]


In [23]:
print([tokenizer.index_word[k] for k in train_sequences[0]])

['the', 'subject', 'notwithstanding', 'this', 'is', 'an', 'amateur', 'exhibitionist', 'movie', 'or', 'an', 'effort', 'at', 'one', 'which', 'is', 'about', 'as', 'interesting', 'and', 'daring', 'as', 'a', 'moody', 'high', 'school', "student's", 'composition', 'book', 'full', 'of', 'death', 'poetry', 'to', 'be', 'sure', 'it', 'will', 'disturb', 'viewers', 'who', 'are', 'hell', 'bent', 'on', 'being', 'disturbed', 'but', 'the', 'success', 'will', 'be', 'attributable', 'to', 'themselves', 'not', 'to', 'the', 'director', 'to', 'genuinely', 'get', 'under', "somebody's", 'skin', 'requires', 'sensibility', 'discipline', 'technique', 'and', 'talent', 'as', 'well', 'as', 'an', 'eye', 'and', 'an', 'ear', 'the', 'film', 'does', 'contain', 'one', 'evocative', 'image', 'shown', 'as', 'a', 'still', 'and', 'also', 'used', 'on', 'the', 'video', 'case', 'but', 'with', 'no', 'development', 'leading', 'up', 'to', 'or', 'away', 'from', 'it', 'if', 'the', 'director', 'had', 'had', 'an', 'eye', 'he', 'would', 

In [25]:
MAX_SEQUENCE_LENGTH = max([max(map(len, train_sequences)), max(map(len, test_sequences))]);MAX_SEQUENCE_LENGTH

2493

In [26]:
train_data = pad_sequences(train_sequences, maxlen=MAX_SEQUENCE_LENGTH)
test_data = pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH)

In [27]:
print([tokenizer.index_word.get(k, '<PAD>') for k in train_data[0]])

['<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', 

# Train a classifier with Word Embeddings

In [28]:
countries_wiki = KeyedVectors.load('wiki-countries.w2v')

In [29]:
embedding_layer = utils.make_embedding_layer(countries_wiki, tokenizer, MAX_SEQUENCE_LENGTH)
countries_wiki_model = Sequential([
    Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32'),
    embedding_layer,
    GlobalAveragePooling1D(),
    Dense(128, activation='relu'),
    Dense(64, activation='relu'),
    Dense(1, activation='sigmoid')
])
countries_wiki_model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])

In [30]:
countries_wiki_history = countries_wiki_model.fit(
    train_data, dfTrain['label'].values,
    validation_data=(test_data, xts['label'].values),
    batch_size=64, epochs=30
)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


When we feed the vector of size (1,9) in the embedding function, we get the output tensor of size (1,9,128).
We have a matrix of (1,9,128) that will be converted into (1,128).
The transition goes from (1,9) to (1,9,128) to (1,128).

# Train with a different set of word embeddings

## GloVe: Global Vectors for Word Representation
### Download [here](http://nlp.stanford.edu/data/glove.6B.zip)

In [None]:
glove_wiki = KeyedVectors.load_word2vec_format('data/glove.6B/glove.6B.300d.txt', binary=False, no_header=True)

In [None]:
embedding_layer = utils.make_embedding_layer(glove_wiki, tokenizer, MAX_SEQUENCE_LENGTH)

glove_model = Sequential([
    Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32'),
    embedding_layer,
    GlobalAveragePooling1D(),
    Dense(128, activation='relu'),
    Dense(64, activation='relu'),
    Dense(1, activation='sigmoid')
])
glove_model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])

In [None]:
glove_history = glove_model.fit(
    train_data, dfTrain['label'].values,
    validation_data=(test_data, xts['label'].values),
    batch_size=32, epochs=30
)

In [None]:
plt.plot(countries_wiki_history.history['val_accuracy'], label='Countries Wiki')
plt.plot(glove_history.history['val_accuracy'], label='All Wiki')
plt.legend()