<a href="https://colab.research.google.com/github/saffarizadeh/BUAN4061/blob/main/IMDb_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="http://saffarizadeh.com/Logo.png" width="300px"/>

# *BUAN 4061: Advanced Business Analytics*

# **Text to Sequence: IMDb Example**

Instructor: Dr. Kambiz Saffarizadeh

---

Credit: Laurence Moroney (https://github.com/lmoroney)

Install `beautifulsoup4` to manipulate HTML. (`beautifulsoup4` is preinstalled on Colab environment.)

Side Note: this library is mostly used for web scraping.

In [None]:
!pip install beautifulsoup4



In [None]:
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds
import numpy as np

# Data

## Download the dataset

In [None]:
train_dataset = tfds.load('imdb_reviews', split="train")

It is easier to first convert the dataset to an iterable of numpy arrays using `tfds.as_numpy()` before preprocessing the textual data.

In [None]:
train_dataset = tfds.as_numpy(train_dataset)

You can iterate through the dataset to see how it looks like:

In [None]:
for item in train_dataset:
  print(item)
  break # breaks the loop after one iteration

{'label': 0, 'text': b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."}


In [None]:
# Alternative to check what's going on inside the dataset
train_dataset_iterator = iter(train_dataset)
next(train_dataset_iterator)

{'label': 0,
 'text': b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."}

## First try: extract the text without the cleaning steps

In [None]:
imdb_docs = []
imdb_labels = []

for item in train_dataset:
    imdb_docs.append(str(item['text']))
    imdb_labels.append(item['label'])

In [None]:
print(imdb_docs[0])

b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."


In [None]:
print(imdb_labels[0])

0


## Tokenizer

In [None]:
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=5000)

In [None]:
tokenizer.fit_on_texts(imdb_docs)

In [None]:
sequences = tokenizer.texts_to_sequences(imdb_docs)

In [None]:
print(tokenizer.word_index)

In [None]:
print(imdb_docs[0])

In [None]:
print(sequences[0])

## Second try: Extract the text with the clearning steps

In [None]:
from bs4 import BeautifulSoup
import string

In [None]:
stopwords = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at",
             "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do",
             "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having",
             "he", "hed", "hes", "her", "here", "heres", "hers", "herself", "him", "himself", "his", "how",
             "hows", "i", "id", "ill", "im", "ive", "if", "in", "into", "is", "it", "its", "itself",
             "lets", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought",
             "our", "ours", "ourselves", "out", "over", "own", "same", "she", "shed", "shell", "shes", "should",
             "so", "some", "such", "than", "that", "thats", "the", "their", "theirs", "them", "themselves", "then",
             "there", "theres", "these", "they", "theyd", "theyll", "theyre", "theyve", "this", "those", "through",
             "to", "too", "under", "until", "up", "very", "was", "we", "wed", "well", "were", "weve", "were",
             "what", "whats", "when", "whens", "where", "wheres", "which", "while", "who", "whos", "whom", "why",
             "whys", "with", "would", "you", "youd", "youll", "youre", "youve", "your", "yours", "yourself",
             "yourselves"]

https://docs.python.org/3.9/library/stdtypes.html?highlight=maketrans#str.maketrans

In [None]:
table = str.maketrans('', '', string.punctuation)

In [None]:
imdb_docs = []
imdb_labels = []

for item in train_dataset:
    document = str(item['text'].decode('UTF-8').lower())
    document = document.replace(",", " , ")
    document = document.replace(".", " . ")
    document = document.replace("-", " - ")
    document = document.replace("/", " / ")
    # Create a soup
    soup = BeautifulSoup(document)
    document = soup.get_text()

    words = document.split()
    filtered_document = ""
    for word in words:
        word = word.translate(table)
        if word not in stopwords:
            filtered_document= filtered_document + word + " "
    imdb_docs.append(filtered_document)
    imdb_labels.append(item['label'])

In [None]:
print(imdb_docs[0])

absolutely terrible movie  dont lured christopher walken michael ironside  great actors  must simply worst role history  even great acting not redeem movies ridiculous storyline  movie early nineties us propaganda piece  pathetic scenes columbian rebels making cases revolutions  maria conchita alonso appeared phony  pseudo  love affair walken nothing pathetic emotional plug movie devoid real meaning  disappointed movies like  ruining actors like christopher walkens good name  barely sit  


In [None]:
print(imdb_labels[0])

0


In [None]:
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=25000)

In [None]:
tokenizer.fit_on_texts(imdb_docs)

In [None]:
sequences = tokenizer.texts_to_sequences(imdb_docs)

In [None]:
print(tokenizer.word_index)

# Final dataset

In [None]:
print(f'Number of documents {len(imdb_docs)}')
print(f'Number of sequences {len(sequences)}')
print(f'Number of labels {len(imdb_labels)}')

Number of documents 25000
Number of sequences 25000
Number of labels 25000


# Using the tokenizer on new data

In [None]:
sentences = [
    'Today is a sunny day',
    'Today is a rainy day',
    'Is it sunny today?'
]

In [None]:
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

[[516, 5229, 147], [516, 6489, 147], [5229, 516]]


You can create a reverse dictionary to translate these numbers back to the original sentence.

In [None]:
reverse_word_index = {}

for (key, value) in tokenizer.word_index.items():
    reverse_word_index[value] = key

# shorter version:
# reverse_word_index = dict([(value, key) for (key, value) in tokenizer.word_index.items()])

In [None]:
decoded_review = ""

for i in sequences[0]:
    word = reverse_word_index.get(i, '?')
    decoded_review = decoded_review + ' ' + word

# shorter version:
# decoded_review = ' '.join([reverse_word_index.get(i, '?') for i in sequences[0]])

In [None]:
print(decoded_review)

 today sunny day
