<a href="https://colab.research.google.com/github/tqnhu2407/Introduction_to_NLP/blob/main/src.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Encoding Language into Numbers

## Getting Started with Tokenization

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

In [None]:
from bs4 import BeautifulSoup
import string

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
import json

In [None]:
sentences = [
             'Today is a sunny day',
             'Today is a rainy day',
             'Is it sunny today?'
]

In [None]:
tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

{'today': 1, 'is': 2, 'a': 3, 'sunny': 4, 'day': 5, 'rainy': 6, 'it': 7}


## Turning Sentences into Sequences

In [None]:
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

[[1, 2, 3, 4, 5], [1, 2, 3, 6, 5], [2, 7, 4, 1]]


In [None]:
test_data = [
 'Today is a snowy day',
 'Will it be rainy tomorrow?'
]

There is some strange words (not appeared in the vocab `sentences`)

In [None]:
test_sequences = tokenizer.texts_to_sequences(test_data)
print(test_sequences)

[[1, 2, 3, 5], [7, 6]]


Lost a lot of information right?

### Using out-of-vocabulary tokens

In [None]:
tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

test_sequences = tokenizer.texts_to_sequences(test_data)
print(word_index)
print(test_sequences)

{'<OOV>': 1, 'today': 2, 'is': 3, 'a': 4, 'sunny': 5, 'day': 6, 'rainy': 7, 'it': 8}
[[2, 3, 4, 1, 6], [1, 8, 1, 7, 1]]


### Understanding padding

We need all the data to be in the **same shape**.

In [None]:
sentences = [
 'Today is a sunny day',
 'Today is a rainy day',
 'Is it sunny today?',
 'I really enjoyed walking in the snow today'
]

In [None]:
tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

[[2, 3, 4, 5, 6], [2, 3, 4, 7, 6], [3, 8, 5, 2], [9, 10, 11, 12, 13, 14, 15, 2]]


In [None]:
padded = pad_sequences(sequences)
print(padded)

[[ 0  0  0  2  3  4  5  6]
 [ 0  0  0  2  3  4  7  6]
 [ 0  0  0  0  3  8  5  2]
 [ 9 10 11 12 13 14 15  2]]


The above is **prepadding**.

In [None]:
padded = pad_sequences(sequences, padding='post')
print(padded)

[[ 2  3  4  5  6  0  0  0]
 [ 2  3  4  7  6  0  0  0]
 [ 3  8  5  2  0  0  0  0]
 [ 9 10 11 12 13 14 15  2]]


We had too much padding huh?

In [None]:
padded = pad_sequences(sequences, padding='post', maxlen=6)
print(padded)

[[ 2  3  4  5  6  0]
 [ 2  3  4  7  6  0]
 [ 3  8  5  2  0  0]
 [11 12 13 14 15  2]]


The longest sequence was 

[ 9 10 11 12 13 14 15  2]

But now we lost some tokens at the begining.

In [None]:
padded = pad_sequences(sequences, padding='post', maxlen=6, truncating='post')
print(padded)

[[ 2  3  4  5  6  0]
 [ 2  3  4  7  6  0]
 [ 3  8  5  2  0  0]
 [ 9 10 11 12 13 14]]


# Removing Stopwords and Cleaning Text

* Remove HTML tags using BeautifulSoup.
* Have a list of stopwords.
* Remove punctuation using `string` library.




# Working with Real Data Sources

## Getting Text from Tensorflow Datasets

In [None]:
import tensorflow_datasets as tfds

In [None]:
imdb_sentences = []
train_data = tfds.as_numpy(tfds.load('imdb_reviews', split='train'))
for item in train_data:
  imdb_sentences.append(str(item['text']))

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…







HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteTTR8HS/imdb_reviews-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteTTR8HS/imdb_reviews-test.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteTTR8HS/imdb_reviews-unsupervised.tfrecord


HBox(children=(FloatProgress(value=0.0, max=50000.0), HTML(value='')))



[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


In [None]:
print(imdb_sentences[0])

b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."


In [None]:
print(len(imdb_sentences))

25000


In [None]:
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(imdb_sentences)
sequences = tokenizer.texts_to_sequences(imdb_sentences)

In [None]:
print(tokenizer.word_index)



**br**: HTML tag which inserts a single line break.

Update the code to use BeautifulSoup to remove the HTML tags, add string
translation to remove the punctuation, and remove stopwords from the given list

In [None]:
stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'nor', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']

In [None]:
table = str.maketrans('', '', string.punctuation)

In [None]:
imdb_sentences = []
train_data = tfds.as_numpy(tfds.load('imdb_reviews', split="train"))

In [None]:
for item in train_data:
    sentence = str(item['text'].decode('UTF-8').lower())
    soup = BeautifulSoup(sentence)
    sentence = soup.get_text() # remove HTML tags
    sentence = sentence.replace(",", " , ") # to avoid "annoying-conclusion" or "him/her"
    sentence = sentence.replace(".", " . ")
    sentence = sentence.replace("-", " - ")
    sentence = sentence.replace("/", " / ")
    words = sentence.split()
    filtered_sentence = ""
    for word in words:
        word = word.translate(table) # remove punctuation
        if word not in stopwords:
            filtered_sentence += word + " "
    imdb_sentences.append(filtered_sentence)

In [None]:
tokenizer = Tokenizer(num_words=25000)
tokenizer.fit_on_texts(imdb_sentences)
sequences = tokenizer.texts_to_sequences(imdb_sentences)

In [None]:
print(tokenizer.word_index)



### Using the IMDb subwords datasets

In [None]:
(train_data, test_data), info = tfds.load('imdb_reviews/subwords8k', split = (tfds.Split.TRAIN, tfds.Split.TEST), as_supervised=True, with_info=True)



[1mDownloading and preparing dataset imdb_reviews/subwords8k/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…







HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incompleteXU4VH4/imdb_reviews-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incompleteXU4VH4/imdb_reviews-test.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incompleteXU4VH4/imdb_reviews-unsupervised.tfrecord


HBox(children=(FloatProgress(value=0.0, max=50000.0), HTML(value='')))



[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0. Subsequent calls will reuse this data.[0m


In [None]:
encoder = info.features['text'].encoder
print(f'Vocabulary size: {encoder.vocab_size}')

Vocabulary size: 8185


In [None]:
print(encoder.subwords)



In [None]:
sample_string = 'Today is a sunny day'

encoded_string = encoder.encode(sample_string)
print(f'Encoded string is {encoded_string}')

Encoded string is [6427, 4869, 9, 4, 2365, 1361, 606]


token = 6427 -> 6426th item in the array.

5 words are encoded into 7 tokens.

In [None]:
for t in encoded_string:
    print(f'token={t}: {encoder.subwords[t-1]}')

token=6427: Tod
token=4869: ay_
token=9: is_
token=4: a_
token=2365: sun
token=1361: ny_
token=606: day


In [None]:
original_string = encoder.decode(encoded_string)
print(original_string)

Today is a sunny day


## Getting Text from CSV Files

Temporarily skipped

## Getting Text from JSON Files

### Reading JSON files

In [None]:
sentences = []
labels = []
urls = []

In [None]:
with open('Sarcasm_Headlines_Dataset.json', 'r') as f:
    for line in f:
        obj = json.loads(line)
        sentence = obj['headline'].lower()
        sentence = sentence.replace(",", " , ")
        sentence = sentence.replace(".", " . ")
        sentence = sentence.replace("-", " - ")
        sentence = sentence.replace("/", " / ")
        soup = BeautifulSoup(sentence)
        sentence = soup.get_text() # remove HTML tags
        words = sentence.split()
        filtered_sentence = ""
        for word in words:
            word = word.translate(table) # remove punctuation
            if word not in stopwords: # remove stop words
                filtered_sentence = filtered_sentence + word + " "
        sentences.append(filtered_sentence)
        urls.append(obj['article_link'])
        labels.append(obj['is_sarcastic'])

FileNotFoundError: ignored

In [None]:
len(labels)

In [None]:
training_size = 23000
training_sentences = sentences[0:training_size]
testing_sentences = sentences[training_size:]
training_labels = labels[0:training_size]
testing_labels = labels[training_size:]

In [None]:
tokenizer = Tokenizer(num_words=10000, oov_token="<OOV>")
tokenizer.fit_on_texts(training_sentences)

word_index = tokenizer.word_index

training_sequences = tokenizer.texts_to_sequences(training_sentences)
training_padded = pad_sequences(training_sequences, maxlen=10, padding='post', truncating='post')

In [None]:
print(word_index)

In [None]:
tokenizer.fit_on_texts(testing_sentences)
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen=10, padding='post', truncating='post')