<a href="https://colab.research.google.com/github/sohilsshah91/Natural-Language-Processing-TensorFlow/blob/master/Tokenizer-Example/BBC-Article-Tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Ingest Data & Import Libraries

In [4]:
!wget --no-check-certificate \
https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv -O /tmp/bbc-text.csv

import csv
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

#Stopwords list from https://github.com/Yoast/YoastSEO.js/blob/develop/src/config/stopwords.js
# Convert it to a Python list and paste it here
stopwords = [ "a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves"]

--2020-04-10 03:13:22--  https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.203.128, 2404:6800:4008:c07::80
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.203.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5057493 (4.8M) [text/csv]
Saving to: ‘/tmp/bbc-text.csv’


2020-04-10 03:13:23 (142 MB/s) - ‘/tmp/bbc-text.csv’ saved [5057493/5057493]



# Read Data Into Lists of Sentences & Labels

In [5]:
sentences = []
labels = []
with open("/tmp/bbc-text.csv", 'r') as csvfile:
  reader = csv.reader(csvfile)
  header = next(reader) # header line needs to be elminated and hence the next pointer for next row
  for row in reader:
    labels.append(row[0])
    sentences.append(row[1])

print(len(sentences))
print(sentences[0])

2225
tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices.  one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes  like the us s tivo and the uk s sky+ system  allow people to record  store  play  pause and forward wind tv programmes when they want.  essentially  the technology allows for much more personalised tv. they are also being built-in to 

# Tokenize Sentence Corpus & Develop Word Index

Tokenization is a process in which we associate a word by a number (also called word index) and store it in the form of {key : value} pair. 

In [6]:
tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(len(word_index))

29727


# Convert Sentences into Sequence of Tokens & Pad for Uniformity

In [7]:
sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences,padding='post')
print(padded[0]) 
print(padded.shape)

[177 265   7 ...   0   0   0]
(2225, 4491)


# Label Tokenization & Develop Label Word Index  

oov_token is utilised to associate Out of Vocabulary tokens or new tokens with 1 number. In the label tokenizer below OOV is indexed 1,

In [8]:
label_tokenizer = Tokenizer(oov_token="<OOV>")
label_tokenizer.fit_on_texts(labels)
label_seq = label_tokenizer.texts_to_sequences(labels)
label_word_index = label_tokenizer.word_index
print(label_seq)
print(label_word_index)

[[5], [3], [2], [2], [6], [4], [4], [2], [2], [6], [6], [3], [3], [4], [2], [3], [4], [2], [3], [5], [5], [5], [2], [2], [5], [2], [6], [5], [4], [6], [4], [5], [6], [6], [3], [4], [5], [6], [4], [3], [4], [2], [3], [2], [5], [6], [4], [4], [4], [3], [2], [4], [3], [3], [2], [4], [3], [2], [2], [3], [3], [2], [3], [2], [3], [5], [3], [6], [5], [3], [4], [3], [4], [2], [3], [5], [3], [2], [2], [3], [3], [2], [4], [3], [6], [4], [4], [3], [6], [3], [2], [2], [4], [2], [4], [2], [3], [2], [3], [6], [6], [2], [3], [4], [4], [5], [2], [6], [2], [5], [3], [6], [2], [6], [2], [6], [6], [4], [2], [2], [6], [4], [3], [5], [3], [3], [5], [2], [4], [2], [5], [6], [2], [3], [3], [5], [6], [5], [2], [3], [3], [3], [5], [2], [5], [3], [2], [6], [2], [5], [2], [5], [4], [3], [5], [6], [2], [3], [4], [3], [6], [4], [4], [6], [4], [3], [6], [4], [4], [6], [4], [2], [3], [4], [4], [3], [6], [2], [3], [3], [2], [5], [2], [5], [5], [2], [3], [2], [4], [6], [4], [3], [4], [3], [5], [4], [6], [4], [5], [3],