<a href="https://colab.research.google.com/github/visha1Sagar/Practice/blob/main/Text_Preprocessing_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Text Preprocessing for NLP**


In this lab we will learn how to preprocess raw text data and make it suitable for further processing by DL models. We will perform this processing in two ways.
1. Using NLTK Library
2. Using Tensorflow



#Using NLTK



Import NLTK library and download relevant modules.

In [None]:
import os
import random
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')


nltk.download('maxent_ne_chunker')
nltk.download('words')
random.seed(92)

# Load the TensorBoard notebook extension
%load_ext tensorboard

Let's define a corpus of raw text. For this example the following text is taken from wikipedia page on NLP (https://en.wikipedia.org/wiki/Natural_language_processing).
You will notice that the text contains words, numbers, punctuations, and other special characters. This is RAW form of text. For different NLP tasks, it is required to preprocess this data to make it more structures and workable.

In [None]:
corpus = str(["""In 2003, word n-gram model, at the time the best statistical algorithm, was overperformed by a multi-layer perceptron (with a single hidden layer and context length of several words trained on up to 14 million of words with a CPU cluster in language modelling) by Yoshua Bengio with co-authors.[8]
In 2010, Tomáš Mikolov (then a PhD student at Brno University of Technology) with co-authors applied a simple recurrent neural network with a single hidden layer to language modelling,[9] and in the following years he went on to develop Word2vec.
In the 2010s, representation learning and deep neural network-style (featuring many hidden layers) machine learning methods became widespread in natural language processing.
That popularity was due partly to a flurry of results showing that such techniques[10][11] can achieve state-of-the-art results in many natural language tasks, e.g., in language modeling[12] and parsing.[13][14]
This is increasingly important in medicine and healthcare, where NLP helps analyze notes and text in electronic health records that would otherwise be inaccessible for study when seeking to improve care[15] or protect patient privacy.[16]"""])

**Tokenization**
Tokenization means dividing a piece of raw text into semantically and syntactically meaningful units, called tokens.

(word tokens).

These tokens can be

1.   setence-level (sentence tokens)
2.   word-level

The following code passes the raw text (corpus) to the sent_tokenize() method which converts the text into sentences. Each sentence is then further tokenized into smaller units using word_tokenize() method.

In [None]:
#NLTK recognises each word seperated by white-space, number, and punctuations as distinct tokens.
token_list = []
sentences = nltk.sent_tokenize(str(corpus))
for sentence in sentences:
  words = nltk.word_tokenize(sentence)
  token_list.extend(words) #list appending
  #tagged_words = nltk.pos_tag(words)
  #named_entities = nltk.ne_chunk(tagged_words)

# pos_tag() method and ne_chunk() methods perform Part-Of-Speech tagging and Named-Entity Recognition, respectively.
# The POS tagging is a task in which each word is tagged with a part-of-speech category. For example, the word 'horse' is tagged as a noun and 'walk' is tagged as a verb.
# A Named-Entity is a word (or a set of words) that represent name of a person, place, organisation, unit of measurement, etc.

print(token_list)
print ('\nLength of word_list:', len(token_list))

Another way to tokenize a piece of text using NLTK is to use the split() method as shown below. (optional)

In [None]:
word_list = corpus.split()
print(word_list)
print ('\nLength of word_list:', len(word_list))

Next, we filter the tokens and keep only those that contains alphabets.

In [None]:
alphabets_only = [word for word in word_list if word.isalpha()]
print(alphabets_only)
print ('\nLength of word_list:', len(alphabets_only))

**Text Normalisation** means to remove any capitalisation in the text and convert all tokens to lower case.

In [None]:
lower_case = [word.lower() for word in alphabets_only]
print(lower_case)
print ('\nLength of word_list:', len(lower_case))

**Stopword removal** means to remove those words from the remaining tokens that appear frequently in English language but contribute very little to the overall meaning of the text. It is important to note here that stopword removal should not be applied to every NLP task since in some application these stopwords are important as well.

In [None]:
from nltk.corpus import stopwords

stopwords_nltk = set(stopwords.words('english'))
print(stopwords_nltk)
print ('\nLength of word_list:', len(stopwords_nltk), '\n')

cleaned_words = [word for word in lower_case if word not in stopwords_nltk]
print(cleaned_words)
print ('\nLength of word_list:', len(cleaned_words))

#Using Tensorflow

This concludes the first part of this lab which deals with preprocessing raw text using NLTK. Now Let's use tensorflow to perform similar preprocessing (note the difference in the implementation of both libraries) and build a complete example of sentiment analysis.

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog!'
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

In [None]:
import json
import tensorflow as tf

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

import datetime

sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)
#pad the sequences with the same length
#make all sequences in a batch fit a given standard length
padded = pad_sequences(sequences, maxlen=5)
print("\nWord Index = " , word_index)
print("\nSequences = " , sequences)
print("\nPadded Sequences:")
print(padded)

In [None]:
!wget --no-check-certificate \
    https://storage.googleapis.com/learning-datasets/sarcasm.json \
    -O /tmp/sarcasm.json

In [None]:
with open("/tmp/sarcasm.json", 'r') as f:
    datastore = json.load(f)

sentences = []
labels = []

for item in datastore:
    sentences.append(item['headline'])
    labels.append(item['is_sarcastic'])

In [None]:
vocab_size = 10000    #same as no of words
embedding_dim = 16
max_length = 100
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"
training_size = 20000

In [None]:
training_sentences = sentences[0:training_size]
testing_sentences = sentences[training_size:]
training_labels = labels[0:training_size]
testing_labels = labels[training_size:]

In [None]:
tokenizer = Tokenizer(num_words=10000, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)

word_index = tokenizer.word_index

training_sequences = tokenizer.texts_to_sequences(training_sentences)
training_padded = pad_sequences(training_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

In [None]:
# Need this block to get it to work with TensorFlow 2.x
import numpy as np
training_padded = np.array(training_padded)
training_labels = np.array(training_labels)
testing_padded = np.array(testing_padded)
testing_labels = np.array(testing_labels)

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
# sgmd for two class
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

In [None]:
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

history = model.fit(training_padded, training_labels, epochs=30 , validation_data=(testing_padded, testing_labels), verbose=2, callbacks=[tensorboard_callback])

In [None]:
sentence = ["granny starting to fear spiders in the garden might be real", "game of thrones season finale showing this sunday night"]
sequences = tokenizer.texts_to_sequences(sentence)
padded = pad_sequences(sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)
print(model.predict(padded))

# binary classification.
# sarcastic class and Non sarcastic class.

In [None]:
%tensorboard --logdir logs/fit

In [None]:
import matplotlib.pyplot as plt

def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()

plot_graphs(history, "accuracy")
plot_graphs(history, "loss")

#Language modelling

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

<a href="https://colab.research.google.com/github/lmoroney/dlaicourse/blob/master/TensorFlow%20In%20Practice/Course%203%20-%20NLP/Course%203%20-%20Week%204%20-%20Lesson%202%20-%20Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import tensorflow as tf

from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
import numpy as np

In [None]:
!wget --no-check-certificate \
    https://storage.googleapis.com/learning-datasets/irish-lyrics-eof.txt \
    -O /tmp/irish-lyrics-eof.txt

In [None]:
tokenizer = Tokenizer()

data = open('/tmp/irish-lyrics-eof.txt').read()

corpus = data.lower().split("\n")

tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1

print(tokenizer.word_index)
print(total_words)


In [None]:
input_sequences = []
for line in corpus:
	token_list = tokenizer.texts_to_sequences([line])[0]
	for i in range(1, len(token_list)):
		n_gram_sequence = token_list[:i+1]
		input_sequences.append(n_gram_sequence)

# pad sequences
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

# create predictors and label
xs, labels = input_sequences[:,:-1],input_sequences[:,-1]

ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)

In [None]:
print(tokenizer.word_index['in'])
print(tokenizer.word_index['the'])
print(tokenizer.word_index['town'])
print(tokenizer.word_index['of'])
print(tokenizer.word_index['athy'])
print(tokenizer.word_index['one'])
print(tokenizer.word_index['jeremy'])
print(tokenizer.word_index['lanigan'])

In [None]:
print(xs[6])

In [None]:
print(ys[6])

In [None]:
print(xs[5])
print(ys[5])

In [None]:
print(tokenizer.word_index)

In [None]:
model = Sequential()
model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(150)))
model.add(Dense(total_words, activation='softmax'))
adam = Adam(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
#earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=5, verbose=0, mode='auto')
history = model.fit(xs, ys, epochs=100, verbose=1)
#print model.summary()
print(model)


In [None]:
import matplotlib.pyplot as plt


def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.show()

In [None]:
plot_graphs(history, 'accuracy')


In [None]:
seed_text = "I've got a bad feeling about this"
next_words = 100

for _ in range(next_words):
	token_list = tokenizer.texts_to_sequences([seed_text])[0]
	token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
	predicted = np.argmax(model.predict(token_list), axis=-1)
	output_word = ""
	for word, index in tokenizer.word_index.items():
		if index == predicted:
			output_word = word
			break
	seed_text += " " + output_word
print(seed_text)

#Delieverable

Write a detailed document on the learning of this lab covering the different steps involved in the preprocessing of NLP tasks and how is language modelling done in the above task. Clear explain the method and the results. Also explain how and and in what ways tensorboard helps us in analyzing model training and validation. The answer should be specific to the lab and the results obtained.

Its an **individual** lab