<a href="https://colab.research.google.com/github/rohita77/Utils/blob/master/Answered_Test_of_Jumpstart_Skills_level_in_TF_2_0_for_ANLP_Course.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Test of Deep Learning Foundation Skills in TF 2.0

This test is to check that you are comfortable to build a simple LSTM model in TensorFlow 2.0 

Please copy the notebook and then go through and fill in the missing code. We want to make sure you understand the concepts and the code. Please feel free to add comments etc.

Once you are done please create a shared linked and submit that.

## Loading in GloVe Embeddings and the data set

In [0]:
!wget -qq https://www.dropbox.com/s/v14xhvjmfniraf3/glove6b100dtxt.zip
  
!unzip glove6b100dtxt.zip

!wget -qq https://www.dropbox.com/s/fi2ytva8yvbobu1/newsgroup20.zip
!unzip -qq newsgroup20.zip

!rm -r __MACOSX

Archive:  glove6b100dtxt.zip
  inflating: glove.6B.100d.txt       


In [0]:
%tensorflow_version 2.0

`%tensorflow_version` only switches the major version: `1.x` or `2.x`.
You set: `2.0`. This will be interpreted as: `2.x`.


TensorFlow 2.x selected.


## Imports 

bring in the various components you need to preprocess text and run it through 

In [0]:
import os
import sys
import numpy as np

import tensorflow as tf

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

from tensorflow.keras.utils import to_categorical

from tensorflow.keras.layers import Dense, Input, GlobalMaxPooling1D, LSTM
from tensorflow.keras.layers import Conv1D, MaxPooling1D, Embedding
from tensorflow.keras.models import Model
from tensorflow.keras.initializers import Constant

In [0]:
print(tf.__version__)

2.0.0


In [0]:
BASE_DIR = ''
GLOVE_DIR = os.path.join(BASE_DIR, '')
TEXT_DATA_DIR = os.path.join(BASE_DIR, '20_newsgroup')

MAX_SEQUENCE_LENGTH = 200
MAX_NUM_WORDS = 20000
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2

In [0]:
#Assembling the GloVe word vectors
embeddings_index = {}
with open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt')) as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


### Check Example Glove embedding

In [0]:
embeddings_index['hello']

array([ 0.26688  ,  0.39632  ,  0.6169   , -0.77451  , -0.1039   ,
        0.26697  ,  0.2788   ,  0.30992  ,  0.0054685, -0.085256 ,
        0.73602  , -0.098432 ,  0.5479   , -0.030305 ,  0.33479  ,
        0.14094  , -0.0070003,  0.32569  ,  0.22902  ,  0.46557  ,
       -0.19531  ,  0.37491  , -0.7139   , -0.51775  ,  0.77039  ,
        1.0881   , -0.66011  , -0.16234  ,  0.9119   ,  0.21046  ,
        0.047494 ,  1.0019   ,  1.1133   ,  0.70094  , -0.08696  ,
        0.47571  ,  0.1636   , -0.44469  ,  0.4469   , -0.93817  ,
        0.013101 ,  0.085964 , -0.67456  ,  0.49662  , -0.037827 ,
       -0.11038  , -0.28612  ,  0.074606 , -0.31527  , -0.093774 ,
       -0.57069  ,  0.66865  ,  0.45307  , -0.34154  , -0.7166   ,
       -0.75273  ,  0.075212 ,  0.57903  , -0.1191   , -0.11379  ,
       -0.10026  ,  0.71341  , -1.1574   , -0.74026  ,  0.40452  ,
        0.18023  ,  0.21449  ,  0.37638  ,  0.11239  , -0.53639  ,
       -0.025092 ,  0.31886  , -0.25013  , -0.63283  , -0.0118

## Load the text samples and process their dataset

In [0]:
# Process the text 
print('Processing text dataset')

texts = []  # list of text samples
labels_index = {}  # dictionary mapping label name to numeric id
labels = []  # list of label ids

for name in sorted(os.listdir(TEXT_DATA_DIR)):
    path = os.path.join(TEXT_DATA_DIR, name)
    if os.path.isdir(path):

        #new label id
        label_id = len(labels_index)
        # directory name is label name
        labels_index[name] = label_id

        for fname in sorted(os.listdir(path)):
            if fname.isdigit():
                fpath = os.path.join(path, fname)
                args = {} if sys.version_info < (3,) else {'encoding': 'latin-1'}
                with open(fpath, **args) as f:
                    t = f.read()
                    i = t.find('\n\n')  # skip header
                    if 0 < i:
                        t = t[i:]
                    texts.append(t)
                labels.append(label_id)

print('Found %s texts.' % len(texts))

Processing text dataset
Found 19997 texts.


In [0]:
labels_index

{'alt.atheism': 0,
 'comp.graphics': 1,
 'comp.os.ms-windows.misc': 2,
 'comp.sys.ibm.pc.hardware': 3,
 'comp.sys.mac.hardware': 4,
 'comp.windows.x': 5,
 'misc.forsale': 6,
 'rec.autos': 7,
 'rec.motorcycles': 8,
 'rec.sport.baseball': 9,
 'rec.sport.hockey': 10,
 'sci.crypt': 11,
 'sci.electronics': 12,
 'sci.med': 13,
 'sci.space': 14,
 'soc.religion.christian': 15,
 'talk.politics.guns': 16,
 'talk.politics.mideast': 17,
 'talk.politics.misc': 18,
 'talk.religion.misc': 19}

## Tokenize the words 

Please use the Tokenizer class and pad_sequences to prepare the text

In [0]:
tokenizer = Tokenizer(MAX_NUM_WORDS)
#Fit text to get tokens
tokenizer.fit_on_texts(texts);

In [0]:
print(f'Found {len(tokenizer.word_counts)} unique tokens!' )

Found 174074 unique tokens!


Pad Sequences to a common length


In [0]:
data = pad_sequences(tokenizer.texts_to_sequences(texts), maxlen=MAX_SEQUENCE_LENGTH)
data[100]

array([  595,    67,   251,   820,   251,  7310,     5,   114,   251,
       19980,     3,    35,     8,     1,   150,  3543,   252,   251,
        7310,   137,    52,   287,    68,  1997, 11234,     5,  1002,
          12,   816,    22,   314,   816,    82,     2,   495,  1819,
        1242,  3065,     5,   240,    11,  2792,    22,   114,   177,
          11,  2792,     5,   350,    11,     8,    14,   117,   150,
         610,    29, 19979,    61, 11995,   288,   634,  1002,   121,
        1451,     6,   354,  3591,   137,     5,     6,   699,     1,
         422,   137,    17,  4406,    33,   686,     6,    33,   686,
           1,   134,   639,   310,     8,     1,  1267,     3,   362,
           6,  2992,    13,  4318,   110,  3918,    14,   253,     2,
          16,     1,  3336,     3,   374,  2506,     5, 13811,     2,
        8333,     9,    11,     8,   287,  1724,     2,   116,     1,
        2520,  4015,    22,  2506,    15,     1,   230,     3,  2114,
        6851,    82,

## OHE labels

Make One Hot Encoded Labels

In [0]:
#keep the original lables list
labels_orig = labels

In [0]:


labels = to_categorical(labels_orig)
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

Shape of data tensor: (19997, 200)
Shape of label tensor: (19997, 20)


In [0]:
# split the data into a training set and a validation set
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
num_validation_samples = int(VALIDATION_SPLIT * data.shape[0])


## Make a train test split

Make a train test split

In [0]:
train_data = data[:-num_validation_samples]
test_data = data[-num_validation_samples:]
train_labels = labels[:-num_validation_samples]
test_labels= labels[-num_validation_samples:]
print(f' labels in training set: {len(train_data)}')
print(f' labels in test set: {len(test_data)}')

 labels in training set: 15998
 labels in test set: 3999


## Matrix for embedding

In [0]:
word_index = tokenizer.word_index

In [0]:
print('Preparing embedding matrix.')

# prepare embedding matrix
num_words = min(MAX_NUM_WORDS, len(word_index)) + 1
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if i > MAX_NUM_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

# load pre-trained word embeddings into an Embedding layer
# note that we set trainable = False so as to keep the embeddings fixed
embedding_layer = Embedding(num_words,
                            EMBEDDING_DIM,
                            embeddings_initializer=Constant(embedding_matrix),
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

Preparing embedding matrix.


## Define the model and compile using the Functional API

In [0]:
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
input_layer = embedding_layer (sequence_input)

#LSTM Layer with 128 units
# Add dropout to prevent overfitting of training data and divergence between training and validation accuracy.
lstm_layer = LSTM(128,dropout=0.2, recurrent_dropout=0.2)(input_layer)

#Dense Output Layer with 1 node for each predicted lable name. Sofmax for multiclass classification
dense_layer = Dense(len(labels_index),activation='softmax')(lstm_layer)
preds  = dense_layer


In [0]:
model = Model(sequence_input, preds)
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 200)]             0         
_________________________________________________________________
embedding (Embedding)        (None, 200, 100)          2000100   
_________________________________________________________________
lstm (LSTM)                  (None, 128)               117248    
_________________________________________________________________
dense (Dense)                (None, 20)                2580      
Total params: 2,119,928
Trainable params: 119,828
Non-trainable params: 2,000,100
_________________________________________________________________


In [0]:


model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['acc'])


### Train the model using model.fit

In [0]:
#early_stopping_cb = tf.keras.callbacks.EarlyStopping(monitor='val_loss')

In [0]:
model.fit(train_data, train_labels,
          batch_size=512,
          epochs=50,
          validation_split=0.2,
        #  callbacks = [early_stopping_cb] # To stop training after convergence 
          )

Train on 12798 samples, validate on 3200 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x7f07d67dda20>

### Test on a sample

In [0]:
sample_idx=-88
texts[sample_idx].split('\n')

['',
 '',
 'In article <sbuckley.735337212@sfu.ca> sbuckley@fraser.sfu.ca (Stephen Buckley) writes:',
 '>muttiah@thistle.ecn.purdue.edu (Ranjan S Muttiah) writes:',
 '',
 '>>Mr. Clinton said today that the horrible tragedy of the Waco fiasco',
 '>>should remind those who join cults of the dangers of doing so.',
 '>>Now, I began scratching my head thinking (a bad sign :-), "don\'t the ',
 '>>mainstream religions (in this case Christianity...or the 7th day ',
 ">>adventist in particular) just keep these guys going ? Isn't Mr. Clinton ",
 '>>condemning his own religion ? After all, isn\'t it a cult too ?"',
 '',
 '>>... bad thoughts these.',
 '',
 '>  well it depends on whether you take the literal dictionary definition of',
 '>cult and say all faiths are cults, or if you take a more social-context',
 '>view of "cult which allows you to recognize mainstream religions as ',
 '>socially-acceptable and cults as groups that involve techniques of brain-',
 '>washing and all the other character

In [0]:
# Do prediction step here
sample_idx=-88
# Take Sample from data as Test/Train was shuffled
predict_data = pad_sequences(tokenizer.texts_to_sequences(texts[sample_idx:sample_idx+1]), maxlen=MAX_SEQUENCE_LENGTH)
prediction_logits = model.predict(predict_data)


In [0]:
prediction_logits

array([[4.6901825e-01, 2.9038871e-05, 1.1197404e-05, 2.1173032e-06,
        1.6490731e-05, 2.5303987e-05, 2.1990618e-06, 1.4413213e-05,
        6.5092390e-06, 4.5575682e-05, 9.3843946e-06, 4.9531559e-04,
        9.2440850e-06, 7.8234397e-04, 5.4128811e-04, 7.9022804e-03,
        2.7562813e-03, 4.5764712e-03, 3.1540256e-02, 4.8221603e-01]],
      dtype=float32)

In [0]:
prediction_label_best = prediction_logits.argmax()
print(f'Label Id with highest probability is: {prediction_label_best}')

Label Id with highest probability is: 19


In [0]:
label_names = list(labels_index.keys())
label_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [0]:
# And get the predicted label text
print(f'Predicted Label name is: {label_names[prediction_label_best]}')

Predicted Label name is: talk.religion.misc


In [0]:
#Label Name from Datatset
actual_label_id = labels_orig[sample_idx]
print(f'Actual Label name is: {label_names[actual_label_id]}')

Actual Label name is: talk.religion.misc


#### All Done!