# Introduction

# Practical

## Downloading Required Files
- `tensorflow_hub` - Repository for all trained model for use cases
- `tensorflow_text` - TF.Text is a TensorFlow library of text related ops, modules, and subgraphs. The library can perform the preprocessing regularly required by text-based models, and includes other features useful for sequence modeling not provided by core TensorFlow.

In [5]:
!pip install tensorflow-text



In [6]:
# importing libraries
import tensorflow_hub as hub
import tensorflow_text as text

## Setting the downloading path url
- For Preprocessor
- For Encoder

In [7]:
preprocess_url = "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"
encoder_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4"

## Creating Preprocessing Objects
- Use hub layers to achive it as it makes the object as a function pointer , where one can pass some values
- bert_preprocess : preprocessing text, , returns a dictionart object
- test_text : - text which we will pass to bert_preprocess() to get the preprocessed text


In [8]:
# creating hub layers
bert_preprocess_model = hub.KerasLayer(preprocess_url)

In [9]:
# creating testing text
test_text = ['I love pizza', 'India is my country', 'Italy is fun place to visit']

In [10]:
# preprocessing text
text_preprocessed = bert_preprocess_model(test_text)
text_preprocessed.keys()

dict_keys(['input_type_ids', 'input_word_ids', 'input_mask'])

## Understanding Preprocessed Dictonary Keys

- input_mask
- input_type_ids
- input_words_ids


### Input Mask

- A mask of the words in a sentence - all masks starts with `CLF` token and `SEP` token
- So the masked array is `sentence size + 2`
- The token is defined for CLF - 101 and SEP - 102
- Tensor shape - (no of sentence, 128- maximum length of sentence- other padded with zeors as no values present)

In [11]:
# input_mask 
text_preprocessed['input_mask']

<tf.Tensor: shape=(3, 128), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0

### Input Type Ids
- Give id's to multiple sentences in one statement
- Usefull for same secnario
- Not usefull in our case
- Help in contextiualizing the sentence
- Usefull for a object like pandas data frame

In [12]:
# input_type_ids
text_preprocessed['input_type_ids']

<tf.Tensor: shape=(3, 128), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0

### Input Word Id's

- has the token ids of the input sequences.
- Give unique id's for indivisual words
- Each word in encoded(ids can be from a vocabulary) , padded and seperated
- Length : (no of sentence , 128-max length of each sentence)

In [13]:
# input_word_ids
text_preprocessed['input_word_ids']

<tf.Tensor: shape=(3, 128), dtype=int32, numpy=
array([[  101,  1045,  2293, 10733,   102,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0, 

## Creating Embeddings

- Embeddings are unique repersentation of wordsin from of numbers

- In word 2 vec , similar words are given same embedding id's , i.e even for a diffferent secnario it will predict the same words if it had to do so.

- BERT - `Bidirectional Encoder Repersentation From Transformer` solves this issue by creating a contextulized embedding.


USAGE: 
- WORD EMBEDDINGS GENERATION : for each time stamp of a given words , it also carry some attention or semantics of previous words(inf forms of feature vectors )/ look into other parts of sentence, allow us to predict words unbiasedly and create goood embedding vectors.

- CLASSIFICATION:  It allows us to do parallel computation which speeds up the calculations and thus capturing the meaning of a word in a right way as it can now see the whole sentence in one go and predict out words with high probability values. 

- WORDS/ SENTENCES PREDICTIONS: Since the previous predection is feed into it and due to above feature , it can also genrate new word suggestion or even sentences - similar to auto complete.


STEPS INVOLVED
- Create a encoder/ model object using hub API - to act as a function pointer
- Pass `prerpocessed_text` into model - important
- Check the keys.
- Reuturns a dictonary

In [14]:
# create our model- encoder which encodes to create word embeddings
bert_encoder_model = hub.KerasLayer(encoder_url)

In [15]:
# encoding - creating embeddings
bert_results = bert_encoder_model(text_preprocessed)

In [16]:
# check the keys
bert_results.keys()

dict_keys(['default', 'pooled_output', 'sequence_output', 'encoder_outputs'])

## Understanding Embedding Dictonary

- pooled_output
- sequence_output
- encoder_outputs

### Pooled Output
- Embedding for the entire sentence
- Length : `(no of sentence, no of hidden units - 768[this case])` 
- Also these 768 will not be 0 as bert carries some of the contextual meaning for each meaning i.e relates how much one feature differ from each other [-ve less relatable , +ve - very relatable], this is the feature why bert is so popular and powerful in NLP task

In [17]:
# pooled_output
bert_results['pooled_output']

<tf.Tensor: shape=(3, 768), dtype=float32, numpy=
array([[-0.8258812 , -0.23015198,  0.44445884, ...,  0.364323  ,
        -0.6134957 ,  0.88324934],
       [-0.86772424, -0.396556  , -0.3774799 , ..., -0.11730427,
        -0.6479708 ,  0.89818156],
       [-0.73642206, -0.19927329,  0.39546844, ...,  0.17739205,
        -0.57021993,  0.7637285 ]], dtype=float32)>

### Sequence output

- This is the repersentation/ embedding  of indivisual word of a sentence
- Due to processing sentence become a length of 128
- Hidden state is 768
- No of sentence 

So length of the sequnence out put array/ tensor is `(no of sentences, length of sentence-128, no of hidden units -768 )`


In [18]:
# sequence_output
bert_results['sequence_output']

<tf.Tensor: shape=(3, 128, 768), dtype=float32, numpy=
array([[[ 0.13989685,  0.30410853,  0.06352295, ..., -0.10268749,
          0.22148427,  0.14977898],
        [ 0.32137358,  0.29354602, -0.1408123 , ..., -0.20841433,
          0.83567744,  0.20442402],
        [ 1.1886755 ,  0.64992356,  0.6742487 , ...,  0.10617819,
          0.41522282, -0.23921698],
        ...,
        [ 0.2002529 ,  0.27129635,  0.45000356, ...,  0.22973916,
          0.02177444,  0.11564779],
        [ 0.1203473 ,  0.17267032,  0.36960474, ...,  0.27761802,
          0.02981692,  0.06090191],
        [-0.23403423, -0.10138148,  0.3471217 , ...,  0.44735333,
          0.04013262, -0.04988689]],

       [[-0.02354308,  0.37282205, -0.09210153, ..., -0.44056374,
          0.01131827,  0.3328247 ],
        [-0.6194298 , -0.28675115, -0.72419006, ..., -0.38319874,
          0.31782746, -0.2673164 ],
        [ 0.22914922, -0.26078117, -0.25905174, ..., -0.38670355,
          0.53242093,  0.7462712 ],
        ...,

### Encoder Output

- It is the intermediate activations of transformer block and its last encoding is similar to sequence outputs.

- Length same as sequence output - `(no of sentences, length of each sentence- 128, no of hidden units-768)`

In [19]:
# encoder_output
bert_results['encoder_outputs']

[<tf.Tensor: shape=(3, 128, 768), dtype=float32, numpy=
 array([[[ 0.17126326,  0.05714692, -0.0674366 , ...,  0.01732911,
           0.14355004,  0.04887236],
         [ 0.63244945,  1.2133186 , -0.05207317, ...,  0.5324665 ,
           0.8438143 ,  0.26414844],
         [ 1.3925557 ,  0.6356847 ,  0.7422873 , ...,  0.4379002 ,
           0.93434745,  0.0773314 ],
         ...,
         [-0.05122608, -0.17709947,  0.735608  , ...,  0.41847372,
          -0.19071554,  0.02412996],
         [-0.15421638, -0.21907634,  0.5903027 , ...,  0.48437566,
          -0.10654722, -0.06792136],
         [-0.02584023, -0.14862329,  0.59410834, ...,  0.7565186 ,
          -0.3872101 , -0.13480763]],
 
        [[ 0.15777123,  0.04426716, -0.13644354, ..., -0.00874431,
           0.12490405,  0.09001191],
         [-0.18299678,  0.27142665, -0.63025737, ..., -0.63325226,
           0.11569571,  0.18398017],
         [-0.7387831 , -0.22359407, -0.29233336, ...,  0.04634879,
           0.6207118 ,  0.38

In [20]:
# check if encoder output[-1] is same as pooled_output last embedding

bert_results['encoder_outputs'][-1] == bert_results['sequence_output']

<tf.Tensor: shape=(3, 128, 768), dtype=bool, numpy=
array([[[ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        ...,
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True]],

       [[ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        ...,
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True]],

       [[ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        ...,
        [ True,  True,  True, ...,  True,  True,  