 <div>
<img src="https://edlitera-images.s3.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>

# `Word representations`

**Two types of models:**

* **co-occurence based models** - train over the whole corpus and capture global dependencies and context

* **predictive models** - train over a smaller section (**context window**) and capture local dependencies

* we will mostly use predictive models (more on them in the following notebook)

**Two types of word representations:**

* Sparse representations 

* Distributed representations 

# `Sparse word representations`

* Bag-of-words and TFIDF are **sparse representations**
    * to represent n different words, we require n dimensions
    * in practice: we need as many dimensions as there are words in the dictionary of our entire corpus


* this is called the **curse of dimensionality**
    * since there are a lot of different words in corpora, when representing n words with n dimensions a lot of the dimensions end up being zeros
    
    
* solution: **limit our vocabulary (limit the number of dimensions of our Vector Space Model)**
    * words removed are called **OOV words (Out-Of-Vocabulary words)**
    * problem: can't assign semantic meaning to OOV words
    
    
* **additional problem**: hard to find connections between words

**We can solve this problem using Word Embeddings**

# `Distributed Representations`

* for Deep Learning we typically transform words into so-called **word embeddings**
    * this allows us to represent words in a neural networks friendly way

* distributed representations are based on the following ideas:
    * words used in similar contexts have similar meanings
    * meaning can be derived by performing statistical analysis of word usage

* great because:
    * can create representations for words that are NOT sparse
    * can decrease the number of dimensions we need to represent some corpus of words 

## `Vector Space Model`

* model for representing text data as vectors in some n-dimensional space

* similar data will be represented with vectors of similar values
    * this makes calculating similarities between examples easy


<center><img src="https://edlitera-images.s3.amazonaws.com/vector_space_model.png" width="800">

inspired by:
<br>
https://www.researchgate.net/publication/298215705_A_Quantitative_Evalution_of_the_Enhanced_Topic-based_Vector_Space_Model

* vectors that are similar are closer to each other (e.g. the vector "car" is close to the vector "vehicle", but far from the vector "wave")

# `Word embeddings`

* instead of representing each word as a one-hot encoded vector we can represent words as combinations of features

* can create embeddings for entire documents!

* drastically decreases the number of values needed to represent a word 

* results in dense representations

* we can also add new words without needing to add new features

* **IMPORTANT:** because words share the same features, we can easily deduce which words are similar to each other
    * example: the vectors representing the words "boy" and "prince" would be closer to each other than the vectors representing "girl" and "prince"

### `Word embeddings example`


<center><img src="https://edlitera-images.s3.amazonaws.com/dense_representation.png" width="700">

source:
<br>
https://www.r-craft.org/r-news/get-busy-with-word-embeddings-an-introduction/

## `Concept of analogies`

* we can add and substract embeddings

**Example:**

* this works for a lot of different relationships:

    * `embedding(smallest) - embedding(small) + embedding(big) ≈ biggest`
    * `embedding(Paris) - embedding(France) + embedding(Italy) ≈ Rome`
    

* very complex relationships are accurately represented using this method
* this scales very well

## `Defining similarity`

* proving that two vectors are similar (e.g. Paris - France + Italy ≈ Rome) can be done using various methods:
<br>

    * using the Euclidean distance between two points
        * Pythagora's theorem
    * using the cosine similarity 
        * checks whether the two vectors are pointing in the same direction, the magnitude of the vectors is not important


<center><img src="https://edlitera-images.s3.amazonaws.com/euclidean_distance_and_cosine_similarity.png" width="700">

source:
<br>
https://towardsdatascience.com/building-a-backend-system-for-artificial-intelligence-c404efade360

# `Keras Embedding Layer`

* Keras layer designed for embedding text data 

   

**Can be used in multiple ways:**

* Alone - to learn word embeddings which we can save for later and use in other models


* Part of a Deep Learning model - embeddings are learned together with everything else


* Transfer learning - loading pretrained embeddings

## ` Layer arguments`

**Three important arguments we need to specify:**

* input_dim
* output_dim
* input_length

### `input_dim`


* vocabulary size

* directly connected with our integer encoded values
    * if we have integer encoded values from 0-20, our vocabulary size would be 21
    * this will also be the value we need to use for the **input_dim** argument

### `output_dim`


* vector space size
    * a vector space size of 100 will mean that each word gets encoded as a vector with 100 dimensions
 

* there is no "best value"
    * usually we test out multiple values until we find the best one
   

### `input_length`


* length of the input sequence

* works the same as for any other layer
    * the total number of words in some document that gets processed in our embedding layer is our **input_length**

## `Example`


In [1]:
# Import needed libraries

import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

In [2]:
# Load in data and create DataFrame

df = pd.read_csv("https://edlitera-datasets.s3.amazonaws.com/customer_complaints_dataset.csv")


In [3]:
# Take a look at our data

df

Unnamed: 0,Product,Consumer_complaint_narrative,category_id
0,"Credit reporting, credit repair services, or o...",I've reached out to Discover in the past. They...,0
1,"Credit reporting, credit repair services, or o...",Two accounts are still on my credit history af...,0
2,Checking or savings account,I have asked to close my Key Bank Checking acc...,1
3,Credit card or prepaid card,I opened a citi double cash card the beginning...,2
4,"Credit reporting, credit repair services, or o...",There are several inaccuracies on my report th...,0
...,...,...,...
752134,Credit card,"Automated calls from "" XXXX with Capital One '...",10
752135,Debt collection,I have disputed my debts several times with no...,6
752136,Mortgage,My father died in XX/XX/XXXX. Left me his only...,4
752137,Credit reporting,cfbp i would Like to file a complaint on Exper...,9


In [4]:
# Prepare dataframe sample

df = df.iloc[:125000, :]
df = df.sample(frac=1).reset_index(drop=True)

In [5]:
# Separate dependent feature from the independent feature

X = df["Consumer_complaint_narrative"]

y = df["category_id"]

In [6]:
# Take a look at our independent feature

X

0         I am in desperate need of assistance postponin...
1         According to my recent credit report request f...
2         I was in the XXXX for a week my second year of...
3         This account is not mine, nor do I know whom i...
4         I have reachead out to Equifax about inaccurat...
                                ...                        
124995    After I have a total loss of my car the past X...
124996    ATTENTION DISPUTE DEPARTMENT XXXX, XXXX SOC SE...
124997    i sent in my normal payment on time and with e...
124998    XXXX loan through US Bank was not being report...
124999    On XX/XX/2020, my Bank of America debit card w...
Name: Consumer_complaint_narrative, Length: 125000, dtype: object

In [25]:
# Take a look at our dependent feature

y

0         4
1         0
2         6
3         6
4         0
         ..
124995    5
124996    6
124997    4
124998    5
124999    2
Name: category_id, Length: 125000, dtype: int64

In [26]:
# Separate data into training data and testing data

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42
)

In [27]:
# Separate data into training data and validation data

X_train, X_valid, y_train, y_valid = train_test_split(
    X_train, y_train, test_size=0.20, random_state=42
)

### `Keras tokenizer`

* API for preparing text 

In [23]:
#t  = Tokenizer()
t = Tokenizer(
    num_words=33, 
    filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\nx',
    oov_token="<OOV>")
fit_text = "The earth is an awesome place live"
t.fit_on_texts(fit_text)
sequences = t.texts_to_sequences(fit_text)

print("sequences : ",sequences,'\n')

print("word_index : ",t.word_index)
#[] specifies : 1. space b/w the words in the test_text    2. letters that have not occured in fit_text


sequences :  [[4], [5], [2], [], [2], [3], [9], [4], [5], [], [6], [7], [], [3], [10], [], [3], [11], [2], [7], [12], [13], [2], [], [14], [8], [3], [15], [2], [], [8], [6], [16], [2]] 

word_index :  {'<OOV>': 1, 'e': 2, 'a': 3, 't': 4, 'h': 5, 'i': 6, 's': 7, 'l': 8, 'r': 9, 'n': 10, 'w': 11, 'o': 12, 'm': 13, 'p': 14, 'c': 15, 'v': 16}


* a text tokenization class

```
tf.keras.preprocessing.text.Tokenizer(
    num_words=None,
    filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
    lower=True, split=' ', oov_token=None,
    document_count=0, **kwargs
)
```

**Important arguments:**
    
   * **num_words** - maximum number of words to keep, we define how many most frequent words we want to keep
   * **filters** - input to the argument is a string; whatever is in that string will get removed
   * **lower** - whether to convert text to lowercase
   * **split** - defines how we split words
   * **oov_token** - placeholder token for all out-of-vocabulary words used in text_to_sequence calls

In [28]:
from keras.preprocessing.text import Tokenizer

In [29]:
# Define tokenizer
# Set number of words as 10 000
# Set value of oov token
# Leave everything else on default values

tokenizer = Tokenizer(
    num_words=10_000, 
    filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\nx',
    oov_token="<OOV>")


**Keras tokenizer methods:**
    
   * `fit_on_texts()`
   * `texts_to_sequences()`
   * `texts_to_matrix()`
   * `sequences_to_matrix()`

* two important methods we will use are **`fit_on_texts()`** and **`texts_to_sequences()`**

**`fit_on_texts()`**

* we use it to update our vocabulary of texts
* the first thing that we use
* **only use on training data**

* attributes:
    * word_counts - gives use all of the words inside the dictionary together with their counts
    * word_docs - gives use all of the words inside the dictionary together with information on the number of documents that word appeared in
    * word_index - unique integers, we assign one to each word
    * document_count - total number of documents that we used to fit the tokenizer

In [30]:
# Fit tokenizer on train data

tokenizer.fit_on_texts(X_train)

**texts_to_sequences()**

* converts tokens of some corpus into sequences of integers
* used after fit_on_texts(), but before padding


In [31]:
# Convert into sequences of integers

X_train = tokenizer.texts_to_sequences(X_train)
X_valid = tokenizer.texts_to_sequences(X_valid)
X_test = tokenizer.texts_to_sequences(X_test)

In [32]:
len(X_train)

80000

In [33]:
X_train

[[75,
  3667,
  44,
  2,
  218,
  365,
  19,
  6,
  95,
  653,
  21,
  34,
  1639,
  385,
  28,
  21,
  7,
  4105,
  677,
  24,
  2,
  3942,
  8,
  310,
  19,
  2952,
  298,
  77,
  17,
  38,
  49,
  1590,
  61,
  16,
  137,
  110,
  56,
  14,
  133,
  13,
  24,
  6,
  116,
  992,
  2,
  138,
  2,
  54,
  5,
  295,
  13,
  22,
  21,
  38,
  450,
  4,
  20,
  50,
  17,
  277,
  706,
  201,
  5,
  2099,
  70,
  631,
  201],
 [3,
  48,
  7,
  91,
  1327,
  8,
  5,
  5,
  11,
  222,
  4,
  2485,
  319,
  1064,
  10,
  6,
  125,
  16,
  211,
  8,
  176,
  334,
  202,
  9,
  63,
  14,
  1500,
  85,
  204,
  12,
  253,
  90,
  571,
  8,
  306,
  97,
  78,
  5,
  29,
  665,
  3,
  48,
  4106,
  61,
  334,
  202,
  1671,
  10,
  6,
  28,
  3,
  235,
  40,
  131,
  24,
  6,
  16,
  211,
  200,
  3,
  15,
  219,
  773,
  2,
  125,
  137,
  5,
  117,
  15,
  14,
  253,
  2,
  202,
  25,
  129,
  220,
  12,
  105,
  185,
  81,
  1278,
  2,
  16,
  1351,
  157,
  3,
  48,
  14,
  222,
  4,
  1402,
 

### Padding

* when using the keras embedding layer, (actually any neural network layer), we always must input data of the same dimensions

* **problem:** sentences are often of different length

* **solution:** select a max length for the sentences and make sure that all sentences are of that length
    * if a sentence is longer, truncate it 
    * if a sentence is shorter, pad it with zeroes
    
    
 

* always pad all data (training, validation and testing data)

In [51]:
from keras.preprocessing.sequence import pad_sequences

In [35]:
max_len_of_sequence = max(len(x) for x in X_train)

print(max_len_of_sequence)

5277


In [52]:
# Define values important for padding

max_length = 100
trunc_type = "post"
padding_type = "post"

In [49]:
all_len = [len(x) for x in X_train]
len(X_train[np.argmax(all_len)])

5277

In [53]:
# Pad train, validation and test data

X_train = pad_sequences(X_train, padding=padding_type, maxlen=max_length, truncating=trunc_type)
X_valid = pad_sequences(X_valid, padding=padding_type, maxlen=max_length, truncating=trunc_type)
X_test = pad_sequences(X_test, padding=padding_type, maxlen=max_length, truncating=trunc_type)

### Model creation

* don't worry about this part: we'll cover it in detail in just a bit

* for now, you only need to understand the embedding layer:
    
    * as the **input_dim** argument use the size of your vocabulary + 1
    * as the **output_dim** use how many vectors do you want to use to embed your words
    * as **max_length** use the number of words you limited your sentences to

In [54]:
len(tokenizer.word_index)

45124

In [55]:
from keras.layers import Embedding, Dense, LSTM, Dropout
from keras.models import Sequential

embedding_dim = 100
input_dim = len(tokenizer.word_index) + 1

model = Sequential()
model.add(Embedding(input_dim=input_dim, output_dim=embedding_dim, input_length=max_length))
model.add(LSTM(32, recurrent_dropout=0.2)) 
model.add(Dense(16, activation="relu"))
model.add(Dense(y.nunique(), activation="softmax"))

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 100, 100)          4512500   
                                                                 
 lstm (LSTM)                 (None, 32)                17024     
                                                                 
 dense (Dense)               (None, 16)                528       
                                                                 
 dense_1 (Dense)             (None, 17)                289       
                                                                 
Total params: 4,530,341
Trainable params: 4,530,341
Non-trainable params: 0
_________________________________________________________________


### Model compiling and training

In [56]:
from tensorflow.keras.optimizers import Adam
from keras.losses import SparseCategoricalCrossentropy
from keras.metrics import SparseCategoricalAccuracy

loss_function = SparseCategoricalCrossentropy()

metric = SparseCategoricalAccuracy()

optim = Adam()

model.compile(loss=loss_function, optimizer=optim, metrics=["accuracy"])

In [57]:
num_epochs = 10

batch_size = 128

history = model.fit(X_train, 
                    y_train, 
                    batch_size=batch_size, 
                    epochs=num_epochs, 
                    verbose=1, 
                    validation_data=(X_valid, y_valid))


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [58]:
score = model.evaluate(X_test, y_test, verbose=1)



In [59]:
print(f'Test loss: {score[0]} / Test accuracy: {score[1]}')

Test loss: 0.6687042713165283 / Test accuracy: 0.8143600225448608


**Problem we typically run into: not enough data !**

**Solution: use pretrained word vectors**

* we use pretrained vectors from:

    * `spaCy`
    * `Word2Vec`
    * `GloVe`

* using pretrained components in our model is the basis of **transfer learning**
    * more on transfer learning later

 <div>
<img src="https://edlitera-images.s3.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>