# Text Classification Overview
---

## Contents

1. [Overview](#1.-Overview)
2. [How to represent a sentence or tokens?](#2.-How-to-represent-a-sentence-or-tokens?)

In [1]:
import numpy as np
np.random.seed(777)

---

## 1. Overview

Examples below are **"Text Classification"** problem.

* Sentiment analysis: is this review positive or negative?
* Text categorization: which category does this blog post belong to?
* Intent classification: is this a question about a Chinese restaurant?

We can define "Text Classification" as:
* Input: A natural language sentence / paragraph
* Output: a category to which the input text belongs
    - There are a fixed number **C** of categories
    
---

## 2. How to represent a sentence or tokens?

What is a sentence? A sentence can be viewed as a variable-length sequence of tokens. Each token could be any one from a vocabulary.

It's very similar to talking. For example, "the quick brown fox jumps over the lazy dog" can be viewed as a list "\[the, quick, brown, fox, jumps, over, the, lazy, dog\]" that you speak. 

$$X=(x_1, x_2, \cdots, x_t ,\cdots,x_T) \quad where\ x_t \in V$$ 

So, a sentence $X$ becomes a sequence of tokens, and $V$ is vocabulary set which are total unique tokens in the training data. 

The unit of token is not matter, you can try units like words, strings, or even digit bits. Here we take "word" as our token unit. 

Since computer machine can't understand "words", so we change words to an integer indices. 

After encoded, this sentence can be representated as **a sequence of integer indicies**.

In [2]:
sentence = "the quick brown fox jumps over the lazy dog".split()
vocab = {}
for token in sentence:
    if vocab.get(token) is None:
        vocab[token] = len(vocab)
sentence = list(map(vocab.get, sentence))
sentence

[0, 1, 2, 3, 4, 5, 0, 6, 7]

The other method to represent a token is called **"one-hot representaion"**. Every token has a length of vocabulary size and only one of the elements is 1 at vocabulary index.

For example: The token "the" has index 0 in our vocabulary. So, 

$$the = [1, 0, 0, 0, 0, 0, 0, 0]$$

In [3]:
one_hot = np.eye(len(vocab), dtype=np.int)
print('index of "the" in the vocabulary:', vocab['the'])
print('one-hot vector:', one_hot[vocab['the']])

index of "the" in the vocabulary: 0
one-hot vector: [1 0 0 0 0 0 0 0]


However these integer indicies are arbitrary and they can't capture the "meaning" of words. 

**What is the "meaning(semantic)" of a word?**

There are some hypotheses in Natural Language Processing. From the paper ["From Frequency to Meaning: Vector Space Models of Semantics"](https://arxiv.org/abs/1003.1141)

>**Statistical semantics hypothesis**: Statistical patterns of human word usage can be
used to figure out what people mean (Weaver, 1955; Furnas et al., 1983). – If units of text
have similar vectors in a text frequency matrix, then they tend to have similar meanings.
(We take this to be a general hypothesis that subsumes the four more specific hypotheses
that follow.)
>
>**Bag of words hypothesis**: The frequencies of words in a document tend to indicate
the relevance of the document to a query (Salton et al., 1975). – If documents and pseudodocuments
(queries) have similar column vectors in a term–document matrix, then they tend to have similar meanings.
>
>**Distributional hypothesis**: Words that occur in similar contexts tend to have similar meanings (Harris, 1954; Firth, 1957; Deerwester et al., 1990). – If words have similar row vectors in a word–context matrix, then they tend to have similar meanings.
>
>**Extended distributional hypothesis**: Patterns that co-occur with similar pairs tend
to have similar meanings (Lin & Pantel, 2001). – If patterns have similar column vectors
in a pair–pattern matrix, then they tend to express similar semantic relations.
>
>**Latent relation hypothesis**: Pairs of words that co-occur in similar patterns tend
to have similar semantic relations (Turney et al., 2003). – If word pairs have similar row
vectors in a pair–pattern matrix, then they tend to have similar semantic relations.

According to these hypotheses, we can know that the meaning of a word can be represented as a vector. Also, Similar vectors have similar meanings which means the distance between two vectors will be closer. **Cosine similarity** is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. So, we use it to measure the similarity of the meaning between two words.

However, all one hot representations have same distance "0" when using cosine similiarity to measure their distance.

In [4]:
def cos_similiarity(x, y):
    return (np.dot(x, y) / (np.linalg.norm(x)*np.linalg.norm(y))).round(5)

In [5]:
word1 = one_hot[vocab['fox']]  # array([0, 0, 0, 1, 0, 0, 0, 0])
word2 = one_hot[vocab['dog']]  # array([0, 0, 0, 0, 0, 0, 0, 1])
print('cosine similiarity between word "fox" and word "dog" is :', cos_similiarity(word1, word2))

cosine similiarity between word "fox" and word "dog" is : 0.0


Then, how should we represent a token so that it reflects its "meaning"? If we want to calculate some similiarity, we need to use **dense vectors** to represent these words. One-hot vectors are **sparse vectors** which contains a lot of zeros.

In [6]:
vec1 = np.array([1, 2, 3, 4])
vec2 = np.array([1, 2, 3, 5])
print('cosine similiarity between "vec1" and "vec2" is :', cos_similiarity(vec1, vec2))

cosine similiarity between "vec1" and "vec2" is : 0.994


So, how to slove this? If we assume there is a vector sapce ($\Bbb{R}^{\vert V \vert \times d} $) that can represent these tokens. Then train this vector space in neural network to solve a classification problem. This vector space will capture the token's meaning. This process can be done by a simple matrix multiplication. Let's call word representaion vector space as $W$ which the size is $(\vert V \vert, d)$, since we changed tokens to one hot vectors, by doing matrix multiplication, can get a single row vector of $W$$(w_i)$ and this vector will be the meaning of the word.

$$w_i = t_i \cdot W , \quad where\ t_i = [0, \cdots, \underset{i\text{-}th\ index}{1}, \cdots, 0],\ len(t_i)=\vert V \vert$$

In [7]:
d = 5
W = np.random.rand(len(vocab), d).round(3)
print('vector space for all tokens, size of (len(vocab), d)')
print(W)
print()
print('a dense vector representation for word "fox", at column index 3')
print(np.dot(word1, W))  # one-hot token of word "fox" = array([0, 0, 0, 1, 0, 0, 0, 0])

vector space for all tokens, size of (len(vocab), d)
[[0.153 0.302 0.062 0.46  0.835]
 [0.927 0.727 0.768 0.269 0.644]
 [0.093 0.08  0.59  0.343 0.989]
 [0.626 0.682 0.552 0.269 0.373]
 [0.223 0.186 0.391 0.193 0.611]
 [0.883 0.622 0.253 0.18  0.816]
 [0.225 0.517 0.518 0.6   0.533]
 [0.013 0.524 0.896 0.77  0.123]]

a dense vector representation for word "fox", at column index 3
[0.626 0.682 0.552 0.269 0.373]


In practical, we just get the index of row vector of vector space $W$. At backward time, update the index of previous gradient.

In [8]:
idxes = [np.argmax(v) for v in [word1, word2]]
print('Index for word "fox" and "dog" is {}'.format(idxes))
print('Row vector for word "fox" and "dog": ')
print(W[idxes, :])

Index for word "fox" and "dog" is [3, 7]
Row vector for word "fox" and "dog": 
[[0.626 0.682 0.552 0.269 0.373]
 [0.013 0.524 0.896 0.77  0.123]]


In [9]:
dout = np.random.rand(2, d).round(3)
dW = np.zeros_like(W)
dW[idxes] = dout
print('Update word "fox" & "dog": ')
print(dW)

Update word "fox" & "dog": 
[[0.    0.    0.    0.    0.   ]
 [0.    0.    0.    0.    0.   ]
 [0.    0.    0.    0.    0.   ]
 [0.296 0.612 0.726 0.463 0.769]
 [0.    0.    0.    0.    0.   ]
 [0.    0.    0.    0.    0.   ]
 [0.    0.    0.    0.    0.   ]
 [0.192 0.558 0.551 0.472 0.792]]


for same indices, all vector refer to same indices will be summed together.

In [25]:
idxes = [np.argmax(v) for v in [word1, word2, word1]]
dout = dout[np.array([0, 1, 0])]
temp = {i: a for i, a in zip(idxes, dout)}
dW = np.zeros_like(W)
for i in idxes:
    dW[i] += 1
for i, v in temp.items():
    dW[i] = dW[i] * v

In [28]:
print('Update word "fox" & "dog": ')
print(dW)

Update word "fox" & "dog": 
[[0.    0.    0.    0.    0.   ]
 [0.    0.    0.    0.    0.   ]
 [0.    0.    0.    0.    0.   ]
 [0.592 1.224 1.452 0.926 1.538]
 [0.    0.    0.    0.    0.   ]
 [0.    0.    0.    0.    0.   ]
 [0.    0.    0.    0.    0.   ]
 [0.192 0.558 0.551 0.472 0.792]]


This process is called **Embedding**. In pytorch, use "Embedding" Layer for embedding words to dense vector space.

In [43]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [48]:
# you can set your pretrained vector parameters giving "_weight" args by "torch.Tensor" type
embed_layer = nn.Embedding(len(vocab), d, _weight=torch.FloatTensor(W))  
idxes_tensor = torch.LongTensor(idxes)
embeded = embed_layer(idxes_tensor)
print('Embedding Layer parameters: ')
print(embed_layer.weight)
print()
print('Embeded vector for "fox" and "dog": ')
print(embeded)

Embedding Layer parameters: 
Parameter containing:
tensor([[0.1530, 0.3020, 0.0620, 0.4600, 0.8350],
        [0.9270, 0.7270, 0.7680, 0.2690, 0.6440],
        [0.0930, 0.0800, 0.5900, 0.3430, 0.9890],
        [0.6260, 0.6820, 0.5520, 0.2690, 0.3730],
        [0.2230, 0.1860, 0.3910, 0.1930, 0.6110],
        [0.8830, 0.6220, 0.2530, 0.1800, 0.8160],
        [0.2250, 0.5170, 0.5180, 0.6000, 0.5330],
        [0.0130, 0.5240, 0.8960, 0.7700, 0.1230]], requires_grad=True)

Embeded vector for "fox" and "dog": 
tensor([[0.6260, 0.6820, 0.5520, 0.2690, 0.3730],
        [0.0130, 0.5240, 0.8960, 0.7700, 0.1230],
        [0.6260, 0.6820, 0.5520, 0.2690, 0.3730]],
       grad_fn=<EmbeddingBackward>)


In [40]:
dout_tensor = torch.tensor(dout, requires_grad=True)
print('Gradient before backward into Embedding Layer: ')
print(dout_tensor)
print()
print('Update word "fox" & "dog": ')
print(embeded.grad_fn(dout_tensor))

Gradient before backward into Embedding Layer: 
tensor([[0.2960, 0.6120, 0.7260, 0.4630, 0.7690],
        [0.1920, 0.5580, 0.5510, 0.4720, 0.7920],
        [0.2960, 0.6120, 0.7260, 0.4630, 0.7690]],
       dtype=torch.float64, requires_grad=True)

Update word "fox" & "dog": 
tensor([[0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5920, 1.2240, 1.4520, 0.9260, 1.5380],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1920, 0.5580, 0.5510, 0.4720, 0.7920]],
       dtype=torch.float64, grad_fn=<Error>)


Very Easy to use, and after training, we can use word vectors in Embedding layer to calculate some similarity of two vectors.

In [63]:
wv1 = embed_layer.weight[vocab.get('fox')]
wv2 = embed_layer.weight[vocab.get('dog')]
print('Calculate similarity for "fox" & "dog" {:.4f}'.format(F.cosine_similarity(wv1, wv2, dim=0)))

Calculate similarity for "fox" & "dog" 0.7316
