# Primitive Embeddings (Sparse Vector)

For the first tutorial, here I show you primitive embeddings (preprocessing, featurizing, or vectorizing) for languages.

As you can see in the later tutorials, embeddings in this example is very beginning and will not be used in practices. But it will be a good example for your first understanding NLP.

There are many types of embeddings - such as, character embedding, word embedding, sentence embedding, or document embedding, and I'll show you sentence vectorization in this notebook.

*back to [index](https://github.com/tsmatz/nlp-tutorials/)*

## Install required packages

In [None]:
!pip install scikit-learn nltk pandas

In [None]:
import nltk
nltk.download("popular")

## Count Vectorize

One of primitive method to vectorize a text is count vectorization.<br>
This method is based on one hot vectorizing and each element represents the count of that word in a document as follows.

![Count vectorize](images/count_vectorize.png)

Count vectorization is very straighforward and comprehensive for humans, but it'll build sparse vectors (in which, almost elements are zero) and also resource-intensive. I note that it will then waste a lot of time and resources for large data.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
import pandas as pd

lemmatizer = WordNetLemmatizer()

# Convert :
# "pens" -> "pen"
# "wolves" -> "wolf"
def my_lemmatizer(text):
    return [lemmatizer.lemmatize(t) for t in word_tokenize(text)]

vectorizer = CountVectorizer(
    tokenizer=my_lemmatizer)
texts = [
    "This is a book",
    "These are pens and my pen is here"
]
vectors = vectorizer.fit_transform(texts)

cols = [k for k, v in sorted(vectorizer.vocabulary_.items(), key=lambda item: item[1])]
df = pd.DataFrame(vectors.toarray(), columns=cols)
df

Unnamed: 0,a,and,are,book,here,is,my,pen,these,this
0,1,0,0,1,0,1,0,0,0,1
1,0,1,1,0,1,1,1,2,1,0


Hence this vectorization often results into low performance (low accuracy) in several ML use-cases. (Since the neural network won't work well with very high-dimensional and sparse vectors.)<br>
The following is the example for classifying document into 20 e-mail groups.

> Note : In the real usage, please train with unknown words with a specific symbol, such as "[UNK]".

In [2]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

# Load train dataset
train = fetch_20newsgroups(
    subset="train",
    remove=("headers", "footers", "quotes"))

# Count vectorize
vectorizer.fit(train.data)
X_trian = vectorizer.transform(train.data)
y_train = train.target

# Train
clf = MultinomialNB(alpha=.01)
clf.fit(X_trian, y_train)

# Evaluate accuracy
test = fetch_20newsgroups(
    subset="test",
    remove=("headers", "footers", "quotes"))
X_test = vectorizer.transform(test.data)
y_test = test.target
y_pred = clf.predict(X_test)
score = metrics.accuracy_score(y_test, y_pred)
print("classification accuracy: {}".format(score))



classification accuracy: 0.6240042485395645


## TF-IDF weighting

In above example, the word "book" or "pen" has the same weight as words "a", "for", "the", etc.<br>
Using TF-IDF, you can prioritize the words that rarely appear in the given corpus.

TF (=**T**erm **F**requency) is

$$ \frac{\#d(w)}{\sum_{w^{\prime} \in d} \#d(w^{\prime})} $$

in which, $ \#d(w) $ means the count of word $w$ in document $d$.<br>
TF is the normalized value of the count of word $w$ in document $d$. 

TF-IDF (=**I**nverse **D**ocument **F**requency) is

$$ \frac{\#d(w)}{\sum_{w^{\prime} \in d} \#d(w^{\prime})} \times \log{\frac{|D|}{|\{d \in D:w\in d\}|}}$$

where $D$ is large corpus (a set of documents).

If some word $w$ (such like, "a", "the") is included in all document $d \in D$, the second term will be relatively small. If some word is rarely included in $d \in D$, the second term will be relatively large.

Let's see the following example.<br>

In [3]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

# Convert :
# "pens" -> "pen"
# "wolves" -> "wolf"
def my_lemmatizer(text):
    return [lemmatizer.lemmatize(t) for t in word_tokenize(text)]

# Count vectorize
count_vectorizer = CountVectorizer(tokenizer=my_lemmatizer)
texts = [
    "This is a book",
    "These are pens and my pen is here"
]
count_vectors = count_vectorizer.fit_transform(texts)

# TF-IDF weighting
tfidf_trans = TfidfTransformer(use_idf=True).fit(count_vectors)
tfidf_vectors = tfidf_trans.transform(count_vectors)

As you can see above, only the word "is" is included in both documents. The word "pen" is also used twice, however, this word is not used in the first document.<br>
As a result, only the word "is" has small value for IDF weights.

In [4]:
cols = [k for k, v in sorted(count_vectorizer.vocabulary_.items(), key=lambda item: item[1])]
df = pd.DataFrame([tfidf_trans.idf_], columns=cols)
df

Unnamed: 0,a,and,are,book,here,is,my,pen,these,this
0,1.405465,1.405465,1.405465,1.405465,1.405465,1.0,1.405465,1.405465,1.405465,1.405465


The generated vectors has the following values.<br>
As you can see below, the word "is" has relatively small value compared with other words in the same document.<br>
The second document ("These are pens and my pen is here") has more words than the first document ("This is a book"), and then TF values (normalized values) in the second document are small rather than ones in the first document.<br>
The word "pen" appears in the second documnt twice, and it then has 2x values compared with other words in this document.

In [5]:
df = pd.DataFrame(tfidf_vectors.toarray(), columns=cols)
df

Unnamed: 0,a,and,are,book,here,is,my,pen,these,this
0,0.534046,0.0,0.0,0.534046,0.0,0.379978,0.0,0.0,0.0,0.534046
1,0.0,0.324336,0.324336,0.0,0.324336,0.230768,0.324336,0.648673,0.324336,0.0


Let's see the example for classifying text into 20 e-mail groups. (Compare the result with the previous one.)

In [6]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

# Load train dataset
train = fetch_20newsgroups(
    subset="train",
    remove=("headers", "footers", "quotes"))

# Count vectorize
count_vectorizer.fit(train.data)
X_train_count = count_vectorizer.transform(train.data)

# TF-IDF weighting
tfidf_trans = TfidfTransformer(use_idf=True).fit(X_train_count)
X_train_tfidf = tfidf_trans.transform(X_train_count)

# Train
y_train = train.target
clf = MultinomialNB(alpha=.01)
clf.fit(X_train_tfidf, y_train)

# Evaluate accuracy
test = fetch_20newsgroups(
    subset="test",
    remove=("headers", "footers", "quotes"))
X_test_count = count_vectorizer.transform(test.data)
X_test_tfidf = tfidf_trans.transform(X_test_count)
y_pred = clf.predict(X_test_tfidf)
y_test = test.target
score = metrics.accuracy_score(y_test, y_pred)
print("classification accuracy: {}".format(score))



classification accuracy: 0.6964949548592672


**TF-IDF can also be applied to dense vectors**, with such as CBOW (continuos bag-of-words) by weighting word's vector (so called, weighted CBOW or WCBOW) as follows :

$$ \frac{1}{\sum_{i=1}^{k} \verb|tfidf|(w_i)} \sum_{i=1}^{k} \verb|tfidf|(w_i) v(w_i) $$

where $v(\cdot)$ is word's vectorization (dense vector) and $\verb|tfidf|(\cdot)$ is TF-IDF weighting.