# Text Sparse Representations

Most classic text representation methods are based on histograms of word occurences in a each document. The technique is commonly referred to bag-of-words representation and text sparse representations. While this simple naive approach works quite well, there are different ways of transforming a text document and weighting the importance of each word to its document. 


## Tokenization and Vocabulary
First, text needs to be converted in minimal units of text segments, the so called tokens. A token can be a word, a span of characters or of words.
The role of a tokenizer, also called analyser, is exactly this one. In the example below we use scikit-learn's simplest tokenizer, the CountVectorizer:

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html


In [1]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',
 ]

my_stop_words = {'is', 'the'}
# UNIGRAMS
vectorizer = CountVectorizer(ngram_range=(1,1), analyzer='word', stop_words = None)
#vectorizer = CountVectorizer(ngram_range=(1,1), analyzer='word', stop_words = 'english')
#vectorizer = CountVectorizer(ngram_range=(1,1), analyzer='word', stop_words = my_stop_words)

# UNIGRAMS and BIGRAMS
#vectorizer = CountVectorizer(ngram_range=(1,2), analyzer='word')

# Character GRAMS
#vectorizer = CountVectorizer(ngram_range=(3,4), analyzer='char')


The above examples illustrates how you can change the unit of tokenization, _word_ or _character_, the number of _ngrams_, and the chosen stop words. Refer to the reference documentation for more information about all the available options.

Once the tokenizer is defined, you can learn a dictionary by parsing the entire corpus. This will create a $V$ vocabulary of $L$ unique tokens, 

$V =\{v_1,..., v_l\}$.




In [2]:
vectorizer.fit(corpus)
vocabulary = vectorizer.get_feature_names_out()
print(vocabulary)

['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']


## One-hot encoding

In [3]:
X = vectorizer.transform(vocabulary)
print(X.todense())

[[1 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 0 0 1]]


## Document representation

The vocabulary then allows us to transform every documents to a vector representation, where each dimension corresponds to a unique token and to the number of times it occurs in that particular document. Hence, each document is represented as a vector

$d_i = (v_{1,i}, ... , v_{L,i})$

The next example transforms the entire set of documents of our corpus onto a 4 vectors of $L$ dimensions:

In [4]:
X = vectorizer.transform(corpus)
print(X.todense())

[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


# Advanced Tokenization
Bag of word representations also offer the possibility of there different approaches that offer better results.
https://spacy.io/usage/linguistic-features

## Tokenization with Spacy

In [5]:
import spacy
from spacy import displacy
from pathlib import Path

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

save_figures = False

print("token".ljust(10), "lemma".ljust(10), "pos".ljust(6), "tag".ljust(6), "dep".ljust(10),
            "shape".ljust(10), "alpha", "stop")
print("------------------------------------------------------------------------------")
for token in doc:
    print(token.text.ljust(10), token.lemma_.ljust(10), token.pos_.ljust(6), token.tag_.ljust(6), token.dep_.ljust(10),
            token.shape_.ljust(10), token.is_alpha, token.is_stop)

html_dep = displacy.render(doc, style="dep", jupyter=True)

if save_figures:
    file_name = "demo-dep.html"
    output_path = Path("./images/" + file_name)
    output_path.open("w", encoding="utf-8").write(html_dep)


token      lemma      pos    tag    dep        shape      alpha stop
------------------------------------------------------------------------------
Apple      Apple      PROPN  NNP    nsubj      Xxxxx      True False
is         be         AUX    VBZ    aux        xx         True True
looking    look       VERB   VBG    ROOT       xxxx       True False
at         at         ADP    IN     prep       xx         True True
buying     buy        VERB   VBG    pcomp      xxxx       True False
U.K.       U.K.       PROPN  NNP    dobj       X.X.       False False
startup    startup    NOUN   NN     dep        xxxx       True False
for        for        ADP    IN     prep       xxx        True True
$          $          SYM    $      quantmod   $          False False
1          1          NUM    CD     compound   d          False False
billion    billion    NUM    CD     pobj       xxxx       True False


## Named Entities

In [6]:
import spacy
from spacy import displacy
from pathlib import Path

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text.ljust(12), ent.label_.ljust(10), ent.start_char, ent.end_char)

html_ent = displacy.render(doc, style="ent", jupyter=True)

if save_figures:
    file_name = "demo-ent.html"
    output_path = Path("./images/" + file_name)
    output_path.open("w", encoding="utf-8").write(html_ent)


Apple        ORG        0 5
U.K.         GPE        27 31
$1 billion   MONEY      44 54
