In this notebook, we will use functions from the [Scikit-learn](https://scikit-learn.org/stable/index.html) library to vectorize sentences with different methods.

The sentences are taken from the [Parallel Meaning Bank](https://pmb.let.rug.nl/), a multilingual corpus of public domain text annotated with many features. First, let's download the corpus and extract the content of the archive. We use version 1.0.0 because of its reduced size of about 11MB (the following versions are much larger).

In [None]:
!wget https://pmb.let.rug.nl/releases/pmb-1.0.0.zip
!unzip pmb-1.0.0.zip

The `data' directory contains several subdirectories (p* , for parts). In each p directory, there are several documents, for each directory d* (document).

We will read only the files called en.tok.off, containing the tokenized text in English, one token per line.

In [None]:
from glob import glob
corpus = []
for filename in glob("pmb-1.0.0/data/p*/d*/en.tok.off"):
    with open(filename) as f:
        lines = f.readlines()
        tokens = []
        for line in lines:
            tokens.append(" ".join(line.strip().split(" ")[3:]))
        corpus.append(" ".join(tokens))

At the end of the loop, the variable *corpus* contains a list of documents, each represented as a list of tokens.

In [None]:
print (corpus[0:2])

Scikit-learn (in short, *sklearn*) provides useful functions for vectorization.A CountVectorizer object computes a vocabulary given a corpus (fit) and vectorizes sentences by counting word occurrences.
The result is returned as a Numpy 2-dimensional sparse array.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=1000)
corpus_vectorized = cv.fit_transform(corpus)
print(corpus_vectorized[0])

TfidfVectorizer works similarly to CountVectorizer, but it also weights the features by their TF-IDF.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tv = TfidfVectorizer(max_features=1000)
corpus_vectorized = tv.fit_transform(corpus)
print(corpus_vectorized[0])