In [1]:
from gensim import corpora, models
import spacy

In [2]:
nlp = spacy.load('en_core_web_sm')

**Vector transformations in Gensim**

Now that we know what vector transformations are, let's get used to creating them, and using them. We will be performing these transformations with Gensim, but even scikit-learn can be used. We'll also have a look at scikit-learn's approach later on.

Let's create our corpus now. We discussed earlier that a corpus is a collection of documents. In our examples, each document would just be one sentence, but this is obviously not the case in most real-world examples we will be dealing with. We should also note that once we are done with preprocessing, we get rid of all punctuation marks - as for as our vector representation is concerned, each document is just one sentence.

Of course, before we start, be sure to install Gensim. Like spaCy, pip or conda is the best way to do this based on your working environment.

In [3]:
# We performed very similar preprocessing when we introduced spaCy. What do our documents look like now?
documents = [u"Football club Arsenal defeat local rivals this weekend.", 
             u"Weekend football frenzy takes over London.", 
             u"Bank open for take over bids after losing millions.", 
             u"London football clubs bid to move to Wembley stadium.", 
             u"Arsenal bid 50 million pounds for striker Kane.", 
             u"Financial troubles result in loss of millions for bank.", 
             u"Western bank files for bankruptcy after financial losses.", 
             u"London football club is taken over by oil millionaire from Russia.", 
             u"Banking on finances not working for Russia."]

texts = []
for document in documents:
    doc = nlp(document)
    text = [
        w.lemma_
        for w in doc
        if not w.is_stop and not w.is_punct and not w.like_num
    ]
    texts.append(text)

dictionary = corpora.Dictionary(texts)

Let's start by whipping up a bag-of-words representation for our mini-corpus. Gensim allows us to do this very conveniently through its `dictionary` class.

In [4]:
print(dictionary.token2id)

{'Arsenal': 0, 'club': 1, 'defeat': 2, 'football': 3, 'local': 4, 'rival': 5, 'weekend': 6, 'London': 7, 'frenzy': 8, 'take': 9, 'bank': 10, 'bid': 11, 'lose': 12, 'million': 13, 'open': 14, 'Wembley': 15, 'stadium': 16, 'Kane': 17, 'arsenal': 18, 'pound': 19, 'striker': 20, 'financial': 21, 'loss': 22, 'result': 23, 'trouble': 24, 'bankruptcy': 25, 'file': 26, 'western': 27, 'Russia': 28, 'millionaire': 29, 'oil': 30, 'banking': 31, 'finance': 32, 'work': 33}


There are 32 unique words in our corpus, all of which are represented in our dictionary with each word being assigned an index value. When we refer to a word's word_id henceforth, it means we are talking about the words integer-id mapping made by the dictionary.

We will be using the `doc2bow` method, which, as the name suggests, helps convert our document to bag-of-words.

In [5]:
corpus = [dictionary.doc2bow(text) for text in texts] 

If we print our corpus, we'll have our bag of words representation of the documents we used.

In [6]:
corpus

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)],
 [(3, 1), (6, 1), (7, 1), (8, 1), (9, 1)],
 [(10, 1), (11, 1), (12, 1), (13, 1), (14, 1)],
 [(1, 1), (3, 1), (7, 1), (11, 1), (15, 1), (16, 1)],
 [(11, 1), (17, 1), (18, 1), (19, 1), (20, 1)],
 [(10, 1), (13, 1), (21, 1), (22, 1), (23, 1), (24, 1)],
 [(10, 1), (21, 1), (22, 1), (25, 1), (26, 1), (27, 1)],
 [(1, 1), (3, 1), (7, 1), (9, 1), (28, 1), (29, 1), (30, 1)],
 [(28, 1), (31, 1), (32, 1), (33, 1)]]

In [7]:
# TF-IDF representation

tfidf = models.TfidfModel(corpus)

In [9]:
for document in tfidf[corpus]:
    print(document)

[(0, 0.4538520228951382), (1, 0.2269260114475691), (2, 0.4538520228951382), (3, 0.1675032779320012), (4, 0.4538520228951382), (5, 0.4538520228951382), (6, 0.3106776504135697)]
[(3, 0.2421296766697527), (6, 0.44909138478886224), (7, 0.32802654645398593), (8, 0.6560530929079719), (9, 0.44909138478886224)]
[(10, 0.29019840161676663), (11, 0.29019840161676663), (12, 0.5803968032335333), (13, 0.3973019972146358), (14, 0.5803968032335333)]
[(1, 0.29431054749542984), (3, 0.21724253258131515), (7, 0.29431054749542984), (11, 0.29431054749542984), (15, 0.5886210949908597), (16, 0.5886210949908597)]
[(11, 0.24253562503633302), (17, 0.48507125007266605), (18, 0.48507125007266605), (19, 0.48507125007266605), (20, 0.48507125007266605)]
[(10, 0.2615055248879334), (13, 0.35801943340074827), (21, 0.35801943340074827), (22, 0.35801943340074827), (23, 0.5230110497758668), (24, 0.5230110497758668)]
[(10, 0.24434832234965204), (21, 0.33453001789363906), (22, 0.33453001789363906), (25, 0.4886966446993041), 

In [10]:
# creating n-grams
bigram = models.Phrases(texts) 
texts = [bigram[line] for line in texts]

In [11]:
texts

[['football', 'club', 'Arsenal', 'defeat', 'local', 'rival', 'weekend'],
 ['weekend', 'football', 'frenzy', 'take', 'London'],
 ['bank', 'open', 'bid', 'lose', 'million'],
 ['London', 'football', 'club', 'bid', 'Wembley', 'stadium'],
 ['arsenal', 'bid', 'pound', 'striker', 'Kane'],
 ['financial', 'trouble', 'result', 'loss', 'million', 'bank'],
 ['western', 'bank', 'file', 'bankruptcy', 'financial', 'loss'],
 ['London', 'football', 'club', 'take', 'oil', 'millionaire', 'Russia'],
 ['banking', 'finance', 'work', 'Russia']]

In [33]:
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

In [34]:
print(dictionary.token2id)

{'Arsenal': 0, 'club': 1, 'defeat': 2, 'football': 3, 'local': 4, 'rival': 5, 'weekend': 6, 'London': 7, 'frenzy': 8, 'take': 9, 'bank': 10, 'bid': 11, 'lose': 12, 'million': 13, 'open': 14, 'Wembley': 15, 'stadium': 16, 'Kane': 17, 'arsenal': 18, 'pound': 19, 'striker': 20, 'financial': 21, 'loss': 22, 'result': 23, 'trouble': 24, 'bankruptcy': 25, 'file': 26, 'western': 27, 'Russia': 28, 'millionaire': 29, 'oil': 30, 'banking': 31, 'finance': 32, 'work': 33}


In [35]:
dictionary.filter_extremes(no_below=5, no_above=0.5) 

In [36]:
print(dictionary)

Dictionary<0 unique tokens: []>
