<center><img src="./images/logo_fmkn.png" width=300 style="display: inline-block;"></center> 

## Машинное обучение 2
### Семинар 6. Тематическое моделирование

<br />
<br />
24 марта 2022

### Введение

В данной работе рассмотрены две модели тематического моделирования библиотеки `gensim`:

  - Модель LDA (Latent Dirichlet Allocation)
  - Модель word2vec

Источники вдохновения: 

  - https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html
  - https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html


## Модель LDA (Latent Dirichlet Allocation)

### Подготовка данных

Подключаем библиотеку тематического моделирования gensim (http://radimrehurek.com/gensim/) и загружаем библиотеку NLTK (http://nltk.org/), которая понадобится при лемматизации.

In [2]:
!pip install --upgrade gensim
!pip install --upgrade nltk

Collecting gensim
  Downloading gensim-4.1.2-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.0 MB)
     |████████████████████████████████| 24.0 MB 3.9 MB/s            
Collecting smart-open>=1.8.1
  Downloading smart_open-5.2.1-py3-none-any.whl (58 kB)
     |████████████████████████████████| 58 kB 4.7 MB/s             
Installing collected packages: smart-open, gensim
Successfully installed gensim-4.1.2 smart-open-5.2.1
You should consider upgrading via the '/home/avalur/mkn_env/bin/python -m pip install --upgrade pip' command.[0m
Collecting nltk
  Downloading nltk-3.7-py3-none-any.whl (1.5 MB)
     |████████████████████████████████| 1.5 MB 3.8 MB/s            
Installing collected packages: nltk
  Attempting uninstall: nltk
    Found existing installation: nltk 3.6.5
    Uninstalling nltk-3.6.5:
      Successfully uninstalled nltk-3.6.5
Successfully installed nltk-3.7
You should consider upgrading via the '/home/avalur/mkn_env/bin/python -m pip install --upgrade pip' com

In [3]:
import nltk
import numpy as np
import os
import sys

from gensim import corpora, models, similarities
from math import log
from time import time

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Считываем коллекцию исходных текстов в список документов. Каждый документ — список лемм (токенов). В этом примере мы загружаем всю коллекцию в оперативную память. На самом деле, `gensim` позволяет этого избежать на всех этапах построения модели.

Используемая коллекция — статьи с конференции NeurIPS, одна из стандартных коллекций для тематического моделирования. Число документов — около 1700, длина каждого документа в словах 1000-2000.

In [4]:
import tarfile
import re
import urllib.request, zipfile


tarfile_url = 'https://cs.nyu.edu/~roweis/data/nips12raw_str602.tgz'
filename = 'nips12raw_str602.tgz'
urllib.request.urlretrieve(tarfile_url, filename)

def extract_documents(fname=filename):
    with tarfile.open(fname, mode='r:gz') as tar:
        # Ignore directory entries, as well as files like README, etc.
        files = [
            m for m in tar.getmembers()
            if m.isfile() and re.search(r'nipstxt/nips\d+/\d+\.txt', m.name)
        ]
        for member in sorted(files, key=lambda x: x.name):
            member_bytes = tar.extractfile(member).read()
            yield member_bytes.decode('utf-8', errors='replace')


In [5]:
docs = list(extract_documents())
print(len(docs))
print(print(docs[0][:500]))

1740
1 
CONNECTIVITY VERSUS ENTROPY 
Yaser S. Abu-Mostafa 
California Institute of Technology 
Pasadena, CA 91125 
ABSTRACT 
How does the connectivity of a neural network (number of synapses per 
neuron) relate to the complexity of the problems it can handle (measured by 
the entropy)? Switching theory would suggest no relation at all, since all Boolean 
functions can be implemented using a circuit with very low connectivity (e.g., 
using two-input NAND gates). However, for a network that learns a pr
None


Подготовка данных:
- строим словарь
- делаем лемматизацию
- строим n-граммы
- отсеиваем слишком часто\редко встречающиеся токены

In [6]:
from nltk.tokenize import RegexpTokenizer

# Split the documents into tokens.
tokenizer = RegexpTokenizer(r'\w+')
for idx in range(len(docs)):
    docs[idx] = docs[idx].lower()  # Convert to lowercase.
    docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.

In [7]:
print(np.sum([len(doc) for doc in docs]))

# Remove numbers, but not words that contain numbers.
docs = [[token for token in doc if not token.isnumeric()] for doc in docs]

print(np.sum([len(doc) for doc in docs]))

5461201
5115888


In [8]:
print(docs[1][:50])

['stochastic', 'learning', 'networks', 'and', 'their', 'electronic', 'implementation', 'joshua', 'alspector', 'robert', 'b', 'allen', 'victor', 'hut', 'and', 'srinagesh', 'satyanarayana', 'bell', 'communications', 'research', 'morristown', 'nj', 'abstract', 'we', 'describe', 'a', 'family', 'of', 'learning', 'algorithms', 'that', 'operate', 'on', 'a', 'recurrent', 'symmetrically', 'connected', 'neuromorphic', 'network', 'that', 'like', 'the', 'boltzmann', 'machine', 'settles', 'in', 'the', 'presence', 'of', 'noise']


In [9]:
# Remove words that are only one character.
docs = [[token for token in doc if len(token) > 1] for doc in docs]
print(np.sum([len(doc) for doc in docs]))

4629808


In [10]:
docs = [[token for token in doc if '_' not in token] for doc in docs]
print(np.sum([len(doc) for doc in docs]))

4626035


In [13]:
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /home/avalur/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [14]:
print(lemmatizer.lemmatize('abstracts'),
      lemmatizer.lemmatize('documents'))

abstract document


In [15]:
docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]

In [16]:
# Compute bigrams.
from gensim.models import Phrases

# Add bigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(docs, min_count=20)


2022-02-22 16:24:47,605 : INFO : collecting all words and their counts
2022-02-22 16:24:47,606 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2022-02-22 16:24:57,926 : INFO : collected 1114019 token types (unigram + bigrams) from a corpus of 4626035 words and 1740 sentences
2022-02-22 16:24:57,927 : INFO : merged Phrases<1114019 vocab, min_count=20, threshold=10.0, max_vocab_size=40000000>
2022-02-22 16:24:57,928 : INFO : Phrases lifecycle event {'msg': 'built Phrases<1114019 vocab, min_count=20, threshold=10.0, max_vocab_size=40000000> in 10.32s', 'datetime': '2022-02-22T16:24:57.928760', 'gensim': '4.1.2', 'python': '3.9.7 (default, Nov 25 2021, 21:01:41) \n[GCC 9.3.0]', 'platform': 'Linux-4.4.0-22000-Microsoft-x86_64-with-glibc2.31', 'event': 'created'}


In [17]:
for token in bigram[docs[0][:100]]:
    if '_' in token:
        print(token)

abu_mostafa
california_institute
technology_pasadena
ca_abstract
neural_network
boolean_function
can_be
very_low
learning_rule
lower_bound


In [18]:
for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)

In [19]:
from gensim.corpora import Dictionary

# Create a dictionary representation of the documents.
dictionary = Dictionary(docs)

2022-02-22 16:25:22,772 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2022-02-22 16:25:27,340 : INFO : built Dictionary(77874 unique tokens: ['0a', '2h', '2h2', '2he', '2n']...) from 1740 documents (total 4944899 corpus positions)
2022-02-22 16:25:27,341 : INFO : Dictionary lifecycle event {'msg': "built Dictionary(77874 unique tokens: ['0a', '2h', '2h2', '2he', '2n']...) from 1740 documents (total 4944899 corpus positions)", 'datetime': '2022-02-22T16:25:27.341599', 'gensim': '4.1.2', 'python': '3.9.7 (default, Nov 25 2021, 21:01:41) \n[GCC 9.3.0]', 'platform': 'Linux-4.4.0-22000-Microsoft-x86_64-with-glibc2.31', 'event': 'created'}


Удалим слишком редкие слова (например, опечатки) и слишком частые слова (например, стоп-слова или просто частотные нетематические термины). Функция filter_extremes удаляет из словаря токены, которые встретились менее чем в no_below документов или более чем в доле no_above от общего числа документов.

In [20]:
# Remove rare and common tokens
# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)

2022-02-22 16:25:35,231 : INFO : discarding 69258 tokens: [('0a', 19), ('2h', 16), ('2h2', 1), ('2he', 3), ('a', 1740), ('about', 1058), ('abstract', 1740), ('after', 1087), ('alently', 2), ('all', 1658)]...
2022-02-22 16:25:35,232 : INFO : keeping 8616 tokens which were in no less than 20 and no more than 870 (=50.0%) documents
2022-02-22 16:25:35,297 : INFO : resulting dictionary: Dictionary(8616 unique tokens: ['2n', 'a2', 'a_follows', 'ability', 'abu']...)


Представим все документы в векторном виде (Bag-of-words)

In [21]:
corpus = [dictionary.doc2bow(doc) for doc in docs]

In [22]:
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 8616
Number of documents: 1740


## Обучение модели

Теперь мы готовы к тому, чтобы строить тематическую модель нашей коллекции. Мы строим модель online LDA, реализованную в библиотеке gensim. Указываем векторизованный корпус текстов, словарь, число тем 10. Остальные параметры обсудим позже.

In [23]:
id2word = dictionary.id2token

In [24]:
start = time()
# Set training parameters.
num_topics = 10
chunksize = 2000  # batch-size
passes = 5   # epochs
iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model = models.ldamodel.LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)
print('Evaluation time: {}'.format((time()-start) / 60))

2022-02-22 16:25:53,889 : INFO : using autotuned alpha, starting with [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]
2022-02-22 16:25:53,892 : INFO : using serial LDA version on this node
2022-02-22 16:25:53,902 : INFO : running online (multi-pass) LDA training, 10 topics, 5 passes over the supplied corpus of 1740 documents, updating model once every 1740 documents, evaluating perplexity every 0 documents, iterating 400x with a convergence threshold of 0.001000
2022-02-22 16:25:53,904 : INFO : PROGRESS: pass 0, at document #1740/1740
2022-02-22 16:26:24,637 : INFO : optimized alpha [0.058923192, 0.08117263, 0.10330559, 0.057845123, 0.0694567, 0.07500869, 0.07700171, 0.062445663, 0.08848998, 0.085008636]
2022-02-22 16:26:24,650 : INFO : topic #3 (0.058): 0.006*"image" + 0.006*"tree" + 0.005*"object" + 0.004*"recognition" + 0.004*"node" + 0.003*"hidden" + 0.003*"density" + 0.003*"matrix" + 0.003*"layer" + 0.002*"training_set"
2022-02-22 16:26:24,651 : INFO : topic #0 (0.059): 0.008*"

Evaluation time: 1.3982077638308208


Посмотрим, что получилось. Нас интересует часть матрицы Phi - вероятностей слов в темах. Коллекция NeurIPS вся посвящена машинному обучению. Сложно оценить темы, хотя некоторая интерпретируемость прослеживается.

In [26]:
for position in range(10):
    row = []
    for topic in range(10):
        row.append(model.show_topic(topic)[position][0].center(11, ' '))
    print(''.join(row))

   neuron      chip     control     image      action     hidden      cell     gaussian    image      neuron  
    rule     circuit     policy     object     robot      layer      neuron     matrix  recognition   hidden  
 activation   analog    optimal      tree    prediction    net      response   mixture     speech     noise   
 connection   signal     action  recognitionreinforcementgeneralization  stimulus    class       word      matrix  
  dynamic     neuron    dynamic      node     control   classifier   visual  approximation  sequence   dynamic  
   layer     voltage   threshold   distance    target     class     activity  likelihood   class   hidden_unit
   symbol     memory     bound    character     goal   hidden_unit  synaptic    sample      face      kernel  
    cell       vlsi     theorem     digit      image   training_set   spike       log        hmm       layer   
  binding      cell    controller   class       hand     trained   frequency    prior     context      m

In [27]:
top_topics = model.top_topics(corpus)

2022-02-22 16:37:05,231 : INFO : CorpusAccumulator accumulated stats from 1000 documents


In [28]:
top_topics[0]

([(0.0073933867, 'neuron'),
  (0.0071649444, 'hidden'),
  (0.0054954486, 'noise'),
  (0.00519319, 'matrix'),
  (0.0041962005, 'dynamic'),
  (0.0041336617, 'hidden_unit'),
  (0.0035209463, 'kernel'),
  (0.0033246516, 'layer'),
  (0.003298026, 'map'),
  (0.003261122, 'eq'),
  (0.0031875812, 'component'),
  (0.0030365295, 'solution'),
  (0.002856641, 'connection'),
  (0.0028020288, 'field'),
  (0.0027057056, 'rule'),
  (0.002698164, 'fig'),
  (0.002629931, 'generalization'),
  (0.0024733716, 'optimal'),
  (0.0024529216, 'signal'),
  (0.002380457, 'dimensional')],
 -0.9328473022670469)

In [29]:
model.inference([corpus[0]])[0]

array([[3.4935974e-02, 4.3625142e-02, 7.7860468e+02, 3.9755680e-02,
        4.4036742e-02, 5.1865373e-02, 5.4756276e-02, 1.9868952e+01,
        5.1610108e-02, 4.7824293e-02]], dtype=float32)

## Оценка моделей с помощью перплексии

Хочется оценить модель чем-то более убедительным, чем разглядывание профилей тем и профилей документов. Это необходимо для возможноси сравнения разных моделей, например полученных с разными параметрами запуска. Научимся измерять **перплексию**. Функция `model.state.get_lambda` возвращает ненормированную матрицу $\Phi$, `model.inference` оценивает ненормированную матрицу $\Theta$ для списка документов. 

Проходим по коллекции и считаем перплексию по формуле. Чем меньше перплексия, тем лучше.

In [30]:
def perplexity(model, corpus):
    corpus_length = 0
    log_likelihood = 0
    topic_profiles = model.state.get_lambda() / np.sum(model.state.get_lambda(), axis=1)[:, np.newaxis]
    for document in corpus:
        gamma, _ = model.inference([document])
        document_profile = gamma / np.sum(gamma)
        for term_id, term_count in document:
            corpus_length += term_count
            term_probability = np.dot(document_profile, topic_profiles[:, term_id])
            log_likelihood += term_count * log(term_probability)
    perplexity = np.exp(-log_likelihood / corpus_length)
    return perplexity

In [31]:
print('Perplexity: {}'.format(perplexity(model, corpus)))

Perplexity: 2769.3908373846953


In [32]:
model_5 = models.ldamodel.LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=5,
    passes=passes,
    eval_every=eval_every
)
model_20 = models.ldamodel.LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=20,
    passes=passes,
    eval_every=eval_every
)

2022-02-22 16:37:31,649 : INFO : using autotuned alpha, starting with [0.2, 0.2, 0.2, 0.2, 0.2]
2022-02-22 16:37:31,653 : INFO : using serial LDA version on this node
2022-02-22 16:37:31,659 : INFO : running online (multi-pass) LDA training, 5 topics, 5 passes over the supplied corpus of 1740 documents, updating model once every 1740 documents, evaluating perplexity every 0 documents, iterating 400x with a convergence threshold of 0.001000
2022-02-22 16:37:31,661 : INFO : PROGRESS: pass 0, at document #1740/1740
2022-02-22 16:37:54,738 : INFO : optimized alpha [0.17777632, 0.20432077, 0.12614259, 0.07163164, 0.15984705]
2022-02-22 16:37:54,745 : INFO : topic #0 (0.178): 0.003*"neuron" + 0.003*"image" + 0.003*"recognition" + 0.003*"classifier" + 0.003*"word" + 0.003*"matrix" + 0.003*"class" + 0.003*"node" + 0.003*"memory" + 0.003*"layer"
2022-02-22 16:37:54,747 : INFO : topic #1 (0.204): 0.006*"neuron" + 0.004*"signal" + 0.004*"cell" + 0.003*"class" + 0.003*"noise" + 0.003*"layer" + 0.0

2022-02-22 16:38:37,304 : INFO : PROGRESS: pass 0, at document #1740/1740
2022-02-22 16:39:11,777 : INFO : optimized alpha [0.038862128, 0.05100475, 0.047491785, 0.048265025, 0.038935926, 0.04641892, 0.04216214, 0.032821245, 0.044396333, 0.04052047, 0.043786053, 0.038821142, 0.040446002, 0.04046346, 0.048172873, 0.04609909, 0.045116663, 0.03775596, 0.044976816, 0.040786207]
2022-02-22 16:39:11,805 : INFO : topic #7 (0.033): 0.008*"cell" + 0.004*"rule" + 0.004*"control" + 0.004*"correlation" + 0.003*"neuron" + 0.003*"spline" + 0.003*"muscle" + 0.003*"potential" + 0.003*"stimulus" + 0.003*"phase"
2022-02-22 16:39:11,806 : INFO : topic #17 (0.038): 0.008*"hidden" + 0.005*"control" + 0.005*"policy" + 0.004*"hidden_unit" + 0.004*"optimal" + 0.003*"layer" + 0.003*"noise" + 0.003*"object" + 0.003*"action" + 0.003*"net"
2022-02-22 16:39:11,808 : INFO : topic #14 (0.048): 0.005*"image" + 0.004*"recognition" + 0.004*"net" + 0.004*"hidden" + 0.003*"layer" + 0.003*"distance" + 0.003*"mixture" + 0.

In [33]:
print('Perplexity 5: {}'.format(perplexity(model_5, corpus)))
print('Perplexity 20: {}'.format(perplexity(model_20, corpus)))

Perplexity 5: 3071.2870198514443
Perplexity 20: 2515.209696656869


# Модель Word2Vec

Word2Vec — одна из основных нейросетевых моделей в «дотрансформерную» эпоху (2013-2018). Суть модели в построении отображения слов в $N$-мерное пространство (embeddings), имеющее определенные характеристики. Два слова имеют тем более близкие эмбеддинги, чем в более схожих контекстах они употребляются.

В библиотеке `gensim` реализовано два метода построения word2vec:

  - Skip-grams (SG)
  - Continuous-bag-of-words (CBOW)


## Demo готовой модели

Для демонстрации возьмем готовую модель, обученную на Google News dataset, содержащая ~3M английских слов и фраз.

In [34]:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')

2022-02-22 16:40:52,058 : INFO : Creating /home/avalur/gensim-data




IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)






2022-02-22 16:45:17,215 : INFO : word2vec-google-news-300 downloaded
2022-02-22 16:45:17,227 : INFO : loading projection weights from /home/avalur/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz
2022-02-22 16:47:00,218 : INFO : KeyedVectors lifecycle event {'msg': 'loaded (3000000, 300) matrix of type float32 from /home/avalur/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz', 'binary': True, 'encoding': 'utf8', 'datetime': '2022-02-22T16:47:00.218484', 'gensim': '4.1.2', 'python': '3.9.7 (default, Nov 25 2021, 21:01:41) \n[GCC 9.3.0]', 'platform': 'Linux-4.4.0-22000-Microsoft-x86_64-with-glibc2.31', 'event': 'load_word2vec_format'}


In [35]:
for index, word in enumerate(wv.index_to_key):
    if index == 10:
        break
    print(f"word #{index}/{len(wv.index_to_key )} is {word}")

word #0/3000000 is </s>
word #1/3000000 is in
word #2/3000000 is for
word #3/3000000 is that
word #4/3000000 is is
word #5/3000000 is on
word #6/3000000 is ##
word #7/3000000 is The
word #8/3000000 is with
word #9/3000000 is said


Возьмем вектор слова king

In [36]:
vec_king = wv['king']
print(vec_king[:10])

[ 0.12597656  0.02978516  0.00860596  0.13964844 -0.02563477 -0.03613281
  0.11181641 -0.19824219  0.05126953  0.36328125]


При помощи модели можно считать расстояния до других слов

In [37]:
pairs = [
    ('car', 'minivan'),   # a minivan is a kind of car
    ('car', 'bicycle'),   # still a wheeled vehicle
    ('car', 'airplane'),  # ok, no wheels, but still a vehicle
    ('car', 'cereal'),    # ... and so on
    ('car', 'communism'),
]
for w1, w2 in pairs:
    print('%r\t%r\t%.2f' % (w1, w2, wv.similarity(w1, w2)))

'car'	'minivan'	0.69
'car'	'bicycle'	0.54
'car'	'airplane'	0.42
'car'	'cereal'	0.14
'car'	'communism'	0.06


Или находить наиболее близкие по смыслу слова

In [38]:
print(wv.most_similar(positive=['car', 'minivan'], topn=5))

[('SUV', 0.8532191514968872), ('vehicle', 0.8175784349441528), ('pickup_truck', 0.7763689756393433), ('Jeep', 0.7567334175109863), ('Ford_Explorer', 0.7565719485282898)]


In [39]:
vec_example = wv['king'] - wv['man'] + wv['woman']

similars = wv.most_similar(positive=[vec_example])
print(similars)

[('king', 0.8449392914772034), ('queen', 0.730051577091217), ('monarch', 0.6454662084579468), ('princess', 0.6156250834465027), ('crown_prince', 0.5818676948547363), ('prince', 0.5777117013931274), ('kings', 0.561366617679596), ('sultan', 0.5376775860786438), ('Queen_Consort', 0.5344247221946716), ('queens', 0.5289887189865112)]


In [44]:
vec_example = wv['programmer'] - wv['woman'] + wv['man'] 

similars = wv.most_similar(positive=[vec_example])
print(similars)

[('programmer', 0.8918285965919495), ('programmers', 0.5779235363006592), ('programer', 0.5624995827674866), ('Programmer', 0.5415107011795044), ('sysadmin', 0.5366033911705017), ('Jon_Shiring', 0.5260592699050903), ('coder', 0.5256212949752808), ('modder', 0.4957827925682068), ('animator', 0.4935148060321808), ('engineer', 0.48991018533706665)]
