# Анализ текстовых данных


Николенко, Кадурин, Архангельская. **Глубокое обучение. Погружение в мир нейронных сетей**. Глава 7.


### Какие задачи можно решать, обрабатывая текст?
"Мама мыла раму, и теперь она блестит"  
"Мама мыла раму, и теперь она сильно устала"  

"Кубок не помещался в чемодан, потому что он был слишком велик. Что именно было слишком велико, чемодан или кубок?"

http://commonsensereasoning.org/winograd.html


1. синтаксические задачи
  * разметка по частям речи и по морфологическим признакам
  * деление слов в тексте на морфемы (суффикс, приставка и пр.)
  * стемминг, лемматизация (?)
  * деление на предложения (инициалы и сокращения) и слова (китайский язык)
  * поиск имен и названий в тексте - сущностей
  * разрешение смысла слов в заданном контексте (замок)
  * построить синтаксическое дерево
  * определение того, к каким другим объектам относится слово
2. задачи на понимание текста, в которых есть "учитель"
  * предсказание следующего символа
  * информационный поиск
  * анализ тональности
  * выделение отношений и фактов
  * ответы на вопросы
3. понимание и порождение текста (оценка качества?)
  * порождение текста
  * автоматическое реферирование
  * машинный перевод
  * диалоговые модели (чат-бот)
  
Косвенные задачи:
  * описание изображения
  * распознавание речи
  
**Задачи бизнеса**:
  * распознавание речи (помощник)
  * чат-бот (замена техподдержки в решении большинства вопросов)
  * поиск точного ответа на вопрос в базе документов (например, база стандартов)
  * оценка мнения в социальных сетях о продукте
  * ... (ваши варианты?)

# Тематическое моделирование

Тематическая модель автоматически определяет, к каким темам относится каждый документ из коллекции документов, а так же какие слова (термины) характеризуют каждую тему.

<img src="https://upload.wikimedia.org/wikipedia/commons/d/d5/%D0%A2%D0%B5%D0%BC%D0%B0%D1%82%D0%B8%D1%87%D0%B5%D1%81%D0%BA%D0%B0%D1%8F_%D0%BC%D0%BE%D0%B4%D0%B5%D0%BB%D1%8C.png">

In [1]:
# Author: Olivier Grisel <olivier.grisel@ensta.org>
#         Lars Buitinck
#         Chyi-Kwei Yau <chyikwei.yau@gmail.com>
# License: BSD 3 clause

from __future__ import print_function
from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation # Non-negative Matrix Factorization & LDA
from sklearn.datasets import fetch_20newsgroups

n_samples = 2000
n_features = 1000
n_components = 10
n_top_words = 20


def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()


# Load the 20 newsgroups dataset and vectorize it. We use a few heuristics
# to filter out useless terms early on: the posts are stripped of headers,
# footers and quoted replies, and common English words, words occurring in
# only one document or in at least 95% of the documents are removed.

print("Loading dataset...")
t0 = time()
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))
data_samples = dataset.data[:n_samples]
print("done in %0.3fs." % (time() - t0))

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


Loading dataset...
done in 409.642s.


In [2]:
len(data_samples)

2000

In [3]:
data_samples[0]

"Well i'm not sure about the story nad it did seem biased. What\nI disagree with is your statement that the U.S. Media is out to\nruin Israels reputation. That is rediculous. The U.S. media is\nthe most pro-israeli media in the world. Having lived in Europe\nI realize that incidences such as the one described in the\nletter have occured. The U.S. media as a whole seem to try to\nignore them. The U.S. is subsidizing Israels existance and the\nEuropeans are not (at least not to the same degree). So I think\nthat might be a reason they report more clearly on the\natrocities.\n\tWhat is a shame is that in Austria, daily reports of\nthe inhuman acts commited by Israeli soldiers and the blessing\nreceived from the Government makes some of the Holocaust guilt\ngo away. After all, look how the Jews are treating other races\nwhen they got power. It is unfortunate.\n"

In [4]:


# Use tf-idf features for NMF.
print("Extracting tf-idf features for NMF...")
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,
                                   stop_words='english')
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))

# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))
print()

# Fit the NMF model
print("Fitting the NMF model (Frobenius norm) with tf-idf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
t0 = time()
nmf = NMF(n_components=n_components, random_state=1,
          alpha=.1, l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in NMF model (Frobenius norm):")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

# Fit the NMF model
print("Fitting the NMF model (generalized Kullback-Leibler divergence) with "
      "tf-idf features, n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
t0 = time()
nmf = NMF(n_components=n_components, random_state=1,
          beta_loss='kullback-leibler', solver='mu', max_iter=1000, alpha=.1,
          l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in NMF model (generalized Kullback-Leibler divergence):")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

print("Fitting LDA models with tf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Extracting tf-idf features for NMF...
done in 1.357s.
Extracting tf features for LDA...
done in 1.443s.

Fitting the NMF model (Frobenius norm) with tf-idf features, n_samples=2000 and n_features=1000...
done in 1.655s.

Topics in NMF model (Frobenius norm):
Topic #0: just people don think like know time good make way really say right ve want did ll new use years
Topic #1: windows use dos using window program os drivers application help software pc running ms screen files version card code work
Topic #2: god jesus bible faith christian christ christians does heaven sin believe lord life church mary atheism belief human love religion
Topic #3: thanks know does mail advance hi info interested email anybody looking card help like appreciated information send list video need
Topic #4: car cars tires miles 00 new engine insurance price condition oil power speed good 000 brake year models used bought
Topic #5: edu soon com send university internet mit ftp mail cc pub article information hope

In [5]:
lda.components_


array([[ 4.96604155,  4.3537397 , 21.42539886, ...,  1.57926488,
         1.33933502,  1.20988436],
       [ 0.48391034,  1.85845783, 14.04720958, ..., 74.59501615,
        59.36116266,  0.27698642],
       [ 0.18708486,  0.13728929,  0.31409364, ...,  1.02679042,
         2.56259123,  0.13662652],
       ...,
       [ 3.22343848, 39.1368944 , 11.24910558, ..., 23.37779481,
         3.06315114,  0.15230766],
       [ 1.41871388, 47.53082031, 16.14390001, ..., 82.46751192,
        16.51319941, 28.11660323],
       [ 4.02759659,  1.24781464, 13.26101699, ..., 29.0225105 ,
         0.24834416,  0.13033208]])

In [6]:
lda.components_.shape

(10, 1000)

# Наивный Байес
* Знаем метку каждого документа
* У каждого документа только одна метка  


Что можно сделать, если нет информации о метках?
#### Проблема кластеризации  
Можно решать с помощью EM-алгоритма:
* E-шаг - вычислить ожидания того, какой документ к какой теме относится
* M-шаг - с помощью Наивного Байеса определить вероятности $p(w|t)$ при фиксированных метках


## EM-алгоритм (Expectation-maximization)

Решает задачу кластеризации.  
Подбирает некоторые параметры модели для данных в которых неизвестен ответ.  

Expectation шаг:
* зафиксировать параметры модели
* посчитать значения скрытых переменных
Maximization шаг:
* зафиксировать скрытые переменные
* посчитать параметры модели

Повторять до сходимости.

Есть математическое обоснование того, что метод сходится к локальному экстремуму, на каждом шаге значение функции правдоподобия не убывает (правдоподобие $p(\theta | \mathcal{X})$ - насколько правдоподобна модель при данных параметрах, насколдько она хорошо описывает данные)

Частный случай EM-алгоритма - **k-means**.  
Метки кластеров - скрытые переменные Z  (latent variables)  
Параметры модели - центры кластеров  

<img src="kmeans.png">

Еще вариант EM-алгоритма - разделение смеси гауссиан (Gaussian Mixture Model, GMM)

Параметры модели - центр кластера и матрица ковариаций (здесь описывает форму могомерного нормального распраделения, или гауссианы)
Скрытые переменные - вероятность пренадлежности к каждой гауссиане (метка кластера выбирается как наиболее вероятный кластер)


<img src="gauss.png">

## PLSA (Probabilistic latent semantic analysis)

Что если у каждого документа может быть много меток?

Рассмотрим модель:
* Каждое слово в документе $d$ сгенерировано из некоторой темы $t \in T$
* Документ сгенерирован некоторым распределением над темами $p(t|d)$
* Слово сгенерировано из темы (не из документа) $p(w|d, t) = p(w|d)$
* Получаем правдоподобие: $$p(w|d) = \sum_{t \in T}p(w|t)p(t|d) $$

Полученная модель - probabilistic latent semantic analysis, pLSA, Вероятностный латентно-семантический анализ

http://www.machinelearning.ru/wiki/index.php?title=%D0%92%D0%B5%D1%80%D0%BE%D1%8F%D1%82%D0%BD%D0%BE%D1%81%D1%82%D0%BD%D1%8B%D0%B9_%D0%BB%D0%B0%D1%82%D0%B5%D0%BD%D1%82%D0%BD%D1%8B%D0%B9_%D1%81%D0%B5%D0%BC%D0%B0%D0%BD%D1%82%D0%B8%D1%87%D0%B5%D1%81%D0%BA%D0%B8%D0%B9_%D0%B0%D0%BD%D0%B0%D0%BB%D0%B8%D0%B7

Обучение:  
Нам нужны величины:
* $p(w|t)$ - вероятности слов в темах, обозначим $\phi_{wt}$

* $p(t|d)$ - вероятности тем в документах, обозначим $\theta_{td}$

E-шаг:
* фиксируем $\phi_{wt}$ и $\theta_{td}$
* вычисляем $$p(t|d,w) = \frac{\phi_{wt} \theta_{td}}{\sum_{s \in T}\phi_{ws} \theta_{sd}}$$ для всех тем, для каждого документа, для каждого термина
* вычисляем количество терминов, которое генерируется в документе $d$ темой $t$ $$n_{dwt} = n_{dw}p(t|d,w)$$

М-шаг:
* по вычисленным $p(t|d,w)$ обновить приближения модели $\phi_{wt}$ и $\theta_{td}$
* $$n_{wt} = \sum_d n_{dwt}$$ $$n_{td} = \sum_{w \in d} n_{dwt}$$ $$n_t=\sum_w n_{wt}$$
* $$\theta_{td} = \frac{n_{td}}{n_d}$$ $$\phi_{wt} = \frac{n_{wt}}{n_t}$$


Можно не хранить матрицу $n_{dwt}$, а итерироваться по документам и суммировать $n_{wt}$ и $n_{td}$
* Много локальных экстремумов
* Много параметров, модель переобучается
* Нужно достичь не локальный минимум, а добиться интерпретируемости - найти "хороший" минимум

## LDA (Latent Dirichlet Allocation)

В общем случае, чтобы улучшить pLSA, в логарифм правдоподобия добавляют регуляризацию:

$$\sum_{d \in D} \sum_{w \in d} n_{dw} ln \sum_{t \in T} \phi_{wt} \theta_{td} + \sum_i \tau_i R_i(\Phi, \Theta)$$

Если добавить априорное распределение - распределение дирехле, получим алгоритм LDA - Latent Dirichlet Allocation

В итоге получаем "хорошее" интерпретируемое решение (лучше, чем с pLSA)


Один документ может содержать несколько тем.  
Составляем иерархическую модель:  
* первый уровень - смесь, компоненты которой отвечают за темы
* второй уровень - мультиномиальная переменная с априорным распределением Дирихле, которая определяет "распределение над темами" в документе

Обучение:
* сэмплирование по Гибсу
* online variational bayes

In [7]:
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
doc5 = "Health experts say that Sugar is not good for your lifestyle."

# compile documents
doc_complete = [doc1, doc2, doc3, doc4, doc5]

In [11]:
import nltk
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\kapmik\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kapmik\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [12]:
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string


stop = set(stopwords.words('english'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()


def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

doc_clean = [clean(doc).split() for doc in doc_complete]        

In [13]:
doc_clean

[['sugar', 'bad', 'consume', 'sister', 'like', 'sugar', 'father'],
 ['father',
  'spends',
  'lot',
  'time',
  'driving',
  'sister',
  'around',
  'dance',
  'practice'],
 ['doctor',
  'suggest',
  'driving',
  'may',
  'cause',
  'increased',
  'stress',
  'blood',
  'pressure'],
 ['sometimes',
  'feel',
  'pressure',
  'perform',
  'well',
  'school',
  'father',
  'never',
  'seems',
  'drive',
  'sister',
  'better'],
 ['health', 'expert', 'say', 'sugar', 'good', 'lifestyle']]

In [15]:
import gensim
from gensim import corpora

# Creating the term dictionary of our courpus, where every unique term is assigned an index. 
dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

In [16]:
dictionary.keys(), dictionary.values()

([0,
  1,
  2,
  3,
  4,
  5,
  6,
  7,
  8,
  9,
  10,
  11,
  12,
  13,
  14,
  15,
  16,
  17,
  18,
  19,
  20,
  21,
  22,
  23,
  24,
  25,
  26,
  27,
  28,
  29,
  30,
  31,
  32,
  33,
  34],
 ValuesView(<gensim.corpora.dictionary.Dictionary object at 0x000001A28C72B5C0>))

In [17]:
dictionary.doc2bow(doc_clean[0])

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 2)]

In [18]:
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word=dictionary, passes=50)

In [19]:
print(ldamodel.print_topics(num_topics=3, num_words=3))

[(0, '0.091*"sugar" + 0.064*"sister" + 0.064*"father"'), (1, '0.029*"father" + 0.029*"sister" + 0.029*"pressure"'), (2, '0.079*"driving" + 0.045*"pressure" + 0.045*"suggest"')]


In [21]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

In [22]:
pyLDAvis.gensim.prepare(ldamodel, doc_term_matrix, dictionary)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  return pd.concat([default_term_info] + list(topic_dfs))


# https://www.kaggle.com/benhamner/nips-papers

In [20]:
import pandas as pd

ds = pd.read_csv('papers.csv')
ds['paper_text']

0       767\n\nSELF-ORGANIZATION OF ASSOCIATIVE DATABA...
1       683\n\nA MEAN FIELD THEORY OF LAYER IV OF VISU...
2       394\n\nSTORING COVARIANCE BY THE ASSOCIATIVE\n...
3       Bayesian Query Construction for Neural\nNetwor...
4       Neural Network Ensembles, Cross\nValidation, a...
5       U sing a neural net to instantiate a\ndeformab...
6       Plasticity-Mediated Competitive Learning\n\nTe...
7       ICEG Morphology Classification using an\nAnalo...
8       Real-Time Control of a Tokamak Plasma\nUsing N...
9       Real-Time Control of a Tokamak Plasma\nUsing N...
10      Learning To Play the Game of Chess\n\nSebastia...
11      Multidimensional Scaling and Data Clustering\n...
12      An experimental comparison\nof recurrent neura...
13      133\n\nTRAINING MULTILAYER PERCEPTRONS WITH TH...
14      Interference in Learning Internal\nModels of I...
15      Active Learning with Statistical Models\n\nDav...
16      A Rapid Graph-based Method for\nArbitrary Tran...
17      Ocular

In [19]:
len(ds)

7241

In [19]:
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string


stop = set(stopwords.words('english'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()


def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

doc_clean = [clean(doc).split() for doc in ds['paper_text']]        

In [23]:
doc_clean[0]

[u'767',
 u'selforganization',
 u'associative',
 u'database',
 u'application',
 u'hisashi',
 u'suzuki',
 u'suguru',
 u'arimoto',
 u'osaka',
 u'university',
 u'toyonaka',
 u'osaka',
 u'560',
 u'japan',
 u'abstract',
 u'efficient',
 u'method',
 u'selforganizing',
 u'associative',
 u'database',
 u'proposed',
 u'together',
 u'application',
 u'robot',
 u'eyesight',
 u'system',
 u'proposed',
 u'database',
 u'associate',
 u'input',
 u'output',
 u'first',
 u'half',
 u'part',
 u'discussion',
 u'algorithm',
 u'selforganization',
 u'proposed',
 u'aspect',
 u'hardware',
 u'produce',
 u'new',
 u'style',
 u'neural',
 u'network',
 u'latter',
 u'half',
 u'part',
 u'applicability',
 u'handwritten',
 u'letter',
 u'recognition',
 u'autonomous',
 u'mobile',
 u'robot',
 u'system',
 u'demonstrated',
 u'introduction',
 u'let',
 u'mapping',
 u'f',
 u'x',
 u'given',
 u'here',
 u'x',
 u'finite',
 u'infinite',
 u'set',
 u'another',
 u'finite',
 u'infinite',
 u'set',
 u'learning',
 u'machine',
 u'observes',
 u'se

In [25]:
import gensim
from gensim import corpora

# Creating the term dictionary of our courpus, where every unique term is assigned an index. 
dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

In [29]:
doc_term_matrix[0]

[(0, 5),
 (1, 22),
 (2, 1),
 (3, 3),
 (4, 3),
 (5, 1),
 (6, 1),
 (7, 1),
 (8, 4),
 (9, 1),
 (10, 2),
 (11, 1),
 (12, 1),
 (13, 1),
 (14, 2),
 (15, 1),
 (16, 4),
 (17, 9),
 (18, 1),
 (19, 1),
 (20, 1),
 (21, 11),
 (22, 2),
 (23, 1),
 (24, 1),
 (25, 1),
 (26, 4),
 (27, 1),
 (28, 1),
 (29, 1),
 (30, 1),
 (31, 1),
 (32, 1),
 (33, 12),
 (34, 2),
 (35, 1),
 (36, 1),
 (37, 1),
 (38, 1),
 (39, 1),
 (40, 1),
 (41, 1),
 (42, 1),
 (43, 1),
 (44, 1),
 (45, 4),
 (46, 2),
 (47, 3),
 (48, 1),
 (49, 1),
 (50, 1),
 (51, 7),
 (52, 1),
 (53, 1),
 (54, 1),
 (55, 1),
 (56, 7),
 (57, 1),
 (58, 1),
 (59, 1),
 (60, 1),
 (61, 1),
 (62, 1),
 (63, 1),
 (64, 1),
 (65, 1),
 (66, 1),
 (67, 1),
 (68, 1),
 (69, 1),
 (70, 1),
 (71, 1),
 (72, 1),
 (73, 3),
 (74, 2),
 (75, 1),
 (76, 2),
 (77, 1),
 (78, 1),
 (79, 1),
 (80, 2),
 (81, 1),
 (82, 6),
 (83, 1),
 (84, 1),
 (85, 1),
 (86, 2),
 (87, 1),
 (88, 6),
 (89, 1),
 (90, 1),
 (91, 3),
 (92, 1),
 (93, 2),
 (94, 2),
 (95, 1),
 (96, 1),
 (97, 1),
 (98, 2),
 (99, 4),
 (100, 

In [32]:
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=10, id2word=dictionary, passes=1)

In [33]:
for t in ldamodel.print_topics(num_topics=30, num_words=10):
    print(t)

(0, u'0.021*"network" + 0.009*"neural" + 0.008*"1" + 0.007*"input" + 0.007*"model" + 0.005*"output" + 0.005*"unit" + 0.005*"time" + 0.005*"2" + 0.004*"learning"')
(1, u'0.016*"1" + 0.012*"2" + 0.008*"model" + 0.008*"x" + 0.006*"0" + 0.006*"p" + 0.006*"algorithm" + 0.006*"graph" + 0.005*"r" + 0.005*"n"')
(2, u'0.013*"image" + 0.013*"model" + 0.010*"1" + 0.006*"2" + 0.005*"x" + 0.005*"3" + 0.005*"feature" + 0.004*"a" + 0.004*"using" + 0.004*"object"')
(3, u'0.014*"1" + 0.012*"model" + 0.008*"data" + 0.008*"x" + 0.007*"learning" + 0.005*"distribution" + 0.005*"2" + 0.005*"k" + 0.005*"0" + 0.005*"3"')
(4, u'0.022*"1" + 0.015*"2" + 0.015*"x" + 0.010*"algorithm" + 0.010*"k" + 0.009*"n" + 0.009*"0" + 0.007*"f" + 0.007*"function" + 0.006*"p"')
(5, u'0.016*"1" + 0.011*"0" + 0.009*"2" + 0.009*"x" + 0.009*"n" + 0.007*"p" + 0.007*"model" + 0.007*"j" + 0.006*"c" + 0.006*"r"')
(6, u'0.012*"model" + 0.011*"1" + 0.008*"0" + 0.008*"2" + 0.007*"data" + 0.005*"x" + 0.005*"k" + 0.005*"p" + 0.005*"b" + 0.0

### bigrams



In [41]:
doc_bigrams = [[t1 + '_' + t2 for t1, t2 in zip(doc, doc[1:])] for doc in doc_clean]

dictionary = corpora.Dictionary(doc_bigrams)
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_bigrams]

In [43]:
doc_term_matrix[0]

[(0, 1),
 (1, 1),
 (2, 1),
 (3, 1),
 (4, 1),
 (5, 2),
 (6, 1),
 (7, 1),
 (8, 2),
 (9, 1),
 (10, 1),
 (11, 1),
 (12, 1),
 (13, 1),
 (14, 1),
 (15, 2),
 (16, 1),
 (17, 1),
 (18, 1),
 (19, 1),
 (20, 1),
 (21, 1),
 (22, 2),
 (23, 1),
 (24, 4),
 (25, 3),
 (26, 6),
 (27, 1),
 (28, 1),
 (29, 1),
 (30, 1),
 (31, 1),
 (32, 1),
 (33, 1),
 (34, 1),
 (35, 1),
 (36, 1),
 (37, 1),
 (38, 1),
 (39, 1),
 (40, 1),
 (41, 1),
 (42, 1),
 (43, 1),
 (44, 1),
 (45, 1),
 (46, 1),
 (47, 1),
 (48, 1),
 (49, 1),
 (50, 1),
 (51, 1),
 (52, 2),
 (53, 1),
 (54, 1),
 (55, 1),
 (56, 1),
 (57, 1),
 (58, 1),
 (59, 1),
 (60, 1),
 (61, 1),
 (62, 1),
 (63, 1),
 (64, 1),
 (65, 1),
 (66, 1),
 (67, 1),
 (68, 1),
 (69, 1),
 (70, 1),
 (71, 1),
 (72, 1),
 (73, 1),
 (74, 1),
 (75, 1),
 (76, 2),
 (77, 1),
 (78, 1),
 (79, 1),
 (80, 1),
 (81, 1),
 (82, 1),
 (83, 1),
 (84, 1),
 (85, 1),
 (86, 1),
 (87, 1),
 (88, 1),
 (89, 1),
 (90, 1),
 (91, 1),
 (92, 1),
 (93, 1),
 (94, 1),
 (95, 1),
 (96, 1),
 (97, 1),
 (98, 1),
 (99, 1),
 (100, 1),

In [44]:
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=10, id2word=dictionary, passes=1)

In [45]:
for t in ldamodel.print_topics(num_topics=30, num_words=10):
    print(t)

(0, u'0.001*"reinforcement_learning" + 0.001*"et_al" + 0.001*"s_a" + 0.001*"1_2" + 0.000*"machine_learning" + 0.000*"value_function" + 0.000*"reward_function" + 0.000*"0_1" + 0.000*"0_0" + 0.000*"neural_network"')
(1, u'0.001*"1_1" + 0.001*"et_al" + 0.001*"0_1" + 0.000*"1_2" + 0.000*"machine_learning" + 0.000*"x_x" + 0.000*"figure_1" + 0.000*"2_2" + 0.000*"k_k" + 0.000*"figure_2"')
(2, u'0.001*"1_1" + 0.001*"1_2" + 0.001*"machine_learning" + 0.001*"et_al" + 0.001*"0_0" + 0.001*"x_x" + 0.001*"neural_information" + 0.001*"2_2" + 0.001*"0_1" + 0.001*"information_processing"')
(3, u'0.001*"et_al" + 0.001*"machine_learning" + 0.001*"1_2" + 0.001*"x_x" + 0.001*"1_1" + 0.001*"0_1" + 0.001*"processing_system" + 0.001*"0_0" + 0.001*"neural_network" + 0.000*"figure_1"')
(4, u'0.001*"et_al" + 0.001*"x_x" + 0.001*"1_2" + 0.001*"0_0" + 0.001*"information_processing" + 0.001*"processing_system" + 0.001*"0_1" + 0.000*"machine_learning" + 0.000*"neural_network" + 0.000*"gaussian_process"')
(5, u'0.001

In [47]:
#save model
# ldamodel.save('nips.bigrams')

#Load model
ldamodel = Lda.load('nips.bigrams')

In [49]:
for t in ldamodel.print_topics(num_topics=10, num_words=5):
    print(t)

(0, u'0.001*"reinforcement_learning" + 0.001*"et_al" + 0.001*"s_a" + 0.001*"1_2" + 0.000*"machine_learning"')
(1, u'0.001*"1_1" + 0.001*"et_al" + 0.001*"0_1" + 0.000*"1_2" + 0.000*"machine_learning"')
(2, u'0.001*"1_1" + 0.001*"1_2" + 0.001*"machine_learning" + 0.001*"et_al" + 0.001*"0_0"')
(3, u'0.001*"et_al" + 0.001*"machine_learning" + 0.001*"1_2" + 0.001*"x_x" + 0.001*"1_1"')
(4, u'0.001*"et_al" + 0.001*"x_x" + 0.001*"1_2" + 0.001*"0_0" + 0.001*"information_processing"')
(5, u'0.001*"et_al" + 0.001*"neural_network" + 0.000*"figure_2" + 0.000*"computer_vision" + 0.000*"arxiv_preprint"')
(6, u'0.003*"neural_network" + 0.001*"et_al" + 0.001*"0_0" + 0.001*"hidden_unit" + 0.001*"1_1"')
(7, u'0.002*"neural_network" + 0.001*"et_al" + 0.001*"neural_information" + 0.001*"figure_1" + 0.000*"1_2"')
(8, u'0.001*"neural_network" + 0.001*"et_al" + 0.001*"machine_learning" + 0.001*"1_1" + 0.001*"0_1"')
(9, u'0.001*"f_x" + 0.001*"1_2" + 0.001*"1_1" + 0.001*"0_1" + 0.001*"2_2"')


# https://www.kaggle.com/mrisdal/fake-news/data

In [2]:
import pandas as pd

ds = pd.read_csv('fake.csv', usecols = ['text'])
ds.dropna(axis=0, inplace=True, subset=['text'])
ds = ds.sample(frac=1.0)
ds['text'].head()

6852     Comments \nRepublican nominee Donald Trump is ...
3557     It is no longer a question of whether or not f...
6654     Tuesday 1 November 2016 by James W School to t...
12753    posted by Eddie One of the world’s largest tra...
51       November 13, 2016 By 21wire Leave a Comment \n...
Name: text, dtype: object

In [3]:
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string


stop = set(stopwords.words('english'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()


def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word.decode('utf-8')) for word in punc_free.split())
    return normalized

doc_clean = [clean(doc).split() for doc in ds['text']]        

In [4]:
doc_clean[1]

[u'longer',
 u'question',
 u'whether',
 u'financial',
 u'market',
 u'u',
 u'economy',
 u'collapse',
 u'that',
 u'according',
 u'host',
 u'expert',
 u'mainstream',
 u'alternative',
 u'given',
 u'question',
 u'\u201cwhen\u201d',
 u'moment',
 u'come',
 u'according',
 u'christine',
 u'hughes',
 u'chief',
 u'investment',
 u'strategist',
 u'otterwood',
 u'capital',
 u'soon',
 u'basing',
 u'assessment',
 u'historically',
 u'deadon',
 u'yield',
 u'curve',
 u'analysis',
 u'hughes',
 u'say',
 u'latest',
 u'update',
 u'client',
 u'we\u2019re',
 u'looking',
 u'maximum',
 u'breaking',
 u'point',
 u'2020',
 u'time',
 u'next',
 u'12',
 u'\u2013',
 u'15',
 u'month',
 u'likely',
 u'scenario',
 u'peg',
 u'next',
 u'crisis',
 u'right',
 u'beginning',
 u'2018',
 u'first',
 u'chart',
 u'near',
 u'perfect',
 u'accuracy',
 u'thus',
 u'far',
 u'show',
 u'rapidly',
 u'yield',
 u'curve',
 u'collapsed',
 u'last',
 u'12',
 u'month',
 u'hughes',
 u'explains',
 u'mean',
 u'expect',
 u'2018',
 u'year',
 u'reckoning'

In [5]:
from nltk import FreqDist

# use nltk fdist to get a frequency distribution of all words
fdist = FreqDist(word for d in doc_clean for word in d)


In [6]:
len(fdist)

213611

In [7]:
k = 50000
top_k_words = fdist.most_common(k)
top_k_words[-10:]

[(u'pontevedra', 4),
 (u'bayou', 4),
 (u'boomed', 4),
 (u'przeciwko', 4),
 (u'tazed', 4),
 (u'cabinet\u2019s', 4),
 (u'permettra', 4),
 (u'\u0442\u043e\u043b\u0447\u043a\u043e\u0432', 4),
 (u'\u043f\u0440\u0438\u0432\u043b\u0435\u0447\u044c', 4),
 (u'suburbia', 4)]

In [8]:
k = 15000
top_k_words = fdist.most_common(k)
top_k_words[-10:]

[(u'oak', 27),
 (u'trustworthy', 27),
 (u'\u0432\u0430\u043c', 27),
 (u'ersten', 27),
 (u'22nd', 27),
 (u'aspiring', 27),
 (u'scoundrel', 27),
 (u'lao', 27),
 (u'\u042f\u043f\u043e\u043d\u0438\u0438', 27),
 (u'don\xe2\u20ac\u2122t', 27)]

In [9]:
top_k_words = dict(top_k_words)

In [10]:
doc_clean_freqs = [[w for w in doc if w in top_k_words] for doc in doc_clean]

In [11]:
doc_clean_freqs[0]

[u'comment',
 u'republican',
 u'nominee',
 u'donald',
 u'trump',
 u'admitted',
 u'serial',
 u'sexual',
 u'predator',
 u'recorded',
 u'word',
 u'confirm',
 u'much',
 u'dozen',
 u'woman',
 u'come',
 u'forward',
 u'accuse',
 u'sexual',
 u'misconduct',
 u'campaign',
 u'desperately',
 u'fighting',
 u'put',
 u'lid',
 u'growing',
 u'awareness',
 u'trump',
 u'testifying',
 u'oath',
 u'trial',
 u'federal',
 u'court',
 u'accusation',
 u'raping',
 u'girl',
 u'previously',
 u'undisclosed',
 u'second',
 u'girl',
 u'even',
 u'younger',
 u'case',
 u'thrown',
 u'may',
 u'due',
 u'error',
 u'june',
 u'two',
 u'new',
 u'witness',
 u'say',
 u'worked',
 u'convicted',
 u'child',
 u'rapist',
 u'billionaire',
 u'epstein',
 u'part',
 u'appears',
 u'girl',
 u'party',
 u'revealed',
 u'deposition',
 u'convinced',
 u'victim',
 u'attend',
 u'four',
 u'different',
 u'party',
 u'promise',
 u'money',
 u'industry',
 u'\u2013',
 u'story',
 u'match',
 u'account',
 u'another',
 u'person',
 u'would',
 u'arrange',
 u'under

In [12]:
import gensim
from gensim import corpora

doc_bigrams = [[t1 + '_' + t2 for t1, t2 in zip(doc, doc[1:])] for doc in doc_clean_freqs]

dictionary = corpora.Dictionary(doc_bigrams)
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_bigrams]

In [13]:
doc_term_matrix[0]

[(0, 1),
 (1, 1),
 (2, 1),
 (3, 1),
 (4, 1),
 (5, 1),
 (6, 1),
 (7, 1),
 (8, 1),
 (9, 1),
 (10, 1),
 (11, 1),
 (12, 1),
 (13, 1),
 (14, 1),
 (15, 1),
 (16, 1),
 (17, 1),
 (18, 1),
 (19, 1),
 (20, 1),
 (21, 1),
 (22, 1),
 (23, 1),
 (24, 1),
 (25, 1),
 (26, 1),
 (27, 1),
 (28, 1),
 (29, 1),
 (30, 1),
 (31, 1),
 (32, 1),
 (33, 1),
 (34, 1),
 (35, 1),
 (36, 1),
 (37, 1),
 (38, 1),
 (39, 1),
 (40, 1),
 (41, 1),
 (42, 1),
 (43, 1),
 (44, 1),
 (45, 1),
 (46, 1),
 (47, 1),
 (48, 1),
 (49, 1),
 (50, 1),
 (51, 1),
 (52, 1),
 (53, 1),
 (54, 1),
 (55, 1),
 (56, 1),
 (57, 1),
 (58, 3),
 (59, 1),
 (60, 1),
 (61, 1),
 (62, 1),
 (63, 1),
 (64, 1),
 (65, 1),
 (66, 1),
 (67, 1),
 (68, 1),
 (69, 1),
 (70, 1),
 (71, 1),
 (72, 1),
 (73, 1),
 (74, 1),
 (75, 2),
 (76, 1),
 (77, 1),
 (78, 1),
 (79, 1),
 (80, 1),
 (81, 1),
 (82, 1),
 (83, 1),
 (84, 1),
 (85, 1),
 (86, 1),
 (87, 1),
 (88, 1),
 (89, 1),
 (90, 1),
 (91, 1),
 (92, 1),
 (93, 1),
 (94, 1),
 (95, 1),
 (96, 1),
 (97, 1),
 (98, 1),
 (99, 1),
 (100, 1),

In [None]:
Lda = gensim.models.ldamodel.LdaModel


num_topics = 100
chunksize = 300


# low alpha means each document is only represented by a small number of topics, and vice versa
# low eta means each topic is only represented by a small number of words, and vice versa

ldamodel = Lda(
    doc_term_matrix, 
    num_topics=num_topics, 
    id2word=dictionary, 
    alpha=1e-2, 
    eta=0.5e-2, 
    chunksize=chunksize, 
    minimum_probability=0.0, 
    passes=2, 
)

In [None]:
#save model
ldamodel.save('fake.bigrams')

#Load model
# ldamodel = Lda.load('fake.bigrams')

In [None]:
for t in ldamodel.print_topics(num_topics=100, num_words=5):
    print(t)

In [None]:
ldamodel.show_topic(topicid=4, topn=20)

In [None]:
ldamodel.get_document_topics(doc_term_matrix[0])

# VK walls

In [None]:
import json
import os
from pprint import pprint


def get_res_arr(filename):
    res_arr = []
    data = json.load(open(filename))
    for id in data:
        res_arr.append(data[id])
    #print(len(res_arr))
    return res_arr

path='./user_posts/'
super_arr = []
for filename in os.listdir(path):
    #print(filename)
    super_arr.extend(get_res_arr(path + filename))

print(len(super_arr))

In [None]:
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string


stop = set(stopwords.words('russian'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()


def clean(doc):
    #print(doc)
    doc = str(doc)
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

doc_clean = [clean(doc).split() for doc in super_arr]   
print(doc_clean[0:1])

In [None]:
import gensim
from gensim import corpora

# Creating the term dictionary of our courpus, where every unique term is assigned an index. 
dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

In [None]:
doc_term_matrix[0]

In [None]:
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=10, id2word=dictionary, passes=1)

In [23]:
# ldamodel.save('vk.unigrams')
ldamodel = Lda.load('vk.unigrams')

In [25]:
for t in ldamodel.print_topics(num_topics=30, num_words=10):
    print(t[1])

0.014*"n" + 0.004*"это" + 0.004*"бизнес" + 0.003*"компании" + 0.003*"деньги" + 0.003*"1" + 0.003*"рублей" + 0.002*"10" + 0.002*"день" + 0.002*"bitcoin"
0.013*"помоги" + 0.012*"3" + 0.012*"вк" + 0.012*"зайди" + 0.011*"пройти" + 0.011*"ссылке" + 0.008*"игру" + 0.007*"помощь" + 0.007*"нужна" + 0.007*"интерны"
0.024*"с" + 0.019*"♥" + 0.017*"тебе" + 0.016*"рождения" + 0.015*"днем" + 0.015*"открытки" + 0.013*"😉" + 0.012*"узнай" + 0.012*"n" + 0.011*"❤"
0.016*"это" + 0.005*"очень" + 0.005*"просто" + 0.004*"я" + 0.004*"тебе" + 0.003*"—" + 0.003*"всё" + 0.002*"хочу" + 0.002*"всем" + 0.002*"ещё"
0.011*"n" + 0.008*"nи" + 0.006*"—" + 0.004*"это" + 0.004*"жизнь" + 0.004*"жизни" + 0.004*"любовь" + 0.003*"день" + 0.003*"пусть" + 0.003*"–"
0.017*"the" + 0.011*"to" + 0.011*"a" + 0.010*"and" + 0.010*"you" + 0.009*"of" + 0.009*"i" + 0.008*"in" + 0.007*"my" + 0.006*"for"
0.006*"і" + 0.005*"з" + 0.004*"люблю" + 0.004*"❤️" + 0.003*"😂" + 0.003*"спасибо" + 0.003*"день" + 0.003*"russia" + 0.003*"😍" + 0.003*"😊"


### биграммы

In [None]:
doc_bigrams = [[t1 + '_' + t2 for t1, t2 in zip(doc, doc[1:])] for doc in doc_clean]

dictionary = corpora.Dictionary(doc_bigrams)
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_bigrams]

In [None]:
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=10, id2word=dictionary, passes=1)

In [26]:
# ldamodel.save('vk.bigrams')
ldamodel = Lda.load('vk.bigrams')

In [29]:
for t in ldamodel.print_topics(num_topics=30, num_words=10):
    print(t[0])

0
1
2
3
4
5
6
7
8
9


http://www.machinelearning.ru/wiki/images/8/82/BMMO11_14.pdf
http://www.machinelearning.ru/wiki/images/f/f7/DirichletProcessNotes.pdf 
