<center><img src="img/logo_hse_black.jpg"></center>

<h1><center>Методы машинного обучения</center></h1>
<h2><center>Тематическое моделирование</center></h2>

In [32]:
%matplotlib inline

In [33]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (12,8)

# Работа с текстом

In [34]:
df = pd.read_csv('./data/labeledTrainData.tsv', sep='\t')
df.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


### Стандартная предобработка из предыдущего занятия

In [35]:
## Оставляем только слова из латиницы
import re
regex = re.compile(u"[A-Za-z]+")

def words_only(text, regex=regex):
    return " ".join(regex.findall(text))


df.review = df.review.str.lower()
df.loc[:, 'review'] = df.review.apply(words_only)

In [36]:
import nltk

In [37]:
# nltk.download()
# Hа всякий случай
# Открываем вкладку model и качаем punkt

In [38]:
## Удаляем стоп-слова
from nltk.corpus import stopwords
mystopwords = stopwords.words('english') + ['br']

def remove_stopwords(text, mystopwords = mystopwords):
    try:
        return u" ".join([token for token in nltk.word_tokenize(text) if not token in mystopwords])
    except:
        return u""
    
df.review = df.review.apply(remove_stopwords)   

In [39]:
df.review.sample(10)

21607    love science fiction fascinated egyptian mytho...
4182     called remakes good originals one crosses bord...
735      legion mick garris haters feel direct horror f...
4068     bad neither animals eddie murphy anything say ...
24646    sorry charming whimsical film first saw soon f...
4818     wow terrible adaptation beautiful novel gripes...
19646    last year remake hills eyes one better attempt...
19371    one get enjoy gem invisible ray often forget s...
2070     many people beat street inspired lifestyle som...
3591     long defunct prison shut years opened ethan sh...
Name: review, dtype: object

In [40]:
# nltk.download()
# Hа всякий случай
# Открываем вкладку corpora и качаем wordnet

In [41]:
## Нормализуем текст
wnl = nltk.WordNetLemmatizer()
print(wnl.lemmatize('loved'))

loved


In [42]:
## Нормализуем текст
stemmer = nltk.PorterStemmer()
print(stemmer.stem('pilot'))

pilot


In [43]:
%%time 

def normalize(text, method=stemmer):
    try:
        return " ".join(method.stem(word) for word in text.split(' '))
    except:
        return " "
    
df.loc[:, 'review_normed'] = df.review.apply(normalize)

CPU times: user 50 s, sys: 47.5 ms, total: 50 s
Wall time: 50.1 s


In [45]:
df.loc[:, 'review_normed'].sample(10)

8569     old west alway men live breath violenc women h...
19580    realli love sexi action sci fi film sixti actr...
7104     supernatur peter weir thriller truli one haunt...
7482     normal would never rent movi like know go bad ...
8691     yet anoth ventur realm teen gross comedi set c...
24047    ripoff dozen better film particularli steven m...
12391    worst thing crush act pretti bad plot virtual ...
4548     anoth fantasi favorit ralph bakshi watch youtu...
21835    movi terribl first read plot summari look ok w...
12001    tri like program realli even bought pilot film...
Name: review_normed, dtype: object

## Тематическое моделирование
На данный момент, наиболее популярные библиотеки для тематического моделирования, это:
* [Gensim](https://radimrehurek.com/gensim/)
* [BigARTM](https://bigartm.readthedocs.io/en/stable/index.html)

Сегодня (а может и никогда) BigARTM мы разбирать не будем, по той простой причине, что в [репозитории с примерами](https://github.com/bigartm/bigartm-book/) довольно много материалов, в которых содержится исчерпывающая информация о том, как что работает (например [тут](https://github.com/bigartm/bigartm-book/blob/master/ARTM_tutorial_Fun.ipynb) или [тут](https://github.com/bigartm/bigartm-book/blob/master/ARTM_example_RU.ipynb) )

In [None]:
# !pip install gensim

In [46]:
from gensim.corpora import *
texts = [df.review_normed.iloc[i].split() for i in range(len(df))]
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

In [56]:
dictionary[0]

'actual'

In [48]:
corpus[4]

[(0, 1),
 (4, 1),
 (10, 1),
 (23, 1),
 (43, 1),
 (50, 1),
 (56, 1),
 (64, 1),
 (71, 1),
 (82, 1),
 (83, 1),
 (84, 1),
 (89, 1),
 (90, 2),
 (94, 1),
 (102, 1),
 (122, 1),
 (124, 1),
 (136, 1),
 (148, 1),
 (149, 1),
 (151, 1),
 (171, 1),
 (176, 1),
 (178, 1),
 (208, 1),
 (209, 1),
 (217, 1),
 (261, 1),
 (266, 1),
 (286, 1),
 (307, 1),
 (312, 1),
 (318, 1),
 (327, 1),
 (369, 1),
 (419, 1),
 (463, 2),
 (478, 2),
 (488, 1),
 (504, 1),
 (517, 1),
 (518, 1),
 (519, 1),
 (520, 1),
 (521, 1),
 (522, 1),
 (523, 2),
 (524, 1),
 (525, 5),
 (526, 1),
 (527, 1),
 (528, 1),
 (529, 1),
 (530, 1),
 (531, 1),
 (532, 1),
 (533, 1),
 (534, 1),
 (535, 1),
 (536, 1),
 (537, 1),
 (538, 1),
 (539, 1),
 (540, 1),
 (541, 1),
 (542, 1),
 (543, 1),
 (544, 1),
 (545, 1),
 (546, 1),
 (547, 1),
 (548, 1),
 (549, 3),
 (550, 1),
 (551, 1),
 (552, 1),
 (553, 1),
 (554, 1),
 (555, 1),
 (556, 1),
 (557, 2),
 (558, 1),
 (559, 1),
 (560, 1),
 (561, 1),
 (562, 1),
 (563, 1),
 (564, 1),
 (565, 1),
 (566, 1),
 (567, 1),
 (568

In [57]:
from gensim.models import ldamodel

In [60]:
%%time
lda = ldamodel.LdaModel(corpus=corpus, id2word=dictionary, 
                        num_topics=20,
                        alpha='auto', eta='auto', 
                        iterations = 20, passes = 10, 
                        random_state=123)

CPU times: user 2min 4s, sys: 779 ms, total: 2min 5s
Wall time: 1min 53s


In [61]:
## Достаем описание тем

topics = lda.show_topics(20)
for t in range(20):
    print('==========')
    print(topics[t][1])

0.011*"match" + 0.010*"disney" + 0.009*"footbal" + 0.007*"team" + 0.007*"cinderella" + 0.007*"sport" + 0.007*"south" + 0.006*"girl" + 0.006*"one" + 0.005*"indian"
0.009*"allen" + 0.007*"stori" + 0.007*"one" + 0.007*"beauti" + 0.006*"english" + 0.006*"princ" + 0.006*"woodi" + 0.006*"castl" + 0.005*"rose" + 0.005*"tale"
0.018*"war" + 0.010*"american" + 0.006*"world" + 0.006*"soldier" + 0.006*"german" + 0.005*"polit" + 0.005*"countri" + 0.005*"christian" + 0.005*"documentari" + 0.004*"men"
0.056*"show" + 0.023*"seri" + 0.019*"episod" + 0.017*"tv" + 0.013*"year" + 0.012*"watch" + 0.009*"season" + 0.009*"first" + 0.008*"one" + 0.008*"time"
0.013*"murder" + 0.010*"get" + 0.010*"kill" + 0.009*"killer" + 0.008*"polic" + 0.007*"cop" + 0.007*"car" + 0.006*"one" + 0.006*"thriller" + 0.006*"man"
0.014*"robert" + 0.012*"role" + 0.012*"play" + 0.009*"perform" + 0.009*"cast" + 0.007*"john" + 0.006*"well" + 0.006*"actor" + 0.006*"mr" + 0.005*"book"
0.010*"life" + 0.008*"charact" + 0.007*"one" + 0.007*

In [62]:
## Достаем описание документов
from gensim import matutils
T = matutils.corpus2dense(lda[corpus], 20).T

In [80]:
t1 = T[10] # Вектор первого документа

In [81]:
strongest_topics = np.argsort(t1)

In [82]:
topics[strongest_topics[-1]][1]

'0.018*"war" + 0.010*"american" + 0.006*"world" + 0.006*"soldier" + 0.006*"german" + 0.005*"polit" + 0.005*"countri" + 0.005*"christian" + 0.005*"documentari" + 0.004*"men"'

In [83]:
topics[strongest_topics[-2]][1]

'0.013*"murder" + 0.010*"get" + 0.010*"kill" + 0.009*"killer" + 0.008*"polic" + 0.007*"cop" + 0.007*"car" + 0.006*"one" + 0.006*"thriller" + 0.006*"man"'

In [84]:
df.iloc[10,2]

'happens army wetbacks towelheads godless eastern european commies gather forces south border gary busey kicks butts course another laughable example reagan era cultural fallout bulletproof wastes decent supporting cast headed l q jones thalmus rasulala'

##  Визуализация тем
[pyLDAvis](https://pyldavis.readthedocs.io/en/latest/readme.html)

In [None]:
!pip install pyLDAvis

In [85]:
import pyLDAvis.gensim as gensimvis
import pyLDAvis

vis_data = gensimvis.prepare(lda, corpus, dictionary)
pyLDAvis.display(vis_data)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))
