Содержание данного ноутбука:
* Предобработка текстовых данных - лемматизация и удаление стоп-слов
* Рассчет метрик TD-IDF
* Обучение модель Word2Vec для сравнения скорости с TD-IDF
* Анализ текстовых данных (вывод наиболее часто/редко встречающихся слова)

#### Импорт библиотек

In [None]:
!pip install pymorphy2



In [None]:
import pymorphy2
import nltk
import sklearn
import pandas as pd
import gensim

#### Текст песен
Для анализы были выбраны 3 песни группы Depeche Mode.

In [None]:
texts = [
'''
Precious and fragile things
Need special handling
My God what have we done to you?
We always tried to share
The tenderest of care
Now look what we have put you through
Things get damaged, things get broken
I thought we'd manage, but words left unspoken
Left us so brittle
There was so little left to give
Angels with silver wings
Shouldn't know suffering
I wish I could take the pain for you
If God has a master plan
That only He understands
I hope it's your eyes He's seeing through
Things get damaged, things get broken
I thought we'd manage, but words left unspoken
Left us so brittle
There was so little left to give
I pray you learn to trust
Have faith in both of us
And keep room in your hearts for two
Things get damaged, things get broken
I thought we'd manage, but words left unspoken
Left us so brittle
There was so little left to give
''',
'''
Reach out, touch faith
Your own personal Jesus
Someone to hear your prayers
Someone who cares
Your own personal Jesus
Someone to hear your prayers
Someone who's there
Feeling unknown
And you're all alone
Flesh and bone
By the telephone
Lift up the receiver
I'll make you a believer
Take second best
Put me to the test
Things on your chest
You need to confess
I will deliver
You know I'm a forgiver
Reach out, touch faith
Reach out, touch faith
Your own personal Jesus
Someone to hear your prayers
Someone who cares
Your own personal Jesus
Someone to hear your prayers
Someone who's there
Feeling unknown
And you're all alone
Flesh and bone
By the telephone
Lift up the receiver
I'll make you a believer
I will deliver
You know I'm a forgiver
Reach out, touch faith
Your own personal Jesus
Reach out, touch faith
Reach out, touch faith
Reach out, touch faith
(Reach out, reach out)
Reach out, touch faith
Reach out and touch faith
''',
'''
You had something to hide
Should have hidden it, shouldn't you?
Now you're not satisfied
With what you're being put through
It's just time to pay the price
For not listening to advice
And deciding in your youth
On the policy of truth
Things could be so different now
It used to be so civilized
You will always wonder how
It could have been if you'd only lied
It's too late to change events
It's time to face the consequence
For delivering the proof
In the policy of truth
Never again
Is what you swore
The time before
Never again
Is what you swore
The time before
Now you're standing there tongue-tied
You better learn your lesson well
Hide what you have to hide
And tell what you have to tell
You'll see your problems multiplied
If you continually decide
To faithfully pursue
The policy of truth
''']

#### Лемматизация


In [None]:
morph = pymorphy2.MorphAnalyzer()
morph

<pymorphy2.analyzer.MorphAnalyzer at 0x7b69d23dc0d0>

In [None]:
def tokenize(text):
  tokenizer = nltk.tokenize.RegexpTokenizer(r'\w{2,}')
  tokenized_text = tokenizer.tokenize(text)
  return [morph.parse(w)[0].normal_form for w in tokenized_text]

tokenize(texts[0])[:10]

['precious',
 'and',
 'fragile',
 'things',
 'need',
 'special',
 'handling',
 'my',
 'god',
 'what']

#### Стоп-слова

In [None]:
nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')
stop_words[:10]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

### Применение TD-IDF

In [None]:
tfidf_vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(tokenizer=tokenize, stop_words=stop_words)
tfidf_matrix = tfidf_vectorizer.fit_transform(texts)



In [None]:
tfidf_df = pd.DataFrame(tfidf_matrix.todense(), columns = tfidf_vectorizer.get_feature_names_out())
tfidf_df

Unnamed: 0,advice,alone,always,angels,believer,best,better,bone,brittle,broken,...,unknown,unspoken,us,used,well,wings,wish,wonder,words,youth
0,0.0,0.0,0.046501,0.061143,0.0,0.0,0.0,0.0,0.183429,0.183429,...,0.0,0.183429,0.244572,0.0,0.0,0.061143,0.061143,0.0,0.183429,0.0
1,0.0,0.093657,0.0,0.0,0.093657,0.046828,0.0,0.093657,0.0,0.0,...,0.093657,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.103602,0.0,0.078792,0.0,0.0,0.0,0.103602,0.0,0.0,0.0,...,0.0,0.0,0.0,0.103602,0.103602,0.0,0.0,0.103602,0.0,0.103602


### Применение word2vec

In [None]:
def process_texts(texts, tokenizer, stop_words):

  def tokenize_document(document, tokenizer, stop_words):
    return [word for word in tokenize(document) if word not in stop_words]

  tokenized_texts = []
  for text in texts:
    tokenized_texts.append(tokenize_document(text, tokenizer, stop_words))

  return tokenized_texts


processed_texts = process_texts(texts, tokenize, stop_words)
processed_texts[0][:10]

['precious',
 'fragile',
 'things',
 'need',
 'special',
 'handling',
 'god',
 'done',
 'always',
 'tried']

In [None]:
word2vec = gensim.models.Word2Vec(processed_texts)
word2vec

<gensim.models.word2vec.Word2Vec at 0x7b69d2460220>

In [None]:
#пример работы с word2vec - наиболее похожие на 'get' слова
word2vec.wv.most_similar(positive=['get'], topn=5)

[('touch', 0.14874057471752167),
 ('left', 0.045399945229291916),
 ('things', 0.03842667490243912),
 ('someone', 0.023778842762112617),
 ('reach', 0.02101920172572136)]

### Вывод по подходам

Нужно помнить, что TD-IDF работает исключительно со статистическими методами, поэтому он быстро выполняется и хорошо масштабируется при увеличении обьема текста.  
Word2vec основан на нейросетях, поэтому при обработке большого обьема текста алгоритм будет очень долго выполняться. Тем не менее, на данном маленьком датасете Word2vec выполнился почти моментально, что говорит о том, что при малом обьеме данных скорость алгоритмов сопоставима.

### Анализ - наиболее часто/редко встречающиеся слова

In [None]:
word_freq_srs = tfidf_df.sum(axis=0).sort_values(ascending=False)
word_freq_tuples = list(zip(word_freq_srs.index, word_freq_srs))

print('Наиболее часто встречающиеся слова: (слово: tfidf)', *[f'* {word}: {tfidf:.2f}' for word, tfidf in word_freq_tuples[:10]], sep='\n')
print('Наиболее редко встречающиеся слова: (слово: tfidf)', *[f'* {word}: {tfidf:.2f}' for word, tfidf in word_freq_tuples[-1:-10:-1]], sep='\n')

Наиболее часто встречающиеся слова: (слово: tfidf)
* left: 0.55
* reach: 0.52
* touch: 0.42
* time: 0.41
* someone: 0.37
* faith: 0.37
* get: 0.37
* things: 0.34
* truth: 0.31
* hide: 0.31
Наиболее редко встречающиеся слова: (слово: tfidf)
* confess: 0.05
* second: 0.05
* chest: 0.05
* test: 0.05
* best: 0.05
* plan: 0.06
* tenderest: 0.06
* handling: 0.06
* pray: 0.06
