# Введение в обработку текста на естественном языке

Материалы:
* Макрушин С.В. Лекция 9: Введение в обработку текста на естественном языке\
* https://realpython.com/nltk-nlp-python/
* https://scikit-learn.org/stable/modules/feature_extraction.html

## Задачи для совместного разбора

In [2]:
pip install pymorphy2

Collecting pymorphy2
  Downloading pymorphy2-0.9.1-py3-none-any.whl.metadata (3.6 kB)
Collecting dawg-python>=0.7.1 (from pymorphy2)
  Downloading DAWG_Python-0.7.2-py2.py3-none-any.whl.metadata (7.0 kB)
Collecting pymorphy2-dicts-ru<3.0,>=2.4 (from pymorphy2)
  Downloading pymorphy2_dicts_ru-2.4.417127.4579844-py2.py3-none-any.whl.metadata (2.1 kB)
Collecting docopt>=0.6 (from pymorphy2)
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hDownloading pymorphy2-0.9.1-py3-none-any.whl (55 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.5/55.5 kB[0m [31m483.4 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading DAWG_Python-0.7.2-py2.py3-none-any.whl (11 kB)
Downloading pymorphy2_dicts_ru-2.4.417127.4579844-py2.py3-none-any.whl (8.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m620.9 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hBuilding wheels for collected packa

In [4]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.metrics import *
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import pymorphy2
import pandas as pd
import nltk
import random
from functools import reduce
nltk.download('wordnet')
nltk.download('stopwords')
stemmer = SnowballStemmer("russian")

[nltk_data] Downloading package wordnet to /Users/macbook/nltk_data...
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/macbook/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


1. Считайте слова из файла `litw-win.txt` и запишите их в список `words`. В заданном предложении исправьте все опечатки, заменив слова с опечатками на ближайшие (в смысле расстояния Левенштейна) к ним слова из списка `words`. Считайте, что в слове есть опечатка, если данное слово не содержится в списке `words`. 

2. Разбейте текст из формулировки задания 1 на слова; проведите стемминг и лемматизацию слов.

3. Преобразуйте предложения из формулировки задания 1 в векторы при помощи `CountVectorizer`.

## Лабораторная работа 9

### Расстояние редактирования

1.1 Загрузите предобработанные описания рецептов из файла `preprocessed_descriptions.csv`. Получите набор уникальных слов `words`, содержащихся в текстах описаний рецептов (воспользуйтесь `word_tokenize` из `nltk`). 

In [6]:
n = random.randint(301, 500)
preprocessed_descriptions = pd.read_csv("preprocessed_descriptions.csv")[:n]
preprocessed_descriptions.head()
words = reduce(lambda x, y: x + y, [word_tokenize(item) for item in preprocessed_descriptions["preprocessed_descriptions"].to_list() if isinstance(item, str)])
words[:10]

['an',
 'original',
 'recipe',
 'created',
 'by',
 'chef',
 'scott',
 'meskan',
 'george',
 's']

1.2 Сгенерируйте 5 пар случайно выбранных слов и посчитайте между ними расстояние редактирования.

In [7]:
words = list(set(words))
pairs = [' '.join(random.choices(words, k=2)) for _ in range(5)]
print(pairs)
h_s = [edit_distance(*v.split()) for v in pairs]
h_s

['bevvie balanced', '10 football', 'tortilla true', 'supposed cordial', 'circuit substituted']


[6, 8, 6, 8, 9]

1.3 Напишите функцию, которая для заданного слова `word` возвращает `k` ближайших к нему слов из списка `words` (близость слов измеряется с помощью расстояния Левенштейна)

In [8]:
def k_nearest(word:str, k:int=1):
    w_new = sorted(words, key=lambda w: edit_distance(w, word))
    return w_new[:k]

k_nearest('check', k=3)

['check', 'chuck', 'cheap']

### Стемминг, лемматизация

2.1 На основе результатов 1.1 создайте `pd.DataFrame` со столбцами: 
    * word
    * stemmed_word 
    * normalized_word 

Столбец `word` укажите в качестве индекса. 

Для стемминга воспользуйтесь `SnowballStemmer`, для нормализации слов - `WordNetLemmatizer`. Сравните результаты стемминга и лемматизации.

In [10]:
lemmatizer = WordNetLemmatizer()

stemmed_words = [stemmer.stem(word) for word in words]
normalized_word = [lemmatizer.lemmatize(word) for word in words]
df = pd.DataFrame(list(zip(words, stemmed_words, normalized_word))[10:20], columns=['word', 'stemmed_word', 'normalized_word'])
df = df.set_index('stemmed_word')
df.head()

Unnamed: 0_level_0,word,normalized_word
stemmed_word,Unnamed: 1_level_1,Unnamed: 2_level_1
glasses,glasses,glass
party,party,party
2009,2009,2009
how,how,how
carts,carts,cart


2.2. Удалите стоп-слова из описаний рецептов. Какую долю об общего количества слов составляли стоп-слова? Сравните топ-10 самых часто употребляемых слов до и после удаления стоп-слов.

In [11]:
st_w = stopwords.words()
texts = reduce(lambda x, y: x + ' ' + y, [item for item in preprocessed_descriptions["preprocessed_descriptions"].to_list() if isinstance(item, str)])
tokens = word_tokenize(texts)
l = len(tokens)
l_new = 0
words_dict = {}
words_dict_stop = {}
for w in tokens:
    if w not in st_w:
        words_dict_stop[w] = words_dict_stop.get(w, 0) + 1
        l_new += 1
    words_dict[w] = words_dict.get(w, 0) + 1


print(f'Доля стоп слов - {(l - l_new) / l}')
print(f'Топ 10 слов со стоп словами: {"; ".join(sorted(words_dict.keys(), key=lambda x: words_dict[x], reverse=True)[:10])}')
print(f'Топ 10 слов без стоп слов: {"; ".join(sorted(words_dict_stop.keys(), key=lambda x: words_dict_stop[x], reverse=True)[:10])}')

Доля стоп слов - 0.5832258064516129
Топ 10 слов со стоп словами: the; a; i; and; this; it; to; is; of; for
Топ 10 слов без стоп слов: recipe; make; easy; great; time; made; dish; delicious; bread; soup


### Векторное представление текста

3.1 Выберите случайным образом 5 рецептов из набора данных. Представьте описание каждого рецепта в виде числового вектора при помощи `TfidfVectorizer`

In [12]:
data = preprocessed_descriptions.sample(5)
print(data)

vectorizer = TfidfVectorizer()
vectorizer.fit(data['preprocessed_descriptions'])
sent_vec = vectorizer.transform(data['preprocessed_descriptions'])
sent_vec = sent_vec.toarray()
for i, recipe in enumerate(data['preprocessed_descriptions']):
    print("Рецепт:\n", recipe)
    print("Вектор:\n", sent_vec[i])
    print()

                                        name  \
330  actual pf chang s mongolian beef recipe   
74                            big mac  pizza   
13   blepandekager   danish   apple pancakes   
376                   african style broccoli   
238                     7 layer b  l  t  dip   

                             preprocessed_descriptions  
330  this is the actual recipe of the mongolian bee...  
74   this cheeseburger pizza is so different from t...  
13   this recipe has been posted here for play in z...  
376  i found this on the web after searching for af...  
238  there are many layered dip recipes out there  ...  
Рецепт:
 this is the actual recipe of the mongolian beef at pf changs  enjoy  found on pf chang s website  http   www pfchangs com chefscorner recipes gluten free mongolian beef recipe pdf
Вектор:
 [0.         0.17920832 0.         0.         0.         0.
 0.         0.         0.1200179  0.         0.         0.35841665
 0.         0.         0.         0.         

3.2 Вычислите близость между каждой парой рецептов, выбранных в задании 3.1, используя косинусное расстояние (`scipy.spatial.distance.cosine`) Результаты оформите в виде таблицы `pd.DataFrame`. В качестве названий строк и столбцов используйте названия рецептов.

In [13]:
import itertools
from scipy.spatial import distance

max_pair = None
max_result = -1

coeff_dict = {}
vectorizer3 = TfidfVectorizer(analyzer="word", stop_words="english")
transform3 = vectorizer3.fit_transform(data["preprocessed_descriptions"].to_list())

all_data = list(zip(data["preprocessed_descriptions"].to_list(), transform3.toarray()))

for pair in itertools.product(all_data, repeat=2):
    
    text1, matrix1 = pair[0]
    text2, matrix2 = pair[1]
    result = distance.cosine(matrix1, matrix2)
    inverse_result = 1-result
    
    if text1 not in coeff_dict:
        coeff_dict[text1] = []
    coeff_dict[text1].append(inverse_result)
    

    if inverse_result > max_result and text1 != text2:
        max_result = inverse_result
        max_pair = (text1, text2)
    
    print(f"{text1}\n{text2}\n{inverse_result}\n")

df_final2 = pd.DataFrame.from_dict(coeff_dict)
df_final2.columns = data["preprocessed_descriptions"].to_list()
df_final2.index = data["preprocessed_descriptions"].to_list()
df_final2

this is the actual recipe of the mongolian beef at pf changs  enjoy  found on pf chang s website  http   www pfchangs com chefscorner recipes gluten free mongolian beef recipe pdf
this is the actual recipe of the mongolian beef at pf changs  enjoy  found on pf chang s website  http   www pfchangs com chefscorner recipes gluten free mongolian beef recipe pdf
1

this is the actual recipe of the mongolian beef at pf changs  enjoy  found on pf chang s website  http   www pfchangs com chefscorner recipes gluten free mongolian beef recipe pdf
this cheeseburger pizza is so different from the others on  zaar  i renamed it  big mac  because it tastes like a big mac only better   my son christopher was served this as an appetizer at a bar  very surprising how good 
0.0

this is the actual recipe of the mongolian beef at pf changs  enjoy  found on pf chang s website  http   www pfchangs com chefscorner recipes gluten free mongolian beef recipe pdf
this recipe has been posted here for play in zwt9

Unnamed: 0,this is the actual recipe of the mongolian beef at pf changs enjoy found on pf chang s website http www pfchangs com chefscorner recipes gluten free mongolian beef recipe pdf,this cheeseburger pizza is so different from the others on zaar i renamed it big mac because it tastes like a big mac only better my son christopher was served this as an appetizer at a bar very surprising how good,this recipe has been posted here for play in zwt9 scandinavia this recipe was found at website mindspring com christian s danish recipes,i found this on the web after searching for african recipes that included broccoli for the zaar world tour 2005 it was on the mom mom website,there are many layered dip recipes out there i have a couple of excellent ones in my collection too but this is a tasty new version that i think you ll enjoy
this is the actual recipe of the mongolian beef at pf changs enjoy found on pf chang s website http www pfchangs com chefscorner recipes gluten free mongolian beef recipe pdf,1.0,0.0,0.234507,0.040489,0.054197
this cheeseburger pizza is so different from the others on zaar i renamed it big mac because it tastes like a big mac only better my son christopher was served this as an appetizer at a bar very surprising how good,0.0,1.0,0.0,0.037339,0.0
this recipe has been posted here for play in zwt9 scandinavia this recipe was found at website mindspring com christian s danish recipes,0.234507,0.0,1.0,0.062987,0.027637
i found this on the web after searching for african recipes that included broccoli for the zaar world tour 2005 it was on the mom mom website,0.040489,0.037339,0.062987,1.0,0.025048
there are many layered dip recipes out there i have a couple of excellent ones in my collection too but this is a tasty new version that i think you ll enjoy,0.054197,0.0,0.027637,0.025048,1.0


3.3 Какие рецепты являются наиболее похожими? Прокомментируйте результат (словами).

In [14]:
print(f"Из датасета выше больше всего совпадений в предложениях:\n\n{max_pair[0]}\n\n{max_pair[1]}\n\n{max_result}")

Из датасета выше больше всего совпадений в предложениях:

this is the actual recipe of the mongolian beef at pf changs  enjoy  found on pf chang s website  http   www pfchangs com chefscorner recipes gluten free mongolian beef recipe pdf

this recipe has been posted here for play in zwt9   scandinavia   this recipe was found at website  mindspring com   christian s danish recipes 

0.2345070476524006
