# Работа со строковыми значениями

__Автор задач: Блохин Н.В. (NVBlokhin@fa.ru)__

Материалы:
* Макрушин С.В. Лекция "Работа со строковыми значениям"
* https://pyformat.info/
* https://docs.python.org/3/library/re.html
    * https://docs.python.org/3/library/re.html#flags
    * https://docs.python.org/3/library/re.html#functions
* https://pythonru.com/primery/primery-primeneniya-regulyarnyh-vyrazheniy-v-python
* https://kanoki.org/2019/11/12/how-to-use-regex-in-pandas/
* https://realpython.com/nltk-nlp-python/

## Задачи для совместного разбора

1. Вывести на экран данные из словаря `obj` построчно в виде `k = v`, задав формат таким образом, чтобы знак равенства оказался на одной и той же позиции во всех строках. Строковые литералы обернуть в кавычки.

In [1]:
obj = {
    "home_page": "https://github.com/pypa/sampleproject",
    "keywords": "sample setuptools development",
    "license": "MIT",
}

2. Написать регулярное выражение,которое позволит найти номера групп студентов.

In [4]:
obj = pd.Series(["Евгения гр.ПМ19-1", "Илья пм 20-4", "Анна 20-3"])
obj

0    Евгения гр.ПМ19-1
1         Илья пм 20-4
2            Анна 20-3
dtype: object

3. Разбейте текст формулировки задачи 2 на слова.

## Лабораторная работа 6

In [4]:
import pandas as pd
from bs4 import BeautifulSoup

### Форматирование строк

1\. Загрузите данные из файла `recipes_sample.csv` (__ЛР2__) в виде `pd.DataFrame` `recipes` При помощи форматирования строк выведите информацию об id рецепта и времени выполнения 5 случайных рецептов в виде таблицы следующего вида:

    
    |      id      |  minutes  |
    |--------------------------|
    |    61178     |    65     |
    |    202352    |    80     |
    |    364322    |    150    |
    |    26177     |    20     |
    |    224785    |    35     |
    
Обратите внимание, что ширина столбцов заранее неизвестна и должна рассчитываться динамически, в зависимости от тех данных, которые были выбраны. 

In [3]:
df = pd.read_csv('recipes_sample.csv')
df_5 = df.sample(5)
max_len_id = max(df_5['id'].astype(str).apply(len))
max_len_minutes = max(df_5['minutes'].astype(str).apply(len))

id = 'id'
minutes = 'minutes'

print(f'|{id:^{max_len_id + 8}}|{minutes:^{max_len_minutes + 8}}|')
print('|' + '-'*(8 + 9 + len(id) + len(minutes))+ '|')

for index, row in df_5.iterrows():
  id = row['id']
  minutes = row['minutes']
  print(f'|{id:^{max_len_id + 8}}|{minutes:^{max_len_minutes + 8}}|')

|      id      |  minutes  |
|--------------------------|
|    40343     |    75     |
|    96220     |    135    |
|    392229    |    75     |
|    499449    |    27     |
|    406099    |    70     |


2\. Напишите функцию `show_info`, которая по данным о рецепте создает строку (в смысле объекта python) с описанием следующего вида:

```
"Название Из Нескольких Слов"

1. Шаг 1
2. Шаг 2
----------
Автор: contributor_id
Среднее время приготовления: minutes минут
```

    
Данные для создания строки получите из файлов `recipes_sample.csv` (__ЛР2__) и `steps_sample.xml` (__ЛР3__). 
Вызовите данную функцию для рецепта с id `170895` и выведите (через `print`) полученную строку на экран.

In [5]:
df_recipes = pd.read_csv('recipes_sample.csv')
with open('steps_sample.xml') as f:
  ab = BeautifulSoup(f, 'xml')

steps_dict = {}

for recipe in ab.find_all('recipe'):
  recipe_id = int(recipe.id.text)
  steps = [step.text for step in recipe.find_all('step')]
  steps_dict[recipe_id] = steps

In [6]:
# сбор значений атрибутов для recipe_id = 170895
row = df_recipes[df_recipes['id'] == 170895]

name = row['name'].iloc[0]
minutes = row['minutes'].iloc[0]
author_id = row['contributor_id'].iloc[0]
steps = steps_dict[170895]

In [7]:
def show_info(name: str, steps: list, minutes: int, author_id: int) -> str:
  s = ''
  s += f'"{name.title()}"\n\n'

  for index, step in enumerate(steps):
    s += f'{index+1}. {step.capitalize()}\n'
  s += '-' * 10 + '\n'
  s += f'Автор: {author_id}\n'
  s += f'Среднее время приготовления: {minutes} минут\n'


  return s

In [8]:
print(show_info(name, steps, minutes, author_id))

"Leeks And Parsnips  Sauteed Or Creamed"

1. Clean the leeks and discard the dark green portions
2. Cut the leeks lengthwise then into one-inch pieces
3. Melt the butter in a medium skillet , med
4. Heat
5. Add the garlic and fry 'til fragrant
6. Add leeks and fry until the leeks are tender , about 6-minutes
7. Meanwhile , peel and chunk the parsnips into one-inch pieces
8. Place in a steaming basket and steam 'til they are as tender as you prefer
9. I like them fork-tender
10. Drain parsnips and add to the skillet with the leeks
11. Add salt and pepper
12. Gently sautee together for 5-minutes
13. At this point you can serve it , or continue on and cream it:
14. In a jar with a screw top , add the half-n-half and arrowroot
15. Shake 'til blended
16. Turn heat to low under the leeks and parsnips
17. Pour in the arrowroot mixture , stirring gently as you pour
18. If too thick , gradually add the water
19. Let simmer for a couple of minutes
20. Taste to adjust seasoning , probably an addi

## Работа с регулярными выражениями

In [10]:
import re

3\. Напишите регулярное выражение, которое ищет следующий паттерн в строке: число (1 цифра или более), затем пробел, затем слова: hour или hours или minute или minutes. Произведите поиск по данному регулярному выражению в каждом шаге рецепта с id 25082. Выведите на экран все непустые результаты, найденные по данному шаблону.

In [11]:
for index, step in enumerate(steps_dict[25082]):
  res = re.findall(r'[1-9]{1}[0-9]+ hour[s]?|[1-9]{1}[0-9]+ minute[s]?', step)
  if len(res) != 0 :
    print(f'Шаг {index+1}: {res}')

Шаг 6: ['20 minutes']
Шаг 8: ['10 minutes']
Шаг 14: ['10 minutes']
Шаг 17: ['20 minutes', '30 minutes']


4\. Напишите регулярное выражение, которое ищет шаблон вида "this..., but" _в начале строки_ . Между словом "this" и частью ", but" может находиться произвольное число букв, цифр, знаков подчеркивания и пробелов. Никаких других символов вместо многоточия быть не может. Пробел между запятой и словом "but" может присутствовать или отсутствовать.

Используя строковые методы `pd.Series`, выясните, для каких рецептов данный шаблон содержится в тексте описания. Выведите на экран количество таких рецептов и 3 примера подходящих описаний (текст описания должен быть виден на экране полностью).

In [12]:
df_recipes['description'].isna().sum()

623

In [14]:
df_recipes['description'].fillna(' ', inplace=True)
df_4 = df[df_recipes['description'].str.contains('^this[\w\d\s]+,[ ]?but', regex=True)]
print('Количество подходящих описаний:', df_4.shape[0])

Количество подходящих описаний: 134


In [15]:
pd.set_option('max_colwidth', int(df_4['description'].apply(len).max() + 10))
df_4['description'].sample(3)

21528                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              this grilled chicken gets great flavor from the outdoor grill, but cook time outside is quick because it is partially cooked beforehand.  marinating time is not included in the prep or cook time.
24245                                                                                                                                                            

5\. В текстах шагов рецептов обыкновенные дроби имеют вид "a / b". Используя регулярные выражения, уберите в тексте шагов рецепта с id 72367 пробелы до и после символа дроби. Выведите на экран шаги этого рецепта после их изменения.

In [16]:
for step in steps_dict[72367]:
  print(re.sub(r'\d+ / \d+', '\1/\2', step, count=0))

mix butter , flour , / c
sugar and 1-/ t
vanilla
press into greased 9" springform pan
mix cream cheese , / c
sugar , eggs and / t
vanilla beating until fluffy
pour over dough
combine apples , / c
sugar and cinnamon
arrange on top of cream cheese mixture and sprinkle with almonds
bake at 350 for 45-55 minutes , or until tester comes out clean


### Сегментация текста

In [21]:
import nltk

In [22]:
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /Users/ilya/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /Users/ilya/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /Users/ilya/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /Users/ilya/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /Users/ilya/nltk_data...
[nltk_data]    |   Package basque_grammars is already up-to-date!
[nltk_data]    | Downloading package bcp47 to /Users/ilya/nltk_data

True

6\. Разбейте тексты шагов рецептов на слова при помощи пакета `nltk`. Посчитайте и выведите на экран кол-во уникальных слов среди всех рецептов. Словом называется любая последовательность алфавитных символов (для проверки можно воспользоваться `str.isalpha`). При подсчете количества уникальных слов не учитывайте регистр.

In [32]:
arr = steps_dict.values()
flat_arr = [item.lower() for sublist in arr for item in sublist]

flat_arr[:3]

['in 1 / 4 cup butter , saute carrots , onion , celery and broccoli stems for 5 minutes',
 'add thyme , oregano and basil',
 'saute 5 minutes more']

In [33]:
from nltk.tokenize.toktok import ToktokTokenizer
toktok = ToktokTokenizer()
res = toktok.tokenize(flat_arr)

print('Количество уникальных слов:', len(set(filter(str.isalpha, res))))

Количество уникальных слов: 14953


7\. Разбейте описания рецептов из `recipes` на предложения при помощи пакета `nltk`. Найдите 5 самых длинных описаний (по количеству _предложений_) рецептов в датасете и выведите строки фрейма, соответствующие этим рецептами, в порядке убывания длины.

In [35]:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
df_recipes['sentence_count'] = df_recipes['description'].apply(lambda x : len(tokenizer.tokenize(x))) 
df_recipes.sort_values('sentence_count', ascending=False).head(5)

Unnamed: 0,name,id,minutes,contributor_id,submitted,n_steps,description,n_ingredients,sentence_count
18408,my favorite buttercream icing for decorating,334113,30,681465,2008-10-30,12.0,"this wonderful icing is used for icing cakes and cookies as well as for borders and art work on cakes. it makes a delicious filling also between the layers of cakes and under fondant icing. you can make roses but it takes 3 or more days to dry them depending on the humidity. \r\n\r\nthere are many versions of “buttercream” icing. some are made with eggs and all butter. some varieties, you have to cook your sugar to a softball stage. others are 100% shortening or a combination of shortening and butter.\r\n\r\neach decorator has his or her favorite. i personally think that the best taste and textured recipe is the one that has you cook your sugar, add to whipped eggs and use pounds of butter per batch. but…. i live in a state that can easily be a 100 degrees for days on end during the summer and you know what butter does on hot days. it melts! ...",,76
481,alligator claws avocado fritters with chipotle lime dip,287008,45,765354,2008-02-19,,"a translucent golden-brown crust allows the green of the avocado to be seen. the crispy exterior is a counterpoint to the unctuous interior. these are a signature dish for me, and the one i most often get requests to make (although my seafood and ricotta stuffed buckwheat pancakes run a close second).\r\n\r\nthese fritters came about ten years ago when i was shopping for a dinner i was making for a friend who is a cia-trained chef. i was in a vegetable market and saw these gorgeous avocados that i just knew would be ripe in the next two days. i tried to think of what i could do with them since a) everyone serves cold avocado, and b) i really am not fond of guacamole. as i tried to think of what i could make with them that was hot, the work 'fritters' jumped into my head. having never made a fritter before, i was a little surprised to have tha...",9.0,27
22566,rich barley mushroom soup,328708,60,221776,2008-10-03,,"this is one of the best soups i've ever made and it is even worthy of company. so simple, yet rich in deep, mushroomy flavor. the inspiration was zaar #26877, a delicious mushroom rice casserole. i found i couldn't stop eating the liquid before putting the casserole into the oven and that gave me the idea that the base would make a delicious soup. and it does! \r\nuse plenty of fresh mushrooms. i buy them when they are marked 1/2 price at the grocery, as this is a good way to use your 'shrooms that are starting to get dark. it is the soy sauce that transforms the broth from ho-hum to yum. i try to use low sodium or home-made no sodium chicken broth so that i can use the soy for the sodium. there is no sense of ""asian"" in this soup at all. ( i would not make this without the soy. ) just a little bit adds the depth of flavor and even color...",10.0,24
6779,chocolate tea,205348,6,428824,2007-01-14,,"i wrote this because there are an astounding lack of chocolate tea recipes on the internet. \r\n\r\n the first time i heard about chocolate tea was doing a web search on chocolate. there seem to be a few companies out there who sell chocolate tea. i like to stay up late and had run out of coffee. i was in real need for a good tasting caffene beverage. i first thought chocolate tea would be yucky. we are conditioned to accept chocolate with coffee as a rule but not tea. i was very mistaken! \r\n\r\n tea and chocolate goes very well with each other and it is also very good for your body. both tea and chocolate are loaded with antioxidents. you may however not want to give this to small children because of the caffene. \r\n\r\n not having a recipe to follow, i created one. (this one) i used these ingredients because i had them on hand and it was quick...",,23
16296,little bunny foo foo cake carrot cake with cream cheese frosti,316000,68,689540,2008-07-27,14.0,"the first time i made this cake i grated a million pounds of carrots on a knucklebuster. then they invented cuisinarts! now it is much faster to shred the carrots on a fine shredding disk and no bloody knuckles! i have baked it in 8"", 9"", 9x13"" pans so if you want to experiment with pan size it works. one thing i found was baking and stacking the three layers is tricky. my favorite way is two 8"" pans for a nice layer cake and an 8"" square pan to put into the freezer for unexpected company. i hope you try this wonderful cake. update: in the spirit of carrot cake stories, this cake was invented by a bunny named foo-foo. he is very famous and even has a hit song which goes like this: sing to the tune of 'down by the station'.......... \r\n\r\n\r\n little bunny foo foo,\r\nhopping through the forest,\r\nscooping up the field mice,\r\nand bo...",,23


8\. Напишите функцию, которая для заданного предложения выводит информацию о частях речи слов, входящих в предложение, в следующем виде:
```
PRP   VBD   DT      NNS     CC   VBD      NNS        RB   
 I  omitted the raspberries and added strawberries instead
``` 
Для определения части речи слова можно воспользоваться `nltk.pos_tag`.

Проверьте работоспособность функции на названии рецепта с id 241106.

Обратите внимание, что часть речи должна находиться ровно посередине над соотвествующим словом, а между самими словами должен быть ровно один пробел.


In [28]:
from nltk import pos_tag
from nltk import word_tokenize

In [29]:
def tag(sentence: str) -> str:
  tokenizer = word_tokenize(sentence)
  tagged_list = pos_tag(tokenizer)

  res_word = ''
  res_tag = ''

  for elem in tagged_list:
    word, tag = elem[0], elem[1]
    n_spaces = abs(len(word) - len(tag))
    if len(word) >= len(tag):
      tag = ' ' * (n_spaces//2) + tag + ' ' * (n_spaces - n_spaces//2)
    else:
      word = ' ' * (n_spaces//2) + tag + ' ' * (n_spaces - n_spaces//2)
      
    res_tag += tag + ' '
    res_word += word + ' '

  return '\n'.join([res_tag, res_word])

In [30]:
sentence = df_recipes[df_recipes['id'] == 241106]['name'].tolist()[0]
print(tag(sentence))

   JJ     NNS    IN     NNS    VBP    JJ   CC   JJ    NNS   
eggplant steaks with chickpeas feta cheese and black olives 
