# Part-of-Speech Tagging for Russian

In this lesson, we're going to learn about the textual analysis methods *part-of-speech tagging* and *keyword extraction*. These methods will help us computationally parse sentences and better understand words in context.

---

## spaCy and Natural Language Processing (NLP)

To computationally identify parts of speech, we're going to use the natural language processing library spaCy. For a more extensive introduction to NLP and spaCy, see the previous lesson.

To parse sentences, spaCy relies on machine learning models that were trained on large amounts of labeled text data. If you've used the preprocessing or named entity recognition notebooks for this language, you can skip the steps for installing spaCy and downloading the language model.

## Install spaCy

To use spaCy, we first need to install the library. 

Russian models are only available starting in spaCy 3.0. 

If you run into errors because spaCy 2.x is installed, you can run `!pip uninstall spacy -y` first, then run the cell below.

In [None]:
!pip install -U spacy>=3.0

## Import Libraries

Then we're going to import `spacy` and `displacy`, a special spaCy module for visualization.

In [2]:
import spacy
from spacy import displacy
from collections import Counter
import pandas as pd
pd.set_option("max_rows", 400)
pd.set_option("max_colwidth", 400)

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


We're also going to import the `Counter` module for counting nouns, verbs, adjectives, etc., and the `pandas` library for organizing and displaying data (we're also changing the pandas default max row and column width display setting).

## Download Language Model

Next we need to download the Russian-language model (`ru_core_news_md`), which will be processing and making predictions about our texts. You can read more about the corpus it was trained on, on the [spaCy model page](https://spacy.io/models/ru). You can download the `ru_core_news_md` model by running the cell below:

In [None]:
!python -m spacy download ru_core_news_md

*Note: spaCy offers [models for other languages](https://spacy.io/usage/models#languages) including German, French, Spanish, Portuguese, Italian, Dutch, Greek, Norwegian, and Lithuanian. Languages such as Russian, Ukrainian, Thai, Chinese, Japanese, Korean and Vietnamese don't currently have their own NLP models. However, spaCy offers language and tokenization support for many of these language with external dependencies — such as [PyviKonlpy](https://github.com/konlpy/konlpy) for Korean or [Jieba](https://github.com/fxsjy/jieba) for Chinese.*

## Load Language Model

Once the model is downloaded, we need to load it with `spacy.load()` and assign it to the variable `nlp`.

In [3]:
nlp = spacy.load('ru_core_news_md')

## Create a Processed spaCy Document

Whenever we use spaCy, our first step will be to create a processed spaCy `document` with the loaded NLP model `nlp()`. Most of the heavy NLP lifting is done in this line of code. After processing, the `document` object will contain tons of juicy language data — named entities, sentence boundaries, parts of speech — and the rest of our work will be devoted to accessing this information.

In [4]:
filepath = '../texts/other-languages/ru.txt'
text = open(filepath, encoding='utf-8').read()
document = nlp(text)

## spaCy Part-of-Speech Tagging
The tags that spaCy uses for part-of-speech are based on work done by [Universal Dependencies](https://universaldependencies.org/), an effort to create a set of part-of-speech tags that work across many different languages. Texts from various languages are annotated using this common set of tags, and contributed to a common repository that can be used to train models like spaCy.

The Universal Dependencies page has information about the annotated corpora available for each language; it's worth looking into the corpora that were annotated for your language.

| POS   | Description               | Examples                                      |
|:-----:|:-------------------------:|:---------------------------------------------:|
| ADJ   | adjective                 | big, old, green, incomprehensible, first      |
| ADP   | adposition                | in, to, during                                |
| ADV   | adverb                    | very, tomorrow, down, where, there            |
| AUX   | auxiliary                 | is, has (done), will (do), should (do)        |
| CONJ  | conjunction               | and, or, but                                  |
| CCONJ | coordinating conjunction  | and, or, but                                  |
| DET   | determiner                | a, an, the                                    |
| INTJ  | interjection              | psst, ouch, bravo, hello                      |
| NOUN  | noun                      | girl, cat, tree, air, beauty                  |
| NUM   | numeral                   | 1, 2017, one, seventy-seven, IV, MMXIV        |
| PART  | particle                  | ’s, not,                                      |
| PRON  | pronoun                   | I, you, he, she, myself, themselves, somebody |
| PROPN | proper noun               | Mary, John, London, NATO, HBO                 |
| PUNCT | punctuation               | ., (, ), ?                                    |
| SCONJ | subordinating conjunction | if, while, that                               |
| SYM   | symbol                    | $, %, §, ©, +, −, ×, ÷, =, :), 😝             |
| VERB  | verb                      | run, runs, running, eat, ate, eating          |
| X     | other                     | sfpksdpsxmsa                                  |
| SPACE | space                     |                                               |


Above is a POS chart taken from [spaCy's website](https://spacy.io/api/annotation#named-entities), which shows the different parts of speech that spaCy can identify as well as their corresponding labels. To quickly see spaCy's POS tagging in action, we can use the [spaCy module `displacy`](https://spacy.io/usage/visualizers#ent) on our sample `document` with the `style=` parameter set to "dep" (short for dependency parsing):

## Get Part-Of-Speech Tags

To get part of speech tags for every word in a document, we have to iterate through all the tokens in the document and pull out the `.lemma_` attribute for each token, which gives us the un-inflected version of the word. We'll also pull out the  `.pos_` attribute for each token. We can get even finer-grained dependency information with the attribute `.dep_`.


In [5]:
for token in document:
    print(token.lemma_, token.pos_, token.dep_)

яблони PROPN nsubj
цвести VERB ROOT
* PUNCT punct


 SPACE conj
i PROPN appos

 SPACE punct
зачем ADV advmod
она PRON nsubj
так ADV advmod
сделать VERB ROOT
, PUNCT punct
что SCONJ mark
я PRON nsubj
не PART advmod
уметь VERB ccomp
жить VERB xcomp
без ADP case
нее PRON obl
? PUNCT punct
это PRON obj
она PRON nsubj
сделать VERB ROOT
, PUNCT punct
я PRON nsubj
не PART advmod
виноватый ADJ conj
… PUNCT punct


 SPACE parataxis
я PRON nsubj
написать VERB ROOT
это PRON obj
– PUNCT punct
и CCONJ cc
мне PRON iobj
стать VERB parataxis
странный ADJ xcomp
. PUNCT punct
говорить VERB ROOT
точно ADV advmod
о ADP case
возлюбленной NOUN obl
. PUNCT punct
но CCONJ cc
возлюбленной NOUN nsubj
у ADP case
меня PRON obl
нет VERB ROOT
. PUNCT punct
это PRON obj
мой DET det
мать NOUN nsubj
сделать VERB ROOT
так ADV advmod
, PUNCT punct
что SCONJ mark
я PRON nsubj
умирать VERB advcl
без ADP case
нее PRON obl
. PUNCT punct
если SCONJ mark
человек NOUN obj
держать VERB advcl
в ADP case
тепло NOUN obl
весь DET d

. PUNCT punct
и CCONJ cc
непременно ADV advmod
ты PRON nsubj
хотеть VERB ROOT
бы AUX aux
скрыть VERB xcomp
этот DET det
счастие NOUN obj
, PUNCT punct
потому ADV mark
что SCONJ fixed
тебе PRON iobj
стыдно ADV advcl
и CCONJ cc
казаться VERB conj
, PUNCT punct
что SCONJ mark
ты PRON nsubj
один DET amod
так ADV advmod
чувствовать VERB ccomp
и CCONJ advmod
другие ADJ nsubj
не PART advmod
понять VERB conj
, PUNCT punct
а CCONJ cc
между ADP case
тем PRON obl
все PART advmod
так ADV advmod
же PART advmod
чувствовать VERB conj
– PUNCT punct
и CCONJ cc
каждый DET nsubj
стыдиться VERB parataxis
, PUNCT punct
думать VERB advcl
, PUNCT punct
что SCONJ mark
он PRON nsubj
один DET ccomp
. PUNCT punct
у ADP case
каждый DET obl
быть VERB ROOT
светлый ADJ amod
облако NOUN nsubj
в ADP case
прошлое NOUN nmod
. PUNCT punct


 SPACE ROOT
и CCONJ cc
у ADP case
меня PRON obl
оно PRON nsubj
быть VERB conj
, PUNCT punct
как SCONJ case
у ADP case
всех PRON obl
" PUNCT punct
. PUNCT punct


 SPACE ROOT
ii ADJ ap

она PRON nsubj
же PART advmod
сама ADJ acl
прежде ADV advmod
это PRON obl
не PART advmod
бояться VERB advcl
? PUNCT punct


 SPACE ROOT
да PART cc
и CCONJ cc
что SCONJ mark
тут ADV advmod
нехороший ADJ conj
? PUNCT punct
пусть PART advmod
я PRON nsubj
похожий ADJ ROOT
на ADP case
женщина NOUN obl
! PUNCT punct
я PRON nsubj
любить VERB ROOT
прежде ADV advmod
всего PRON fixed
весь DET det
красивый ADJ obj
– PUNCT punct
но CCONJ cc
без ADP case
суровость NOUN conj
, PUNCT punct
без ADP case
сила NOUN conj
, PUNCT punct
а CCONJ cc
нежный ADJ conj
и CCONJ cc
простой ADJ conj
. PUNCT punct
я PRON nsubj
не PART advmod
виноватый ADJ ROOT
, PUNCT punct
что SCONJ mark
я PRON nsubj
такой DET det
… PUNCT punct


 SPACE ccomp
мама NOUN nsubj
долго ADV advmod
просить VERB ROOT
у ADP case
меня PRON obl
прощение NOUN obj
, PUNCT punct
и CCONJ cc
мы PRON nsubj
помириться VERB conj
. PUNCT punct
но CCONJ cc
я PRON nsubj
не PART advmod
забыть VERB ROOT
её DET det
слово NOUN obj
и CCONJ cc
часто ADV advmo

знать VERB ROOT
, PUNCT punct
что SCONJ mark
тут ADV advmod
мы PRON nsubj
не PART advmod
понимать VERB ccomp
друг PRON obj
друга PRON fixed
, PUNCT punct
только PART advmod
тут ADV advmod
. PUNCT punct
она PRON nsubj
даже PART advmod
сад NOUN nsubj
наш DET obj
не PART advmod
любить VERB ROOT
, PUNCT punct
– PUNCT punct
гулять VERB nsubj
ходить VERB parataxis
по ADP case
улица NOUN obl
, PUNCT punct
говорить VERB conj
, PUNCT punct
что SCONJ mark
солнечный ADJ amod
свет NOUN nsubj
гораздо ADV advmod
беспокойный ADJ ccomp
полутьма NOUN obl
гостиная NOUN nmod
, PUNCT punct
а CCONJ cc
её DET det
дух NOUN nsubj
хороший ADJ conj
запах NOUN obl
настоящий ADJ amod
весна NOUN nmod
. PUNCT punct


 SPACE ROOT
когда SCONJ mark
я PRON nsubj
об ADP case
это PRON obl
думать VERB advcl
, PUNCT punct
приходило VERB conj
невольный ADJ amod
чувство NOUN nsubj
на ADP case
минута NOUN nmod
, PUNCT punct
что SCONJ mark
она PRON nsubj
не PART advmod
молодой ADJ acl
, PUNCT punct
что SCONJ mark
она PRON nsub

воздух NOUN obj
, PUNCT punct
пища NOUN conj
, PUNCT punct
– PUNCT punct
а CCONJ cc
жить VERB xcomp
без ADP case
них PRON obl
не PART advmod
мочь VERB conj
. PUNCT punct
я PRON nsubj
знать VERB ROOT
, PUNCT punct
что SCONJ mark
я PRON nsubj
слабый ADJ amod
, PUNCT punct
слабый ADJ conj
человек NOUN ccomp
. PUNCT punct
у ADP case
меня PRON obl
нет VERB ROOT
сила NOUN nsubj
сделать VERB nmod
против ADP case
себя PRON obl
, PUNCT punct
не PART advmod
страдать VERB conj
, PUNCT punct
когда SCONJ mark
я PRON nsubj
страдать VERB advcl
… PUNCT punct


 SPACE conj
я PRON nsubj
был AUX cop
последний ADJ amod
год NOUN nsubj
в ADP case
москва PROPN obl
, PUNCT punct
когда SCONJ mark
заболеть VERB advcl
отец NOUN nsubj
, PUNCT punct
и CCONJ cc
мама NOUN nsubj
уехать VERB conj
перед ADP case
рождество PROPN obl
. PUNCT punct
а CCONJ cc
в ADP case
конец NOUN obl
январь NOUN nmod
я PRON nsubj
заболеть VERB ROOT
сам ADJ acl
, PUNCT punct
бросить VERB conj
все PRON obj
и CCONJ cc
приехать VERB conj
дом

то PRON nsubj
проволочься VERB conj
по ADP case
земля NOUN obl
– PUNCT punct
и CCONJ cc
затихнуть VERB conj
. PUNCT punct
я PRON nsubj
мочь VERB ROOT
бы AUX aux
встать VERB xcomp
и CCONJ cc
посмотреть VERB conj
, PUNCT punct
что SCONJ mark
это PRON nsubj
такой DET ccomp
, PUNCT punct
– PUNCT punct
изгородь NOUN nsubj
приходиться VERB parataxis
мне PRON iobj
в ADP case
одном NUM nummod
место NOUN obl
чуть ADV advmod
высоко ADJ advmod
пояс NOUN obl
, PUNCT punct
– PUNCT punct
но CCONJ cc
я PRON nsubj
подумать VERB conj
: PUNCT punct
" PUNCT punct
ну PART parataxis
, PUNCT punct
все ADV nsubj
равный ADJ fixed
, PUNCT punct
не PART advmod
стоять VERB parataxis
" PUNCT punct
. PUNCT punct
весенний ADJ amod
воздух NOUN nsubj
утомить VERB ROOT
меня PRON obj
, PUNCT punct
мне PRON iobj
хотеться VERB conj
быть AUX cop
спокойный ADJ xcomp
и CCONJ cc
дремать VERB conj
. PUNCT punct


 SPACE ROOT
но CCONJ cc
опять ADV advmod
повториться VERB conj
шорох NOUN nsubj
и CCONJ cc
умолкнуть VERB conj
сра

глаз NOUN nsubj
тоже ADV advmod
бледный ADJ conj
, PUNCT punct
но CCONJ cc
прозрачный ADJ conj
, PUNCT punct
как SCONJ case
чистый ADJ amod
вода NOUN obl
. PUNCT punct
я PRON nsubj
не PART advmod
знать VERB ROOT
, PUNCT punct
какой DET det
цвет NOUN obj
были AUX cop
этот DET det
глаз NOUN nsubj
. PUNCT punct
я PRON nsubj
думать VERB ROOT
, PUNCT punct
в ADP case
полдень NOUN obl
, PUNCT punct
когда SCONJ mark
небо NOUN nsubj
очень ADV advmod
синий ADJ acl:relcl
, PUNCT punct
они PRON nsubj
темнеть VERB ccomp
. PUNCT punct


 SPACE ROOT
так ADV advmod
– PUNCT punct
я PRON nsubj
помнить VERB parataxis
весь DET det
черта NOUN obj
, PUNCT punct
помнить VERB conj
тоненький ADJ amod
, PUNCT punct
прямой ADJ conj
полоска NOUN obj
бровь NOUN nmod
, PUNCT punct
светло ADJ conj
- ADJ conj
розовый ADJ conj
, PUNCT punct
сжать VERB amod
губа NOUN conj
, PUNCT punct
а CCONJ cc
весь DET det
лицо NOUN nsubj
ускользать VERB conj
от ADP case
меня PRON obl
. PUNCT punct
и CCONJ cc
я PRON nsubj
почти ADV

туман NOUN obl
, PUNCT punct
в ADP case
котором PRON obl
я PRON nsubj
жить VERB acl:relcl
последний ADJ amod
день NOUN obj
, PUNCT punct
я PRON nsubj
беспокоиться VERB conj
, PUNCT punct
мне PRON iobj
недоставать VERB conj
что PRON nsubj
- PRON obj
то PRON nsubj
– PUNCT punct
я PRON nsubj
редко ADV advmod
видеть VERB parataxis
мама NOUN obj
. PUNCT punct
я PRON nsubj
ей PRON iobj
не PART advmod
хотеть VERB ROOT
ничто PRON obj
рассказывать VERB xcomp
– PUNCT punct
ведь SCONJ mark
она PRON nsubj
не PART advmod
любить VERB parataxis
сад NOUN obj
, PUNCT punct
а CCONJ cc
это PRON nsubj
только PART advmod
о ADP case
сад NOUN conj
. PUNCT punct
но CCONJ cc
мама NOUN nsubj
была AUX cop
мне PRON iobj
нужный ADJ ROOT
, PUNCT punct
как SCONJ mark
я PRON nsubj
сам ADJ acl
. PUNCT punct
я PRON nsubj
теперь ADV advmod
только PART advmod
понять VERB ROOT
, PUNCT punct
что SCONJ mark
она PRON nsubj
чувствовать VERB ccomp
не PART advmod
то PRON obj
, PUNCT punct
что SCONJ mark
я PRON nsubj
, PUNCT pun

никогда ADV advmod
не PART advmod
лгать VERB conj
. PUNCT punct
я PRON nsubj
думать VERB ROOT
, PUNCT punct
она PRON nsubj
увидала VERB ccomp
, PUNCT punct
что SCONJ mark
мне PRON iobj
тяжёлый ADJ ccomp
. PUNCT punct


 SPACE ROOT
– PUNCT punct
ну PART parataxis
, PUNCT punct
хороший ADJ parataxis
, PUNCT punct
– PUNCT punct
перебить VERB parataxis
она PRON nsubj
, PUNCT punct
– PUNCT punct
ты PRON nsubj
и PART advmod
сам ADJ amod
, PUNCT punct
казаться VERB parataxis
, PUNCT punct
не PART advmod
подозревать VERB parataxis
, PUNCT punct
что SCONJ mark
это PRON nsubj
так ADV advmod
случиться VERB ccomp
. PUNCT punct
но CCONJ cc
помни NOUN ROOT
, PUNCT punct
володя VERB parataxis
. PUNCT punct
наш DET det
отношение NOUN nsubj
не PART advmod
таковы ADJ ROOT
. PUNCT punct
я PRON nsubj
не PART advmod
мочь VERB ROOT
никогда ADV advmod
и CCONJ cc
не PART advmod
мочь VERB conj
быть AUX cop
пассивно ADV amod
- ADJ amod
нежный ADJ amod
мать NOUN xcomp
. PUNCT punct
я PRON nsubj
тебе PRON iobj
жи

, PUNCT punct
что PRON nsubj
сильный ADJ ccomp
меня PRON obl
– PUNCT punct
я PRON nsubj
покориться VERB parataxis
и CCONJ advmod
тут ADV advmod
неизбежный ADJ xcomp
. PUNCT punct
я PRON nsubj
еще ADV advmod
раз NOUN obl
посмотреть VERB ROOT
на ADP case
солнце NOUN obl
, PUNCT punct
слабо ADV advmod
улыбнуться VERB conj
и CCONJ cc
, PUNCT punct
не PART advmod
оглядываться VERB advcl
, PUNCT punct
не PART advmod
останавливаться VERB advcl
, PUNCT punct
пойти VERB conj
в ADP case
сад NOUN obl
. PUNCT punct


 SPACE ROOT
ix PROPN appos

 SPACE punct
когда SCONJ mark
я PRON nsubj
захлопнуть VERB ROOT
за ADP case
себя PRON obl
калитка NOUN obj
и CCONJ cc
сделать VERB conj
несколько NUM nummod:gov
шаг NOUN obj
вглубь ADV advmod
– PUNCT punct
я PRON nsubj
вдруг ADV advmod
ожить VERB parataxis
. PUNCT punct
ожить VERB ROOT
и CCONJ cc
все PRON obj
забыть VERB conj
. PUNCT punct
с ADP case
каждый DET det
секунда NOUN obl
мне PRON iobj
делаться VERB ROOT
лёгкий ADJ advmod
и CCONJ cc
радостный ADJ 

хотеться VERB ROOT
, PUNCT punct
кроме ADP case
того PRON obl
, PUNCT punct
что PRON nsubj
быть VERB acl
. PUNCT punct
я PRON nsubj
думать VERB ROOT
, PUNCT punct
это PRON nsubj
и PART advmod
есть AUX cop
счастие NOUN ccomp
. PUNCT punct
весь DET det
сад NOUN nsubj
наполнялся VERB ROOT
новый ADJ amod
, PUNCT punct
сильный ADJ conj
аромат NOUN obl
. PUNCT punct
месяц NOUN nsubj
никнул VERB ROOT
и CCONJ cc
уходить VERB conj
с ADP case
небо NOUN obl
, PUNCT punct
но CCONJ cc
яблони VERB nsubj
не PART advmod
темнеть VERB conj
. PUNCT punct
они PRON nsubj
были AUX cop
белый ADJ ROOT
не PART advmod
от ADP case
лунный ADJ amod
свет NOUN obl
. PUNCT punct


 SPACE ROOT
когда SCONJ mark
пройти VERB advcl
время NOUN nsubj
, PUNCT punct
и CCONJ cc
весь DET det
круг NOUN xcomp
нас PRON obj
стать VERB conj
ясный ADJ xcomp
и CCONJ cc
холодный ADJ conj
, PUNCT punct
небо NOUN nsubj
позеленело VERB conj
, PUNCT punct
и CCONJ cc
утренний ADJ amod
сумерки NOUN nsubj
спуститься VERB conj
, PUNCT punct
– 

были AUX ROOT
с ADP case
нею PRON obl
удивительный ADJ amod
отношение NOUN nsubj
… PUNCT punct


 SPACE ROOT
я PRON nsubj
неестественно ADV advmod
рассмеяться VERB conj
и CCONJ cc
сказать VERB conj
: PUNCT punct


 SPACE parataxis
– PUNCT punct
да PART parataxis
, PUNCT punct
да PART parataxis
, PUNCT punct
вы PRON nsubj
правый ADJ conj
… PUNCT punct


 SPACE punct
и CCONJ cc
сам ADJ acl
пожать VERB ROOT
рука NOUN obj
господин NOUN iobj
. PUNCT punct


 SPACE ROOT
потом ADV advmod
ее PRON obj
похоронить VERB conj
, PUNCT punct
и CCONJ cc
я PRON nsubj
уехать VERB conj
. PUNCT punct
зачем ADV advmod
бы AUX aux
я PRON nsubj
там ADV advmod
остаться VERB ROOT
? PUNCT punct
о ADP case
март PROPN obl
я PRON nsubj
не PART advmod
спросить VERB ROOT
, PUNCT punct
в ADP case
сад NOUN obl
не PART advmod
заходить VERB conj
… PUNCT punct


 SPACE conj
и CCONJ cc
пройти VERB conj
сколько ADV nummod:gov
- ADV nummod:gov
то DET det
год NOUN obl
. PUNCT punct
я PRON nsubj
не PART advmod
помнить VERB ROO

## Practicing with the example text
When working with languages that have inflection, we typically use `token.lemma_` instead of `token.text` like you'll find in the English examples. This is important when we're counting, so that differently-inflected forms of a word (e.g. masculine vs. feminine or singular vs. plural) aren't counted as if they were different words.

In [6]:
filepath = "../texts/other-languages/ru.txt"
document = nlp(open(filepath, encoding="utf-8").read())

## Get Adjectives

| POS   | Description               | Examples                                      |
|:-----:|:-------------------------:|:---------------------------------------------:|
| ADJ   | adjective                 | big, old, green, incomprehensible, first      |

To extract and count the adjectives in the example text, we will follow the same model as above, except we'll add an `if` statement that will pull out words only if their POS label matches "ADJ."

```{admonition} Python Review!
:class: pythonreview
While we demonstrate how to extract parts of speech in the sections below, we're also going to reinforce some integral Python skills. Notice how we use `for` loops and `if` statements to `.append()` specific words to a list. Then we count the words in the list and make a pandas dataframe from the list.
```

Here we make a list of the adjectives identified in the example text:

In [7]:
adjs = []
for token in document:
    if token.pos_ == 'ADJ':
        adjs.append(token.lemma_)

In [8]:
adjs

['виноватый',
 'странный',
 'неодетый',
 'двадцатиградусный',
 'хороший',
 'лёгкий',
 'последний',
 'важный',
 'значительный',
 'прямой',
 'унылый',
 'быстрый',
 'однообразный',
 'нужный',
 'старый',
 'сам',
 'красивый',
 'нежный',
 'похожий',
 'больший',
 'жёлтый',
 'тёмный',
 'должный',
 'последний',
 'пошлый',
 'трусливый',
 'хороший',
 'порядочный',
 'хороший',
 'длинный',
 'трудный',
 'холодный',
 'самые',
 'обыденный',
 'скучный',
 'прекрасный',
 'сентиментальный',
 'трудный',
 'тихий',
 'глубокий',
 'пожилой',
 'первый',
 'толковый',
 'маленький',
 'простой',
 'малороссийский',
 'далёкий',
 'тёмный',
 'тёплый',
 'милый',
 'невольный',
 'радостный',
 'другие',
 'светлый',
 'ii',
 'скучный',
 'самого',
 'счастливый',
 'должный',
 'печальный',
 'пожилой',
 'другой',
 'молодой',
 'тоненький',
 'больший',
 'чёрный',
 'свежий',
 'блестящий',
 'быстрый',
 'странный',
 'самую',
 'ранний',
 'необычный',
 'светлый',
 'другого',
 'хороший',
 'низкий',
 'тёмный',
 'равный',
 'большой',
 'но

Then we count the unique adjectives in this list with the `Counter()` module:

In [9]:
adjs_tally = Counter(adjs)

In [10]:
adjs_tally.most_common()

[('должный', 18),
 ('хороший', 17),
 ('сам', 13),
 ('странный', 11),
 ('первый', 11),
 ('красивый', 10),
 ('равный', 8),
 ('сама', 8),
 ('тёмный', 7),
 ('холодный', 7),
 ('молодой', 7),
 ('последний', 6),
 ('нужный', 6),
 ('похожий', 6),
 ('простой', 6),
 ('другие', 6),
 ('светлый', 6),
 ('большой', 6),
 ('больший', 5),
 ('жёлтый', 5),
 ('чистый', 5),
 ('чужой', 5),
 ('бледный', 5),
 ('нежный', 4),
 ('трудный', 4),
 ('маленький', 4),
 ('рад', 4),
 ('весенний', 4),
 ('белый', 4),
 ('-', 4),
 ('розовый', 4),
 ('тяжёлый', 4),
 ('лёгкий', 3),
 ('пошлый', 3),
 ('длинный', 3),
 ('скучный', 3),
 ('глубокий', 3),
 ('чёрный', 3),
 ('новый', 3),
 ('удивительный', 3),
 ('самое', 3),
 ('умный', 3),
 ('слабый', 3),
 ('лишний', 3),
 ('серый', 3),
 ('страшный', 3),
 ('высоко', 3),
 ('спокойный', 3),
 ('неясный', 3),
 ('сильный', 3),
 ('близкий', 3),
 ('виноватый', 2),
 ('прямой', 2),
 ('быстрый', 2),
 ('старый', 2),
 ('тихий', 2),
 ('пожилой', 2),
 ('далёкий', 2),
 ('невольный', 2),
 ('радостный', 2)

Then we make a dataframe from this list:

In [11]:
df = pd.DataFrame(adjs_tally.most_common(), columns=['adj', 'count'])
df[:100]

Unnamed: 0,adj,count
0,должный,18
1,хороший,17
2,сам,13
3,странный,11
4,первый,11
5,красивый,10
6,равный,8
7,сама,8
8,тёмный,7
9,холодный,7


## Get Nouns

| POS   | Description               | Examples                                      |
|:-----:|:-------------------------:|:---------------------------------------------:|
| NOUN  | noun                      | girl, cat, tree, air, beauty                  |

To extract and count nouns, we can follow the same model as above, except we will change our `if` statement to check for POS labels that match "NOUN".

In [12]:
nouns = []
for token in document:
    if token.pos_ == 'NOUN':
        nouns.append(token.lemma_)

nouns_tally = Counter(nouns)

df = pd.DataFrame(nouns_tally.most_common(), columns=['noun', 'count'])
df[:100]

Unnamed: 0,noun,count
0,сад,33
1,мама,31
2,глаз,15
3,небо,15
4,день,14
5,раз,13
6,слово,12
7,человек,11
8,год,10
9,солнце,10


## Get Verbs

| POS   | Description               | Examples                                      |
|:-----:|:-------------------------:|:---------------------------------------------:|
| VERB  | verb                      | run, runs, running, eat, ate, eating          |

To extract and count works of art, we can follow a similar-ish model to the examples above. This time, however, we're going to make our code even more economical and efficient (while still changing our `if` statement to match the POS label "VERB").

```{admonition} Python Review!
:class: pythonreview
We can use a [*list comprehension*](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Python/More-Lists-Loops.html#List-Comprehensions) to get our list of verbs in a single line of code! Closely examine the first line of code below:
```

In [13]:
verbs = [token.lemma_ for token in document if token.pos_ == 'VERB']

verbs_tally = Counter(verbs)

df = pd.DataFrame(verbs_tally.most_common(), columns=['verb', 'count'])
df[:100]

Unnamed: 0,verb,count
0,мочь,49
1,знать,47
2,быть,46
3,сказать,37
4,говорить,31
5,думать,27
6,любить,25
7,стать,24
8,жить,20
9,казаться,20


# Keyword Extraction

## Get Sentences with Keyword

spaCy can also identify sentences in a document. To access sentences, we can iterate through `document.sents` and pull out the `.text` of each sentence.

We can use spaCy's sentence-parsing capabilities to extract sentences that contain particular keywords, such as in the function below. Note that the function assumes that the keyword provided will be exactly the same as it appears in the text.

With the function `find_sentences_with_keyword()`, we will iterate through `document.sents` and pull out any sentence that contains a particular "keyword." Then we will display these sentence with the keywords bolded.

In [14]:
import re
from IPython.display import Markdown, display

In [15]:
def find_sentences_with_keyword(keyword, document):
    
    #Iterate through all the sentences in the document and pull out the text of each sentence
    for sentence in document.sents:
        sentence = sentence.text
        
        #Check to see if the keyword is in the sentence (and ignore capitalization by making both lowercase)
        if keyword.lower() in sentence.lower():
            
            #Use the regex library to replace linebreaks and to make the keyword bolded, again ignoring capitalization
            sentence = re.sub('\n', ' ', sentence)
            sentence = re.sub(f"{keyword}", f"**{keyword}**", sentence, flags=re.IGNORECASE)
            
            display(Markdown(sentence))

In [16]:
find_sentences_with_keyword(keyword="хороший", document=document)

и если не **хороший**, то порядочный, как говорят.

## Get Keyword in Context

We can also find out about a keyword's more immediate context — its neighboring words to the left and right — and we can fine-tune our search with POS tagging.

To do so, we will first create a list of what's called *ngrams*. "Ngrams" are any sequence of *n* tokens in a text. They're an important concept in computational linguistics and NLP. (Have you ever played with [Google's *Ngram* Viewer](https://books.google.com/ngrams)?)

Below we're going to make a list of *bigrams*, that is, all the two-word combinations from the sample text. We're going to use these bigrams to find the neighboring words that appear alongside particular keywords.

In [17]:
#Make a list of tokens and POS labels from document if the token is a word 
tokens_and_labels = [(token.text, token.pos_) for token in document if token.is_alpha]

In [18]:
#Make a function to get all two-word combinations
def get_bigrams(word_list, number_consecutive_words=2):
    
    ngrams = []
    adj_length_of_word_list = len(word_list) - (number_consecutive_words - 1)
    
    #Loop through numbers from 0 to the (slightly adjusted) length of your word list
    for word_index in range(adj_length_of_word_list):
        
        #Index the list at each number, grabbing the word at that number index as well as N number of words after it
        ngram = word_list[word_index : word_index + number_consecutive_words]
        
        #Append this word combo to the master list "ngrams"
        ngrams.append(ngram)
        
    return ngrams

In [19]:
bigrams = get_bigrams(tokens_and_labels)

Let's take a peek at the bigrams:

In [20]:
bigrams[5:20]

[[('так', 'ADV'), ('сделала', 'VERB')],
 [('сделала', 'VERB'), ('что', 'SCONJ')],
 [('что', 'SCONJ'), ('я', 'PRON')],
 [('я', 'PRON'), ('не', 'PART')],
 [('не', 'PART'), ('умею', 'VERB')],
 [('умею', 'VERB'), ('жить', 'VERB')],
 [('жить', 'VERB'), ('без', 'ADP')],
 [('без', 'ADP'), ('нее', 'PRON')],
 [('нее', 'PRON'), ('Это', 'PRON')],
 [('Это', 'PRON'), ('она', 'PRON')],
 [('она', 'PRON'), ('сделала', 'VERB')],
 [('сделала', 'VERB'), ('я', 'PRON')],
 [('я', 'PRON'), ('не', 'PART')],
 [('не', 'PART'), ('виноват', 'ADJ')],
 [('виноват', 'ADJ'), ('Я', 'PRON')]]

Now that we have our list of bigrams, we're going to make a function `get_neighbor_words()`. This function will return the most frequent words that appear next to a particular keyword. The function can also be fine-tuned to return neighbor words that match a certain part of speech by changing the `pos_label` parameter.

In [21]:
def get_neighbor_words(keyword, bigrams, pos_label = None):
    
    neighbor_words = []
    keyword = keyword.lower()
    
    for bigram in bigrams:
        
        #Extract just the lowercased words (not the labels) for each bigram
        words = [word.lower() for word, label in bigram]        
        
        #Check to see if keyword is in the bigram
        if keyword in words:
            
            for word, label in bigram:
                
                #Now focus on the neighbor word, not the keyword
                if word.lower() != keyword:
                    #If the neighbor word matches the right pos_label, append it to the master list
                    if label == pos_label or pos_label == None:
                        neighbor_words.append(word.lower())
    
    return Counter(neighbor_words).most_common()

In [25]:
get_neighbor_words("сад", bigrams)

[('в', 9),
 ('и', 6),
 ('не', 4),
 ('там', 2),
 ('как', 2),
 ('теплый', 1),
 ('чистый', 1),
 ('к', 1),
 ('мой', 1),
 ('оживший', 1),
 ('то', 1),
 ('я', 1),
 ('люблю', 1),
 ('увидеть', 1),
 ('весну', 1),
 ('ix', 1),
 ('весь', 1),
 ('наполнялся', 1)]

In [24]:
get_neighbor_words("сад", bigrams, pos_label='VERB')

[('оживший', 1), ('люблю', 1), ('увидеть', 1), ('наполнялся', 1)]

## Your Turn!

Try out `find_sentences_with_keyword()` and `get_neighbor_words` with your own keywords of interest.

In [None]:
find_sentences_with_keyword(keyword="YOUR KEY WORD", document=document)

In [None]:
get_neighbor_words(keyword="YOUR KEY WORD", bigrams, pos_label=None)