# Seminar 01. Jupyter, Python, NumPy, NLTK

[Dr. Constantine Korikov, Huawei](mailto:constantine.korikov@huawei.com)

Dr. Valentin Malykh, Huawei

## 1. Jupyter notebook

### 1.1.  What is Jupyter
[Jupyter Notebook Users Manual](https://jupyter.brynmawr.edu/services/public/dblank/Jupyter%20Notebook%20Users%20Manual.ipynb)
![](https://jupyter.org/assets/labpreview.png)

Jupyter is de facto a standard in area of programming education. A notebook consists of cells that can be different types:
1. code
2. markdown
3. raw

Under the hood, a notebook is JSON + metadata. For instance, several first lines of this notebook are listed below.
```json
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Seminar 01. Python, NumPy, NLTK\n",
    "\n",
    "Dr. Constantine Korikov, Huawei"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Python crash course"
   ]
  }
}
```
> **Note**
>
> JSON (JavaScript Object Notation) is a lightweight data-interchange format.
> It is easy for humans to read and write. It is easy for machines to parse and generate. 
>
> [Read more about JSON](https://www.json.org/json-en.html)

### 1.2. Cells

#### Markdown cells

All text in this file is contained in markdown cells. Jupyter supports rich features of markdown. For example, some text formatting usage is shown below.

`**This is bold text**`

**This is bold text**

`__This is bold text__`

__This is bold text__

`*This is italic text*`

*This is italic text*

`_This is italic text_`

_This is italic text_

`~~Strikethrough~~`

~~Strikethrough~~

`![logo](http://www-file.huawei.com/-/media/corporate/images/home/logo/huawei_logo.png)`

![logo](http://www-file.huawei.com/-/media/corporate/images/home/logo/huawei_logo.png)

`$$\int_{-\infty}^{+\infty} e^{-x^2}\,dx = \sqrt{\pi}$$`

$$
    \int_{-\infty}^{+\infty} e^{-x^2}\,dx = \sqrt{\pi}
$$

> **Note** 
>
> Markdown is a lightweight markup language with plain-text-formatting syntax.
>
> [Read more about markdown features in Jupyter](https://athena.brynmawr.edu/jupyter/hub/dblank/public/Jupyter%20Notebook%20Users%20Manual.ipynb)

#### Code cells

Code cells let users work with programming backend in REPL mode.

> **Note**
>
> REPL (read–eval–print loop) is a simple, interactive computer programming environment that takes single user 
> inputs, evaluates them, and returns the result to the user. A program written in a REPL environment is executed 
> piecewise.  
>
> [Read more about REPL](https://en.wikipedia.org/wiki/Read%E2%80%93eval%E2%80%93print_loop)

Typically, if you type code in code cell and press `Shift` + `Enter` you will see results provided by processing backend. Here, it is python.

In [2]:
1+2+3

6

If you use `!` symbol before the code this cell will be processed by the shell.

In [1]:
!python --version

Python 3.7.4


#### Raw cells

The notebook does not render raw cells and internely looks like
```json
{
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "This is a raw cell"
   ]
}
```

### 1.3 Magic commands

Jupyter provides internal commands, known as magic commands. They are not a part of python nor shell. Every magic command inserts into the code cell and starts from `%` symbol. To get the full list of supported magic commands, type the following line.

In [4]:
%lsmagic

Available line magics:
%alias  %alias_magic  %autoawait  %autocall  %automagic  %autosave  %bookmark  %cd  %clear  %cls  %colors  %conda  %config  %connect_info  %copy  %ddir  %debug  %dhist  %dirs  %doctest_mode  %echo  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %macro  %magic  %matplotlib  %mkdir  %more  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %pip  %popd  %pprint  %precision  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %ren  %rep  %rerun  %reset  %reset_selective  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%cmd  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%markdown  %%perl  %%prun  %%pypy  %%python 

For example, the following command changes working directory.

In [8]:
%who_ls

[]

This command runs python code several times to eliminate the influence of other tasks on the machine, such as disk flushing and OS scheduling.

In [2]:
%timeit 2**128

689 ns ± 55.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


The following function helps with the installation of additional python packages. It runs package manager `pip`.

In [51]:
%pip install matplotlib

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'python -m pip install --upgrade pip' command.


> **Note**
>
> [Read more aboute magic commands](https://ipython.readthedocs.io/en/stable/interactive/magics.html)

## 2.  Python crash course

![image.png](attachment:image.png)

### 2.1. Python basics

Python is a friendly language named after the famous British show "Monty Python". Let's print the title of the show.

In [52]:
print("Monty Python")

Monty Python


Now, let's introduce variable and print its value. Here, `f` before `"` means formated string.

In [10]:
x = 6
print(f"Variable x={x}")

Variable x=6


Display form can be tunned.

In [11]:
y = 3.1415926
print(f"Variable y={y:.3f}")

Variable y=3.142


Python has a rich environment where you can find packages for many tasks. To include additional functionality just import a module by its name.

Python is working with objects. And these objects are called by a reference. An example of an object is `list`.

In [4]:
a = [1,2,3, 5]

Let's add a new value to the end of the list.

In [5]:
a.append(4)
a 
# We use the line with a because append function is an in-place modifier
# and command doesn't return anything 
# (by the way, it is an example of comment in python)

[1, 2, 3, 5, 4]

So, let's introduce another list `b` and append `5` to the end of the new list.

In [6]:
b = a
b.append(5)
b

[1, 2, 3, 5, 4, 5]

Because of specificity mentioned above, we see changes in `a` too.

In [7]:
a

[1, 2, 3, 5, 4, 5]

In [18]:
a.append(c)

In [19]:
a

[1, 2, 3, 5, 4, 5, 6, [1, 2, 3, 5, 4, 5]]

In [9]:
from copy import deepcopy 
d = deepcopy(a)

In [10]:
d

[1, 2, 3, 5, 4, 5]

In [11]:
a

[1, 2, 3, 5, 4, 5]

In [12]:
a.append("ce")

In [15]:
d

[1, 2, 3, 5, 4, 5]

So, be careful with mutable objects. If you really need a copy of the list, use deep copy module.

In [17]:
from copy import deepcopy
c = deepcopy(a)
a.append(6)
c

[1, 2, 3, 5, 4, 5]

### 2.2. Types and operations

The following code shows integer arithmetics.

In [60]:
x = 5
print(type(x))
print(x + 1)
print(x - 1)
print(x * 2)
print(x ** 2)  # Power

print(x)
x += 1
print(x)
x *= 2
print(x)

<class 'int'>
6
4
10
25
5
6
12


Also, python supports bignum arithmetics.

> **Note**
>
> Bignum arithmetics is calculations
> performed on numbers whose digits of precision
> are limited only by the available memory of the host system. 

In [61]:
x = 5
y = 5**9999
print(y)

1002474549841290401859511186748595549864313556267685167884285808455847906190156808037822139249684482672304369857241167441626548863537825460329643379095915623717484182223212245787697556804213362675471052162049915791443259424788179828045457969580843869758123426189545097184418216405207466514580333490061648947884893864081057331884617990443033844971619573590297463192164842761560908301677577734033046079844746262077283996350692676845754412109655129006470742338684873074650743886524037778869879980705609483394351010458508990663611379529748795979846550477683047147221153370981453605719638487307491464939384190712279638904467276914113540042800481189771437050053932557066202944459440288064581572353660606291247331510992720871198980647301145026051136469627409806166220889934780630957015286219418544940437623751369656857529693784530764390047773304247434184638451591189009920818089870736319042647556954035483555224284363642013542821325118369953321999851450118080986644929497032335395079461880897680164911729588

In [62]:
y = 2.5
print(type(y))
print(y, y + 1, y * 2, y ** 2)

<class 'float'>
2.5 3.5 5.0 6.25


Boolean operations

In [63]:
t = True
f = False
print(type(t)) 
print(t and f)
print(t or f) 
print(not t)  
print(t != f)  

<class 'bool'>
False
True
False
True


Strings

In [64]:
hello = 'hello'
world = "world"
print(hello, )  
print(len(hello))  # String length
hw = hello + ' ' + world  # String concatenation
print(hw)

hello
5
hello world


In [65]:
multiline = """
One 
   Two
      Three
"""
print(multiline)


One 
   Two
      Three



Some string methods

In [66]:
print(hw.capitalize())  # Capitalize a string
print(hw.upper())       # Convert a string to uppercase
print(hw.replace('l', '(ell)'))  # Replace all instances of one substring with another

Hello world
HELLO WORLD
he(ell)(ell)o wor(ell)d


Control flow and cycles

In [67]:
x = 1
# If-else
if x < 0:
    print("Negative")
elif x == 0:
    print('Zero')
else:
    print('Positive')
    
# Ternary operator
a = "One" if x>0 else "Two"
print(a)

# While cycle
while x > 0:
    print(x)
    x-=1
    
# For cycle
for _ in range(5):
    print(x)
    
# List comprehension
b = [x**2 for x in range(5)]
print(b)

Positive
One
1
0
0
0
0
0
[0, 1, 4, 9, 16]


### 2.3 Functions

A function is a useful unit of decomposition of the programs. The following example shows how to define a function.

In [32]:
def plus_one(x: int) -> int:
    """This function returns incremented value"""
    return x+1

It is simple to use.

In [33]:
plus_one(5.14)

6.14

If it is not necessary to describe details, function can be defined shortly.

In [70]:
def plus_one(x): return x+1

plus_one(5)

6

or even without name

In [71]:
(lambda x: x+1)(5)

6

In [34]:
fun = lambda x: x+1

In [36]:
fun(6)

7

Another type of funtions are generators:

In [38]:
def counts(value=0, step=1):
    while 1:
        value += step
        yield value
    
g = counts(step=3)
type(g), next(g), next(g), next(g)

(generator, 3, 6, 9)

> **Note**
> 
> Lazy evaluation is an evaluation strategy which delays the evaluation
> of an expression until its value is needed and which also avoids repeated evaluations.
>
> [Read more about lazy evaluation](https://en.wikipedia.org/wiki/Lazy_evaluation)

### 2.4. Containers in python

Python has built-in containers, they listed below.

In [73]:
c_tpl = (1, 1.2, "x")
c_rng = range(10)
c_fst = frozenset({1,2,3}) # readonly set
c_bts = bytes((3,1,4,5,1,5))
c_lst = [1,2,3]
c_dct = {1: "One", 2: "Two", 3: "Three"}
c_set = {1,2,3}
c_bar = bytearray((3,1,4,5,1,5)) # writable bytes

Additional containers can be found in package `collections`. For example, useful `namedtuple`.

In [39]:
from collections import namedtuple 
Point = namedtuple('Point', ['x', 'y'])
p = Point(1,2)

In [40]:
p

Point(x=1, y=2)

or `Counter` for multiset implementation

In [75]:
from collections import Counter

s = 'hello world'
c = Counter(s)
c

Counter({'h': 1, 'e': 1, 'l': 3, 'o': 2, ' ': 1, 'w': 1, 'r': 1, 'd': 1})

or dataclass as mutuable alternative for namedtuple.

In [76]:
%pip install dataclasses

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [77]:
from dataclasses import dataclass

@dataclass
class Structure:
    name: str
    value: float
    
s = Structure("x", 2)
s

Structure(name='x', value=2)

![image.png](attachment:image.png)

### 2.5.  Regular expressions in python

![image.png](attachment:image.png)

A regular expression, regex or regexp, is a sequence of characters that define a search pattern.
Usually such patterns are used by string searching algorithms for "find" or "find and replace" operations on strings, or for input validation.

Python supports regular expression with the help of module `re`. It is convenient to play with online services like [http://www.pyregex.com/](http://www.pyregex.com/) to see how regexps work.

There are other services:
- [https://www.regextester.com/](https://www.regextester.com/)
- [https://regex101.com/](https://regex101.com/)
- [https://regexr.com/](https://regexr.com/)
- [https://pythex.org/](https://pythex.org/)

Pattern string can contain:
- Special characters like `\t` (tab symbol) or `\\` (\ symbol).
- A character class means range of symbols, e.g. `[ae]` (symbol a or symbol e), `[A-Z]` (any symbol from A to Z), `\d` (any digit symbol), '.' (any symbol).
- Anchors like `^`(start of the line) or `$` (end of the line).
- Match group (can be accessed after applying to the string): `()` (subpattern for a matching group is placed between brackets). There are several modifiers of matching groups.
- Quantifiers which specifies how many instances of the previous element, like `*` (0 or more), `+` (1 or more).

> **Note**
>
> [Read more about regular expressions in Python](https://docs.python.org/library/re.html)

For instance, let's look how to extract integer and fraction parts of the float number using a regular expression.

In [78]:
import re
pattern = r"(\d+)\.(\d+)" # pattern string has prefix r
matches = re.match(pattern, "3.1415926")
matches.groups()

('3', '1415926')

This regular expression works as follows. The first matching group `(\d+)` matches `3` because `\d+` matches a digit (equal to `[0-9]`)
where `+` quantifier means matching between one and unlimited times, as many times as possible. Next element `\.` matches the character `.` literally (case sensitive). The second matching group the same as the first and it captures `1415926`.

## 3. NumPy

![image.png](attachment:image.png)

[NumPy Reference](https://docs.scipy.org/doc/numpy/reference/index.html)

NumPy is a python library for scientific calculations that provides effective arrays. Arrays in NumPy are called *ndarray* from N-dimensional array.

To start using NumPy, just import the library. If some library is used frequently in some program it is useful to give an alias for this library in that program. Here, np is a widely used alias for NumPy.

In [43]:
import numpy as np

### 3.1. Why are NumPy arrays effective?

ndarray consist of 3 parts:
- data buffer (packed sequence of homogened data)
- metadata (describes data type)
- metadata (describes form)

![image.png](attachment:image.png)

That is why ndarray is much faster than built-in python lists.

In [41]:
%timeit a = [i**2 for i in range(1000)]

1.73 ms ± 54.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [44]:
%timeit b = np.arange(1000)**2

13.8 µs ± 2 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)


### 3.2. Simple NumPy operations

Simple 1D array

In [82]:
# Create array
x = np.array([1, 2, 3], np.int32)

# Add an element to the end of array
x = np.append(x, np.int32(4))

# Some information about array (+pack them into tuple)
(type(x),
 x.shape, # shape
 x.dtype, # data type
 x[2],    # get element by index
 x[:2]    # slice (get subarray)
)

(numpy.ndarray, (4,), dtype('int32'), 3, array([1, 2]))

### 3.3. Indexes

Stride is a step of walking through an array.

In [83]:
# We expect that the following code returns (3*4, 4), where 4 — is size of element in bytes.
x = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]
             ],
             np.int32)
x.strides

(12, 4)

Some array operation can be simply performed with help of strides. Here, an example of array reverse.

In [84]:
x = np.array([1,2,3], np.int8)
x.strides, x[::-1].strides

((1,), (-1,))

Reshaping leads to stride changing.

In [85]:
x = np.array([[1,2,3],
              [4,5,6],
             ],
             np.int8)
x.strides, x.reshape(6,1), x.reshape(6,1).strides

((3, 1), array([[1],
        [2],
        [3],
        [4],
        [5],
        [6]], dtype=int8), (1, 1))

Indexing by mask

In [86]:
x = np.array([1,2,3], np.int8)
x[[False, True, False]]

array([2], dtype=int8)

A mask can be set with help of function

In [87]:
x = np.arange(1, 10)
x[x%2==0]

array([2, 4, 6, 8])

Indexing by list of indexes

In [88]:
x = np.array([1,2,3], np.int8)
x[[1,2]]

array([2, 3], dtype=int8)

### 3.4. Some useful built-in array operations

Hadamard product and dot product

In [89]:
a = np.array([[1,2],[3,4]])
b = np.array([[1,0],[0,1]])

print(a*b, a.dot(b), sep='\n\n')

[[1 0]
 [0 4]]

[[1 2]
 [3 4]]


Sum, mean, max, argmax

In [90]:
x = np.random.rand(10)
(
    x,
    x.sum(),
    x.mean(),
    x.max(),
    x.argmax()
)

(array([0.09727269, 0.38019134, 0.59662354, 0.1182084 , 0.06387711,
        0.42617565, 0.38189879, 0.74874706, 0.19205411, 0.74278158]),
 3.747830265021937,
 0.3747830265021937,
 0.7487470561334292,
 7)

Broadcasting

In [91]:
a = np.array([[1,2],[3,4]])
b = np.array([[0,1]])
a+b

array([[1, 3],
       [3, 5]])

## 3. NLTK


![image.png](attachment:image.png)

The package can be installed directly from Jupyter.

In [92]:
%pip install nltk

Collecting nltk
Installing collected packages: nltk
Successfully installed nltk-3.4.5
Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'python -m pip install --upgrade pip' command.


We will use some modules from NLTK which need to download additional data for them. For this purpose, there is a `download` method.

In [93]:
import nltk
#nltk.set_proxy('http://user:password@proxy.example.com:8080')
nltk.download(['punkt', 'wordnet', 'averaged_perceptron_tagger'])

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\v00524754\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\v00524754\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

Proxy settings are optional here. Let's use text from zen of python to play with some internal functions of NLTK.

In [46]:
import this
import codecs

zen_of_python = codecs.encode(this.s, 'rot13')

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


In [47]:
russian_text = """Граф Лев Николаевич Толсто́й[К 1] (28 августа [9 сентября] 1828, Ясная Поляна, Тульская губерния, Российская 
империя — 7 [20] ноября 1910, станция Астапово, Рязанская губерния, Российская империя) — один из наиболее известных русских 
писателей и мыслителей, один из величайших писателей-романистов мира[4]. Участник обороны Севастополя. Просветитель, публицист, 
религиозный мыслитель, его авторитетное мнение послужило причиной возникновения нового религиозно-нравственного течения — 
толстовства. За свои взгляды был отлучен от церкви. Член-корреспондент Императорской Академии наук (1873), почётный академик 
по разряду изящной словесности (1900)[5]. Был номинирован на Нобелевскую премию по литературе (1902, 1903, 1904, 1905). 
Впоследствии отказался от дальнейшей номинации.

Писатель, ещё при жизни признанный главой русской литературы[6]. Творчество Льва Толстого ознаменовало новый этап в русском и 
мировом реализме, выступив мостом между классическим романом XIX века и литературой XX века. Лев Толстой оказал сильное влияние 
на эволюцию европейского гуманизма, а также на развитие реалистических традиций в мировой литературе. Произведения Льва 
Толстого многократно экранизировались и инсценировались в СССР и за рубежом; его пьесы ставились на сценах всего мира[6]. Лев 
Толстой был самым издаваемым в СССР писателем за 1918—1986 годы: общий тираж 3199 изданий составил 436,261 млн экземпляров[7].

Наиболее известны такие произведения Толстого, как романы «Война и мир», «Анна Каренина», «Воскресение», 
автобиографическая[8][6] трилогия «Детство», «Отрочество», «Юность»[К 2], повести «Казаки», «Смерть Ивана Ильича», «Крейцерова 
соната», «Отец Сергий», «Хаджи-Мурат», цикл очерков «Севастопольские рассказы», драмы «Живой труп», «Плоды просвещения» и 
«Власть тьмы», автобиографические религиозно-философские произведения «Исповедь» и «В чём моя вера?» и др.
"""

> **Note**
> 
> We used rot13 encoding because the source file contains encoded text.
>
> [See source of this module](https://github.com/python/cpython/blob/master/Lib/this.py)

### 3.1. Tokenization

Tokenization is a process of splitting text to tokens. Let's split the text sentencewise.

In [48]:
from nltk.tokenize import sent_tokenize

tokens = sent_tokenize(zen_of_python)
print(tokens)

['The Zen of Python, by Tim Peters\n\nBeautiful is better than ugly.', 'Explicit is better than implicit.', 'Simple is better than complex.', 'Complex is better than complicated.', 'Flat is better than nested.', 'Sparse is better than dense.', 'Readability counts.', "Special cases aren't special enough to break the rules.", 'Although practicality beats purity.', 'Errors should never pass silently.', 'Unless explicitly silenced.', 'In the face of ambiguity, refuse the temptation to guess.', 'There should be one-- and preferably only one --obvious way to do it.', "Although that way may not be obvious at first unless you're Dutch.", 'Now is better than never.', 'Although never is often better than *right* now.', "If the implementation is hard to explain, it's a bad idea.", 'If the implementation is easy to explain, it may be a good idea.', "Namespaces are one honking great idea -- let's do more of those!"]


Then, we split it wordwise.

In [49]:
from nltk.tokenize import word_tokenize

tokens = word_tokenize(zen_of_python)
print(tokens)

['The', 'Zen', 'of', 'Python', ',', 'by', 'Tim', 'Peters', 'Beautiful', 'is', 'better', 'than', 'ugly', '.', 'Explicit', 'is', 'better', 'than', 'implicit', '.', 'Simple', 'is', 'better', 'than', 'complex', '.', 'Complex', 'is', 'better', 'than', 'complicated', '.', 'Flat', 'is', 'better', 'than', 'nested', '.', 'Sparse', 'is', 'better', 'than', 'dense', '.', 'Readability', 'counts', '.', 'Special', 'cases', 'are', "n't", 'special', 'enough', 'to', 'break', 'the', 'rules', '.', 'Although', 'practicality', 'beats', 'purity', '.', 'Errors', 'should', 'never', 'pass', 'silently', '.', 'Unless', 'explicitly', 'silenced', '.', 'In', 'the', 'face', 'of', 'ambiguity', ',', 'refuse', 'the', 'temptation', 'to', 'guess', '.', 'There', 'should', 'be', 'one', '--', 'and', 'preferably', 'only', 'one', '--', 'obvious', 'way', 'to', 'do', 'it', '.', 'Although', 'that', 'way', 'may', 'not', 'be', 'obvious', 'at', 'first', 'unless', 'you', "'re", 'Dutch', '.', 'Now', 'is', 'better', 'than', 'never', '.

For example, we can use this list of tokens to take the most common word in the text.

In [50]:
from nltk.probability import FreqDist
dist = FreqDist(tokens)
dist.most_common(10)

[('.', 18),
 ('is', 10),
 ('better', 8),
 ('than', 8),
 ('to', 5),
 ('the', 5),
 (',', 4),
 ('of', 3),
 ('Although', 3),
 ('never', 3)]

For Russian language unfortuantely we could not use NLTK, since it is not optimized for it. There is other common tokenizer for Russian - `razdel`. Let us install it:

In [99]:
%pip install razdel

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [100]:
from razdel import sentenize, tokenize

text_generator = sentenize(russian_text)
print(next(text_generator))
print(next(text_generator))

list(tokenize(russian_text))[:20]

Substring(0, 308, 'Граф Лев Николаевич Толсто́й[К 1] (28 августа [9 сентября] 1828, Ясная Поляна, Тульская губерния, Российская \nимперия — 7 [20] ноября 1910, станция Астапово, Рязанская губерния, Российская империя) — один из наиболее известных русских \nписателей и мыслителей, один из величайших писателей-романистов мира[4].')
Substring(309, 338, 'Участник обороны Севастополя.')


[Substring(0, 4, 'Граф'),
 Substring(5, 8, 'Лев'),
 Substring(9, 19, 'Николаевич'),
 Substring(20, 28, 'Толсто́й'),
 Substring(28, 29, '['),
 Substring(29, 30, 'К'),
 Substring(31, 32, '1'),
 Substring(32, 33, ']'),
 Substring(34, 35, '('),
 Substring(35, 37, '28'),
 Substring(38, 45, 'августа'),
 Substring(46, 47, '['),
 Substring(47, 48, '9'),
 Substring(49, 57, 'сентября'),
 Substring(57, 58, ']'),
 Substring(59, 63, '1828'),
 Substring(63, 64, ','),
 Substring(65, 70, 'Ясная'),
 Substring(71, 77, 'Поляна'),
 Substring(77, 78, ',')]

### 3.2. Stemming

Usually, we want to preprocess text before performing analysis. Normalization is a preprocessing technique that helps simplify analysis. Stemming is a type of normalization. The following code shows us how to use Porter stemming method to get basic for words.

In [51]:
from nltk.stem import PorterStemmer
porter = PorterStemmer()
[porter.stem(word) for word, freq in dist.most_common(10)]

['.', 'is', 'better', 'than', 'to', 'the', ',', 'of', 'although', 'never']

Let's stem Stemming word

In [53]:
porter.stem("Flying")

'fli'

### 3.3. Lemmatization

Another normalization method is lemmatization. Let's try to apply Wordnet Lemmatizer to the words.

In [54]:
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
[wnl.lemmatize(word) for word, freq in dist.most_common(10)]

['.', 'is', 'better', 'than', 'to', 'the', ',', 'of', 'Although', 'never']

In [None]:
wnl.lemmatize("corpora")

Lemmatization for Russian language is also not that easy task. There is common tool `pymorphy2`, which we will use for it.

In [105]:
%pip install pymorphy2

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [106]:
from pymorphy2 import MorphAnalyzer

In [107]:
morph = MorphAnalyzer()

morph.parse(next(tokenize(russian_text)).text)

[Parse(word='граф', tag=OpencorporaTag('NOUN,inan,masc sing,nomn'), normal_form='граф', score=0.25, methods_stack=((<DictionaryAnalyzer>, 'граф', 33, 0),)),
 Parse(word='граф', tag=OpencorporaTag('NOUN,inan,masc sing,accs'), normal_form='граф', score=0.25, methods_stack=((<DictionaryAnalyzer>, 'граф', 33, 3),)),
 Parse(word='граф', tag=OpencorporaTag('NOUN,anim,masc sing,nomn'), normal_form='граф', score=0.25, methods_stack=((<DictionaryAnalyzer>, 'граф', 52, 0),)),
 Parse(word='граф', tag=OpencorporaTag('NOUN,inan,femn plur,gent'), normal_form='графа', score=0.25, methods_stack=((<DictionaryAnalyzer>, 'граф', 55, 8),))]