https://mrjob.readthedocs.io/en/latest/guides/quickstart.html

In [2]:
# установить можно с помощью pip или conda
# ! pip install mrjob
# ! conda install mrjob

In [3]:
import os
import re

import numpy as np

- A **mapper** takes a single key and value as input, and returns zero or more (key, value) pairs. The pairs from all map outputs of a single step are grouped by key.

- A **combiner** takes a key and a subset of the values for that key as input and returns zero or more (key, value) pairs. Combiners are optimizations that run immediately after each mapper and can be used to decrease total data transfer. Combiners should be idempotent (produce the same output if run multiple times in the job pipeline).

- A **reducer** takes a key and the complete set of values for that key in the current step, and returns zero or more arbitrary (key, value) pairs as output.

    After the reducer has run, if there are more steps, the individual results are arbitrarily assigned to mappers for further processing. If there are no more steps, the results are sorted and made available for reading.


# Word Count

Давайте поработаем с текстом и чего-нибудь там посчитаем

## Lines, Words, Chars

In [4]:
%%writefile job.py

from mrjob.job import MRJob


class MRWordFrequencyCount(MRJob):

    def mapper(self, _, line):
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1

    def reducer(self, key, values):
        yield key, sum(values)


if __name__ == '__main__':
    MRWordFrequencyCount.run()

Writing job.py


```
python3 job.by our_file.txt
```

![](img/mrjob_example_1.png)

## Names

Давайте немного усложним задачу и попробуем прикинуть, сколько раз в тексте упоминаются пары Имя Отчество?

Для этого нам надо придумать регулярку

In [5]:
with open('data/crime-punishment.txt', 'r') as file:
    text = file.read()

In [6]:
import re

name_regex = re.compile('([A-Z][a-z]{3,})\s([A-Z][a-z]{2,}(ich|itch|vna))')

In [7]:
for res in re.finditer(name_regex, text):
    name = res.group()
    name = re.sub('\s+', ' ', name)
    print(name)

Alyona Ivanovna
Alyona Ivanovna
Alyona Ivanovna
Katerina Ivanovna
Katerina Ivanovna
Katerina Ivanovna
Amalia Fyodorovna
Katerina Ivanovna
Ivan Ivanitch
Katerina Ivanovna
Katerina Ivanovna
Darya Frantsovna
Katerina Ivanovna
Katerina Ivanovna
Katerina Ivanovna
Katerina Ivanovna
Darya Frantsovna
Sofya Semyonovna
Amalia Fyodorovna
Darya Frantsovna
Katerina Ivanovna
Katerina Ivanovna
Katerina Ivanovna
Ivan Afanasyvitch
Ivan Afanasyvitch
Katerina Ivanovna
Semyon Zaharovitch
Katerina Ivanovna
Katerina Ivanovna
Amalia Fyodorovna
Semyon Zaharovitch
Semyon Zaharovitch
Semyon Zaharovitch
Katerina Ivanovna
Katerina Ivanovna
Katerina Ivanovna
Katerina Ivanovna
Katerina Ivanovna
Praskovya Pavlovna
Vassily Ivanovitch
Marfa Petrovna
Marfa Petrovna
Marfa Petrovna
Marfa Petrovna
Marfa Petrovna
Marfa Petrovna
Marfa Petrovna
Marfa Petrovna
Marfa Petrovna
Marfa Petrovna
Marfa Petrovna
Pyotr Petrovitch
Marfa Petrovna
Pyotr Petrovitch
Pyotr Petrovitch
Pyotr Petrovitch
Pyotr Petrovitch
Pyotr Petrovitch
Pyotr 

Проверили, что регулярка выдает что-то похожее на правду

Применим к нашей джобе

In [8]:
%%writefile job.py

import re
from mrjob.job import MRJob

PATTERN = re.compile(re.compile('([A-Z][a-z]{3,})\s([A-Z][a-z]{2,}(ich|itch|vna))'))

class MRWordMiddleNameCounts(MRJob):

    def mapper(self, _, line):
        for name in re.finditer(PATTERN, line):
            name = re.sub('\s+', ' ', name.group())
            yield name, 1

    def reducer(self, key, values):
        yield key, sum(values)


if __name__ == '__main__':
    MRWordMiddleNameCounts.run()

Overwriting job.py


Аргумент `-l local` позволяет запускать задачу локально не в один поток. Аргумент `-q` подавляет дебажную информацию

![](img/mrjob_example_2.png)

## Most common middle name

Теперь попробуем еще один шаг в работе нашей программы -- подсчет самых популярных 

*(ставлю на то, что там будет или форма от Петра, или форма от Ивана)*

Здесь мы используем 2 шага. На первом шаге получаем агрегаты вида (int, Отчество), а на втором с помощью дополнительного редьюсера берем максимум

In [9]:
%%writefile job.py
import re

from mrjob.job import MRJob
from mrjob.step import MRStep

PATTERN = re.compile(re.compile('[A-Z][a-z]{2,}(ich|itch|vna)'))

class MRWordMostPopularMiddleName(MRJob):
    
    def steps(self):
        return [
            MRStep(mapper=self.mapper, combiner=self.combiner, reducer=self.reducer),
            MRStep(reducer=self.most_common_reducer)
        ]
    
    def mapper(self, _, line):
        for name in re.finditer(PATTERN, line):
            yield name.group(), 1

    def combiner(self, key, values):
        yield key, sum(values)
    
    def reducer(self, key, values):
        yield None, (sum(values), key)

    def most_common_reducer(self, _, values):
        yield max(values)


if __name__ == '__main__':
    MRWordMostPopularMiddleName.run()

Overwriting job.py


![](img/mrjob_example_3.png)

*:)*

# Average of numbers

Сгенерируем себе файл с цифрами для примера. Пусть у нас будут n строчек, в каждой по m чисел

In [66]:
mat = np.random.randint(-5, 255, size=(1337, 42))

with open(os.path.join('data','digits'), 'w') as file:
    for line in mat:
        file.write(f'{str(line.tolist())[1:-1]}\n')

In [73]:
mat.mean()

123.99216440502903

In [67]:
with open(os.path.join('data','digits'), 'r') as file:
    mat_text = file.readlines()

In [72]:
%%writefile job.py

from mrjob.job import MRJob

class MRNumbersAverager(MRJob):
    def mapper(self, _, line):
        for number in line.strip().split(','):
            yield 1, int(number)

    def reducer(self, key, values):
        values = list(values)
        yield "avg", sum(values) / len(values)


if __name__ == '__main__':
    MRNumbersAverager.run()

Overwriting job.py
