# MapReduce - Demo 1

## Example 1 - List of words

Please check the `resources/` folder

In [2]:
!cat ./resources/datasets/animals.txt 

wolf
monkey
giraffe
elephant
kangaroo
lion
giraffe
wolf
kangaroo
wolf
monkey
lion
lion
zebra

### mapper1.py

`mapper1.py` maps each record to a value `1`. 

In [8]:
%load resources/mappers/mapper1.py

In [3]:
!cat ./resources/datasets/animals.txt | python resources/mappers/mapper1.py

wolf	1
monkey	1
giraffe	1
elephant	1
kangaroo	1
lion	1
giraffe	1
wolf	1
kangaroo	1
wolf	1
monkey	1
lion	1
lion	1
zebra	1


### reducer1.py

`reducer1.py` sums each value of a key

In [3]:
!cat ./resources/datasets/animals.txt | python resources/mappers/mapper1.py | python resources/reducers/reducer1.py

wolf	1
monkey	1
giraffe	1
elephant	1
kangaroo	1
lion	1
giraffe	1
wolf	1
kangaroo	1
wolf	1
monkey	1
lion	2
zebra	1


Simulating the **sort and shuffle**: `sort -k1,1`.

In [4]:
!cat ./resources/datasets/animals.txt | python resources/mappers/mapper1.py | sort -k1,1 

elephant	1
giraffe	1
giraffe	1
kangaroo	1
kangaroo	1
lion	1
lion	1
lion	1
monkey	1
monkey	1
wolf	1
wolf	1
wolf	1
zebra	1


In [5]:
!cat ./resources/datasets/animals.txt | python resources/mappers/mapper1.py | sort -k1,1 | python resources/reducers/reducer1.py

elephant	1
giraffe	2
kangaroo	2
lion	3
monkey	2
wolf	3
zebra	1


## Example 2 - Counting words of a book

Finde more books here: https://www.gutenberg.org/browse/scores/top

In [6]:
!head ./resources/datasets/book.txt

The Project Gutenberg EBook of Pride and Prejudice, by Jane Austen

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: Pride and Prejudice



### mapper2.py
`mapper2.py` splits each line by empty space and adds value `1` to each word.

In [7]:
!cat ./resources/datasets/book.txt | python resources/mappers/mapper2.py | tail

Project	1
Gutenberg	1
EBook	1
of	1
Pride	1
and	1
Prejudice,	1
by	1
Jane	1
Austen	1


In [6]:
!cat ./resources/datasets/book.txt | python resources/mappers/mapper2.py | sort -k1,1 | python resources/reducers/reducer1.py | tail -20

yours,	2
yours,”	2
yours.	3
yours.”	3
“Yours	1
“Yours,	2
Yours,	2
yourself	28
yourself,	9
yourself,”	2
yourself;	1
yourself?	1
yourself.	4
yourself.”	4
yourself--and	1
yourselves	1
yourselves?	1
youth	7
youth,	2
youths	1


### Question:
How can you remove stop words and invalid chars?

## Example 3 - CSV Table

In [7]:
!cat ./resources/datasets/futbr.csv 

ano,campeonato brasileiro,copa do brasil
1989,Vasco da Gama,Grêmio
1990,Corinthians,Flamengo
1991,Criciúma,Corinthians
1992,Flamengo,Internacional
1993,Palmeiras,Cruzeiro
1994,Palmeiras,Grêmio
1995,Botafogo,Corinthians
1996,Grêmio,Cruzeiro
1997,Vasco da Gama,Grêmio
1998,Corinthians,Palmeiras
1999,Corinthians,Juventude
2000,Vasco da Gama,Cruzeiro
2001,Atlético Paranaense,Grêmio
2002,Santos,Corinthians
2003,Cruzeiro,Cruzeiro
2004,Santos,Santo André
2005,Corinthians,Paulista
2006,São Paulo,Flamengo
2007,São Paulo,Fluminense
2008,São Paulo,Sport
2009,Flamengo,Corinthians
2010,Fluminense,Santos
2011,Corinthians,Vasco da Gama
2012,Fluminense,Palmeiras
2013,Cruzeiro,Flamengo
2014,Cruzeiro,Atlético Mineiro
2015,Corinthians,Palmeiras
2016,Palmeiras,Grêmio
2017,Corinthians,Cruzeiro
2018,Palmeiras,Cruzeiro

### mapper3.py

`mapper2.py` splits lines by ',' and returns the second instance.

In [10]:
!cat ./resources/datasets/futbr.csv | python resources/mappers/mapper3.py

Vasco da Gama,1
Corinthians,1
Criciúma,1
Flamengo,1
Palmeiras,1
Palmeiras,1
Botafogo,1
Grêmio,1
Vasco da Gama,1
Corinthians,1
Corinthians,1
Vasco da Gama,1
Atlético Paranaense,1
Santos,1
Cruzeiro,1
Santos,1
Corinthians,1
São Paulo,1
São Paulo,1
São Paulo,1
Flamengo,1
Fluminense,1
Corinthians,1
Fluminense,1
Cruzeiro,1
Cruzeiro,1
Corinthians,1
Palmeiras,1
Corinthians,1
Palmeiras,1


### reducer2.py

In [13]:
!cat ./resources/datasets/futbr.csv | python resources/mappers/mapper3.py | sort -k1,1 | python resources/reducers/reducer2.py '.'

Traceback (most recent call last):
  File "resources/reducers/reducer2.py", line 13, in <module>
    key,value = line.split(sys.argv[1])
ValueError: not enough values to unpack (expected 2, got 1)


In [14]:
!cat ./resources/datasets/futbr.csv | python resources/mappers/mapper3.py | sort -k1,1 | python resources/reducers/reducer2.py ','

Atlético Paranaense,1
Botafogo,1
Corinthians,7
Criciúma,1
Cruzeiro,3
Flamengo,2
Fluminense,2
Grêmio,1
Palmeiras,4
Santos,2
São Paulo,3
Vasco da Gama,3


## Example 4 - IoT Log

### reducer3.py

In [15]:
!cat ./resources/datasets/futbr.csv | python resources/mappers/mapper3.py | sort -k1,1 | python resources/reducers/reducer2.py ',' | tail

Corinthians,7
Criciúma,1
Cruzeiro,3
Flamengo,2
Fluminense,2
Grêmio,1
Palmeiras,4
Santos,2
São Paulo,3
Vasco da Gama,3


In [16]:
!cat ./resources/datasets/iot-temperature.csv | python resources/mappers/mapper3.py | sort -k1,1 | python resources/reducers/reducer2.py ',' | tail

37.1,16
37.2,13
37.3,26
37.4,76
37.5,161
37.6,97
37.7,128
37.8,68
37.9,53
38.0,21


In [17]:
!cat ./resources/datasets/iot-temperature.csv | python resources/mappers/mapper3.py | sort -k1,1 | python resources/reducers/reducer3.py ','  | tail

37,16
37,13
37,26
37,76
38,161
38,97
38,128
38,68
38,53
38,21


In [18]:
!cat ./resources/datasets/iot-temperature.csv | python resources/mappers/mapper5.py | tail

total,21.6
total,21.6
total,21.6
total,21.6
total,21.6
total,21.7
total,21.7
total,21.7
total,21.7
total,21.7


In [19]:
!cat ./resources/datasets/iot-temperature.csv | python resources/mappers/mapper5.py | sort -k1,1 | python resources/reducers/reducer4.py ',' 

len,68876
sum,1751648.8999999953
avg,25.431919681746837
