# MapReduce

MapReduce 是通过机器集群来分析大量数据的编程技术。在这个 Jupyter notebook 中，你会对 Hadoop MapReduce 的工作原理有一些认识。但是，这个 notebook 只在本地运行而不是在机器集群上。

Hadoop 和 Spark 最大的不同是 Spark 会尽可能通过内存计算，这样就避免了在集群间来回传输数据了。Hadoop 会把中间计算结果写到硬盘里，这样效率会低一些。Hadoop 比起 Spark 算更老的技术，但也是大数据技术的一个里程碑。

如果你点击工作空间顶部的 Jupyter notebook 图标，你就会跳转到工作空间的文件夹。这是个文本文件，其中每行都代表了在 Sparkify 这个 app 里播放的歌曲。你会看到一个叫做 "songplays.txt" 的文件。MapReduce 代码会统计每首歌播放的次数。换句话说，这个代码会统计每首歌的歌名出现在列表中的次数。


# MapReduce VS Hadoop MapReduce

这两个术语很相似，但不是指同一个东西哦！MapReduce 是个编程技术，Hadoop MapReduce 是这个编程技术的一个具体的实现方式。

有些语法看上去有点奇怪，所以一定要把每段的解释和注释读一遍。在后面的课程里你会学习更多的语法。

把下面的代码都跑一遍，看看输出结果。

In [None]:
# Install mrjob library. This package is for running MapReduce jobs with Python
# In Jupyter notebooks, "!" runs terminal commands from inside notebooks 

! pip install mrjob

In [None]:
%%file wordcount.py
# %%file is an Ipython magic function that saves the code cell as a file

from mrjob.job import MRJob # import the mrjob library

class MRSongCount(MRJob):
    
    # the map step: each line in the txt file is read as a key, value pair
    # in this case, each line in the txt file only contains a value but no key
    # _ means that in this case, there is no key for each line
    def mapper(self, _, song):
        # output each line as a tuple of (song_names, 1) 
        yield (song, 1)

    # the reduce step: combine all tuples with the same key
    # in this case, the key is the song name
    # then sum all the values of the tuple, which will give the total song plays
    def reducer(self, key, values):
        yield (key, sum(values))
        
if __name__ == "__main__":
    MRSongCount.run()

In [None]:
# run the code as a terminal command
! python wordcount.py songplays.txt

# 代码结果的总结

songplays.txt 有如下一列的歌曲：

Deep Dreams
Data House Rock
Deep Dreams
Data House Rock
Broken Networks
Data House Rock
etc.....

在 map 阶段，代码一次会读取文本文件的一行，然后会产出一堆如下的元组：

(Deep Dreams, 1)  
(Data House Rock, 1)  
(Deep Dreams, 1)  
(Data House Rock, 1)  
(Broken Networks, 1)  
(Data House Rock, 1)  
etc.....

Finally, the reduce step combines all of the values by keys and sums the values:  

(Deep Dreams, \[1, 1, 1, 1, 1, 1, ... \])  
(Data House Rock, \[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...\])  
(Broken Networks, \[1, 1, 1, ...\]  

With the output 

(Deep Dreams, 1131)  
(Data House Rock, 510)  
(Broken Networks, 828)  