<font color="#483D8B">
<h1  align="center">Map Reduce</h1>
<h2  align="center">Lab 2</h2>
<h4 align="center">
INET4710 Spring 2018<br>
Submitted by (your name here)</h4>
</font>

---------------

Lab Objectives
-	Understand the map-reduce programming framework
-	Learn how to write a simple map reduce pipeline in Python (single input, single output).



---------------
### Map Reduce Program:  Word Count

Modified from  “A Simple Example in Python” <br> at
<https://www.princeton.edu/researchcomputing/computational-hardware/hadoop/mapred-tut/>

**1. Create a text file with the text in the poem **
[The Road Not Taken](https://www.poetryfoundation.org/poems/44272/the-road-not-taken)

**2. In the code section below, write code to read the file and write code for the mapper  function. The code should **
-	Count words
-   Run successfully
-   Remove punctuation
-	Convert text to lower case


This URLs may be helpful:<br>
[stdin encoding](http://stackoverflow.com/questions/16549332/python-3-how-to-specify-stdin-encoding)<br>
[map reduce example](https://github.com/cielavenir/procon/blob/master/hackerrank/map-reduce-advanced-count-number-of-friends.py)
[python dict](https://www.tutorialspoint.com/python/python_dictionary.htm)

In [None]:
import sys
import io
import string
import re


class MapReduce:
    def __init__(self):
        self.intermediate = dict()
        self.result = []

    def emitIntermediate(self, key, value):
        self.intermediate.setdefault(key, [])
        self.intermediate[key].append(value)

    def emit(self, value):
        self.result.append(value)

    def execute(self, data, mapper, reducer):
        for record in data:
            mapper(record)

        for key in sorted(self.intermediate.keys()):
            # print(key, self.intermediate[key])
            reducer(key, self.intermediate[key])

        self.result.sort()
        for item in self.result:
            print(item)


mapReducer = MapReduce()


def mapper(line):
    """
    split line to word here and return (word, 1)
    :param line: 
    :return: 
    """
    for c in string.punctuation:
        line = line.replace(c, "")
    line = line.strip().lower()
    words = line.split()
    for word in words:
        mapReducer.emitIntermediate(word, word)  # don't need word as value I think


def reducer(key, list_of_values):
    """
    :param key: 
    :param list_of_values: 
    :return: 
    """
    mapReducer.emit((key, len(list_of_values)))


if __name__ == '__main__':
    inputData = []
    file = open("/Users/reiven/Documents/Python/CSCI4710/Lab2/TheRoadNotTaken.txt", "r")
    for line in file:
        inputData.append(line)
    mapReducer.execute(inputData, mapper, reducer)


('a', 3)
('about', 1)
('ages', 2)
('all', 1)
('and', 9)
('another', 1)
('as', 5)
('back', 1)
('be', 2)
('because', 1)
('bent', 1)
('better', 1)
('black', 1)
('both', 2)
('by', 1)
('claim', 1)
('come', 1)
('could', 2)
('day', 1)
('difference', 1)
('diverged', 2)
('doubted', 1)
('down', 1)
('equally', 1)
('ever', 1)
('fair', 1)
('far', 1)
('first', 1)
('for', 2)
('grassy', 1)
('had', 2)
('has', 1)
('having', 1)
('hence', 1)
('how', 1)
('i', 8)
('if', 1)
('in', 4)
('it', 2)
('i—', 1)
('just', 1)
('kept', 1)
('knowing', 1)
('lay', 1)
('leads', 1)
('leaves', 1)
('less', 1)
('long', 1)
('looked', 1)
('made', 1)
('morning', 1)
('no', 1)
('not', 1)
('oh', 1)
('on', 1)
('one', 3)
('other', 1)
('passing', 1)
('perhaps', 1)
('really', 1)
('roads', 2)
('same', 1)
('shall', 1)
('should', 1)
('sigh', 1)
('somewhere', 1)
('sorry', 1)
('step', 1)
('stood', 1)
('telling', 1)
('that', 3)
('the', 8)
('them', 1)
('then', 1)
('there', 1)
('this', 1)
('though', 1)
('to', 2)
('took', 2)
('travel', 1)
('trave

---------------
### Map Reduce Program:  Natural Language Toolkit

Following along with the tutorial:
Let us consider a more advanced example, which is based on word counting. A typical text mining or natural language processing task involves following steps:
 
1. Tokenization
2. Text normalization
3. Stop word removal
4. Stemming

For more information about NLTK, browse to https://pypi.python.org/pypi/nltk

NLTK is already installed as part of the Anaconda installation; however, we need to download the stopwords corpus before running the tutorial NLTK program. Please refer to the lab 2 assignment for instructions about downloading the stopwords corpus.

For this section:

1. Download the stopwords corpus.
2. Modify the NLTK code given in the lab 2 assignment to run in the Map Reduce class for this lab.
3. Run the code on "The Path Not Taken" poem file.

In [None]:
import sys
import io
import string
import re
import nltk.tokenize
from nltk.corpus import stopwords


class MapReduce:
    def __init__(self):
        self.intermediate = dict()
        self.result = []

    def emitIntermediate(self, key, value):
        self.intermediate.setdefault(key, [])
        self.intermediate[key].append(value)

    def emit(self, value):
        self.result.append(value)

    def execute(self, data, mapper, reducer):
        for record in data:
            mapper(record)

        for key in sorted(self.intermediate.keys()):
            #            print(key, self.intermediate[key])
            reducer(key, self.intermediate[key])

        self.result.sort()
        for item in self.result:
            print(item)


mapReducer = MapReduce()


def mapper(line):
    # init stopwords
    stops = set(stopwords.words('english'))
    # init stemmer
    stemmer = nltk.stem.SnowballStemmer("english")

    # remove punctuation
    for c in string.punctuation:
        line = line.replace(c, "")

    # remove number
    re.sub("\d+", "", line)

    line = line.strip().lower()
    words = line.split()
    for word in words:
        # remove stopwords
        if word in stops:
            word = ""
        word = stemmer.stem(word)
        if word != "":
            mapReducer.emitIntermediate(word, word)  # don't need word as value I think


# add mapper code below

def reducer(key, list_of_values):
    mapReducer.emit((key, len(list_of_values)))


if __name__ == '__main__':
    # initialize stop words
    # nltk.download()
    stops = set(stopwords.words('english'))

    inputData = []
    file = file = open("/Users/reiven/Documents/Python/CSCI4710/Lab2/TheRoadNotTaken.txt", "r")
    for line in file:
        inputData.append(line)
    mapReducer.execute(inputData, mapper, reducer)


showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


---------------
### Map Reduce Program:  pyspark Word Count


1. Modify the code at
https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py
2. Run the code on text read from the "The Path Not Taken" file

The code should:
- Count words using pyspark
- Remove punctuation
- Convert text to lower case
- Run successfully

In [2]:
import sys
import re
from operator import add
import string

from pyspark.sql import SparkSession


def remove_punctuation(word):
    for c in string.punctuation:
        word = word.replace(c, "")
    return word

if __name__ == "__main__":
    # if len(sys.argv) != 2:
    #     print("Usage: wordcount <file>", file=sys.stderr)
    #     exit(-1)

    spark = SparkSession\
        .builder\
        .master("local[*]")\
        .appName("lab2")\
        .getOrCreate()

    lines = spark.read.text("/Users/reiven/Documents/Python/CSCI4710/Lab2/TheRoadNotTaken.txt").rdd.map(lambda r: r[0])
    # remove punctuation and lowercase
    counts = lines.flatMap(lambda x: remove_punctuation(x).split(' ')) \
                  .map(lambda x: x.strip().lower())\
                  .map(lambda x: (x, 1)) \
                  .reduceByKey(add)
    output = counts.collect()
    for (word, count) in output:
        print("%s: %i" % (word, count))

    spark.stop()