# SIADS 516: Homework 2
Version 2.0.20201020.1
### Dr. Chris Teplovs, School of Information, University of Michigan
<small><a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

# Using the Spark RDD API to analyze text
Data are from 
https://www.kaggle.com/nzalake52/new-york-times-articles

## Objectives
1. To gain familiarity with PySpark
2. To learn the basics of the Spark RDD API
3. To practice solving a real-world problem

## Overview

This project was inspired by an actual event that was experienced by a UMSI student.  This student was applying for a 
job with a large multi-national corporation (let's call it XYZ, Inc.).  XYZ Inc. was looking for someone who could 
conduct an analysis of a massive (terabyte-size) text dataset.  They had heard about Spark and planned on investigating it but hadn't yet found someone internally who had the skill set required to tackle the problem.  The UMSI student indicated that they had experience with Spark and could likely handle the task.  The hiring supervisor then provided a non-Spark script and asked the student to demonstrate how that script could be translated to work in a Spark environment.  The student was able to do the conversion and, pending completion of their degree, will have secured a job at XYZ, Inc.

This assignment simulates that exact situation.  In this assignment you will take a python-based script that does
part-of-speech tagging on a large dataset and convert it, as much as possible, to use a pyspark-based approach.

The original script was written by Luke Petschauer and a forked version is available at https://github.com/umsi-data-science/NP_chunking_with_nltk/blob/master/NP_chunking_with_the_NLTK.ipynb. That page provides a detailed explanation of the original code and an excellent overview and justification for the use of
part-of-speech tagging and a super-gentle introduction to Natural Language Processing (NLP).  **You should read through and study that notebook before you start this assignment.** The code from the "Final Code" section is reproduced in the first code cell below, which you should run.

Run and study that cell and review https://github.com/umsi-data-science/NP_chunking_with_nltk/blob/master/NP_chunking_with_the_NLTK.ipynb.  
You will be taking a similar approach to analyze news articles from the New York Times using pyspark.  There are two tasks (Task 1 and Task 2) to complete below.
**You should create and use a short dataset, based on a small fraction of the complete
dataset in your development work, and then when you're happy with your code run it on the complete dataset.**
The complete analysis should take about 10 minutes to complete.

two tasks for the homwork, related to analysing news articles.



tips:
create and use a short dataset based on a small fraction of the complete dataset
when you're happy with your code, run it on the complete dataset.

### Load the required nltk and other libraries

In [2]:
import nltk
nltk.download('book') # NOTE: this should be unnecessary for Coursera image (should be preloaded)
import re
import pprint
from nltk import Tree

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/macarthur/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /Users/macarthur/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     /Users/macarthur/nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /Users/macarthur/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     /Users/macarthur/nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     /Users/macarthur/nltk_data...
[nltk_data]    |   Package conll2002 is already up-to-date!
[nlt

### The next cell is the original (non-Spark) script

In [3]:
# This is the original (non-Spark) script

patterns = """
    NP: {<JJ>*<NN*>+}
    {<JJ>*<NN*><CC>*<NN*>+}
    """

NPChunker = nltk.RegexpParser(patterns)

def prepare_text(input):
    sentences = nltk.sent_tokenize(input)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]
    sentences = [NPChunker.parse(sent) for sent in sentences]
    return sentences


def parsed_text_to_NP(sentences):
    nps = []
    for sent in sentences:
        tree = NPChunker.parse(sent)
        for subtree in tree.subtrees():
            if subtree.label() == 'NP':
                t = subtree
                t = ' '.join(word for word, tag in t.leaves())
                nps.append(t)
    return nps


def sent_parse(input):
    sentences = prepare_text(str(input))
    nps = parsed_text_to_NP(sentences)
    return nps


text_to_be_analyzed = """WASHINGTON - Stellar pitching kept the Mets afloat in the first half of last season despite their offensive woes. But they cannot produce an encore of their pennant-winning season if their lineup keeps floundering while their pitching is nicked, bruised and stretched thin.
"We were going to ride our pitching," Manager Terry Collins said before Wednesday’s game. "But we're not riding it right now. We've got as many problems with our pitching as we do anything."
Wednesday's 4-2 loss to the Washington Nationals was cruel for the already-limping Mets. Pitching in Steven Matz's place, the spot starter Logan Verrett allowed two runs over five innings. But even that was too large a deficit for the Mets' lineup to overcome against Max Scherzer, the Nationals' starter.
"We're not even giving ourselves chances," Collins said, adding later, "We just can’t give our pitchers any room to work."
The Mets did not score until the ninth inning, when a last-gasp two-run homer by James Loney off Nationals reliever Shawn Kelley snapped a streak of 23 scoreless innings for the team."""


nps = sent_parse(text_to_be_analyzed)
print(nps)

['Stellar pitching', 'afloat', 'first half', 'last season', 'encore', 'pennant-winning season', 'lineup', 'pitching', 'thin', 'pitching', 's game', 'pitching', 'anything', '4-2 loss', 'place', 'spot starter', 'deficit', 'lineup', 'starter', 'room', 'ninth inning', 'last-gasp two-run homer', 'reliever', 'streak', 'team']


In [4]:
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName('SIADS 516 Homework 2') \
    .getOrCreate() 

sc = spark.sparkContext

In [35]:
text = sc.textFile('data/nytimes/nytimes_news_articles.txt')
# show the first 2 lines of the file
text.take(2)

['URL: http://www.nytimes.com/2016/06/30/sports/baseball/washington-nationals-max-scherzer-baffles-mets-completing-a-sweep.html',
 '']

In [31]:
TOKEN_RE = re.compile(r"\b[\w']+\b")
def pos_tag_counter(line):
    toks = nltk.regexp_tokenize(line, TOKEN_RE)
    postoks = nltk.tag.pos_tag(toks)

    return postoks

### Task 1: Create an RDD pipline (i.e. sequence of transformations) that:
1. filters out blank lines 
2. filters out lines starting with 'URL'
3. creates a single list (using flatMap) that applies the pos_tag_counter function to each line
4. maps each resulting line to show the part of speech (which is the second element returned from the pos_tag_counter)
5. converts each resulting line to a pairRDD with POS tags as keys and values of 1
6. reduces the resulting RDD by key, adding up all the 1s (like the lecture and lab examples)
7. sorts the resulting list by the counts, in descending order.

Your output should look something like:
```
[('NN', 5628),
 ('IN', 4690),
 ('NNP', 4575),
 ('DT', 3913),
 ('JJ', 2550),
 ('NNS', 2345),
 ('VBD', 1931),
 ('RB', 1312),
 ('PRP', 1227),
 ('VB', 1170),
 ('CC', 1086),
 ('TO', 1043),
 ('VBN', 935),
 ('VBG', 895)...
 ```
 In other words, the count of each part-of-speech tag sorted in descending order.

In [47]:
#step 1 and 2:
text_filtered = text.filter(lambda x: (len(x)>0) & ('URL' not in x))
#step 3 and 4:
text_pos = text_filtered.flatMap(lambda x: pos_tag_counter(x)).map(lambda x: x[1])
#step 5:
text_pos_kv = text_pos.map(lambda x: (x, 1))
#step 6 and 7:
text_pos_sorted = text_pos_kv.reduceByKey(lambda a, b: a + b).sortBy(lambda x: x[1], ascending = False)

res_preview = text_pos_sorted.take(5)
print('preview first five elements from results: ', res_preview)
res = text_pos_sorted.collect()
print('results: ', res)


# Insert your code here

[('NN', 1126487), ('IN', 928880), ('NNP', 853054), ('DT', 761466), ('JJ', 498464)]


### Run the next cell before proceeding to Task 2

In [48]:
grammar = r"""
    NBAR:
        {<NN.*|JJS>*<NN.*>}
        
    NP:
        {<NBAR>}
        {<NBAR><IN><NBAR>}
"""

  
def tokenize_chunk_parse(line):
    chunker = nltk.RegexpParser(grammar)
  
    toks = nltk.regexp_tokenize(line, TOKEN_RE)
    postoks = nltk.tag.pos_tag(toks)

    tree = chunker.parse(postoks)

    return [term for term in leaves(tree)] 
  
def leaves(tree):
    for subtree in tree.subtrees(filter = lambda t: t.label()=='NP'):
        yield subtree.leaves()

### Task 2:  Create an RDD pipeline to show the distribution of the length of noun phrases
You should wind up with an RDD showing the number of each length of noun phrase.  For example, the following output shows there are 6157 noun phrases of
length 1, 1833 of length 2, 654 of length 3, and so on:
```
[(1, 6157),
 (2, 1833),
 (3, 654),
 (4, 204),
 (5, 65),
 (6, 16),
 (8, 6),
 (7, 4),
 (9, 3)]
```
The steps should be:
1. Apply (using flatMap) the ```tokenize_chunk_parse``` function to each line in the ```text``` RDD
2. Use map to emit the length of each noun phrase
3. Use map to convert each resulting line to a pairRDD with lengths of noun phrases as keys and values of 1
4. Reduce the resulting RDD by key, adding up all the 1s (like the lecture and lab examples)
5. Sort the resulting list by the counts, in descending order.

In [168]:
#step 1:
text_np = text.flatMap(lambda x: tokenize_chunk_parse(x))
#step 2:
text_np2 = text_np.map(lambda x: [len(i[0]) for i in x])
#step 3:
text_np_kv = text_np2.flatMap(lambda x: x).map(lambda x: (x, 1))
#step 4 & 5:
text_np_sorted = text_np_kv.reduceByKey(lambda a, b: a + b).sortBy(lambda x: x[1], ascending = False)

res2_preview = text_np_sorted.take(5)
print('preview first five elements from results: ', res2_preview)
res2 = text_np_sorted.collect()
print('results: ', res2)

[(5, 415258), (6, 412931), (4, 369253), (7, 355174), (8, 252965)]
