# SIADS 516: Homework 2
Version 1.0.20200303.1
### Dr. Chris Teplovs, School of Information, University of Michigan
<small><a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

# Using the Spark RDD API to analyze text
Data are from 
https://www.kaggle.com/nzalake52/new-york-times-articles

## Objectives
1. To gain familiarity with PySpark
2. To learn the basics of the Spark RDD API
3. To practice solving a real-world problem

## Overview

This project was inspired by an actual event that was experienced by a UMSI student.  This student was applying for a 
job with a large multi-national corporation (let's call it XYZ, Inc.).  XYZ Inc. was looking for someone who could 
conduct an analysis of a massive (terabyte-size) text dataset.  They had heard about Spark and planned on investigating it
but hadn't yet found someone internally who had the skillset required to tackle the problem.  THe UMSI student
indicated that they had experience with Spark and could likely handle the task.  The hiring supervisor then provided
a non-Spark script and asked the student to demonstrate how that script could be translated to work in a Spark environment.
The student was able to do the conversion and, pending completion of their degree, will have secured a job at XYZ, Inc.

This assignment simulates that exact situation.  In this assignment you will take a python-based script that does
part-of-speech tagging on a large dataset and convert it, as much as possible, to use a pyspark-based approach.

The original script was written by Luke Petschauer and a forked version is available at https://github.com/umsi-data-science/NP_chunking_with_nltk/blob/master/NP_chunking_with_the_NLTK.ipynb. That page provides a detailed explanation of the original code and an excellent overview and justification for the use of
part-of-speech tagging.  The code from the "Final Code" section is reproduced in the first code cell below, which you should run.

Run and study that cell and review https://github.com/umsi-data-science/NP_chunking_with_nltk/blob/master/NP_chunking_with_the_NLTK.ipynb.  
You will be taking a similar approach to analyze news articles from the New York Times using pyspark.  There are two tasks (Task 1 and Task 2) to complete below.
**You should create and use a short dataset, based on a small fraction of the complete
dataset in your development work, and then when you're happy with your code run it on the complete dataset.**
The complete analysis should take about 10 minutes to complete.

### Load the required nltk and other libraries

In [74]:
import nltk
# nltk.download('book')  # NOTE: this should be unnecessary for Coursera image (should be preloaded)
import re
import pprint
from nltk import Tree

In [75]:
# I think i need to keep this uncommented and in this cell to run...
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### The next cell is the original (non-Spark) script

In [76]:
# This is the original (non-Spark) script

patterns = """
    NP: {<JJ>*<NN*>+}
    {<JJ>*<NN*><CC>*<NN*>+}
    """


NPChunker = nltk.RegexpParser(patterns)


def prepare_text(input):
    sentences = nltk.sent_tokenize(input)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]
    sentences = [NPChunker.parse(sent) for sent in sentences]
    return sentences



def parsed_text_to_NP(sentences):
    nps = []
    for sent in sentences:
        tree = NPChunker.parse(sent)
        for subtree in tree.subtrees():
            if subtree.label() == 'NP':
                t = subtree
                t = ' '.join(word for word, tag in t.leaves())
                nps.append(t)
    return nps



def sent_parse(input):
    sentences = prepare_text(str(input))
    nps = parsed_text_to_NP(sentences)
    return nps



text_to_be_analyzed = """WASHINGTON - Stellar pitching kept the Mets afloat in the first half of last season despite their offensive woes. But they cannot produce an encore of their pennant-winning season if their lineup keeps floundering while their pitching is nicked, bruised and stretched thin.
"We were going to ride our pitching," Manager Terry Collins said before Wednesday’s game. "But we're not riding it right now. We've got as many problems with our pitching as we do anything."
Wednesday's 4-2 loss to the Washington Nationals was cruel for the already-limping Mets. Pitching in Steven Matz's place, the spot starter Logan Verrett allowed two runs over five innings. But even that was too large a deficit for the Mets' lineup to overcome against Max Scherzer, the Nationals' starter.
"We're not even giving ourselves chances," Collins said, adding later, "We just can’t give our pitchers any room to work."
The Mets did not score until the ninth inning, when a last-gasp two-run homer by James Loney off Nationals reliever Shawn Kelley snapped a streak of 23 scoreless innings for the team."""


nps = sent_parse(text_to_be_analyzed)

print(nps)



['Stellar pitching', 'afloat', 'first half', 'last season', 'encore', 'pennant-winning season', 'lineup', 'pitching', 'thin', 'pitching', 's game', 'pitching', 'anything', '4-2 loss', 'place', 'spot starter', 'deficit', 'lineup', 'starter', 'room', 'ninth inning', 'last-gasp two-run homer', 'reliever', 'streak', 'team']


In [77]:

from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName('SIADS 516 Homework 2') \
    .getOrCreate() 

sc = spark.sparkContext



In [78]:
sc

In [79]:
text_to_be_analyzed = """WASHINGTON - Stellar pitching kept the Mets afloat in the first half of last season despite their offensive woes. But they cannot produce an encore of their pennant-winning season if their lineup keeps floundering while their pitching is nicked, bruised and stretched thin.
"We were going to ride our pitching," Manager Terry Collins said before Wednesday’s game. "But we're not riding it right now. We've got as many problems with our pitching as we do anything."
Wednesday's 4-2 loss to the Washington Nationals was cruel for the already-limping Mets. Pitching in Steven Matz's place, the spot starter Logan Verrett allowed two runs over five innings. But even that was too large a deficit for the Mets' lineup to overcome against Max Scherzer, the Nationals' starter.
"We're not even giving ourselves chances," Collins said, adding later, "We just can’t give our pitchers any room to work."
The Mets did not score until the ninth inning, when a last-gasp two-run homer by James Loney off Nationals reliever Shawn Kelley snapped a streak of 23 scoreless innings for the team."""


### Read - I'm changing things up...

In [80]:


# im creating a MUCH smaller amount to process:
# it should be this below: 
#     text = sc.textFile('data/nytimes/nytimes_news_articles.txt')
#     # show the first two lines of the file
#     text.take(2)

# i am creating this instead, and thus all remaining commands will process it faster... 
# when this is proven to work, then i will replace the right command (text) back in...


# text = """WASHINGTON - Stellar pitching kept the Mets afloat in the first half of last season despite their offensive woes. But they cannot produce an encore of their pennant-winning season if their lineup keeps floundering while their pitching is nicked, bruised and stretched thin.
# "We were going to ride our pitching," Manager Terry Collins said before Wednesday’s game. "But we're not riding it right now. We've got as many problems with our pitching as we do anything."
# Wednesday's 4-2 loss to the Washington Nationals was cruel for the already-limping Mets. Pitching in Steven Matz's place, the spot starter Logan Verrett allowed two runs over five innings. But even that was too large a deficit for the Mets' lineup to overcome against Max Scherzer, the Nationals' starter.
# "We're not even giving ourselves chances," Collins said, adding later, "We just can’t give our pitchers any room to work."
# The Mets did not score until the ninth inning, when a last-gasp two-run homer by James Loney off Nationals reliever Shawn Kelley snapped a streak of 23 scoreless innings for the team."""



In [81]:

# UNCOMMENT OUT EVERYTHING WHEN YOU KNOW ITS GOLD...


textfile = """URL: http://www.nytimes.com/2016/06/30/sports/baseball/washington-nationals-max-scherzer-baffles-mets-completing-a-sweep.html

WASHINGTON — Stellar pitching kept the Mets afloat in the first half of last season despite their offensive woes. But they cannot produce an encore of their pennant-winning season if their lineup keeps floundering while their pitching is nicked, bruised and stretched thin.
“We were going to ride our pitching,” Manager Terry Collins said before Wednesday’s game. “But we’re not riding it right now. We’ve got as many problems with our pitching as we do anything.”
Wednesday’s 4-2 loss to the Washington Nationals was cruel for the already-limping Mets. Pitching in Steven Matz’s place, the spot starter Logan Verrett allowed two runs over five innings. But even that was too large a deficit for the Mets’ lineup to overcome against Max Scherzer, the Nationals’ starter.
“We’re not even giving ourselves chances,” Collins said, adding later, “We just can’t give our pitchers any room to work.”
The Mets did not score until the ninth inning, when a last-gasp two-run homer by James Loney off Nationals reliever Shawn Kelley snapped a streak of 23 scoreless innings for the team.
The Mets were swept in the three-game series and fell six games behind the Nationals in the National League East. Of late, the Mets have looked worse than their 40-37 record.
“I don’t think we’ve played half our games yet this year,” right fielder Curtis Granderson said. “There’s still a lot of things left that can and hopefully will happen.”
Scherzer toyed with the Mets, who were initially without Granderson after he was scratched from the lineup with lingering calf tightness. Even though Granderson has been inconsistent this season, he had hit well against Scherzer in the past. Alejandro De Aza, who entered the game with a .165 average, started in right field instead because Collins said the team had few options.
After Scherzer gave up a single to Asdrubal Cabrera and walked Loney in the second inning, he retired the next 18 batters, until an eighth-inning single by Brandon Nimmo.
The Mets struggled again with runners on base. After Nimmo and the pinch-hitting Granderson singled in the eighth, pinch-hitter Travis d’Arnaud grounded out, and De Aza struck out.
“If they keep adding pressure on themselves, they’re going to continue to struggle,” Collins said. “That’s one of the things we try to make sure they have to understand: They have to be themselves.”
General Manager Sandy Alderson, Collins and the coaching staff have met about the offense and discussed the odd dynamics: Some players are performing at or better than their career averages, but the lineup as a whole has struggled immensely, especially with runners in scoring position.
“We’re just not driving in any runs,” Collins said. “That’s been the frustrating part. It’s not that we’re striking out. We’re popping up, or a double-play ball.”
The Mets have a power-hitting team, so asking players to bunt or hit and run would go against their strengths.
“When you start to change a team that’s built one way and start to make them do something different, you’re going to get your butt beat,” Collins said.
Earlier in the season, the Mets appeared like an all-or-nothing, home-run-driven team. Although they hit only .211 as a team in May, they smashed 40 home runs. They have a higher average in June, but they have hit only 24 homers, and the inconsistent offense has put a strain on the pitching staff.
In the second inning, Verrett gave up a solo home run to the ex-Met Daniel Murphy. Collins wanted to limit the workloads of Addison Reed and Jeurys Familia, so he turned to reliever Sean Gilmartin in the eighth. Gilmartin gave up a two-run homer to Murphy, who has hit .429 (15 for 35) with four home runs against the Mets this season, his first since leaving the team.
“I felt like I kept us in the game and gave us a chance to come back and win it,” Verrett said. “I wish that I wouldn’t have given up the two runs.”
Verrett was put in this position because of the effects of bone spurs on the Mets’ rotation. The team asked Verrett to start Wednesday and gave Matz an extra day of rest after he received anti-inflammatory medication for the large bone spur in his left elbow. He will try to pitch through it.
Noah Syndergaard has a smaller and less intrusive bone spur in the back of his right elbow.
“As long as I’m staying on my anti-inflammatories and my mechanics are on point, I’m able to go out there every five days and compete,” Syndergaard said.
For the Mets, the immediate road ahead will be even tougher. Matz was expected to pitch Thursday against the Chicago Cubs, one of the best teams in baseball this season.

URL: http://www.nytimes.com/2016/06/30/nyregion/mayor-de-blasios-counsel-to-leave-next-month-to-lead-police-review-board.html

Mayor Bill de Blasio’s counsel and chief legal adviser, Maya Wiley, is resigning next month from her City Hall position to become the chairwoman of the Civilian Complaint Review Board, New York City’s independent oversight agency for the Police Department.
The move represents the latest shake-up for the de Blasio administration amid continuing state and federal investigations into the mayor’s fund-raising, and fills a two-month vacancy at the police review board created by the resignation of its chairman, Richard D. Emery, in April.
A civil rights lawyer and advocate for racial and social justice, Ms. Wiley joined the de Blasio administration in early 2014 to focus on legal issues as well as on the mayor’s efforts to address issues of inequality. But over time, Ms. Wiley became discouraged over not being part of Mr. de Blasio’s inner circle and felt cut out of both legal questions and advocacy, according to a person familiar with her thinking. On the former, Mr. de Blasio often relied instead on the city’s corporation counsel and Henry Berger, the mayor’s special counsel; on the latter, he favored his top political aides. The person requested anonymity to discuss private conversations.
More recently, Ms. Wiley was assigned to help craft the administration’s legal response to the state and federal inquiries as well as to requests for the public disclosure of documents, notably emails between Mr. de Blasio and trusted advisers outside the administration.
It was in response to a question from reporters about the withholding of those emails with advisers that Ms. Wiley, defending the practice, described the advisers as “agents of the city” — a designation that appeared novel and resulted in days of unfavorable press coverage.
In a statement on Wednesday, Mr. de Blasio thanked Ms. Wiley for her service and congratulated her on her new role.
The review board investigates allegations of misconduct by officers and makes recommendations for discipline to the Police Department. Its data on the number of complaints, and its investigations of officers, provide an important barometer of police behavior and a politically important one for Mr. de Blasio, a Democrat who campaigned on improving police-community relations.
Mr. Wiley will also take a position at the New School in Manhattan. Her moves were reported by The Wall Street Journal.
The announcement of Ms. Wiley’s departure from City Hall followed that of a recently hired director of social media, Scott Kleinberg, who resigned on Saturday, just eight weeks after being hired to bring greater personality to the Twitter, Facebook and other online accounts associated with the mayor’s office. His resignation was reported by DNA Info.
In a Facebook post that was later removed, Mr. Kleinberg complained of long hours and micromanagement and described his experience with the administration as working with “political hacks plus a boss who just couldn’t get it,” adding, “It was a bad combination for sure.” Mr. Kleinberg declined to comment.
The departures came less than two months after Karen Hinton, the mayor’s top spokeswoman, announced her resignation from the administration. (She stayed in the position until mid-June.)

URL: http://www.nytimes.com/2016/06/30/nyregion/three-men-charged-in-killing-of-cuomo-administration-lawyer.html

In the early morning hours of Labor Day last year, a group of gunmen from the 8-Trey street gang made their way through a crowd of revelers gathered near a Brooklyn public housing project to celebrate J’ouvert, a pre-dawn party that precedes the annual West Indian American Day Parade. The housing project was “enemy territory,” the authorities said, the stronghold of a rival gang, the Folk Nation.
As hundreds of people drank and danced in costume, the warring factions spotted each other, and a gunfight broke out in the darkness. Caught in the crossfire was an up-and-coming lawyer in the administration of Gov. Andrew M. Cuomo, Carey Gabay, who was at the festivities with his brother. Mr. Gabay, 43, who was of Jamaican heritage, died a week later from his wounds.
For some in government and law enforcement circles, the death of Mr. Gabay, who had risen from a childhood in Bronx public housing to Harvard Law School and then to public service, was emblematic of the havoc that street gangs have inflicted on New York City residents.
On Wednesday, as part of a continuing investigation, the authorities announced that three men had been indicted in the killing.
One of the men, Micah Alleyne, 24, was already in custody, having been indicted last month. The other two — Tyshawn Crawford, 21, and Keith Luncheon, 24 — were named along with Mr. Alleyne in a new indictment on Wednesday, which holds all three equally responsible for Mr. Gabay’s murder. Each of the men is charged with second-degree murder.
“These defendants are charged with creating a killing field in a crowd of innocent people, showing depraved indifference to human life and causing the death of Carey Gabay,” Ken Thompson, the Brooklyn district attorney, said in a statement about the case.
At a news conference at his office, Mr. Thompson said that 20 gang members, armed with as many as 27 guns, had taken part in the gunfight in which dozens of shots were fired. He promised to seek indictments against the others involved in the episode. “This is just the beginning,” he said.
At the time of his death, Mr. Gabay was first deputy counsel for the Empire State Development Corporation and lived with his wife in the Clinton Hill section of Brooklyn. He was shot in the head while trying to escape the gunfire by ducking between two cars in a parking lot at the Ebbets Field Houses on Bedford Avenue in Crown Heights.
The housing complex had long been Folk Nation turf, Mr. Thompson said, and several of the gang’s members were among the crowd on an outdoor patio there during the J’ouvert celebration. Most of the shots fired that morning came from the direction of the patio, Mr. Thompson added. Members of both the Folk Nation and the 8-Treys had standing orders, he said, to open fire at each other on sight.
“When these two gangs see each other,” Mr. Thompson said, “there is no talking, just shooting.”
Surveillance-camera footage taken that morning, which Mr. Thompson played at the news conference, showed people frantically fleeing the patio and scurrying for cover as shots rang out. Investigators also discovered footage that Mr. Thompson said shows Mr. Alleyne firing a gun into the crowd, then running into a building at the Ebbets Field Houses.
Mr. Alleyne, who was living in a homeless shelter in Jamaica, Queens, when he was arrested in May, was a Folk Nation member, Mr. Thompson said. He added that Mr. Crawford, who lives in East New York, belonged to the Hoodstarz gang, which is allied with the Folk Nation. Mr. Thompson described Mr. Luncheon, who lives in Crown Heights, as a member of the 8-Trey gang, a faction of the Crips.
A fourth defendant, Stanley Elianor, 25, was indicted in October on weapons charges in connection with Mr. Gabay’s murder for carrying a MAC-10 machine pistol during the melee.
Police Commissioner William J. Bratton, appearing with Mr. Thompson at the news conference, said his department was targeting street gangs across the city, adding that officers had arrested gang members in three separate operations in Manhattan on Wednesday morning. Mr. Bratton also said that this year, for the first time, J’ouvert organizers would have to secure a permit for their event, which has traditionally taken place without official city approval. The police, the commissioner said, would increase their presence at the festivities in September to ensure there was no violence.
“If so much as a sneeze is made by these gangs in the run-up to the festival, we will be there,” Mr. Bratton said, “and not to say, ‘gesundheit.’ ”

URL: http://www.nytimes.com/2016/06/30/nyregion/tekserve-precursor-to-the-apple-store-to-close-after-29-years.html

It was the Apple Store in New York City before there was such a thing as an Apple Store.
Before iPods and iPads and iPhones, before Apple started selling and servicing its devices out of a glass cube on Fifth Avenue, the eclectic Tekserve store on West 23rd Street in Manhattan was where customers went for upgrades to their PowerBook laptops or to have their computers fixed.
But times have changed, Tekserve’s managers said, and on Wednesday, they announced that the company was closing its retail and customer-service operation. The service center will remain open until July 31, and the retail store will close on Aug. 15. About 70 employees will lose their jobs, the company said.
“This is a cultural shift,” the company’s chief executive, Jerry Gepner, said in an interview in his office above the store. “It’s not a failure of the business. It’s like this giant wave finally crashed down upon us.”
"""


In [82]:

# RESTORE WHEN THE FILE IS PROPERLY WORKING ! 
# text = sc.textFile('data/nytimes/nytimes_news_articles.txt')
# # show the first two lines of the file
# text.take(2)

# RESTORE WHEN THE FILE IS PROPERLY WORKING ! 
text = sc.textFile('data/nytimes/minitext.txt')
# show the first two lines of the file
text.take(2)



['URL: http://www.nytimes.com/2016/06/30/sports/baseball/washington-nationals-max-scherzer-baffles-mets-completing-a-sweep.html',
 '']

In [83]:

text.take(10)


['URL: http://www.nytimes.com/2016/06/30/sports/baseball/washington-nationals-max-scherzer-baffles-mets-completing-a-sweep.html',
 '',
 'WASHINGTON � Stellar pitching kept the Mets afloat in the first half of last season despite their offensive woes. But they cannot produce an encore of their pennant-winning season if their lineup keeps floundering while their pitching is nicked, bruised and stretched thin.',
 '�We were going to ride our pitching,� Manager Terry Collins said before Wednesday�s game. �But we�re not riding it right now. We�ve got as many problems with our pitching as we do anything.�',
 'Wednesday�s 4-2 loss to the Washington Nationals was cruel for the already-limping Mets. Pitching in Steven Matz�s place, the spot starter Logan Verrett allowed two runs over five innings. But even that was too large a deficit for the Mets� lineup to overcome against Max Scherzer, the Nationals� starter.',
 '�We�re not even giving ourselves chances,� Collins said, adding later, �We just 

In [84]:

text.collect()


['URL: http://www.nytimes.com/2016/06/30/sports/baseball/washington-nationals-max-scherzer-baffles-mets-completing-a-sweep.html',
 '',
 'WASHINGTON � Stellar pitching kept the Mets afloat in the first half of last season despite their offensive woes. But they cannot produce an encore of their pennant-winning season if their lineup keeps floundering while their pitching is nicked, bruised and stretched thin.',
 '�We were going to ride our pitching,� Manager Terry Collins said before Wednesday�s game. �But we�re not riding it right now. We�ve got as many problems with our pitching as we do anything.�',
 'Wednesday�s 4-2 loss to the Washington Nationals was cruel for the already-limping Mets. Pitching in Steven Matz�s place, the spot starter Logan Verrett allowed two runs over five innings. But even that was too large a deficit for the Mets� lineup to overcome against Max Scherzer, the Nationals� starter.',
 '�We�re not even giving ourselves chances,� Collins said, adding later, �We just 

In [85]:

# RESTORE WHEN THE FILE IS PROPERLY WORKING ! 
# text = sc.textFile('data/nytimes/nytimes_news_articles.txt')
# # show the first two lines of the file
# text.take(2)



In [86]:
print(text.take(20))

['URL: http://www.nytimes.com/2016/06/30/sports/baseball/washington-nationals-max-scherzer-baffles-mets-completing-a-sweep.html', '', 'WASHINGTON � Stellar pitching kept the Mets afloat in the first half of last season despite their offensive woes. But they cannot produce an encore of their pennant-winning season if their lineup keeps floundering while their pitching is nicked, bruised and stretched thin.', '�We were going to ride our pitching,� Manager Terry Collins said before Wednesday�s game. �But we�re not riding it right now. We�ve got as many problems with our pitching as we do anything.�', 'Wednesday�s 4-2 loss to the Washington Nationals was cruel for the already-limping Mets. Pitching in Steven Matz�s place, the spot starter Logan Verrett allowed two runs over five innings. But even that was too large a deficit for the Mets� lineup to overcome against Max Scherzer, the Nationals� starter.', '�We�re not even giving ourselves chances,� Collins said, adding later, �We just can�t

In [87]:

#  QUICKTEST FIRST # 

TOKEN_RE = re.compile(r"\b[\w']+\b")
# basically match any 'word', and disregard any punctuation like . , ; etc 


# def pos_tag_counter(line):
#     toks = nltk.regexp_tokenize(line, TOKEN_RE)
#     postoks = nltk.tag.pos_tag(toks)
#     return postoks

# quick test first...

stringtext = "Hi my name is Tom, with lot's; of , punctuation , ; ; : <"

# nltk.regexp_tokenize(stringtext, TOKEN_RE)
#   output:   ['Hi', 'my', 'name', 'is', 'Tom', 'with', "lot's", 'of', 'punctuation']



toks = nltk.regexp_tokenize(stringtext, TOKEN_RE)
postoks = nltk.tag.pos_tag(toks)
#print(postoks)

for entry in postoks: print(entry)



('Hi', 'NNP')
('my', 'PRP$')
('name', 'NN')
('is', 'VBZ')
('Tom', 'NNP')
('with', 'IN')
("lot's", 'NN')
('of', 'IN')
('punctuation', 'NN')


<hr>

In [88]:

# DO NOT DELETE THIS CELL # 

TOKEN_RE = re.compile(r"\b[\w']+\b")


def pos_tag_counter(line):
    toks = nltk.regexp_tokenize(line, TOKEN_RE)
    postoks = nltk.tag.pos_tag(toks)
    return postoks



### Task 1: Create an RDD pipline (i.e. sequence of transformations) that:
1. filters out blank lines 
2. filters out lines starting with 'URL'
3. creates a single list (using flatMap) that applies the pos_tag_counter function to each line
4. maps each resulting line to show the part of speech (which is the second element returned from the pos_tag_counter)
5. converts each resulting line to a pairRDD with words as keys and values of 1
6. reduces the resulting RDD by key, adding up all the 1s (like the lecture and lab examples)
7. sorts the resulting list by the counts, in descending order.

Your output should look something like:
```
[('NN', 5628),
 ('IN', 4690),
 ('NNP', 4575),
 ('DT', 3913),
 ('JJ', 2550),
 ('NNS', 2345),
 ('VBD', 1931),
 ('RB', 1312),
 ('PRP', 1227),
 ('VB', 1170),
 ('CC', 1086),
 ('TO', 1043),
 ('VBN', 935),
 ('VBG', 895)...
 ```
 In other words, the count of each part-of-speech tag sorted in descending order.

In [89]:

# decomposition, me examining the below

   
pos_tag_counts = text.filter(lambda line: len(line) > 0) \
    .filter(lambda line: re.findall('^(?!URL).*', line)) \
    .flatMap(pos_tag_counter) \
    .map(lambda word: word[1]) \
    .map(lambda word: (word, 1)) \
    .reduceByKey(lambda x, y: x + y) \
    .sortBy(lambda x: x[1], ascending = False)

pos_tag_counts.collect()



# i think that matches only the NNP, NN, etc stuff 
#     .map(lambda word: (word, 1)) \
#     .reduceByKey(lambda x, y: x + y) \
#     .sortBy(lambda x: x[1], ascending = False)

# pos_tag_counts.collect()






#     .flatMap(pos_tag_counter) \  # flatmap output, so ONE unified list, of all the tuples ...
#     # [('WASHINGTON', 'NNP'),
#     #  ('Stellar', 'NNP'),
#     #  ('pitching', 'NN'),
#     #  ('kept', 'VBD'),
#     #  ('the', 'DT'),
#     #  ('Mets', 'NNPS'),
    

    
    
    
    
#     pos_tag_counts = text.filter(lambda line: len(line) > 0) \
#     .filter(lambda line: re.findall('^(?!URL).*', line)) \
#     .flatMap(pos_tag_counter) \
#     .map(lambda word: word[1]) \
    
# ['NNP',
#  'NNP',
#  'NN',
#  'VBD',
#  'DT',
#  'NNPS',
#  'NN',
#  'IN',
#  'DT',
#  'JJ',
#  'NN',
     
   


# pos_tag_counts = text.filter(lambda line: len(line) > 0) \
#     .filter(lambda line: re.findall('^(?!URL).*', line)) \
#     .flatMap(pos_tag_counter) \
#     .map(lambda word: word[1]) \
#     .map(lambda word: (word, 1)) \

# [('NNP', 1),
#  ('NNP', 1),
#  ('NN', 1),
#  ('VBD', 1),
#  ('DT', 1),
#  ('NNPS', 1),
#  ('NN', 1),
#  ('IN', 1),
#  ('DT', 1),
#  ('JJ', 1),
#  ('NN', 1),
#  ('IN', 1),
#  ('JJ', 1),
#  ('NN', 1),
#  ('IN', 1),
#  ('PRP$', 1),
#  ('JJ', 1),
 
 
# pos_tag_counts = text.filter(lambda line: len(line) > 0) \
#     .filter(lambda line: re.findall('^(?!URL).*', line)) \
#     .flatMap(pos_tag_counter) \
#     .map(lambda word: word[1]) \
#     .map(lambda word: (word, 1)) \
#     .reduceByKey(lambda x, y: x + y) \
# #     .sortBy(lambda x: x[1], ascending = False)

# [('VBD', 84),
#  ('PRP$', 29),
#  ('NNS', 73),
#  ('CC', 48),
#  ('PRP', 50),
#  ('VBP', 35),
#  ('VBZ', 12),
#  ('TO', 36),
#  ('RB', 50),
#  ('CD', 30),
#  ('WRB', 2),
#  ('JJR', 7),
#  ('EX', 1),
#  ('WP', 6),
#  ('RP', 10),
#  ('JJS', 2),
#  ('FW', 6),
#  ('NNP', 159),
#  ('NN', 207),
#  ('DT', 139),
#  ('NNPS', 7),
#  ('IN', 174),
 
 
    

[('NN', 207),
 ('IN', 174),
 ('NNP', 159),
 ('DT', 139),
 ('VBD', 84),
 ('JJ', 79),
 ('NNS', 73),
 ('PRP', 50),
 ('RB', 50),
 ('CC', 48),
 ('VB', 41),
 ('TO', 36),
 ('VBP', 35),
 ('CD', 30),
 ('PRP$', 29),
 ('VBG', 28),
 ('VBN', 28),
 ('VBZ', 12),
 ('RP', 10),
 ('JJR', 7),
 ('NNPS', 7),
 ('MD', 7),
 ('WP', 6),
 ('FW', 6),
 ('WDT', 4),
 ('RBR', 3),
 ('WRB', 2),
 ('JJS', 2),
 ('EX', 1),
 ('PDT', 1)]

In [90]:

#  DO NO DELETE THIS CELL # 

pos_tag_counts = text.filter(lambda line: len(line) > 0) \
    .filter(lambda line: re.findall('^(?!URL).*', line)) \
    .flatMap(pos_tag_counter) \
    .map(lambda word: word[1]) \
    .map(lambda word: (word, 1)) \
    .reduceByKey(lambda x, y: x + y) \
    .sortBy(lambda x: x[1], ascending = False)

pos_tag_counts.collect()



# [('NN', 1126515),
#  ('IN', 928916),
#  ('NNP', 853093),
#  ('DT', 761492),
#  ('JJ', 498482),
#  ('NNS', 437116),
#  ('VBD', 379509),
#  ('PRP', 282603),
#  ('RB', 271053),
#  ('CC', 231491),
#  ('VB', 223717),
#  ('CD', 187602),
#  ('TO', 187005),
#  ('VBN', 174980),
#  ('VBZ', 169149),
#  ('VBG', 163653),
#  ('VBP', 143368),
#  ('PRP$', 107984),
#  ('MD', 67185),
#  ('WDT', 44582),
#  ('WP', 42406),
#  ('WRB', 33160),
#  ('RP', 29345),
#  ('JJR', 24746),
#  ('NNPS', 18870),
#  ('JJS', 16425),
#  ('EX', 12397),
#  ('RBR', 12286),
#  ('RBS', 5146),
#  ('PDT', 3784),
#  ('FW', 2793),
#  ('WP$', 2329),
#  ('POS', 493),
#  ('UH', 325),
#  ('$', 219),
#  ('LS', 5),
#  ("''", 2)]




# when minitext is input, this is the output: 
# [('NN', 207),
#  ('IN', 174),
#  ('NNP', 159),
#  ('DT', 139),
#  ('VBD', 84),
#  ('JJ', 79),
#  ('NNS', 73),
#  ('PRP', 50),
#  ('RB', 50),
#  ('CC', 48),
#  ('VB', 41),
#  ('TO', 36),
#  ('VBP', 35),
#  ('CD', 30),
#  ('PRP$', 29),
#  ('VBG', 28),
#  ('VBN', 28),
#  ('VBZ', 12),
#  ('RP', 10),
#  ('JJR', 7),
#  ('NNPS', 7),
#  ('MD', 7),
#  ('WP', 6),
#  ('FW', 6),
#  ('WDT', 4),
#  ('RBR', 3),
#  ('WRB', 2),
#  ('JJS', 2),
#  ('EX', 1),
#  ('PDT', 1)]


# note, this is how the text has URL within it: 
    
# The departures came less than two months after Karen Hinton, the mayor’s top spokeswoman, announced her resignation from the administration. (She stayed in the position until mid-June.)

# URL: http:  do you see it to the left ? //www.nytimes.com/2016/06/30/nyregion/three-men-charged-in-killing-of-cuomo-administration-lawyer.html

# In the early morning hours of Labor Day last year, a group of gunmen from the 8-Trey street gang made their way through a crowd of revelers gathered n


[('NN', 207),
 ('IN', 174),
 ('NNP', 159),
 ('DT', 139),
 ('VBD', 84),
 ('JJ', 79),
 ('NNS', 73),
 ('PRP', 50),
 ('RB', 50),
 ('CC', 48),
 ('VB', 41),
 ('TO', 36),
 ('VBP', 35),
 ('CD', 30),
 ('PRP$', 29),
 ('VBG', 28),
 ('VBN', 28),
 ('VBZ', 12),
 ('RP', 10),
 ('JJR', 7),
 ('NNPS', 7),
 ('MD', 7),
 ('WP', 6),
 ('FW', 6),
 ('WDT', 4),
 ('RBR', 3),
 ('WRB', 2),
 ('JJS', 2),
 ('EX', 1),
 ('PDT', 1)]

In [91]:


# May also be done this way
pos_tag_counts = text.filter(lambda line: len(line) > 0) \
    .filter(lambda line: re.findall('^(?!URL).*', line)) \
    .flatMap(pos_tag_counter) \
    .map(lambda word: (word[1], 1)) \
    .reduceByKey(lambda x, y: x + y) \
    .sortBy(lambda x: x[1], ascending = False)
    
pos_tag_counts.collect()



[('NN', 207),
 ('IN', 174),
 ('NNP', 159),
 ('DT', 139),
 ('VBD', 84),
 ('JJ', 79),
 ('NNS', 73),
 ('PRP', 50),
 ('RB', 50),
 ('CC', 48),
 ('VB', 41),
 ('TO', 36),
 ('VBP', 35),
 ('CD', 30),
 ('PRP$', 29),
 ('VBG', 28),
 ('VBN', 28),
 ('VBZ', 12),
 ('RP', 10),
 ('JJR', 7),
 ('NNPS', 7),
 ('MD', 7),
 ('WP', 6),
 ('FW', 6),
 ('WDT', 4),
 ('RBR', 3),
 ('WRB', 2),
 ('JJS', 2),
 ('EX', 1),
 ('PDT', 1)]

<hr>

### Run the next cell before proceeding to Task 2

In [92]:



# grammar = r"""
#     NBAR:
#         {<NN.*|JJS>*<NN.*>}
        
#     NP:
#         {<NBAR>}
#         {<NBAR><IN><NBAR>}
# """



# print(grammar)

#     NBAR:
#         {<NN.*|JJS>*<NN.*>}
        
#     NP:
#         {<NBAR>}
#         {<NBAR><IN><NBAR>}
        
        

In [93]:

# DO NO DELETE THIS CELL #

grammar = r"""
    NBAR:
        {<NN.*|JJS>*<NN.*>}
        
    NP:
        {<NBAR>}
        {<NBAR><IN><NBAR>}
"""

  
def tokenize_chunk_parse(line): 
    chunker = nltk.RegexpParser(grammar)  # chunker:  regexpparser(regex matcher def)
  
    toks = nltk.regexp_tokenize(line, TOKEN_RE)
    postoks = nltk.tag.pos_tag(toks)

    tree = chunker.parse(postoks)

    return [term for term in leaves(tree)] 
  
    
def leaves(tree):
    for subtree in tree.subtrees(filter = lambda t: t.label()=='NP'):
        yield subtree.leaves()
        
        
        

<hr>

### Task 2:  Create an RDD pipeline to show the distribution of the length of noun phrases
You should wind up with an RDD showing the number of each length of noun phrase.  For example, the following output shows there are 6157 noun phrases of
length 1, 1833 of length 2, 654 of length 3, and so on:
```
[(1, 6157),
 (2, 1833),
 (3, 654),
 (4, 204),
 (5, 65),
 (6, 16),
 (8, 6),
 (7, 4),
 (9, 3)]
```
The steps should be:
1. Apply (using flatMap) the ```tokenize_chunk_parse``` function to each line in the ```text``` RDD
2. Use map to emit the length of each noun phrase
3. Use map to convert each resulting line to a pairRDD with words as keys and values of 1
4. Reduce the resulting RDD by key, adding up all the 1s (like the lecture and lab examples)
5. Sort the resulting list by the counts, in descending order.

In [94]:

np_counts = text.filter(lambda line: len(line) > 0) \
    .filter(lambda line: re.findall('^(?!URL).*', line)) \
    .flatMap(tokenize_chunk_parse) \
    .map(lambda phrase: (len(phrase), 1)) \
    .reduceByKey(lambda x, y: x+y) \
    .sortBy(lambda x: x[1], ascending = False)


np_counts.collect()


[(1, 215), (2, 62), (3, 20), (4, 7), (5, 3), (6, 1)]

<hr>
<br>
<hr>

# Appendix - Tom's Notes

### The text

```
URL: http://www.nytimes.com/2016/06/30/sports/baseball/washington-nationals-max-scherzer-baffles-mets-completing-a-sweep.html

WASHINGTON — Stellar pitching kept the Mets afloat in the first half of last season despite their offensive woes. But they cannot produce an encore of their pennant-winning season if their lineup keeps floundering while their pitching is nicked, bruised and stretched thin.
“We were going to ride our pitching,” Manager Terry Collins said before Wednesday’s game. “But we’re not riding it right now. We’ve got as many problems with our pitching as we do anything.”
Wednesday’s 4-2 loss to the Washington Nationals was cruel for the already-limping Mets. Pitching in Steven Matz’s place, the spot starter Logan Verrett allowed two runs over five innings. But even that was too large a deficit for the Mets’ lineup to overcome against Max Scherzer, the Nationals’ starter.
“We’re not even giving ourselves chances,” Collins said, adding later, “We just can’t give our pitchers any room to work.”
The Mets did not score until the ninth inning, when a last-gasp two-run homer by James Loney off Nationals reliever Shawn Kelley snapped a streak of 23 scoreless innings for the team.
The Mets were swept in the three-game series and fell six games behind the Nationals in the National League East. Of late, the Mets have looked worse than their 40-37 record.
“I don’t think we’ve played half our games yet this year,” right fielder Curtis Granderson said. “There’s still a lot of things left that can and hopefully will happen.”
Scherzer toyed with the Mets, who were initially without Granderson after he was scratched from the lineup with lingering calf tightness. Even though Granderson has been inconsistent this season, he had hit well against Scherzer in the past. Alejandro De Aza, who entered the game with a .165 average, started in right field instead because Collins said the team had few options.
After Scherzer gave up a single to Asdrubal Cabrera and walked Loney in the second inning, he retired the next 18 batters, until an eighth-inning single by Brandon Nimmo.
The Mets struggled again with runners on base. After Nimmo and the pinch-hitting Granderson singled in the eighth, pinch-hitter Travis d’Arnaud grounded out, and De Aza struck out.
“If they keep adding pressure on themselves, they’re going to continue to struggle,” Collins said. “That’s one of the things we try to make sure they have to understand: They have to be themselves.”
General Manager Sandy Alderson, Collin```