# HW4
## MSDS-7337
## Author: Taylor Bonar
---
1. Run one of the part-of-speech (POS) taggers available in Python. 
   * Find the longest sentence you can, longer than 10 words, that the POS tagger tags correctly. Show the input and output.
   * Find the shortest sentence you can, shorter than 10 words, that the POS tagger fails to tag 100 percent correctly. Show the input and output. Explain your conjecture as to why the tagger might have been less than perfect with this sentence.


In [6]:
from platform import python_version
import nltk
import spacy
print(f"""Python Version: {python_version()}
NLTK v.{nltk.__version__}
spaCY v.{spacy.__version__}""")

nlp = spacy.load("en_core_web_sm")

Python Version: 3.8.3
NLTK v.3.5
spaCY v.3.0.6


In [7]:
long_sentence = "Don't judge each day by the harvest you reap but by the seeds that you plant. "
short_sentence = "To be great is to be misunderstood."

long_pos_tags = nltk.pos_tag(nltk.word_tokenize(long_sentence))
short_pos_tags = nltk.pos_tag(nltk.word_tokenize(short_sentence))

print(f"""Original Long Quote: "{long_sentence}" -- Robert Louis Stevenson
Long sentence POS tags:
{long_pos_tags}

Original Quote: "{short_sentence}" -- Ralph Waldo Emmerson
Short sentence POS tags:
{short_pos_tags}
""")

Original Long Quote: "Don't judge each day by the harvest you reap but by the seeds that you plant. " -- Robert Louis Stevenson
Long sentence POS tags:
[('Do', 'VBP'), ("n't", 'RB'), ('judge', 'VB'), ('each', 'DT'), ('day', 'NN'), ('by', 'IN'), ('the', 'DT'), ('harvest', 'NN'), ('you', 'PRP'), ('reap', 'VBP'), ('but', 'CC'), ('by', 'IN'), ('the', 'DT'), ('seeds', 'NNS'), ('that', 'IN'), ('you', 'PRP'), ('plant', 'NN'), ('.', '.')]

Original Quote: "To be great is to be misunderstood." -- Ralph Waldo Emmerson
Short sentence POS tags:
[('To', 'TO'), ('be', 'VB'), ('great', 'JJ'), ('is', 'VBZ'), ('to', 'TO'), ('be', 'VB'), ('misunderstood', 'NN'), ('.', '.')]



The tagger may be less correct on tagging the parts of speech on the shorter sentence, as there are less contextual clues to utilize. For instance, the word "is" was categorized as "VBZ" which is the 3rd singular present term of a verb. This is usually used for identifying third-person verbs ending with the suffix -s or -es. However, the word "is" a base verb, but one can argue that the way it is utilized in the sentence is in a third-person point of view.

----

2.	Run a different POS tagger in Python. Process the same two sentences from question 1.
    * Does it produce the same or different output?
    * Explain any differences as best you can.


In [21]:
spacy_long_pos = nlp(long_sentence)
spacy_short_pos = nlp(short_sentence)

print(f"""Original Long Quote: "{long_sentence}" -- Robert Louis Stevenson
NLTK POS tags:
{long_pos_tags}
spaCY POS tags:
{ [(token.text, token.pos_) for token in spacy_long_pos]}

Original Quote: "{short_sentence}" -- Ralph Waldo Emmerson
NLTK POS tags:
{short_pos_tags}
spaCY POS tags:
{ [(token.text, token.pos_) for token in spacy_short_pos]}""")

Original Long Quote: "Don't judge each day by the harvest you reap but by the seeds that you plant. " -- Robert Louis Stevenson
NLTK POS tags:
[('Do', 'VBP'), ("n't", 'RB'), ('judge', 'VB'), ('each', 'DT'), ('day', 'NN'), ('by', 'IN'), ('the', 'DT'), ('harvest', 'NN'), ('you', 'PRP'), ('reap', 'VBP'), ('but', 'CC'), ('by', 'IN'), ('the', 'DT'), ('seeds', 'NNS'), ('that', 'IN'), ('you', 'PRP'), ('plant', 'NN'), ('.', '.')]
spaCY POS tags:
[('Do', 'AUX'), ("n't", 'PART'), ('judge', 'VERB'), ('each', 'DET'), ('day', 'NOUN'), ('by', 'ADP'), ('the', 'DET'), ('harvest', 'NOUN'), ('you', 'PRON'), ('reap', 'VERB'), ('but', 'CCONJ'), ('by', 'ADP'), ('the', 'DET'), ('seeds', 'NOUN'), ('that', 'DET'), ('you', 'PRON'), ('plant', 'VERB'), ('.', 'PUNCT')]

Original Quote: "To be great is to be misunderstood." -- Ralph Waldo Emmerson
NLTK POS tags:
[('To', 'TO'), ('be', 'VB'), ('great', 'JJ'), ('is', 'VBZ'), ('to', 'TO'), ('be', 'VB'), ('misunderstood', 'NN'), ('.', '.')]
spaCY POS tags:
[('To', 'PAR

There are slight documentation differences between how NLTK and spaCY record their parts of speech:
* The word 'do' is marked as a VBP for verb/present tense/not 3rd person singular, while in spaCY its marked as AUX
* For instance, the word "to" or contraction "n't" are marked as PART in spaCY but in NLTK "to" is set to "TO" and "n't" to "RB".

However, as the spaCY library tends to lump VERBS, PART, AUX, etc. into a singular POS tag, it does seem that there are stored information within the Doc object that can we can glean more insight on the morphological features such as lemma and under morph attribute.

---

3.	In a news article from this week’s news, find a random sentence of at least 10 words.
    * Looking at the Penn tag set, manually POS tag the sentence yourself.
    * Now run the same sentences through both taggers that you implemented for questions 1 and 2. Did either of the taggers produce the same results as you had created manually?
    * Explain any differences between the two taggers and your manual tagging as much as you can.


In [22]:
news_sentence = "Meanwhile, the number of available homes on the market has dropped nearly 40% in that same 12 month period."

manual_tagging = [('Meanwhile', 'RB'), (',', ','), ('the', 'DT'), ('number', 'NN'), ('of', 'IN'), ('available', 'JJ'), ('homes', 'NNS'), ('on', 'RP'),
                  ('the', 'DT'), ('market', 'NN'), ('has', 'VBZ'), ('dropped', 'VBF'), ('nearly', 'RBR'), ('40', 'CD'), ('%', 'NN'), ('in', 'RP'),
                  ('that', 'DT'), ('same', 'JJ'), ('12', 'CD'), ('month', 'NN'), ('period', 'NN')]

# NLTK POS tags from tokenized sentence
nltk_tags = nltk.pos_tag(nltk.word_tokenize(news_sentence))

# SpaCY parsing and tagging to SpaCY's Doc object
nlp_news_doc = nlp(news_sentence)
# Extracting only text and pos_ for analysis
spacy_news_txt_pos = [(token.text, token.pos_) for token in nlp_news_doc]

print(f"""Original News Quote: "{news_sentence}"
Manual POS tags:
{manual_tagging}
NLTK POS tags:
{nltk_tags}
spaCY POS tags:
{spacy_news_txt_pos}""")


Original News Quote: "Meanwhile, the number of available homes on the market has dropped nearly 40% in that same 12 month period."
Manual POS tags:
[('Meanwhile', 'RB'), (',', ','), ('the', 'DT'), ('number', 'NN'), ('of', 'IN'), ('available', 'JJ'), ('homes', 'NNS'), ('on', 'RP'), ('the', 'DT'), ('market', 'NN'), ('has', 'VBZ'), ('dropped', 'VBF'), ('nearly', 'RBR'), ('40', 'CD'), ('%', 'NN'), ('in', 'RP'), ('that', 'DT'), ('same', 'JJ'), ('12', 'CD'), ('month', 'NN'), ('period', 'NN')]
NLTK POS tags:
[('Meanwhile', 'RB'), (',', ','), ('the', 'DT'), ('number', 'NN'), ('of', 'IN'), ('available', 'JJ'), ('homes', 'NNS'), ('on', 'IN'), ('the', 'DT'), ('market', 'NN'), ('has', 'VBZ'), ('dropped', 'VBN'), ('nearly', 'RB'), ('40', 'CD'), ('%', 'NN'), ('in', 'IN'), ('that', 'DT'), ('same', 'JJ'), ('12', 'CD'), ('month', 'NN'), ('period', 'NN'), ('.', '.')]
spaCY POS tags:
[('Meanwhile', 'ADV'), (',', 'PUNCT'), ('the', 'DET'), ('number', 'NOUN'), ('of', 'ADP'), ('available', 'ADJ'), ('homes', 

Between NLTK and my manual tagging, it seems that there were some misclassifications on my part when it came to classifying tenses of verbs and adverbs

Between spaCY and my manual tagging, since spaCY UPOS which is a Universal POS tags that are meant to distinguish lexical and grammatical properties.