# Lab1-Assignment

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

This notebook describes the assignment for Lab 1 of the text mining course. 

**Points**: each exercise is prefixed with the number of points you can obtain for the exercise.

We assume you have worked through the following notebooks:
* **Lab1.1-introduction**
* **Lab1.2-introduction-to-NLTK**
* **Lab1.3-introduction-to-spaCy** 

In this assignment, you will process an English text (**Lab1-apple-samsung-example.txt**) with both NLTK and spaCy and discuss the similarities and differences.

## Credits
The notebooks in this block have been originally created by [Marten Postma](https://martenpostma.github.io). Adaptations were made by [Filip Ilievski](http://ilievski.nl).

## Tip: how to read a file from disk
Let's open the file **Lab1-apple-samsung-example.txt** from disk.

In [4]:
from pathlib import Path

In [5]:
cur_dir = Path().resolve() # this should provide you with the folder in which this notebook is placed
path_to_file = Path.joinpath(cur_dir, 'Lab1-apple-samsung-example.txt')
print(path_to_file)
print('does path exist? ->', Path.exists(path_to_file))

/Users/stefaniaconte/Desktop/newenv/Lab1-apple-samsung-example.txt
does path exist? -> True


If the output from the code cell above states that **does path exist? -> False**, please check that the file **Lab1-apple-samsung-example.txt** is in the same directory as this notebook.

In [6]:
with open(path_to_file) as infile:
    text = infile.read()

print('number of characters', len(text))

number of characters 1139


## [total points: 4] Exercise 1: NLTK
In this exercise, we use NLTK to apply **Part-of-speech (POS) tagging**, **Named Entity Recognition (NER)**, and **Constituency parsing**. The following code snippet already performs sentence splitting and tokenization. 

In [7]:
import nltk
from nltk.tokenize import sent_tokenize
from nltk import word_tokenize

In [8]:
sentences_nltk = sent_tokenize(text)

In [9]:
tokens_per_sentence = []
for sentence_nltk in sentences_nltk:
    sent_tokens = word_tokenize(sentence_nltk)
    tokens_per_sentence.append(sent_tokens)

We will use lists to keep track of the output of the NLP tasks. We can hence inspect the output for each task using the index of the sentence.

In [10]:
sent_id = 1
print('SENTENCE', sentences_nltk[sent_id])
print('TOKENS', tokens_per_sentence[sent_id])

SENTENCE The six phones and tablets affected are the Galaxy S III, running the new Jelly Bean system, the Galaxy Tab 8.9 Wifi tablet, the Galaxy Tab 2 10.1, Galaxy Rugby Pro and Galaxy S III mini.
TOKENS ['The', 'six', 'phones', 'and', 'tablets', 'affected', 'are', 'the', 'Galaxy', 'S', 'III', ',', 'running', 'the', 'new', 'Jelly', 'Bean', 'system', ',', 'the', 'Galaxy', 'Tab', '8.9', 'Wifi', 'tablet', ',', 'the', 'Galaxy', 'Tab', '2', '10.1', ',', 'Galaxy', 'Rugby', 'Pro', 'and', 'Galaxy', 'S', 'III', 'mini', '.']


### [point: 1] Exercise 1a: Part-of-speech (POS) tagging
Use `nltk.pos_tag` to perform part-of-speech tagging on each sentence.

Use `print` to **show** the output in the notebook (and hence also in the exported PDF!).

In [11]:
pos_tags_per_sentence = [] #stores POS tags for each sentence List[Tuple]
for tokens in tokens_per_sentence:
    pos_tags = nltk.pos_tag(tokens) #each tuple consists of a token and its corresponding POS tag
    pos_tags_per_sentence.append(pos_tags)
    #print(pos_tags) #prints POS-tagged tokens for one sentence at a time

In [12]:
print(pos_tags_per_sentence)

[[('https', 'NN'), (':', ':'), ('//www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html', 'JJ'), ('Documents', 'NNS'), ('filed', 'VBN'), ('to', 'TO'), ('the', 'DT'), ('San', 'NNP'), ('Jose', 'NNP'), ('federal', 'JJ'), ('court', 'NN'), ('in', 'IN'), ('California', 'NNP'), ('on', 'IN'), ('November', 'NNP'), ('23', 'CD'), ('list', 'NN'), ('six', 'CD'), ('Samsung', 'NNP'), ('products', 'NNS'), ('running', 'VBG'), ('the', 'DT'), ('``', '``'), ('Jelly', 'RB'), ('Bean', 'NNP'), ("''", "''"), ('and', 'CC'), ('``', '``'), ('Ice', 'NNP'), ('Cream', 'NNP'), ('Sandwich', 'NNP'), ("''", "''"), ('operating', 'VBG'), ('systems', 'NNS'), (',', ','), ('which', 'WDT'), ('Apple', 'NNP'), ('claims', 'VBZ'), ('infringe', 'VB'), ('its', 'PRP$'), ('patents', 'NNS'), ('.', '.')], [('The', 'DT'), ('six', 'CD'), ('phones', 'NNS'), ('and', 'CC'), ('tablets', 'NNS'), ('affected', 'VBN'), ('are', 'VBP'), ('the', 'DT'), ('Galaxy', 'NNP'), ('S', 'NNP'), ('III', 'NN

### [point: 1] Exercise 1b: Named Entity Recognition (NER)
Use `nltk.chunk.ne_chunk` to perform Named Entity Recognition (NER) on each sentence.

Use `print` to **show** the output in the notebook (and hence also in the exported PDF!).

In [13]:
from nltk.chunk import ne_chunk

In [14]:
ner_tags_per_sentence = []
for pos_tags in pos_tags_per_sentence: #from before (List[Tuple])
    ner_tree = ne_chunk(pos_tags)
    ner_tags_per_sentence.append(ner_tree)
    #print(ner_tree)

In [15]:
print(ner_tags_per_sentence)

[Tree('S', [('https', 'NN'), (':', ':'), ('//www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html', 'JJ'), ('Documents', 'NNS'), ('filed', 'VBN'), ('to', 'TO'), ('the', 'DT'), Tree('ORGANIZATION', [('San', 'NNP'), ('Jose', 'NNP')]), ('federal', 'JJ'), ('court', 'NN'), ('in', 'IN'), Tree('GPE', [('California', 'NNP')]), ('on', 'IN'), ('November', 'NNP'), ('23', 'CD'), ('list', 'NN'), ('six', 'CD'), Tree('ORGANIZATION', [('Samsung', 'NNP')]), ('products', 'NNS'), ('running', 'VBG'), ('the', 'DT'), ('``', '``'), ('Jelly', 'RB'), Tree('GPE', [('Bean', 'NNP')]), ("''", "''"), ('and', 'CC'), ('``', '``'), ('Ice', 'NNP'), ('Cream', 'NNP'), ('Sandwich', 'NNP'), ("''", "''"), ('operating', 'VBG'), ('systems', 'NNS'), (',', ','), ('which', 'WDT'), Tree('PERSON', [('Apple', 'NNP')]), ('claims', 'VBZ'), ('infringe', 'VB'), ('its', 'PRP$'), ('patents', 'NNS'), ('.', '.')]), Tree('S', [('The', 'DT'), ('six', 'CD'), ('phones', 'NNS'), ('and', 'CC'),

### [points: 2] Exercise 1c: Constituency parsing
Use the `nltk.RegexpParser` to perform constituency parsing on each sentence.

Use `print` to **show** the output in the notebook (and hence also in the exported PDF!).

In [16]:
constituent_parser = nltk.RegexpParser('''
NP: {<DT>? <JJ>* <NN>*} # NP
P: {<IN>}           # Preposition
V: {<V.*>}          # Verb
PP: {<P> <NP>}      # PP -> P NP
VP: {<V> <NP|PP>*}  # VP -> V (NP|PP)*''')

In [17]:
constituency_output_per_sentence = []
for pos_tags in pos_tags_per_sentence:
    # Parse the POS-tagged sentence using the defined grammar
    parse_tree = constituent_parser.parse(pos_tags)
    constituency_output_per_sentence.append(parse_tree)
    #print(parse_tree)

In [18]:
print(constituency_output_per_sentence)

[Tree('S', [Tree('NP', [('https', 'NN')]), (':', ':'), Tree('NP', [('//www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html', 'JJ')]), ('Documents', 'NNS'), Tree('VP', [Tree('V', [('filed', 'VBN')])]), ('to', 'TO'), Tree('NP', [('the', 'DT')]), ('San', 'NNP'), ('Jose', 'NNP'), Tree('NP', [('federal', 'JJ'), ('court', 'NN')]), Tree('P', [('in', 'IN')]), ('California', 'NNP'), Tree('P', [('on', 'IN')]), ('November', 'NNP'), ('23', 'CD'), Tree('NP', [('list', 'NN')]), ('six', 'CD'), ('Samsung', 'NNP'), ('products', 'NNS'), Tree('VP', [Tree('V', [('running', 'VBG')]), Tree('NP', [('the', 'DT')])]), ('``', '``'), ('Jelly', 'RB'), ('Bean', 'NNP'), ("''", "''"), ('and', 'CC'), ('``', '``'), ('Ice', 'NNP'), ('Cream', 'NNP'), ('Sandwich', 'NNP'), ("''", "''"), Tree('VP', [Tree('V', [('operating', 'VBG')])]), ('systems', 'NNS'), (',', ','), ('which', 'WDT'), ('Apple', 'NNP'), Tree('VP', [Tree('V', [('claims', 'VBZ')])]), Tree('VP', [Tree('V', [

Augment the RegexpParser so that it also detects Named Entity Phrases (NEP), e.g., that it detects *Galaxy S III* and *Ice Cream Sandwich*

In [19]:
constituent_parser_v2 = nltk.RegexpParser('''
NP: {<DT>? <JJ>* <NN>*} # NP
P: {<IN>}           # Preposition
V: {<V.*>}          # Verb
PP: {<P> <NP>}      # PP -> P NP
VP: {<V> <NP|PP>*}  # VP -> V (NP|PP)*
NEP: {}             # ???''')

In [20]:
constituency_v2_output_per_sentence = []
for pos_tags in pos_tags_per_sentence:
    # Parse the POS-tagged sentence using the defined grammar
    parse_tree = constituent_parser_v2.parse(pos_tags)
    constituency_v2_output_per_sentence.append(parse_tree)
    #print(parse_tree)

In [21]:
print(constituency_v2_output_per_sentence)

[Tree('S', [Tree('NP', [('https', 'NN')]), (':', ':'), Tree('NP', [('//www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html', 'JJ')]), ('Documents', 'NNS'), Tree('VP', [Tree('V', [('filed', 'VBN')])]), ('to', 'TO'), Tree('NP', [('the', 'DT')]), ('San', 'NNP'), ('Jose', 'NNP'), Tree('NP', [('federal', 'JJ'), ('court', 'NN')]), Tree('P', [('in', 'IN')]), ('California', 'NNP'), Tree('P', [('on', 'IN')]), ('November', 'NNP'), ('23', 'CD'), Tree('NP', [('list', 'NN')]), ('six', 'CD'), ('Samsung', 'NNP'), ('products', 'NNS'), Tree('VP', [Tree('V', [('running', 'VBG')]), Tree('NP', [('the', 'DT')])]), ('``', '``'), ('Jelly', 'RB'), ('Bean', 'NNP'), ("''", "''"), ('and', 'CC'), ('``', '``'), ('Ice', 'NNP'), ('Cream', 'NNP'), ('Sandwich', 'NNP'), ("''", "''"), Tree('VP', [Tree('V', [('operating', 'VBG')])]), ('systems', 'NNS'), (',', ','), ('which', 'WDT'), ('Apple', 'NNP'), Tree('VP', [Tree('V', [('claims', 'VBZ')])]), Tree('VP', [Tree('V', [

## [total points: 1] Exercise 2: spaCy
Use Spacy to process the same text as you analyzed with NLTK.

In [22]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [23]:
#part-of-speech tagging 
doc = nlp(text) 
for token in doc:
    print(f'{token.text:15} {token.lemma_:15} {token.pos_:5} {token.tag_:5} {token.dep_:7}')
#name entity recognition 
for ent in doc.ents:
    print(f'{ent.text:15} {ent.start_char:5} {ent.end_char:5} {ent.label_:5}')
#dependency parsing 
for token in doc:
    print(f"{token.text:10} {token.dep_:10} {token.head.text:10} {token.head.pos_:5} {list(token.children)}")
#visualisation of dependency parse
#displacy.render(doc, style="dep", jupyter=True, options={'distance': 90})

https://www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html https://www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html NOUN  NNS   amod   


              

              SPACE _SP   dep    
Documents       document        NOUN  NNS   nsubj  
filed           file            VERB  VBD   ROOT   
to              to              ADP   IN    prep   
the             the             DET   DT    det    
San             San             PROPN NNP   nmod   
Jose            Jose            PROPN NNP   nmod   
federal         federal         ADJ   JJ    amod   
court           court           NOUN  NN    pobj   
in              in              ADP   IN    prep   
California      California      PROPN NNP   pobj   
on              on              ADP   IN    prep   
November        November        PROPN NNP   pobj   
23              23              NUM   CD    nummod 
list            list      

small tip: You can use **sents = list(doc.sents)** to be able to use the index to access a sentence like **sents[2]** for the third sentence.


In [24]:
'''
sents = list(doc.sents)
# Access a sentence
specific_sentence = sents[2] #in this case the third one 
print(specific_sentence.text)

# Part-of-Speech tagging for specific sentence
for token in specific_sentence:
    print(f'{token.text:15} {token.pos_:5}')

# Named Entity Recognition for specific sentence
for ent in specific_sentence.ents:
    print(f'{ent.text:15} {ent.label_:5}')
    
#dependency parse for specific sentence
displacy.render(specific_sentence, style="dep", jupyter=True, options={'distance': 90})
'''

'\nsents = list(doc.sents)\n# Access a sentence\nspecific_sentence = sents[2] #in this case the third one \nprint(specific_sentence.text)\n\n# Part-of-Speech tagging for specific sentence\nfor token in specific_sentence:\n    print(f\'{token.text:15} {token.pos_:5}\')\n\n# Named Entity Recognition for specific sentence\nfor ent in specific_sentence.ents:\n    print(f\'{ent.text:15} {ent.label_:5}\')\n    \n#dependency parse for specific sentence\ndisplacy.render(specific_sentence, style="dep", jupyter=True, options={\'distance\': 90})\n'

## [total points: 7] Exercise 3: Comparison NLTK and spaCy
We will now compare the output of NLTK and spaCy, i.e., in what do they differ?

### [points: 3] Exercise 3a: Part of speech tagging
Compare the output from NLTK and spaCy regarding part of speech tagging.

* To compare, you probably would like to compare sentence per sentence. Describe if the sentence splitting is different for NLTK than for spaCy. If not, where do they differ?
* After checking the sentence splitting, select a sentence for which you expect interesting results and perhaps differences. Motivate your choice.
* Compare the output in `token.tag` from spaCy to the part of speech tagging from NLTK for each token in your selected sentence. Are there any differences? This is not a trick question; it is possible that there are no differences.


When we look at how sentences are divided, both NLTK and spaCy aim to break down text into individual sentences. However, they rely on distinct algorithms and models for this task, leading to some differences in their sentence-splitting behavior. Interestingly, the primary difference noted in the provided example doesn't lie in how sentences are split but in the way specific tokens, especially those involving currency symbols and amounts, are identified within a sentence.


"In August, Samsung lost a US patent case to Apple and was ordered to pay its rival $1.05bn (£0.66bn) in damages for copying features of the iPad and iPhone in its Galaxy range of devices."

We chose this sentence because it consists of difficult parts and overall has a difficult structure.
The only difference in splitting that we could notice is '£0.66bn'.NLTK treats this as a single token, which makes sense since it represents a unified monetary amount. Conversely, spaCy breaks it down into two separate tokens: '£' and '0.66bn'. This action divides the currency symbol from its associated value. This distinction showcases the different approaches and tokenization rules that each library applies, with spaCy taking a more detailed route in breaking down tokens in this scenario.

When comparing spacy's token.tag_ output with NLTK's part-of-speech tagging for each word in our chosen sentence, we should keep in mind that the differences we notice stem from the unique tagging conventions and models each library uses. We noticed that the way "was" gets tagged could be a good exaple of how differently these tools can see language. SpaCy might label it as VBD (verb, past tense) or AUX (auxiliary verb), based on the context and spaCy model version. NLTK could also tag it as a past tense verb, but it might not always make a clear distinction for auxiliary verbs without specific settings. 

### [points: 2] Exercise 3b: Named Entity Recognition (NER)
* Describe differences between the output from NLTK and spaCy for Named Entity Recognition. Which one do you think performs better?

In [25]:
text = """In August, Samsung lost a US patent case to Apple and was ordered to pay its rival $1.05bn (£0.66bn) in damages for copying features of the iPad and iPhone in its Galaxy range of devices."""
doc = nlp(text)
from spacy import displacy
displacy.render(doc, jupyter=True, style='ent')


sentences = nltk.sent_tokenize(text)
for sentence in sentences:
    
    tokens = nltk.word_tokenize(sentence)
    tokens_pos_tagged = nltk.pos_tag(tokens)
    tokens_pos_tagged_and_named_entities = ne_chunk(tokens_pos_tagged)
    print()
    print('ORIGINAL SENTENCE', sentence)
    print('NAMED ENTITY RECOGNITION OUTPUT', tokens_pos_tagged_and_named_entities)
    

pos_tags_per_sentence = [] #stores POS tags for each sentence List[Tuple]
for tokens in tokens_per_sentence:
    pos_tags = nltk.pos_tag(tokens) #each tuple consists of a token and its corresponding POS tag
    pos_tags_per_sentence.append(pos_tags)
    #print(pos_tags) #prints POS-tagged tokens for one sentence at a time


ORIGINAL SENTENCE In August, Samsung lost a US patent case to Apple and was ordered to pay its rival $1.05bn (£0.66bn) in damages for copying features of the iPad and iPhone in its Galaxy range of devices.
NAMED ENTITY RECOGNITION OUTPUT (S
  In/IN
  (GPE August/NNP)
  ,/,
  (PERSON Samsung/NNP)
  lost/VBD
  a/DT
  (GSP US/NNP)
  patent/NN
  case/NN
  to/TO
  (GPE Apple/NNP)
  and/CC
  was/VBD
  ordered/VBN
  to/TO
  pay/VB
  its/PRP$
  rival/JJ
  $/$
  1.05bn/CD
  (/(
  £0.66bn/NN
  )/)
  in/IN
  damages/NNS
  for/IN
  copying/VBG
  features/NNS
  of/IN
  the/DT
  (ORGANIZATION iPad/NN)
  and/CC
  (ORGANIZATION iPhone/NN)
  in/IN
  its/PRP$
  (GPE Galaxy/NNP)
  range/NN
  of/IN
  devices/NNS
  ./.)


The differences between the output in the given sentence are:

**Entities Identified**:
SpaCy correctly identifies more specific types of named entities, organizations ("Apple"), and monetary values ("$1.05bn", "£0.66bn").
NLTK, on the other hand, identifies fewer specific entity types. 

**Accuracy of Labels**:
SpaCy provides more accurate entity labels by assigning specific types such as DATE, ORG, GPE, and MONEY. Although it makes a mistake with labeling Galaxy device type and detects iPad as an organiztion.
NLTK labels entities like "Samsung" as an Person, and "Apple" and "Galaxy" as an GPE(geo-political entities), which is less accurate and informative.

Overall, we wuld say that SpaCy performs better in Named Entity Recognition for the given sentence due to its ability to recognize a wider range of entity types and provide more accurate and informative labels. 

### [points: 2] Exercise 3c: Constituency/dependency parsing
Choose one sentence from the text and run constituency parsing using NLTK and dependency parsing using spaCy.
* describe briefly the difference between constituency parsing and dependency parsing
* describe differences between the output from NLTK and spaCy.

Dependency parsing and constituency parsing are two methods used to represent the structure of sentences in natural language processing.

- Using a hierarchical structure that is usually represented by a parse tree, **constituency parsing** attempts to dissect a sentence into its grammatical components or phrases.When using constituency parsing, the sentence's layered structure of phrases is reflected in the parse tree's structure, where each node denotes a component, and each edge denotes a grammatical link.Constituency parsing finds the noun, verb, prepositional, and other phrase constituents in a sentence (more language independent as it focuses on universal syntactic relationships) , as well as their hierarchical relationships.When comparing it to dependency parsing in terms of handling ambiguity, the former may resolve certain syntactic ambiguiguities, while the latter might face more challenges with structurals ambiguities (a sentence could be represented by multiple valid parse trees making it harder to discern the intended stucture).
- In **dependency parsing**, each word is viewed as a node, and the connections between them are shown as directed edges.Unlike constituency parsing, dependency parsing does not rely on phrasal constituents or sub-phrases. Rather, it represents the syntax of a sentence through relationships between words, specifically directed and typed edges within a graph. The goal of dependency parsing is to determine the links between words in a phrase.

In summary constituency parsing is a layered architecture of phrase-based constituents, while dependency parsing is a network of word-to-word interactions.

**Difference in output between the two :**

**NLTK**
The NLTK output is demonstrated through tree structures and focuses on the categorization of syntax in the different parts of the sentences, such as the noun phrases and verb phrases, without emphesazing the named entities within the text. This structure is more aligned with showing sentence grammar and structure rather than identifying and categorizing named entites explicitly.

**spaCy**
This output appears to be more straight forward when it comes to identifying named entities. SpaCy pairs words or phraases with their corresponding entity types such as 'organizations(ORG), geopolitical entities(GPE), DATE, MONEY etc'. It also provides a more detailed analysis on each words role in a sentnece, including the words relationship to other words in the sentence. In addition, it identifies named entities with specific labeles, which makes it easier to extract information about the people, organizations, locations and more.

**Conclusion**
Overall, spaCy is generally perceived to perform better for Named Entity Recognition tasks, as opposed to NLTK. It identifies the entities more explicitly and categorizes them into predefined classes, which can be more useful when the user is trying to extract information, analyze data or enhance search algorithms. Thus, SpaCy's efficiency in processing and its ability to handle complex NER tasks with a higher degree of accuracy, makes it a preferred choice for NLP applications focused on named entity identification and categorization.

# End of this notebook