## DATA 622 Natural Language Processing
### Homework 4

Questions
Use the Gettysburg Address by Abraham Lincoln.
1. Tokenization
Tokenize both sentences into words using spaCy. Print the list of tokens for each sentence.
Also use the benepar library.
2. Part-of-Speech Tagging
Print the part-of-speech (POS) tag for each token in the first sentence.
3. Dependency Parsing
Print the dependency relation and head word for each token in the second sentence.
4. Constituent Parsing
Using the NLTK and benepar libraries, print the constituency (phrase structure) parse tree
of the first sentence.
5. Extract Noun Phrases
Using spaCy, extract all noun phrases (noun chunks) from both sentences.
5. CRF and HMM
Why do you use CRF and HMM? How do they differ? Please summarize in less than 50
words.

In [None]:
!pip install benepar

Collecting benepar
  Downloading benepar-0.2.0.tar.gz (33 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting torch-struct>=0.5 (from benepar)
  Downloading torch_struct-0.5-py3-none-any.whl.metadata (4.3 kB)
Downloading torch_struct-0.5-py3-none-any.whl (34 kB)
Building wheels for collected packages: benepar
  Building wheel for benepar (setup.py) ... [?25l[?25hdone
  Created wheel for benepar: filename=benepar-0.2.0-py3-none-any.whl size=37625 sha256=99c1bd515c01c2d525c4a21ce48a1dbd09c094d118ba5599afd3d6a8d0513fd7
  Stored in directory: /root/.cache/pip/wheels/9b/84/c1/f2ac877f519e2864e7dfe52a1c17fe5cdd50819cb8d1f1945f
Successfully built benepar
Installing collected packages: torch-struct, benepar
Successfully installed benepar-0.2.0 torch-struct-0.5


In [None]:
import benepar, spacy
import spacy.cli

# Install spacy model if not already
spacy.cli.download("en_core_web_sm")

# Load spaCy
nlp = spacy.load("en_core_web_sm")

# Add benepar
benepar.download('benepar_en3')
nlp.add_pipe("benepar", config={"model": "benepar_en3"})

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


[nltk_data] Downloading package benepar_en3 to /root/nltk_data...
[nltk_data]   Package benepar_en3 is already up-to-date!


<benepar.integrations.spacy_plugin.BeneparComponent at 0x7b3f7dd99490>

1. Tokenization (spaCy + benepar)

In [None]:
import spacy
import benepar

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Add benepar
nlp.add_pipe("benepar", config={"model": "benepar_en3"})

# Two sentences
sentences = [
    "Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.",
    "Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure."
]

# Tokenization
for sent in sentences:
    doc = nlp(sent)
    print([token.text for token in doc])

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'fathers', 'brought', 'forth', 'on', 'this', 'continent', ',', 'a', 'new', 'nation', ',', 'conceived', 'in', 'Liberty', ',', 'and', 'dedicated', 'to', 'the', 'proposition', 'that', 'all', 'men', 'are', 'created', 'equal', '.']
['Now', 'we', 'are', 'engaged', 'in', 'a', 'great', 'civil', 'war', ',', 'testing', 'whether', 'that', 'nation', ',', 'or', 'any', 'nation', 'so', 'conceived', 'and', 'so', 'dedicated', ',', 'can', 'long', 'endure', '.']


2. Part-of-Speech Tagging (first sentence)

In [None]:
doc1 = nlp(sentences[0])
for token in doc1:
    print(f"{token.text} --> {token.pos_}")

Four --> NUM
score --> NOUN
and --> CCONJ
seven --> NUM
years --> NOUN
ago --> ADV
our --> PRON
fathers --> NOUN
brought --> VERB
forth --> ADV
on --> ADP
this --> DET
continent --> NOUN
, --> PUNCT
a --> DET
new --> ADJ
nation --> NOUN
, --> PUNCT
conceived --> VERB
in --> ADP
Liberty --> PROPN
, --> PUNCT
and --> CCONJ
dedicated --> VERB
to --> ADP
the --> DET
proposition --> NOUN
that --> SCONJ
all --> DET
men --> NOUN
are --> AUX
created --> VERB
equal --> ADJ
. --> PUNCT


3. Dependency Parsing (second sentence)

In [None]:
doc2 = nlp(sentences[1])
for token in doc2:
    print(f"{token.text} --> {token.dep_} --> {token.head.text}")

Now --> advmod --> engaged
we --> nsubjpass --> engaged
are --> auxpass --> engaged
engaged --> ROOT --> engaged
in --> prep --> engaged
a --> det --> war
great --> amod --> war
civil --> amod --> war
war --> pobj --> in
, --> punct --> engaged
testing --> advcl --> endure
whether --> mark --> conceived
that --> det --> nation
nation --> nsubj --> conceived
, --> punct --> nation
or --> cc --> nation
any --> det --> nation
nation --> conj --> nation
so --> advmod --> conceived
conceived --> ccomp --> testing
and --> cc --> conceived
so --> advmod --> dedicated
dedicated --> conj --> conceived
, --> punct --> endure
can --> aux --> endure
long --> advmod --> endure
endure --> advcl --> engaged
. --> punct --> engaged


4. Constituent Parsing (first sentence)

In [None]:
doc1 = nlp(sentences[0])
sent = list(doc1.sents)[0]
print(sent._.parse_string)   # phrase structure tree

(S (NP (CD Four) (NN score)) (CC and) (S (ADVP (NP (CD seven) (NNS years)) (RB ago)) (NP (PRP$ our) (NNS fathers)) (VP (VBD brought) (ADVP (RB forth)) (PP (IN on) (NP (DT this) (NN continent))) (, ,) (NP (NP (DT a) (JJ new) (NN nation)) (, ,) (VP (VBN conceived) (PP (IN in) (NP (NNP Liberty))))) (, ,) (CC and) (VP (VBN dedicated) (PP (IN to) (NP (DT the) (NN proposition) (SBAR (IN that) (S (NP (DT all) (NNS men)) (VP (VBP are) (VP (VBN created) (S (ADJP (JJ equal)))))))))))) (. .))


5. Extract Noun Phrases (noun chunks)

In [None]:
for i, sent in enumerate(sentences):
    doc = nlp(sent)
    print(f"Sentence {i+1} noun phrases:")
    for chunk in doc.noun_chunks:
        print("-", chunk.text)

Sentence 1 noun phrases:
- Four score
- our fathers
- this continent
- a new nation
- Liberty
- the proposition
- all men
Sentence 2 noun phrases:
- we
- a great civil war
- that nation
- any nation


6. CRF vs HMM (summary in < 50 words)

HMMs model sequential data using hidden states and transitions, assuming Markov dependence. CRFs directly model conditional probabilities of labels given the sequence, capturing more context. HMMs are generative, CRFs are discriminative — making CRFs more accurate for NLP tasks like POS tagging and NER.