# Homework 7
### Due Date: 03/31/25
- Using the knowledge gained from the lecture and the reading, complete the following tasks in Python. Ensure you have spaCy installed and the en_core_web_sm language model downloaded.

In [1]:
# Installing the spaCy library
!pip install spacy

# Installing the small English language model used by spaCy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[38;5;2mâœ” Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


# Task 1: Tokenization
### 1. Write a Python script to tokenize the following text: 
    "The quick brown fox doesn't jump over the lazy dog. Natural Language Processing is fascinating!"


In [2]:
text = ("""The quick brown fox doesn't jump over the lazy dog. Natural Language Processing is fascinating!""")

import spacy
# Load spaCy's small English model
nlp = spacy.load('en_core_web_sm')

# Apply the NLP pipeline to the text
doc = nlp(text)

# Extract tokens
tokens= [token.text for token in doc] #this puts it in a list 
print(tokens)


['The', 'quick', 'brown', 'fox', 'does', "n't", 'jump', 'over', 'the', 'lazy', 'dog', '.', 'Natural', 'Language', 'Processing', 'is', 'fascinating', '!']


### 2. Questions to Answer in Comments:

**1. How does spaCy process the various tokens?**
    Hint: Loop through the doc container using the token attributes: .text_, .head, .lemma_, .morph.
- The tokenization process  segments a sentence into words, punctuations, etc. A collection of tokens builds a doc. It processes each token by applying rules specific to each language. The first step is splitting the text based on white space (similar to split), and then the tokenizer performs two checks: 1. Are there any exceptions (ex: despite not having white space doesn't needs to be split), and 2. Can a prefix, suffix of infix be split off.

**2. How does spaCy handle punctuation marks like periods and commas?**
- Punctuation marks are treated as separte tokens.

**3. What happens when the text includes contractions (e.g., "don't")?**
- SpaCy splits splits tetx in such a way where each token is in its simplest form. For this reason contractions will be split into 2 words.


# Task 2: Part-of-Speech Tagging
### Extend your script to include part-of-speech tagging for the tokens.
    Hint: Use token.pos_ and token.tag_ properties.

In [3]:
#Part of speeach tagging added to tokens
for token in doc: #for loop will go through each token and asign a text, lemma, pos_ and tag_ feature to each
    print(f"{token.text} - {token.lemma_} - POS: {token.pos_}, Tag: {token.tag_}") 
    
#POS Features
#token.text - shows original token
#token.lemma_ - shows simplest version of word (root form)
#token.pos_ - shows the core Universal Part Of Speech (UPOS) category of the word (noun, verb, adjective)
#token.tag_ - Fine grained tag (more specific part of category defined in token.pos_)
    # ex: token.text = walked, token.pos_: verb, token.tag_: past tense verb


The - the - POS: DET, Tag: DT
quick - quick - POS: ADJ, Tag: JJ
brown - brown - POS: ADJ, Tag: JJ
fox - fox - POS: NOUN, Tag: NN
does - do - POS: AUX, Tag: VBZ
n't - not - POS: PART, Tag: RB
jump - jump - POS: VERB, Tag: VB
over - over - POS: ADP, Tag: IN
the - the - POS: DET, Tag: DT
lazy - lazy - POS: ADJ, Tag: JJ
dog - dog - POS: NOUN, Tag: NN
. - . - POS: PUNCT, Tag: .
Natural - Natural - POS: PROPN, Tag: NNP
Language - Language - POS: PROPN, Tag: NNP
Processing - processing - POS: NOUN, Tag: NN
is - be - POS: AUX, Tag: VBZ
fascinating - fascinating - POS: ADJ, Tag: JJ
! - ! - POS: PUNCT, Tag: .


### Questions to Answer in Comments:
1. **Identify the POS tags for "quick," "jumps," and "is."**
- "quick" POS tag: ADJ (adjective)
- "jumps" POS tag: VERB
- "is" POS tag: AUX (auxiliary verb)

2. **Why might POS tagging be useful for tasks like grammar checking or machine translation?**
- One of the main features of POS tagging is that it can identify the type of a word (ex: noun, verb, adjective) and therefore its role in sentence to predict what may be coming next. For example, a machine learning model, might be able to predict that after an adjective a noun is likely. Other features like token.is_stop, can help identify the tokens in a doc that are hold information and therefore are more valauble in machine learning models.

# Task 3: Named Entity Recognition (NER)
### Modify your script to identify named entities in the following text:
    "Barack Obama was the 44th President of the United States. He was born in Hawaii."

In [4]:
text = ("""Barack Obama was the 44th President of the United States. He was born in Hawaii.""")

#tokenize
doc = nlp(text)
tokens= [token.text for token in doc] #this puts it in a list 

# Display named entities in the text
# doc.ents contains only the named entities This ensures we are not looping through all tokens, just the recognized entities.
for ent in doc.ents: 
    print(f"{ent.text}: {ent.label_} ({spacy.explain(ent.label_)})") # Print each entity's text, label, and explanation of the label

#Better Presentation using Displacy
from spacy import displacy
displacy.render(doc, style='ent', jupyter=True) #style='ent' will highlight entities, suing a differnet color for each category (ex: purple for person, white for ordinal, yellow for GPE)



Barack Obama: PERSON (People, including fictional)
44th: ORDINAL ("first", "second", etc.)
the United States: GPE (Countries, cities, states)
Hawaii: GPE (Countries, cities, states)


### Questions to Answer in Comments:

1. **Which entities are recognized by spaCy?**
   Hint: Loop through doc.ents 
- Barack Obama, 44th, the United States, Hawaii

2. **What entity types are assigned to "Barack Obama" and "Hawaii"?**
    Hint: Use token.label_ properties
- Barack Obama: PERSON (People, including fictional)
- Hawaii: GPE (Geopolitical entity: countries, cities, states, etc)

# Task 4: Experimentation
### Write a new sentence or paragraph of your choice and run the spaCy pipeline on it.
### Experiment with changing words, adding punctuation, or introducing typos.


In [24]:
#Text with no errors
text = "Despite loving my semester abroad in London and the opportunity to travel around Europe I really missed Football season at Notre Dame"
doc = nlp(text)

#display POS features of text
#added POS Feature: .token.dep_ shows the syntatic relationship, meaning the relationship between tokens. This was added as chnages made in the specific chnage made between text and text1 (specificaly loving misspeling) might trigger change in this feature
for token in doc:
    print(f"{token.text} - {token.lemma_} - POS: {token.pos_}, Tag: {token.pos_}, Dependency: {token.dep_}")

# Display named entities in the text
for ent in doc.ents:
    print(f"{ent.text}: {ent.label_} ({spacy.explain(ent.label_)})")


Despite - despite - POS: SCONJ, Tag: SCONJ, Dependency: prep
loving - love - POS: VERB, Tag: VERB, Dependency: pcomp
my - my - POS: PRON, Tag: PRON, Dependency: poss
semester - semester - POS: NOUN, Tag: NOUN, Dependency: dobj
abroad - abroad - POS: ADV, Tag: ADV, Dependency: advmod
in - in - POS: ADP, Tag: ADP, Dependency: prep
London - London - POS: PROPN, Tag: PROPN, Dependency: pobj
and - and - POS: CCONJ, Tag: CCONJ, Dependency: cc
the - the - POS: DET, Tag: DET, Dependency: det
opportunity - opportunity - POS: NOUN, Tag: NOUN, Dependency: conj
to - to - POS: PART, Tag: PART, Dependency: aux
travel - travel - POS: VERB, Tag: VERB, Dependency: acl
around - around - POS: ADP, Tag: ADP, Dependency: prep
Europe - Europe - POS: PROPN, Tag: PROPN, Dependency: pobj
I - I - POS: PRON, Tag: PRON, Dependency: nsubj
really - really - POS: ADV, Tag: ADV, Dependency: advmod
missed - miss - POS: VERB, Tag: VERB, Dependency: ROOT
Football - Football - POS: PROPN, Tag: PROPN, Dependency: compound

In [22]:

#Text with typo (lving should be loving), semster changed to time and added comma after Europe
text1 = "Despite lving my time abroad in London and the opportunity to travel around Europe, I really missed Football season at Notre Dame"
doc1 = nlp(text1)

for token in doc1:
    print(f"{token.text} - {token.lemma_} - POS: {token.pos_}, Tag: {token.pos_}, Dependency: {token.dep_}")

for ent in doc1.ents:
    print(f"{ent.text}: {ent.label_} ({spacy.explain(ent.label_)})")


Despite - despite - POS: SCONJ, Tag: SCONJ, Dependency: prep
lving - lve - POS: VERB, Tag: VERB, Dependency: pcomp
my - my - POS: PRON, Tag: PRON, Dependency: poss
time - time - POS: NOUN, Tag: NOUN, Dependency: dobj
abroad - abroad - POS: ADV, Tag: ADV, Dependency: advmod
in - in - POS: ADP, Tag: ADP, Dependency: prep
London - London - POS: PROPN, Tag: PROPN, Dependency: pobj
and - and - POS: CCONJ, Tag: CCONJ, Dependency: cc
the - the - POS: DET, Tag: DET, Dependency: det
opportunity - opportunity - POS: NOUN, Tag: NOUN, Dependency: conj
to - to - POS: PART, Tag: PART, Dependency: aux
travel - travel - POS: VERB, Tag: VERB, Dependency: acl
around - around - POS: ADP, Tag: ADP, Dependency: prep
Europe - Europe - POS: PROPN, Tag: PROPN, Dependency: pobj
, - , - POS: PUNCT, Tag: PUNCT, Dependency: punct
I - I - POS: PRON, Tag: PRON, Dependency: nsubj
really - really - POS: ADV, Tag: ADV, Dependency: advmod
missed - miss - POS: VERB, Tag: VERB, Dependency: ROOT
Football - Football - POS:

In [18]:
#Comparison

# Render the named entities of original text (doc)
print("Original Text:")
from spacy import displacy
displacy.render(doc, style='ent', jupyter=True)

# Render the named entities of altered text (doc1)
print("Altered Text:")
from spacy import displacy
displacy.render(doc1, style='ent', jupyter=True)


Original Text:


Altered Text:


### Questions to Answer in Comments:

1. **How does spaCy handle your modifications?**
- Despite having a typo in the word loving, the POS features are able to identify that it still is a prepositional complement dependency and tehrefore whould be followed by another sentence. The word chnage from semseter to time was also picked up and identified as a noun. Lastly, the pucntuation added was also documented by spaCy as a puntuation.

2. **Did any entities or tags change? Why might this happen?**
- Given that the changes made to the text didn't alter the menaing of the sentence, none of the tags or entities cheanged (as shown in the output above)