## HOMEWORK 7 
### Elements of Computing 
### Sabrina Cohen 

### Task 1 - Tokenization


In [2]:
import spacy

# Load spaCy's small English model
nlp = spacy.load('en_core_web_sm')

# Input text
text = (
    "The quick brown fox doesn't jump over the lazy dog. Natural Language Processing is fascinating!"
)

# Apply the spaCy pipeline to the text
doc = nlp(text)

# Extract and print token list 
tokens_spacy = [token.text for token in doc]
print(" Tokens:", tokens_spacy)

# Detailed token analysis
print ("\n Token | Lemma | Head | Morph:")

for token in doc:
    print(f"{token.text} - Lemma: {token.lemma_} | Head: {token.head.text} | Morph: {token.morph}")


 Tokens: ['The', 'quick', 'brown', 'fox', 'does', "n't", 'jump', 'over', 'the', 'lazy', 'dog', '.', 'Natural', 'Language', 'Processing', 'is', 'fascinating', '!']

 Token | Lemma | Head | Morph:
The - Lemma: the | Head: fox | Morph: Definite=Def|PronType=Art
quick - Lemma: quick | Head: fox | Morph: Degree=Pos
brown - Lemma: brown | Head: fox | Morph: Degree=Pos
fox - Lemma: fox | Head: jump | Morph: Number=Sing
does - Lemma: do | Head: jump | Morph: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
n't - Lemma: not | Head: jump | Morph: Polarity=Neg
jump - Lemma: jump | Head: jump | Morph: VerbForm=Inf
over - Lemma: over | Head: jump | Morph: 
the - Lemma: the | Head: dog | Morph: Definite=Def|PronType=Art
lazy - Lemma: lazy | Head: dog | Morph: Degree=Pos
dog - Lemma: dog | Head: over | Morph: Number=Sing
. - Lemma: . | Head: jump | Morph: PunctType=Peri
Natural - Lemma: Natural | Head: Language | Morph: Number=Sing
Language - Lemma: Language | Head: Processing | Morph: Number=Si

#### Questions to Answer: 
* How does spaCy process the various tokens? (Hint: Loop through the doc container using the token attributes: .text_, .head, .lemma_, .morph.)
    - SpaCy processes the text by breaking it down into the individual words, which are known as token objects. Each token has unique attributes, and analyzing them alllows us to figure out the purpose anfd characteristics of each word within that sentence.  In this case, the code displayed the following attributes:
        - Text: The entity as it appears in the text.
        - Lemma: The base/dictionary form of a word.
        - Head: The main word another word depends on.
        - Morph: Grammatical features like tense, number, or gender.
* How does spaCy handle punctuation marks like periods and commas?
    - SpaCy handles punctuation marks like their own token, the same way as words are listed. This is important as it also analyzes these small marks, and displays their unique attributes - 
even though they aren't complete words. 
* What happens when the text includes contractions (e.g., "don't")?
    -   When looking at how spaCy proesses the various tokens, it is clear that it separates contractions into two. In this case, one token was 'does' and the other was 'n't'. If analyzed closely, the 'n't' token is processed as a negation, so NLP understands the grammatical structure and accurately analyzes this sentence.

### Task 2 - Part-of-Speech Tagging 

In [3]:
for token in doc:
    print(f"{token.text} - {token.lemma_}: {token.pos_}")

The - the: DET
quick - quick: ADJ
brown - brown: ADJ
fox - fox: NOUN
does - do: AUX
n't - not: PART
jump - jump: VERB
over - over: ADP
the - the: DET
lazy - lazy: ADJ
dog - dog: NOUN
. - .: PUNCT
Natural - Natural: PROPN
Language - Language: PROPN
Processing - processing: NOUN
is - be: AUX
fascinating - fascinating: ADJ
! - !: PUNCT


#### Questions to Answer: 
* Identify the POS tags for "quick," "jumps," and "is."
    - The POS tag for "quick" is ADJ, since it's an adjective that describes the noun "fox."
    - The POS tag for "jumps" is VERB, because it's the main action in the sentence.
    - The POS tag for "is" is AUX (auxiliary verb), as it helps connect the subject to the description "fascinating."
* Why might POS tagging be useful for tasks like grammar checking or machine translation?
    - POS tagging can be useful for grammar checking because it helps catch sentence structure issues, like when someone uses a noun where a verb should be, or forms a sentence in a way that doesn't quite make sense. It can guide tools to suggest better phrasing and help users catch any mistake. For machine translation, it helps the system understand the role each word plays in a sentence. This way, it can recreate that same structure in another language and make the translation be grammatically correct.

### Task 3 - Named Entity Recognition 

In [4]:
text = (
    "Barack Obama was the 44th President of the United States. He was born in Hawaii."
)

# Apply the spaCy pipeline to the text
doc = nlp(text)
tokens_spacy = [token.text for token in doc]

# Display named entities in the text
for ent in doc.ents:
    print(f"{ent.text}: {ent.label_} ({spacy.explain(ent.label_)})")

Barack Obama: PERSON (People, including fictional)
44th: ORDINAL ("first", "second", etc.)
the United States: GPE (Countries, cities, states)
Hawaii: GPE (Countries, cities, states)


#### Questions to Answer: 
* Which entities are recognized by spaCy? (Hint: Loop through doc.ents) 
    - SpaCy recognizes Barack Obama as a person, '44th' as an ordinal, and 'United States' and 'Hawaii' as a Geopolitical Entity. 
* What entity types are assigned to "Barack Obama" and "Hawaii"? (Hint: Use token.label_ properties)
    - Barack Obama is assigned as a Person, which is correct, and Hawaii is assigned as a GPE, which includes countries, so it is also accurate. 

### Task 4 - Experimentation 

In [None]:
# Original - Clean Text 
original_text = (
    "After solving a tricky NLP script in computing class, Sabrina went to the TDS open house. "
    "She met with faculty, explored cool projects, and talked about her plans to continue the Computing & Digital Technologies minor with excitement!"
)

# Apply the spaCy pipeline to the text
doc_clean = nlp(original_text)


# Entities
print("Clean version - Named Entities:")
for ent in doc_clean.ents:
    print(f"{ent.text}: {ent.label_} ({spacy.explain(ent.label_)})")

# Extract and print token list 
tokens_spacy = [token.text for token in doc_clean]
print(" Tokens:", tokens_spacy)

for token in doc_clean:
    print(f"{token.text} - {token.lemma_}: {token.pos_}")






Clean version - Named Entities:
NLP: ORG (Companies, agencies, institutions, etc.)
Sabrina: PERSON (People, including fictional)
TDS: ORG (Companies, agencies, institutions, etc.)
the Computing & Digital Technologies: ORG (Companies, agencies, institutions, etc.)
 Tokens: ['After', 'solving', 'a', 'tricky', 'NLP', 'script', 'in', 'computing', 'class', ',', 'Sabrina', 'went', 'to', 'the', 'TDS', 'open', 'house', '.', 'She', 'met', 'with', 'faculty', ',', 'explored', 'cool', 'projects', ',', 'and', 'talked', 'about', 'her', 'plans', 'to', 'continue', 'the', 'Computing', '&', 'Digital', 'Technologies', 'minor', 'with', 'excitement', '!']
After - after: ADP
solving - solve: VERB
a - a: DET
tricky - tricky: ADJ
NLP - NLP: PROPN
script - script: NOUN
in - in: ADP
computing - compute: VERB
class - class: NOUN
, - ,: PUNCT
Sabrina - Sabrina: PROPN
went - go: VERB
to - to: ADP
the - the: DET
TDS - TDS: PROPN
open - open: ADJ
house - house: NOUN
. - .: PUNCT
She - she: PRON
met - meet: VERB
with

In [6]:
# Modified (typos + weird punctuation + lowercase names)
modified_text = (
    "Aftr solving... a triky NLP skript in Computing Class!!  ,sabrina went to the TDS opnhouse. "
    "she met with faculty, explored cool projects, and talked abut her plans to continue the Computting & digital Technolgies minor with excitment!"
)

# Apply the spaCy pipeline to the text
doc_modified = nlp(modified_text)

# Print entities from modified version
print(" Modified version - Named Entities:")
for ent in doc_modified.ents:
    print(f"{ent.text}: {ent.label_} ({spacy.explain(ent.label_)})")

# Extract and print token list 
tokens_spacy = [token.text for token in doc_modified]
print(" Tokens:", tokens_spacy)

for token in doc_modified:
    print(f"{token.text} - {token.lemma_}: {token.pos_}")

 Modified version - Named Entities:
NLP: ORG (Companies, agencies, institutions, etc.)
Computing: PERSON (People, including fictional)
TDS: ORG (Companies, agencies, institutions, etc.)
Computting &: ORG (Companies, agencies, institutions, etc.)
Technolgies: PRODUCT (Objects, vehicles, foods, etc. (not services))
 Tokens: ['Aftr', 'solving', '...', 'a', 'triky', 'NLP', 'skript', 'in', 'Computing', 'Class', '!', '!', ' ', ',', 'sabrina', 'went', 'to', 'the', 'TDS', 'opnhouse', '.', 'she', 'met', 'with', 'faculty', ',', 'explored', 'cool', 'projects', ',', 'and', 'talked', 'abut', 'her', 'plans', 'to', 'continue', 'the', 'Computting', '&', 'digital', 'Technolgies', 'minor', 'with', 'excitment', '!']
Aftr - Aftr: PROPN
solving - solving: NOUN
... - ...: PUNCT
a - a: DET
triky - triky: NOUN
NLP - NLP: PROPN
skript - skript: NOUN
in - in: ADP
Computing - Computing: PROPN
Class - Class: PROPN
! - !: PUNCT
! - !: PUNCT
  -  : SPACE
, - ,: PUNCT
sabrina - sabrina: PROPN
went - go: VERB
to - to

#### Questions to Answer: 
* How does spaCy handle your modifications?
    - SpaCy struggles when the input contains typos, weird punctuation, or inconsistent capitalization. In the clean version, it correctly identified "Sabrina" as a PERSON, "TDS" as an ORG,and "Computing & Digital Technologies" as part of an ORG. But in the modified version, I removed the capitalization from "Sabrina" , so while it was still tagged as PROPN (proper noun), it was no longer recognized as a named entity. I also introduced a typo in "Technologies" (misspelled as "Technolgies") and removed the capital letter from "Digital."As a result, spaCy incorrectly labeled "Technolgies" as a PRODUCT, showing that spelling and formatting errors confuse the model. 

* Did any entities or tags change? Why might this happen?
    -  Yes , several entity tags and part-of-speech (POS) tags changed or disappeared. For example, "Computing & Digital Technologies" got split into "Computting &" (ORG) and "Technolgies" (PRODUCT) in the modified version. Also, the POS tag for "solving" changed from VERB to NOUN, likely because the  context was confusing for spaCy to recognize it as an action. Punctuation like '!' and '...' were still tokenized correctly, but added noise that may have impacted interpretation.This shows that spaCy relies on specific patters, even small modifications can lead to misclassifications.