In [3]:
# Installing the spaCy library
!pip install spacy

# Installing the small English language model used by spaCy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


# Task 1: Tokenization
### 1. Write a Python script to tokenize the following text: 
    "The quick brown fox doesn't jump over the lazy dog. Natural Language Processing is fascinating!"


In [20]:
text = ("""The quick brown fox doesn't jump over the lazy dog. Natural Language Processing is fascinating!""")

import spacy
# Load spaCy's small English model
nlp = spacy.load('en_core_web_sm')

# Apply the NLP pipeline to the text
doc = nlp(text)

# Extract tokens
tokens= [token.text for token in doc] #this puts it in a list 
print(tokens)


['The', 'quick', 'brown', 'fox', 'does', "n't", 'jump', 'over', 'the', 'lazy', 'dog', '.', 'Natural', 'Language', 'Processing', 'is', 'fascinating', '!']


### 2. Questions to Answer in Comments:

**1. How does spaCy process the various tokens?**
    Hint: Loop through the doc container using the token attributes: .text_, .head, .lemma_, .morph.
- The tokenization process  segments a sentence into words, punctuations, etc. A collection of tokens builds a doc. It processes each token by applying rules specific to each language. The first step is splitting the text based on white space (similar to split), and then the tokenizer performs two checks: 1. Are there any exceptions (ex: despite not having white space doesn't needs to be split), and 2. Can a prefix, suffix of infix be split off.

**2. How does spaCy handle punctuation marks like periods and commas?**
- Punctuation marks are treated as separte tokens.

**3. What happens when the text includes contractions (e.g., "don't")?**
- SpaCy splits splits tetx in such a way where each token is in its simplest form. For this reason contractions will be split into 2 words.


# Task 2: Part-of-Speech Tagging
### Extend your script to include part-of-speech tagging for the tokens.
    Hint: Use token.pos_ and token.tag_ properties.

In [26]:
#WHICH TO USE
#ask if the "is." period is an error
for token in doc:
    print(f"{token.text} - {token.lemma_}: {token.pos_}")

for token in doc:
    print(f"{token.text} - {token.lemma_} - POS: {token.pos_}, Tag: {token.tag_}")


The - the: DET
quick - quick: ADJ
brown - brown: ADJ
fox - fox: NOUN
does - do: AUX
n't - not: PART
jump - jump: VERB
over - over: ADP
the - the: DET
lazy - lazy: ADJ
dog - dog: NOUN
. - .: PUNCT
Natural - Natural: PROPN
Language - Language: PROPN
Processing - processing: NOUN
is - be: AUX
fascinating - fascinating: ADJ
! - !: PUNCT
The - the - POS: DET, Tag: DT
quick - quick - POS: ADJ, Tag: JJ
brown - brown - POS: ADJ, Tag: JJ
fox - fox - POS: NOUN, Tag: NN
does - do - POS: AUX, Tag: VBZ
n't - not - POS: PART, Tag: RB
jump - jump - POS: VERB, Tag: VB
over - over - POS: ADP, Tag: IN
the - the - POS: DET, Tag: DT
lazy - lazy - POS: ADJ, Tag: JJ
dog - dog - POS: NOUN, Tag: NN
. - . - POS: PUNCT, Tag: .
Natural - Natural - POS: PROPN, Tag: NNP
Language - Language - POS: PROPN, Tag: NNP
Processing - processing - POS: NOUN, Tag: NN
is - be - POS: AUX, Tag: VBZ
fascinating - fascinating - POS: ADJ, Tag: JJ
! - ! - POS: PUNCT, Tag: .


### Questions to Answer in Comments:
1. **Identify the POS tags for "quick," "jumps," and "is."**
- "quick" POS tag: ADJ (adjective)
- "jumps" POS tag: VERB
- "is" POS tag: AUX (auxiliary verb)

2. **Why might POS tagging be useful for tasks like grammar checking or machine translation?**
- One of the main features of POS tagging is th

# Task 3: Named Entity Recognition (NER)
### Modify your script to identify named entities in the following text:
    "Barack Obama was the 44th President of the United States. He was born in Hawaii."

In [27]:
#ASK WHY THE IS TAKEN AS A ENTITY
#does q2 of the writtehn part need to be answer using code? tokens.properties
text = ("""Barack Obama was the 44th President of the United States. He was born in Hawaii.""")

#tokenize
doc = nlp(text)
tokens= [token.text for token in doc] #this puts it in a list 

# Display named entities in the text
for ent in doc.ents:
    print(f"{ent.text}: {ent.label_} ({spacy.explain(ent.label_)})")

#Better Presentation using Displacy
from spacy import displacy
displacy.render(doc, style='ent', jupyter=True)



Barack Obama: PERSON (People, including fictional)
44th: ORDINAL ("first", "second", etc.)
the United States: GPE (Countries, cities, states)
Hawaii: GPE (Countries, cities, states)


### Questions to Answer in Comments:

1. **Which entities are recognized by spaCy?**
   Hint: Loop through doc.ents 
- Barack Obama, 44th, the United States, Hawaii

2. **What entity types are assigned to "Barack Obama" and "Hawaii"?**
    Hint: Use token.label_ properties
- Barack Obama: PERSON
- Hawaii: GPE (Geopolitical entity: countries, cities, states, etc)

# Task 4: Experimentation
### Write a new sentence or paragraph of your choice and run the spaCy pipeline on it.
### Experiment with changing words, adding punctuation, or introducing typos.


In [39]:
#Text with typo (lovd should be loved)
text = "Despite I lovd my semester abroad in London and traveling around Europe, I really missed Football season at Notre Dame"
doc = nlp(text)

for token in doc:
    print(f"{token.text} - POS: {token.lemma_}, Tag: {token.pos_}")

for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")

#Better Presentation using Displacy
from spacy import displacy
displacy.render(doc, style='ent', jupyter=True)


Despite - POS: despite, Tag: SCONJ
I - POS: I, Tag: PRON
lovd - POS: lovd, Tag: VERB
my - POS: my, Tag: PRON
semester - POS: semester, Tag: NOUN
abroad - POS: abroad, Tag: ADV
in - POS: in, Tag: ADP
London - POS: London, Tag: PROPN
and - POS: and, Tag: CCONJ
traveling - POS: travel, Tag: VERB
around - POS: around, Tag: ADP
Europe - POS: Europe, Tag: PROPN
, - POS: ,, Tag: PUNCT
I - POS: I, Tag: PRON
really - POS: really, Tag: ADV
missed - POS: miss, Tag: VERB
Football - POS: Football, Tag: PROPN
season - POS: season, Tag: NOUN
at - POS: at, Tag: ADP
Notre - POS: Notre, Tag: PROPN
Dame - POS: Dame, Tag: PROPN
Entity: London, Label: GPE
Entity: Europe, Label: LOC
Entity: Notre Dame, Label: FAC


### Questions to Answer in Comments:

1. **How does spaCy handle your modifications?**


2. **Did any entities or tags change? Why might this happen?**
  