<a href="https://colab.research.google.com/github/shinsuikyo/cumberlands/blob/main/spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Write a script that includes the following:


* A sentence of ~10 words that is grammatically correct and free of spelling errors.
*   Process the sentence using NLTK or SpaCy to generate and output part of speech tags.
* Process the sentence using NLTK or SpaCy to generate and output part of speech tags.
* Edit the sentence to introduce 2-3 spelling errors and generate/output part of speech tags again

Write a paragraph discussing your results:
* Were there errors in tagging the misspelled words?
* Was the algorithm still able to tag misspelled words correctly?
* Why do you think the algorithm did, or did not, tag the misspelled words correctly.

1. Import and download all required nltk and spacy libraries

In [11]:
!pip install spacy nltk
!pip install pandas

[31mERROR: Could not find a version that satisfies the requirement as (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for as[0m[31m
[0m

In [14]:
import spacy
from spacy.lang.en import English
import nltk
import pandas as pd # Import pandas and assign it the alias 'pd'
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker_tab')
nltk.download('punkt_tab')
nltk.download('words')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker_tab is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

2. Set the text to be used for

In [3]:
text ="Imagination is the spark that lights your world, but only those who face their fears can see its full glow."

In [6]:

# Process with SpaCy for POS tagging and dependency parsing
print("Original Text:")
nlp = spacy.load("en_core_web_sm") # Load the language model
doc_original = nlp(text) # Process the text to create a Doc object
for token in doc_original:
    print(f"{token.text} ({token.pos_}), ({token.dep_})")


Original Text:
Imagination (NOUN), (nsubj)
is (AUX), (ROOT)
the (DET), (det)
spark (NOUN), (attr)
that (PRON), (nsubj)
lights (VERB), (relcl)
your (PRON), (poss)
world (NOUN), (dobj)
, (PUNCT), (punct)
but (CCONJ), (cc)
only (ADV), (advmod)
those (PRON), (nsubj)
who (PRON), (nsubj)
face (VERB), (relcl)
their (PRON), (poss)
fears (NOUN), (dobj)
can (AUX), (aux)
see (VERB), (conj)
its (PRON), (poss)
full (ADJ), (amod)
glow (NOUN), (dobj)
. (PUNCT), (punct)


In [8]:

# Introduce spelling errors into the text
text_with_errors = "Imagination is the spark that lits your world, but only those who face their feears can see its full glow."

print(text_with_errors)

Imagination is the spark that lits your world, but only those who face their feears can see its full glow.


In [9]:

print("\nText with Spelling Errors:")
doc_errors = nlp(text_with_errors) # Process the text with errors
for token in doc_errors:
    print(f"{token.text} ({token.pos_}), ({token.dep_})")




Text with Spelling Errors:
Imagination (NOUN), (nsubj)
is (AUX), (ROOT)
the (DET), (det)
spark (NOUN), (attr)
that (PRON), (nsubj)
lits (VERB), (relcl)
your (PRON), (poss)
world (NOUN), (dobj)
, (PUNCT), (punct)
but (CCONJ), (cc)
only (ADV), (advmod)
those (PRON), (nsubj)
who (PRON), (nsubj)
face (VERB), (relcl)
their (PRON), (poss)
feears (NOUN), (dobj)
can (AUX), (aux)
see (VERB), (conj)
its (PRON), (poss)
full (ADJ), (amod)
glow (NOUN), (dobj)
. (PUNCT), (punct)


In [15]:
# 6. Build a side-by-side comparison DataFrame
# We'll match tokens by their index, up to the min length in case token counts differ
comparison_data = []
max_len = min(len(doc_original), len(doc_errors))

for i in range(max_len):
    # Original text token/POS
    orig_token = doc_original[i].text
    orig_pos = doc_original[i].pos_

    # Error text token/POS
    err_token = doc_errors[i].text
    err_pos = doc_errors[i].pos_

    comparison_data.append({
        "Original Token": orig_token,
        "Original POS": orig_pos,
        "Error Token": err_token,
        "Error POS": err_pos
    })

df_comparison = pd.DataFrame(comparison_data)

df_comparison

Unnamed: 0,Original Token,Original POS,Error Token,Error POS
0,Imagination,NOUN,Imagination,NOUN
1,is,AUX,is,AUX
2,the,DET,the,DET
3,spark,NOUN,spark,NOUN
4,that,PRON,that,PRON
5,lights,VERB,lits,VERB
6,your,PRON,your,PRON
7,world,NOUN,world,NOUN
8,",",PUNCT,",",PUNCT
9,but,CCONJ,but,CCONJ


## Analysis of Results

### Overall POS Tag Consistency
From the comparison DataFrame, nearly all tokens in the original sentence and the misspelled sentence are tagged with the same parts of speech. For example:

- **Original**: "lights" (VERB)  
  **Misspelled**: "lits" (VERB)  
- **Original**: "fears" (NOUN)  
  **Misspelled**: "feears" (NOUN)

---

### Why SpaCy Still Tagged Misspelled Words Correctly

#### Contextual Clues
spaCy’s tagger does not rely solely on a dictionary of correctly spelled words. Instead, it looks at the sequence of words / (context), syntactic structure, and the morphological shape of each token. For instance, “lits” appears in the context of “that ___ your world,” which is typically a verb phrase (“that lights your world”). Consequently, spaCy infers it should be a verb, even if spelled incorrectly.

#### Morphology & Syntax
“feears” is positioned in the sentence right after a possessive determiner (“their feears”), which strongly suggests a noun. Since “fears” is commonly a plural noun, the spelling “feears” still looks like a plural noun with an extra character.

---

### Potential for Tagging Errors
In this example, the misspellings are relatively mild—just one or two letters off—so the words still resemble their original forms. However, if spelling errors were more extreme, spaCy might struggle to match them to recognizable morphological patterns or to infer them from context, potentially leading to incorrect POS tags.

Real-world text can contain typos or nonstandard spellings (like internet slang), which may cause more frequent or severe tagging errors, depending on how far those spellings are from spaCy’s learned patterns.

---

### Summary

- **Were there errors in tagging the misspelled words?**  
  In this case, no. The misspelled tokens (“lits” and “feears”) were still assigned the same part of speech as their correctly spelled counterparts (“lights” and “fears”).

- **Did the algorithm tag misspelled words correctly?**  
  Yes, it did. The minor misspellings did not confuse spaCy’s POS tagger enough to change the tags.

- **Why?**  
  Because spaCy uses a statistical model that heavily relies on the context and syntax of the surrounding tokens, as well as morphological cues. Slight variations in spelling often will not override these contextual signals.

Overall, this illustrates that context-based POS taggers like spaCy can be fairly robust to small spelling errors in determining a word’s part of speech.
