<a href="https://colab.research.google.com/github/teju123540/Natural-Language-Processing-Lab/blob/main/NLPEXP6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install spacy nltk
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m46.4 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [2]:
import spacy

# Load SpaCy model
nlp = spacy.load("en_core_web_sm")

In [3]:
text = "Apple is looking at buying U.K. startup for $1 billion. Elon Musk founded SpaceX in 2002."
doc = nlp(text)

print("Original Text:")
print(text)

Original Text:
Apple is looking at buying U.K. startup for $1 billion. Elon Musk founded SpaceX in 2002.


Tokenization

In [4]:
print(" Tokenization:")
tokens = [token.text for token in doc]
print(tokens)

 Tokenization:
['Apple', 'is', 'looking', 'at', 'buying', 'U.K.', 'startup', 'for', '$', '1', 'billion', '.', 'Elon', 'Musk', 'founded', 'SpaceX', 'in', '2002', '.']


Stop Word Removal

In [5]:
print(" Stop Word Removal:")
filtered_tokens = [token.text for token in doc if not token.is_stop and not token.is_punct]
print(filtered_tokens)

 Stop Word Removal:
['Apple', 'looking', 'buying', 'U.K.', 'startup', '$', '1', 'billion', 'Elon', 'Musk', 'founded', 'SpaceX', '2002']


Lemmatization (SpaCy)

In [7]:
print("Lemmatization:")
for token in doc:
    print(token.text, "→", token.lemma_)

Lemmatization:
Apple → Apple
is → be
looking → look
at → at
buying → buy
U.K. → U.K.
startup → startup
for → for
$ → $
1 → 1
billion → billion
. → .
Elon → Elon
Musk → Musk
founded → found
SpaceX → SpaceX
in → in
2002 → 2002
. → .


POS Tagging

In [8]:
print("POS Tagging:")
for token in doc:
    print(token.text, "→", token.pos_)

POS Tagging:
Apple → PROPN
is → AUX
looking → VERB
at → ADP
buying → VERB
U.K. → PROPN
startup → VERB
for → ADP
$ → SYM
1 → NUM
billion → NUM
. → PUNCT
Elon → PROPN
Musk → PROPN
founded → VERB
SpaceX → PROPN
in → ADP
2002 → NUM
. → PUNCT


Named Entity Recognition (NER)

In [9]:
print(" Named Entity Recognition:")
for ent in doc.ents:
    print(ent.text, "→", ent.label_)

 Named Entity Recognition:
Apple → ORG
U.K. → GPE
$1 billion → MONEY
Elon Musk → PERSON
2002 → DATE


Sentence Segmentation

In [10]:
print("Sentence Segmentation:")
for sent in doc.sents:
    print(sent.text)

Sentence Segmentation:
Apple is looking at buying U.K. startup for $1 billion.
Elon Musk founded SpaceX in 2002.


REDUNDANT CLEANING

In [11]:
clean_tokens = [
    token.text
    for token in doc
    if not token.is_punct and not token.is_space
]

clean_text = " ".join(clean_tokens)

print("Cleaned Text:")
print(clean_text)

Cleaned Text:
Apple is looking at buying U.K. startup for $ 1 billion Elon Musk founded SpaceX in 2002
