<a href="https://colab.research.google.com/github/robinsroy/nlp/blob/main/NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
import spacy

# Load the English NLP model
nlp = spacy.load("en_core_web_sm")

# Input text
text = "spaCy is a powerful library for natural language processing."

# Process the text using spaCy
doc = nlp(text)

# Step 1: Tokenization (split text into words/tokens)
print("All Tokens:")
for token in doc:
    print(token.text)

# Sentence Tokenization
print("Sentence Tokenization:")
for sent in doc.sents:
    print(sent)

# Step 2: Remove stopwords and punctuation
print("\nTokens after removing stopwords and punctuation:")
for token in doc:
    if not token.is_stop and token.is_alpha:  # removes stopwords and non-alphabetic tokens (like punctuation)
        print(token.text)

# Step 3: Lemmatization (convert to base form)
print("\nLemmatized tokens (after stopword removal):")
for token in doc:
    if not token.is_stop and token.is_alpha:
        print(token.lemma_)


All Tokens:
spaCy
is
a
powerful
library
for
natural
language
processing
.
Sentence Tokenization:
spaCy is a powerful library for natural language processing.

Tokens after removing stopwords and punctuation:
spaCy
powerful
library
natural
language
processing

Lemmatized tokens (after stopword removal):
spacy
powerful
library
natural
language
processing


In [None]:
import spacy
from nltk.stem import PorterStemmer

# Load spaCy English model
nlp = spacy.load("en_core_web_sm")

# Initialize NLTK's stemmer
stemmer = PorterStemmer()

# Sample text
text = "Hello! I'm using Google Colab to learn NLP. spaCy and NLTK are helpful libraries."

# Process text with spaCy
doc = nlp(text)

# Perform stemming on each word (excluding punctuation)
print("Stemmed Words:")
for token in doc:
    if token.is_alpha:  # ignore punctuation/numbers
        print(f"{token.text} → {stemmer.stem(token.text)}")


Stemmed Words:
Hello → hello
I → i
using → use
Google → googl
Colab → colab
to → to
learn → learn
NLP → nlp
spaCy → spaci
and → and
NLTK → nltk
are → are
helpful → help
libraries → librari


In [6]:


import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download required NLTK data (only once)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')


# Sample text
text = "NLTK is a powerful library for natural language processing tasks like tokenization, stemming, and lemmatization."

# Tokenization
tokens = word_tokenize(text)
print("Tokens:", tokens)


# Sentence Tokenization
print("Sentence Tokenization:")
for sent in doc.sents:
    print(sent)

# Remove Stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print("\nAfter Stopword Removal:", filtered_tokens)

# Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
print("\nAfter Stemming:", stemmed_tokens)

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print("\nAfter Lemmatization:", lemmatized_tokens)


Tokens: ['NLTK', 'is', 'a', 'powerful', 'library', 'for', 'natural', 'language', 'processing', 'tasks', 'like', 'tokenization', ',', 'stemming', ',', 'and', 'lemmatization', '.']
Sentence Tokenization:
spaCy is a powerful library for natural language processing.

After Stopword Removal: ['NLTK', 'powerful', 'library', 'natural', 'language', 'processing', 'tasks', 'like', 'tokenization', ',', 'stemming', ',', 'lemmatization', '.']

After Stemming: ['nltk', 'power', 'librari', 'natur', 'languag', 'process', 'task', 'like', 'token', ',', 'stem', ',', 'lemmat', '.']

After Lemmatization: ['NLTK', 'powerful', 'library', 'natural', 'language', 'processing', 'task', 'like', 'tokenization', ',', 'stemming', ',', 'lemmatization', '.']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
