<a href="https://colab.research.google.com/github/supunabeywickrama/my-colab-work/blob/main/Remove_Stopwords_%26_Punctuation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Using Stemming**

In [1]:
import nltk
import spacy

In [2]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [3]:
words = ["eating", "eats", "eat", "ate", "adjustable", "rafting", "ability", "meeting"]

for word in words:
    print(word, "|", stemmer.stem(word))

eating | eat
eats | eat
eat | eat
ate | ate
adjustable | adjust
rafting | raft
ability | abil
meeting | meet


### **Using Lemmatization**

In [5]:
nlp = spacy.load("en_core_web_sm")

doc = nlp("Mando talked for 3 hours although talking isn't his thing he become talkative")

for token in doc:
    print(token, "|", token.lemma_)

Mando | Mando
talked | talk
for | for
3 | 3
hours | hour
although | although
talking | talk
is | be
n't | not
his | his
thing | thing
he | he
become | become
talkative | talkative


In [6]:
nlp = spacy.load("en_core_web_sm")

doc = nlp("Mando talked for 3 hours although talking isn't his thing he become talkative")

for token in doc:
    print(token, "|", token.lemma_, "|", token.lemma)

Mando | Mando | 7837215228004622142
talked | talk | 13939146775466599234
for | for | 16037325823156266367
3 | 3 | 602994839685422785
hours | hour | 9748623380567160636
although | although | 343236316598008647
talking | talk | 13939146775466599234
is | be | 10382539506755952630
n't | not | 447765159362469301
his | his | 2661093235354845946
thing | thing | 2473243759842082748
he | he | 1655312771067108281
become | become | 12558846041070486771
talkative | talkative | 13364764166055324990


In [8]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [9]:
doc = nlp("Bro, you wanna go? Brah, don't say no ! I am exhausted")
for token in doc:
    print(token.text, "|", token.lemma_)

Bro | Bro
, | ,
you | you
wanna | wanna
go | go
? | ?
Brah | Brah
, | ,
do | do
n't | not
say | say
no | no
! | !
I | I
am | be
exhausted | exhaust


In [10]:
doc[0]

Bro

## **Want to add 'attrybute_ruler' for Lemmitization**

In [11]:
ar = nlp.get_pipe('attribute_ruler')

ar.add([[{"TEXT":"Bro"}],[{"TEXT":"Brah"}]],{"LEMMA":"Brother"})

doc = nlp("Bro, you wanna go? Brah, don't say no ! I am exhausted")
for token in doc:
    print(token.text, "|", token.lemma_)

Bro | Brother
, | ,
you | you
wanna | wanna
go | go
? | ?
Brah | Brother
, | ,
do | do
n't | not
say | say
no | no
! | !
I | I
am | be
exhausted | exhaust


In [12]:
doc[0]

Bro

In [14]:
doc[0].lemma_

'Brother'

# ***Project: Text Cleaner (Stopwords & Punctuation Removal)***

**Goal**

To build a reusable Python function that:

* Cleans text by lowercasing it

* Removes punctuation

* Removes stopwords (words like “the”, “is”, “and”)

* Collapses extra spaces

* Normalizes accented characters (“Café” → “cafe”)

In [16]:
!pip -q install nltk unidecode pandas

In [28]:
import re, string
from unidecode import unidecode
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')




[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [23]:
def build_stopwords(language='english', extra=None, keep=None):
    base = set(stopwords.words(language)) if language else set()
    if extra:
        base |= set(map(str.lower, extra))
    if keep:
        base -= set(map(str.lower, keep))
    return base

def clean_text(text,
               language='english',
               remove_punct=True,
               to_lower=True,
               normalize_accents=True,
               extra_stopwords=None,
               keep_stopwords=None):
    """Clean a text string by lowercasing, removing punctuation and stopwords."""
    if text is None:
        return ''
    if not isinstance(text, str):
        text = str(text)

    s = text
    if normalize_accents:
        s = unidecode(s)
    if to_lower:
        s = s.lower()

    if remove_punct:
        s = s.translate(str.maketrans('', '', string.punctuation))

    s = re.sub(r"\s+", " ", s).strip()

    sw = build_stopwords(language, extra=extra_stopwords, keep=keep_stopwords) if language else set()
    tokens = [t for t in s.split() if t not in sw]
    return ' '.join(tokens)

print(clean_text("This is, perhaps, the simplest possible example!"))


perhaps simplest possible example


In [24]:
sample_texts = [
    "Hello!!! This is a SAMPLE sentence, with punctuation.",
    "NLTK helps remove stopwords; it's handy.",
    "Café con leche — déjà vu… and emojis 😊 are removed by accent normalization.",
]

cleaned = [clean_text(t) for t in sample_texts]
for before, after in zip(sample_texts, cleaned):
    print('\nOriginal:', before)
    print('Cleaned :', after)


Original: Hello!!! This is a SAMPLE sentence, with punctuation.
Cleaned : hello sample sentence punctuation

Original: NLTK helps remove stopwords; it's handy.
Cleaned : nltk helps remove stopwords handy

Original: Café con leche — déjà vu… and emojis 😊 are removed by accent normalization.
Cleaned : cafe con leche deja vu emojis removed accent normalization


In [25]:
import pandas as pd

try:
    df = pd.read_csv('sample_texts.csv')
    if 'text' in df.columns:
        df['clean_text'] = df['text'].apply(clean_text)
        display(df.head())
        df.to_csv('cleaned_texts.csv', index=False)
        print('💾 Saved cleaned file -> cleaned_texts.csv')
    else:
        print("CSV loaded but missing a 'text' column.")
except FileNotFoundError:
    print('No sample_texts.csv found. Upload your CSV first.')


Unnamed: 0,id,text,clean_text
0,1,"Hello!!! This is a SAMPLE sentence, with punct...",hello sample sentence punctuation
1,2,NLTK helps remove stopwords; it's handy.,nltk helps remove stopwords handy
2,3,Café con leche — déjà vu… and emojis 😊 are rem...,cafe con leche deja vu emojis removed accent n...
3,4,This is the last example; it should be cleaned...,last example cleaned nicely


💾 Saved cleaned file -> cleaned_texts.csv



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.



In [26]:
user_text = "Type your text here to clean. Remove stopwords and punctuation!"
print('Original:', user_text)
print('Cleaned :', clean_text(user_text))

Original: Type your text here to clean. Remove stopwords and punctuation!
Cleaned : type text clean remove stopwords punctuation


In [27]:
assert clean_text('The the the.') == ''
assert clean_text('Hello, world!') == 'hello world'
assert clean_text('Café!') == 'cafe'
print('✅ All tests passed!')

✅ All tests passed!
