## Natural Language Processing With Python's NLTK Package

##### To install NLTK with pip. It’s a best practice to install it in a virtual environment

In [1]:
#pip install nltk
#pip install textblob 
#pip install pyspellchecker

#nltk.download('punkt_tab')
#nltk.download('stopwords')
#nltk.download('averaged_perceptron_tagger_eng')
#nltk.download('wordnet')

### Step 1: Print the text
#### Purpose: Set up the imports and print the text stored in example_text.

In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from textblob import TextBlob
import re

example_text = "Once you are FULLY vaccinated, you can traval in the United States without getting tested or self-quarantining after traveling. https://www.cdc.gov"

print(example_text)

Once you are FULLY vaccinated, you can traval in the United States without getting tested or self-quarantining after traveling. https://www.cdc.gov


### Step 2: URL Removal
#### Purpose: Removes any substring starting with "http" followed by non-whitespace characters using regex.

In [3]:
# Removal of URLs:
def remove_url(text_data):
    return re.sub(r"http\S+", "", text_data)

processed_text = remove_url(example_text)

print(processed_text)

Once you are FULLY vaccinated, you can traval in the United States without getting tested or self-quarantining after traveling. 


### Step 3: Convert to Lowercase
#### Purpose: Converts all characters to lowercase for standardization.

In [4]:
# lowercases
def lower_case(text_data):
    return text_data.lower()

lower = lower_case(processed_text)

print(lower)

once you are fully vaccinated, you can traval in the united states without getting tested or self-quarantining after traveling. 


### Step 4: Spelling Correction
#### Purpose: character-level spelling correction

In [5]:
# Spelling Correction
# character-level spelling correction 
def correct_spelling(text):
    return str(TextBlob(text).correct())

processed_text = correct_spelling(lower)
print(processed_text)

once you are fully vaccinated, you can travel in the united states without getting tested or self-quarantining after traveling. 


### Step 5: Tokenization
#### Purpose: Uses word_tokenize from NLTK to split the lowercase string into a list of word-level tokens, including punctuation.

In [6]:
# Tokenization
#import nltk
#nltk.download('punkt_tab')

def token(text_data):
    return word_tokenize(text_data)

word_tokens = token(processed_text)
print(word_tokens)

['once', 'you', 'are', 'fully', 'vaccinated', ',', 'you', 'can', 'travel', 'in', 'the', 'united', 'states', 'without', 'getting', 'tested', 'or', 'self-quarantining', 'after', 'traveling', '.']


### Step 6: Stopword removal 
#### Purpose: Remove Stopword

In [7]:
from nltk.corpus import stopwords
import string

# Stopword Removal
def remove_stopwords(tokens):
    stop_words = set(stopwords.words('english'))
    return [w for w in tokens if w.lower() not in stop_words]

filtered_tokens = remove_stopwords(word_tokens)
print(filtered_tokens)

['fully', 'vaccinated', ',', 'travel', 'united', 'states', 'without', 'getting', 'tested', 'self-quarantining', 'traveling', '.']


### Step 7: Punctuation removal 
#### Purpose: Remove punctuation

In [8]:
# Punctuation Removal
def remove_punctuation(tokens):
    return [w for w in tokens if w not in string.punctuation]

clean_tokens = remove_punctuation(filtered_tokens)

print(clean_tokens)

['fully', 'vaccinated', 'travel', 'united', 'states', 'without', 'getting', 'tested', 'self-quarantining', 'traveling']


### Step 8: Part-of-speech (POS) tagging
#### Purpose: Assign POS tags to each word (token) in a list of tokens using NLTK’s pos_tag() function.

In [9]:
# Part of speech tagging
def part_of_speech(text_data):
    return nltk.pos_tag(text_data)

tags = nltk.pos_tag(clean_tokens)

print(tags)

[('fully', 'RB'), ('vaccinated', 'VBN'), ('travel', 'NN'), ('united', 'JJ'), ('states', 'NNS'), ('without', 'IN'), ('getting', 'VBG'), ('tested', 'VBN'), ('self-quarantining', 'JJ'), ('traveling', 'NN')]


What POS Tagging Does? 

It helps identify the grammatical role of each word, such as:

    'NN' = Noun (e.g., "traval")

    'JJ' = Adjective (e.g., "vaccinated")

    'VBG' = Verb Gerund (e.g., "getting")

    'RB' = Adverb (e.g., "fully")

    'IN' = Preposition (e.g., "without")

    'NNS' = Plural Noun (e.g., "states")

### Step 9: Lemmatization Using POS Tags
#### Purpose: Lemmatization based on part-of-speech (POS) tags for each word (token)

In [10]:
from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize, pos_tag
from collections import defaultdict

def pos_lemm(text_data):
    tag_map = defaultdict(lambda: wn.NOUN)
    tag_map['J'] = wn.ADJ
    tag_map['V'] = wn.VERB
    tag_map['R'] = wn.ADV

    lemtzr = WordNetLemmatizer()

    for token, tag in pos_tag(text_data):
        lemma = lemtzr.lemmatize(token, tag_map[tag[0]])
        print("{0:20}{1:20}{2:20}".format(token, "lemma =>", lemma))

print(pos_lemm(clean_tokens))

fully               lemma =>            fully               
vaccinated          lemma =>            vaccinate           
travel              lemma =>            travel              
united              lemma =>            united              
states              lemma =>            state               
without             lemma =>            without             
getting             lemma =>            get                 
tested              lemma =>            test                
self-quarantining   lemma =>            self-quarantining   
traveling           lemma =>            traveling           
None


## Typical Next Steps After Lemmatization
1. Text Vectorization – Convert text to numbers for machine learning. Choose based on model requirements:

| Method              | Description                               | Use Case                  |
| ------------------- | ----------------------------------------- | ------------------------- |
| `CountVectorizer`   | Bag-of-Words (word frequency)             | Basic ML models           |
| `TfidfVectorizer`   | Term frequency-inverse document frequency | More meaningful weighting |
| `Word2Vec`, `GloVe` | Dense vector embeddings                   | Semantic meaning          |
| `BERT`, `GPT`, etc. | Contextual embeddings                     | Deep NLP (transformers)   |


2. Modeling or Analysis. Now you're ready for your main NLP task, such as:

    Classification (e.g., spam detection, sentiment analysis)

    Clustering (e.g., topic modeling with LDA)

    Named Entity Recognition

    Summarization

    Question Answering