### Basic Text Processing Techniques
1. Tokenization 
- Involves splitting text into smaller units like words or sentences.

In [6]:
from nltk.tokenize import TreebankWordTokenizer
import nltk
nltk.download('punkt')

# example text
text = 'NLP is fun and useful!'

# Tokenize the text
tokenizer = TreebankWordTokenizer()
tokens = tokenizer.tokenize(text)
tokens

[nltk_data] Downloading package punkt to /home/wanyua/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['NLP', 'is', 'fun', 'and', 'useful', '!']

2. Stopword Removal
- Stopwords are common words like the 'I', 'and' that don't carry significant meaning. Removing them help focus on meaning words

In [7]:
from nltk.corpus import stopwords
nltk.download('stopwords')

# Define a set of stopwords
stop_words = set(stopwords.words('english'))

# filter our stopworkds from the tokenized text
filtered_tokens = [word for  word in tokens if word.lower() not in stop_words]
filtered_tokens

[nltk_data] Downloading package stopwords to /home/wanyua/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['NLP', 'fun', 'useful', '!']

3. Stemming
- Stemming reduces words to their root form by chopping off suffixex. It helps in grouping words with the same meaning.

Example: 'running' -> 'run'


In [8]:
from nltk.stem import PorterStemmer

# initialize stemmmer
stemmer = PorterStemmer()

# Stem the filtered tokens
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
stemmed_tokens

['nlp', 'fun', 'use', '!']

4. Lemmatization
- It reduces words to their base or dictionary form(lemma) by considering the context and meaning.
Example: 'better' -> 'good'


In [9]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize the filtered tokens
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
print('Lemmatized tokens:', lemmatized_tokens)

[nltk_data] Downloading package wordnet to /home/wanyua/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Lemmatized tokens: ['NLP', 'fun', 'useful', '!']


5. Case conversion
- Converting text to lowercase to ensure consistency during processing. (eg. 'Cat' and 'cat' are treated the same)

In [10]:
# Converting text to lowercase
lowercase_tokens = [token.lower() for token in lemmatized_tokens]
print(f"Lowercase tokens: {lowercase_tokens}")

Lowercase tokens: ['nlp', 'fun', 'useful', '!']


In [4]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /home/wanyua/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/wanyua/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/wanyua/nltk_data...


True