Regex (regular expressions) played a crucial role in the early days of NLP (Natural Language Processing) before the advent of modern models like transformers:

1. Pattern Matching:
Regex provided a powerful and efficient way to identify and extract patterns in text. This was especially useful for tasks such as:
Information Extraction: Identifying names, dates, email addresses, and other specific data points.
Text Cleaning: Removing unwanted characters, whitespace, and formatting artifacts from text data.

2. Simplicity and Speed:
Regex operations are generally fast and efficient. They can be executed quickly and with minimal computational resources, making them ideal for early NLP systems with limited processing power and memory.

3. Linguistic Rule Implementation:
Before sophisticated machine learning models, much of NLP relied on hand-crafted linguistic rules. Regex allowed developers to encode these rules in a concise and maintainable way, facilitating tasks like:
Tokenization: Splitting text into words, sentences, or other meaningful units.
Normalization: Converting text to a standard format, such as lowercasing or stemming words.

4. Text Classification and Preprocessing:
Regex was essential for preprocessing text data for various NLP tasks. By using regex, developers could:
Filter Text: Remove noise or irrelevant parts of text data.
Feature Extraction: Generate features based on text patterns for use in simpler statistical models like Naive Bayes or Logistic Regression.

5. Lack of Data and Resources:
In the early days, large annotated datasets were scarce, and computing resources were limited. Regex allowed NLP practitioners to perform meaningful text manipulation and analysis without the need for extensive training data or powerful hardware.

6. Flexibility and Generality:
Regex is a general tool that can be applied to any language or text format. It provided a flexible approach for handling diverse text data, which was crucial before the development of specialized NLP models.

Explanation of Concepts
Matching Literal Strings:
Finds exact matches of the string "fox".

Using Metacharacters:
\b\w{3}\b matches words of exactly 3 letters.

Character Classes:
[a-zA-Z]+ matches words containing only letters.

Predefined Character Classes:
\d{3}-\d{3}-\d{3} matches a pattern similar to a phone number.

Anchors:
^The matches "The" at the start of the string.

Groups and Capturing:
(\w+)@(\w+\.\w+) captures the username and domain of an email separately.

Non-capturing Groups:
(?:\+\d{1,3}-)?\d{3}-\d{3}-\d{3} matches phone numbers with an optional country code.

Lookahead and Lookbehind:
\b\w+(?=\sfox) matches a word that appears before the word "fox".
(?<=\bquick\s)\w+ matches a word that appears after the word "quick".

In [6]:
import re

# Example text
text = "The quick brown fox jumps over the lazy dog. Email: example@example.com. Phone: +1-234-567-890."

# 1. Matching Literal Strings
pattern_literal = re.compile(r"fox")
matches_literal = pattern_literal.findall(text)
print("1. Matching Literal Strings:\n", matches_literal)

# 2. Using Metacharacters (e.g., ., ^, $, *, +, ?, {}, [], |, \)
pattern_meta = re.compile(r"\b\w{3}\b")  # Words of exactly 3 letters
matches_meta = pattern_meta.findall(text)
print("\n2. Using Metacharacters:\n", matches_meta)

# 3. Character Classes
pattern_class = re.compile(r"[a-zA-Z]+")  # Words containing only letters
matches_class = pattern_class.findall(text)
print("\n3. Character Classes:\n", matches_class)

# 4. Predefined Character Classes (e.g., \d, \D, \s, \S, \w, \W)
pattern_predefined = re.compile(r"\d{3}-\d{3}-\d{3}")  # Phone number pattern
matches_predefined = pattern_predefined.findall(text)
print("\n4. Predefined Character Classes:\n", matches_predefined)

# 5. Anchors (e.g., ^, $)
pattern_anchor = re.compile(r"^The")  # Matches 'The' at the start of the string
matches_anchor = pattern_anchor.findall(text)
print("\n5. Anchors:\n", matches_anchor)

# 6. Groups and Capturing
pattern_group = re.compile(r"(\w+)@(\w+\.\w+)")  # Captures username and domain separately
matches_group = pattern_group.findall(text)
print("\n6. Groups and Capturing:\n", matches_group)

# 7. Non-capturing Groups
pattern_non_capturing = re.compile(r"(?:\+\d{1,3}-)?\d{3}-\d{3}-\d{3}")  # Matches phone number with optional country code
matches_non_capturing = pattern_non_capturing.findall(text)
print("\n7. Non-capturing Groups:\n", matches_non_capturing)

# 8. Lookahead and Lookbehind
pattern_lookahead = re.compile(r"\b\w+(?=\sfox)")  # Matches word before 'fox'
matches_lookahead = pattern_lookahead.findall(text)
print("\n8. Lookahead:\n", matches_lookahead)

pattern_lookbehind = re.compile(r"(?<=\bquick\s)\w+")  # Matches word after 'quick'
matches_lookbehind = pattern_lookbehind.findall(text)
print("\n8. Lookbehind:\n", matches_lookbehind)


1. Matching Literal Strings:
 ['fox']

2. Using Metacharacters:
 ['The', 'fox', 'the', 'dog', 'com', '234', '567', '890']

3. Character Classes:
 ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', 'Email', 'example', 'example', 'com', 'Phone']

4. Predefined Character Classes:
 ['234-567-890']

5. Anchors:
 ['The']

6. Groups and Capturing:
 [('example', 'example.com')]

7. Non-capturing Groups:
 ['+1-234-567-890']

8. Lookahead:
 ['brown']

8. Lookbehind:
 ['brown']


Tokenization:
Sentence Tokenization: Splitting the text into sentences using sent_tokenize.
Word Tokenization: Splitting the text into words using word_tokenize.

Stemming:
Using PorterStemmer to reduce words to their root form. For example, "running" becomes "run".

Lemmatization:
Using WordNetLemmatizer to reduce words to their base form (lemma). For example, "running" becomes "run", but unlike stemming, lemmatization considers the context and converts words to meaningful base forms.

Stopwords Removal:
Removing common words (stopwords) that may not contribute significant meaning to the text. The list of stopwords is obtained from nltk.corpus.stopwords.

Keyword Dictionaries:
Using a predefined dictionary (keywords) to map certain words to specific categories or keywords. For example, "river" is mapped to "water_body".

In [7]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords

# Example text
text = ("The Kennebec River now runs mostly clean, thanks to laws that reduced pollution. "
        "Yet four hydroelectric dams, two built in the early 20th century and the other two in the 1980s, "
        "remain on the lower reaches of the 150-mile-long river and continue to prevent endangered salmon "
        "from reaching their single most important spawning tributary, the Sandy River.")

# 1. Tokenization
# Sentence Tokenization
sentences = sent_tokenize(text)
print("1. Sentence Tokenization:\n", sentences)

# Word Tokenization
words = word_tokenize(text)
print("\n1. Word Tokenization:\n", words)

# 2. Stemming
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in words]
print("\n2. Stemming:\n", stemmed_words)

# 3. Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print("\n3. Lemmatization:\n", lemmatized_words)

# 4. Stopwords Removal
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words]
print("\n4. Stopwords Removal:\n", filtered_words)

# 5. Keyword Dictionaries
keywords = {"river": "water_body", "pollution": "environment_issue", "salmon": "fish"}
keyword_dict = {word: keywords[word.lower()] for word in words if word.lower() in keywords}
print("\n5. Keyword Dictionaries:\n", keyword_dict)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...


1. Sentence Tokenization:
 ['The Kennebec River now runs mostly clean, thanks to laws that reduced pollution.', 'Yet four hydroelectric dams, two built in the early 20th century and the other two in the 1980s, remain on the lower reaches of the 150-mile-long river and continue to prevent endangered salmon from reaching their single most important spawning tributary, the Sandy River.']

1. Word Tokenization:
 ['The', 'Kennebec', 'River', 'now', 'runs', 'mostly', 'clean', ',', 'thanks', 'to', 'laws', 'that', 'reduced', 'pollution', '.', 'Yet', 'four', 'hydroelectric', 'dams', ',', 'two', 'built', 'in', 'the', 'early', '20th', 'century', 'and', 'the', 'other', 'two', 'in', 'the', '1980s', ',', 'remain', 'on', 'the', 'lower', 'reaches', 'of', 'the', '150-mile-long', 'river', 'and', 'continue', 'to', 'prevent', 'endangered', 'salmon', 'from', 'reaching', 'their', 'single', 'most', 'important', 'spawning', 'tributary', ',', 'the', 'Sandy', 'River', '.']

2. Stemming:
 ['the', 'kennebec', 'ri

One-Hot Encoding:
CountVectorizer(binary=True) creates a binary vector for each word.
The vocabulary is displayed to show the mapping of words to their indices.

Bag-of-Words:
CountVectorizer() creates a frequency vector for each word.
The vocabulary is displayed to show the mapping of words to their indices.

TF-IDF:
TfidfVectorizer() creates a TF-IDF vector for each word.
The vocabulary is displayed to show the mapping of words to their indices.
compute_tf function calculates term frequency (TF) manually.
compute_idf function calculates inverse document frequency (IDF) manually.

Word2Vec CBOW:
Word2Vec model with sg=0 (CBOW) creates word vectors.
An example word vector for 'river' is displayed.

Word2Vec Skip-gram:
Word2Vec model with sg=1 (Skip-gram) creates word vectors.
An example word vector for 'river' is displayed.

Average Word Embeddings:
Function average_word_vectors computes the average vector for a sentence.
The average word embedding for the example text is displayed.

In [5]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import Word2Vec
from collections import Counter
import math
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

# Example text
text = ["The Kennebec River now runs mostly clean, thanks to laws that reduced pollution. "
        "Yet four hydroelectric dams, two built in the early 20th century and the other two in the 1980s, "
        "remain on the lower reaches of the 150-mile-long river and continue to prevent endangered salmon "
        "from reaching their single most important spawning tributary, the Sandy River."]

# Tokenizing the text
tokens = [word_tokenize(sentence.lower()) for sentence in text]

# 1. One-Hot Encoding
vectorizer = CountVectorizer(binary=True)
one_hot = vectorizer.fit_transform(text).toarray()
print("One-Hot Encoding:\n", one_hot)

# Display One-Hot Encoding interim steps
one_hot_vocab = vectorizer.vocabulary_
print("\nOne-Hot Encoding Vocabulary:\n", one_hot_vocab)

# 2. Bag-of-Words
bow_vectorizer = CountVectorizer()
bow = bow_vectorizer.fit_transform(text).toarray()
print("\nBag-of-Words:\n", bow)

# Display Bag-of-Words interim steps
bow_vocab = bow_vectorizer.vocabulary_
print("\nBag-of-Words Vocabulary:\n", bow_vocab)

# 3. TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf = tfidf_vectorizer.fit_transform(text).toarray()
print("\nTF-IDF:\n", tfidf)

# Display TF-IDF interim steps
tfidf_vocab = tfidf_vectorizer.vocabulary_
print("\nTF-IDF Vocabulary:\n", tfidf_vocab)

# Calculate TF and IDF manually for illustration
def compute_tf(text):
    tf_text = []
    for sentence in text:
        tokens = word_tokenize(sentence.lower())
        counter = Counter(tokens)
        total_words = len(tokens)
        tf_text.append({word: count / total_words for word, count in counter.items()})
    return tf_text

def compute_idf(text):
    tokenized_text = [word_tokenize(sentence.lower()) for sentence in text]
    idf_dict = {}
    N = len(tokenized_text)
    all_tokens_set = set([item for sublist in tokenized_text for item in sublist])
    
    for tkn in all_tokens_set:
        contains_token = sum([1 for sublist in tokenized_text if tkn in sublist])
        idf_dict[tkn] = math.log(N / (1 + contains_token))
    return idf_dict

tf = compute_tf(text)
idf = compute_idf(text)
print("\nTF:\n", tf)
print("\nIDF:\n", idf)

# Tokenizing the text for Word2Vec
tokens = [word_tokenize(sentence.lower()) for sentence in text]

# 4. Word2Vec CBOW (Continuous Bag of Words)
cbow_model = Word2Vec(sentences=tokens, vector_size=100, window=5, min_count=1, sg=0)
print("\nWord2Vec CBOW example for 'river':\n", cbow_model.wv['river'])

# 5. Word2Vec Skip-gram
skipgram_model = Word2Vec(sentences=tokens, vector_size=100, window=5, min_count=1, sg=1)
print("\nWord2Vec Skip-gram example for 'river':\n", skipgram_model.wv['river'])

# 6. Average Word Embeddings (using Word2Vec CBOW)
def average_word_vectors(tokens, model):
    vector_size = model.vector_size
    avg_vector = np.zeros((vector_size,))
    num_words = 0
    for word in tokens:
        if word in model.wv:
            avg_vector = np.add(avg_vector, model.wv[word])
            num_words += 1
    if num_words > 0:
        avg_vector = np.divide(avg_vector, num_words)
    return avg_vector

avg_vector = average_word_vectors(tokens[0], cbow_model)
print("\nAverage Word Embeddings for the example text (using CBOW):\n", avg_vector)


One-Hot Encoding:
 [[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1]]

One-Hot Encoding Vocabulary:
 {'the': 41, 'kennebec': 16, 'river': 33, 'now': 23, 'runs': 34, 'mostly': 22, 'clean': 6, 'thanks': 39, 'to': 43, 'laws': 17, 'that': 40, 'reduced': 31, 'pollution': 27, 'yet': 46, 'four': 11, 'hydroelectric': 13, 'dams': 8, 'two': 45, 'built': 4, 'in': 15, 'early': 9, '20th': 2, 'century': 5, 'and': 3, 'other': 26, '1980s': 1, 'remain': 32, 'on': 25, 'lower': 19, 'reaches': 29, 'of': 24, '150': 0, 'mile': 20, 'long': 18, 'continue': 7, 'prevent': 28, 'endangered': 10, 'salmon': 35, 'from': 12, 'reaching': 30, 'their': 42, 'single': 37, 'most': 21, 'important': 14, 'spawning': 38, 'tributary': 44, 'sandy': 36}

Bag-of-Words:
 [[1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1
  1 1 1 1 1 7 1 2 1 2 1]]

Bag-of-Words Vocabulary:
 {'the': 41, 'kennebec': 16, 'river': 33, 'now': 23, 'runs': 34, 'mostly': 22, 'clean': 6,

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [11]:
# These examples illustrate different methods for sentiment analysis:
# Dictionary-Based Approach: Uses predefined sentiment lexicons to calculate sentiment scores.
# Classifier-Based Approach: Uses machine learning models (e.g., BERT) to classify the sentiment.
# Aspect Sentiment Modeling: Analyzes sentiments related to specific aspects or topics within the text.

#!pip install vaderSentiment
#!pip install transformers
#!pip install textblob

text = ("The Kennebec River now runs mostly clean, thanks to laws that reduced pollution. "
        "Yet four hydroelectric dams, two built in the early 20th century and the other two in the 1980s, "
        "remain on the lower reaches of the 150-mile-long river and continue to prevent endangered salmon "
        "from reaching their single most important spawning tributary, the Sandy River.")

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl.metadata (572 bytes)
Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
   ---------------------------------------- 0.0/126.0 kB ? eta -:--:--
   --- ------------------------------------ 10.2/126.0 kB ? eta -:--:--
   ----------------------------- ---------- 92.2/126.0 kB 1.7 MB/s eta 0:00:01
   ---------------------------------------- 126.0/126.0 kB 2.5 MB/s eta 0:00:00
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2
Collecting transformers
  Downloading transformers-4.41.1-py3-none-any.whl.metadata (43 kB)
     ---------------------------------------- 0.0/43.8 kB ? eta -:--:--
     ----------------- -------------------- 20.5/43.8 kB 640.0 kB/s eta 0:00:01
     -------------------------------------- 43.8/43.8 kB 714.4 kB/s eta 0:00:00
Collecting huggingface-hub<1.0,>=0.23.0 (from transformers)
  Downloading huggingface_hub-0.23.2-py3-none-any.whl.meta

In [12]:
# Dictionary-based sentiment analysis using VADER
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

text = ("The Kennebec River now runs mostly clean, thanks to laws that reduced pollution. "
        "Yet four hydroelectric dams, two built in the early 20th century and the other two in the 1980s, "
        "remain on the lower reaches of the 150-mile-long river and continue to prevent endangered salmon "
        "from reaching their single most important spawning tributary, the Sandy River.")

analyzer = SentimentIntensityAnalyzer()
sentiment = analyzer.polarity_scores(text)

print(f"Dictionary-Based Sentiment Analysis: {sentiment}")


Dictionary-Based Sentiment Analysis: {'neg': 0.034, 'neu': 0.781, 'pos': 0.184, 'compound': 0.7645}


In [13]:
# Classifier-based sentiment analysis using a pre-trained model (e.g., BERT)
from transformers import pipeline

classifier = pipeline('sentiment-analysis')
sentiment = classifier(text)

print(f"Classifier-Based Sentiment Analysis: {sentiment}")


None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


RuntimeError: At least one of TensorFlow 2.0 or PyTorch should be installed. To install TensorFlow 2.0, read the instructions at https://www.tensorflow.org/install/ To install PyTorch, read the instructions at https://pytorch.org/.

In [14]:
# Aspect-based sentiment analysis using a rule-based approach

from textblob import TextBlob

# Splitting the text into different aspects
aspects = {
    "environment": "The Kennebec River now runs mostly clean, thanks to laws that reduced pollution.",
    "dams": "Yet four hydroelectric dams, two built in the early 20th century and the other two in the 1980s, "
            "remain on the lower reaches of the 150-mile-long river and continue to prevent endangered salmon "
            "from reaching their single most important spawning tributary, the Sandy River."
}

aspect_sentiments = {}
for aspect, aspect_text in aspects.items():
    blob = TextBlob(aspect_text)
    sentiment = blob.sentiment
    aspect_sentiments[aspect] = sentiment

print("Aspect-Based Sentiment Analysis:")
for aspect, sentiment in aspect_sentiments.items():
    print(f"{aspect.capitalize()}: {sentiment}")


Aspect-Based Sentiment Analysis:
Environment: Sentiment(polarity=0.2833333333333333, subjectivity=0.45000000000000007)
Dams: Sentiment(polarity=0.13392857142857142, subjectivity=0.3982142857142857)


TextBlob computes subjectivity based on the presence of subjective terms and phrases in the text. The subjectivity score is a float within the range [0.0, 1.0], where 0.0 is very objective and 1.0 is very subjective. TextBlob uses a predefined lexicon of subjective terms to determine the subjectivity of a piece of text.

Here’s a brief explanation of how TextBlob computes subjectivity:

Lexicon-Based Analysis: TextBlob uses a lexicon of words that have been pre-labeled with subjectivity scores. These scores are derived from linguistic research and represent the degree to which a word expresses subjectivity.

Term Frequency: The frequency of subjective terms in the text contributes to the overall subjectivity score. More frequent subjective terms lead to higher subjectivity scores.

Phrase-Level Analysis: TextBlob also considers phrases and not just individual words. Some phrases might convey subjectivity differently from their constituent words.