# Overview of Text Data Analysis and Introduction to NLP

## Introduction

Natural Language Processing (NLP) is a subfield of artificial intelligence and linguistics that focuses on enabling computers to understand, interpret, and generate human language. Text data analysis is an essential part of NLP, as it allows us to extract valuable information and insights from large volumes of unstructured text data. In this notebook, we will cover some fundamental techniques used in text data analysis, including tokenization, stopword removal, stemming, lemmatization, and text feature extraction. We will also introduce some popular Python libraries used in NLP, such as NLTK, spaCy, and TextBlob.

Before diving into the techniques, it's essential to understand the purpose of each step in the text data analysis process:

- **Regular Expressions:** Regular expressions are a powerful tool for pattern matching and searching within text data. They allow you to create complex patterns using a specific syntax and then match these patterns against text data. Regular expressions are particularly helpful in text preprocessing tasks, such as data cleaning and extraction.
- **Tokenization:** Tokenization is the process of breaking down the text into individual words or tokens. This step is crucial because it allows us to analyze the text at the word level and build a structured representation of the data.
- **Stopword Removal:** Stopwords are common words that do not carry much meaningful information (e.g., "a", "an", "the"). Removing stopwords helps reduce the dimensionality of the data and focus on more relevant words.
- **Stemming:** Stemming is the process of reducing words to their root or base form (e.g., "running" -> "run"). This step helps in consolidating similar words and reducing the overall complexity of the text.
- **Lemmatization:** Similar to stemming, lemmatization is the process of converting words to their base form, but it considers the context and part of speech to derive the root word (e.g., "better" -> "good"). It is more accurate than stemming but can be computationally more expensive.
- **Text Feature Extraction:** Feature extraction involves converting the text into a numerical representation that can be used as input for machine learning algorithms. Common techniques include Bag of Words and Term Frequency-Inverse Document Frequency (TF-IDF).

These steps play a vital role in preparing the text data for further analysis, making it easier for algorithms to extract meaningful insights and perform advanced NLP tasks.

## 2. Regular Expressions

Regular expressions (regex) are a powerful tool for pattern matching and searching within text data. The regex syntax is a combination of special characters and literal text that defines a search pattern. Here is an overview of some common regex syntax elements:

1. Literal characters: Regular characters match themselves exactly, e.g., a, b, c.
2. Special characters: Special characters have a specific meaning in the regex syntax, such as:
    - `.`: Matches any single character except a newline.
    - `^`: Matches the start of a string.
    - `$`: Matches the end of a string.
    - `*`: Matches zero or more repetitions of the preceding character or group.
    - `+`: Matches one or more repetitions of the preceding character or group.
    - `?`: Matches zero or one repetition of the preceding character or group.
    - `{m,n}`: Matches at least m and at most n repetitions of the preceding character or group.
    - `[ ]`: Defines a character set. Matches any single character inside the brackets. E.g., [abc] matches either a, b, or c.
    - `( )`: Groups characters or regex patterns. It allows you to apply quantifiers or other regex elements to the entire group.
    - `|`: Acts as an OR operator. Matches either the pattern before or after the |. E.g., a|b matches either a or b.
    - `\`: Escapes special characters, making them literal characters. E.g., \. matches a period, not any character.
3. Character classes: Predefined character sets that match specific types of characters:
    - `\d`: Matches any decimal digit (0-9).
    - `\D`: Matches any non-digit character.
    - `\s`: Matches any whitespace character (space, tab, newline).
    - `\S`: Matches any non-whitespace character.
    - `\w`: Matches any word character (letters, digits, or underscores).
    - `\W`: Matches any non-word character.
4. Lookahead and lookbehind assertions: These allow you to match a pattern based on what comes before or after it without including the matched text in the result:
    - `(?=...)`: Positive lookahead. Matches if the pattern inside the lookahead is found after the current position.
    - `(?!...)`: Negative lookahead. Matches if the pattern inside the lookahead is not found after the current position.
    - `(?<=...)`: Positive lookbehind. Matches if the pattern inside the lookbehind is found before the current position.
    - `(?<!...)`: Negative lookbehind. Matches if the pattern inside the lookbehind is not found before the current position.

These are just some of the basic regex syntax elements. Regular expressions can become very complex as you combine these elements to create more advanced patterns. To practice and learn more about regex syntax, you can use online tools like [regex101.com](regex101.com) or [regexr.com](regexr.com).

In [17]:
import re

# Sample text
text = "The rain in Spain falls mainly on the plain."

# Define a regular expression pattern
pattern = r"[a-zA-Z]*ain\b"

# Find all occurrences of the pattern in the text
matches = re.findall(pattern, text)

print("Matches:", matches)

Matches: ['rain', 'Spain', 'plain']


In this example, we're searching for words that start with a capital "S". The regular expression pattern r"\bS\w+" is defined as follows:

`\b`: a word boundary

`S`: the capital letter "S"

`\w+`: one or more word characters (letters, digits, or underscores)

## 3. Tokenization

Tokenization is the process of splitting a text into smaller units called tokens, usually words or phrases. This is an essential step in text data preprocessing.

In [18]:
!pip install nltk



In [19]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize

sample_text = "Tokenization is an essential step in NLP. It helps in breaking down text into smaller units."

# Word tokenization
word_tokens = word_tokenize(sample_text)
print("Word tokens:")
print(word_tokens)

# Sentence tokenization
sent_tokens = sent_tokenize(sample_text)
print("\nSentence tokens:")
print(sent_tokens)

Word tokens:
['Tokenization', 'is', 'an', 'essential', 'step', 'in', 'NLP', '.', 'It', 'helps', 'in', 'breaking', 'down', 'text', 'into', 'smaller', 'units', '.']

Sentence tokens:
['Tokenization is an essential step in NLP.', 'It helps in breaking down text into smaller units.']


[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 4. Stopword Removal

Stopwords are common words that don't carry much meaning and are often removed from text data during preprocessing. Examples include "a," "an," "the," "in," and "is."

In [26]:
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

len(stop_words)

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


179

In [37]:
filtered_tokens_2 = []
for token in word_tokens:
    if token.lower() not in stop_words:
        filtered_tokens_2.append(token)

In [38]:
filtered_tokens_3 = []
for token in word_tokens:
    if token.lower() not in [".", ";", "!"]:
        filtered_tokens_3.append(token)

In [46]:
filtered_tokens_4 = []
for token in word_tokens:
    if token.lower() not in [".", ";", "!"]:
        if token.lower() not in stop_words:
            filtered_tokens_4.append(token)

In [29]:
filtered_tokens = [token for token in word_tokens if token.lower() not in stop_words] #list comprehension

In [30]:
filtered_tokens_2 == filtered_tokens

True

In [51]:
assert filtered_tokens_4 == filtered_tokens, "not the same"

AssertionError: not the same

In [52]:
print("Original tokens:")
print(word_tokens)
print("\nFiltered tokens (stopwords removed):")
print(filtered_tokens)

Original tokens:
['Tokenization', 'is', 'an', 'essential', 'step', 'in', 'NLP', '.', 'It', 'helps', 'in', 'breaking', 'down', 'text', 'into', 'smaller', 'units', '.']

Filtered tokens (stopwords removed):
['Tokenization', 'essential', 'step', 'NLP', '.', 'helps', 'breaking', 'text', 'smaller', 'units', '.']


## 5. Stemming and Lemmatization

Stemming and Lemmatization are techniques used to reduce words to their base or root form. Stemming cuts off the prefixes and/or suffixes of words, while lemmatization reduces words to their base form using a lexical knowledge base.

In [61]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('wordnet')
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

word = 'lemmatized'

stemmed_word = stemmer.stem(word)
lemmatized_word = lemmatizer.lemmatize(word)

print(f"Original word: {word}")
print(f"Stemmed word: {stemmed_word}")
print(f"Lemmatized word: {lemmatized_word}")

Original word: lemmatized
Stemmed word: lemmat
Lemmatized word: lemmatized


[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## 6. Text Feature Extraction (Bag of Words, TF-IDF)

Text feature extraction is the process of transforming text data into a structured format that can be used as input for machine learning algorithms. Bag of Words and TF-IDF (Term Frequency-Inverse Document Frequency) are two popular methods for text feature extraction.

### 6.1 Bag of Words

In [67]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

corpus = [
    'This is the first document but not the second or the last.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?'
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

In [66]:
bow = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

print("Bag of Words:")
bow

Bag of Words:


Unnamed: 0,and,document,first,is,one,second,the,third,this
0,0,1,1,1,0,0,1,0,1
1,0,2,0,1,0,1,1,0,1
2,1,0,0,1,1,0,1,1,1
3,0,1,1,1,0,0,1,0,1


In [68]:
bow = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

print("Bag of Words:")
bow

Bag of Words:


Unnamed: 0,and,but,document,first,is,last,not,one,or,second,the,third,this
0,0,1,1,1,1,1,1,0,1,1,3,0,1
1,0,0,2,0,1,0,0,0,0,1,1,0,1
2,1,0,0,0,1,0,0,1,0,0,1,1,1
3,0,0,1,1,1,0,0,0,0,0,1,0,1


### 6.2 IF-IDF

In [71]:
corpus = [
    'This is document but not the second or last.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?'
]

In [72]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
tfidf = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

print("TF-IDF:")
tfidf

TF-IDF:


Unnamed: 0,and,but,document,first,is,last,not,one,or,second,the,third,this
0,0.0,0.413592,0.26399,0.0,0.215829,0.413592,0.413592,0.0,0.413592,0.326081,0.215829,0.0,0.215829
1,0.0,0.0,0.728794,0.0,0.297919,0.0,0.0,0.0,0.0,0.450103,0.297919,0.0,0.297919
2,0.511849,0.0,0.0,0.0,0.267104,0.0,0.0,0.511849,0.0,0.0,0.267104,0.511849,0.267104
3,0.0,0.0,0.42797,0.670497,0.349893,0.0,0.0,0.0,0.0,0.0,0.349893,0.0,0.349893


### 6.3 Comparison

#### Bag of Words (BoW) Representation:
- BoW is a simple method that represents text data as a "bag" or unordered set of words, disregarding grammar and word order but keeping track of the frequency of each word.
- In the BoW model, each document is represented as a vector in a high-dimensional space, with each dimension corresponding to a unique word from the entire corpus.

Advantages of BoW:
1. Easy to understand and implement.
2. Works well for simple text classification and sentiment analysis tasks when word order and context are less important.

Disadvantages of BoW:
1. Ignores word order and context, which can be crucial for understanding the meaning of a text.
2. Results in a sparse matrix due to the high dimensionality of the feature space, which can lead to increased memory and computational requirements.
3. Gives equal importance to all words, even common words that don't contribute much to the meaning of a document.

#### TF-IDF (Term Frequency-Inverse Document Frequency) Representation:
- TF-IDF is an improvement over the BoW model that assigns weights to words based on their importance within a document and across the entire corpus.
- It takes into account not only the frequency of words in a document (Term Frequency) but also their rarity across the entire corpus (Inverse Document Frequency).

Advantages of TF-IDF:
1. Considers the importance of words, giving more weight to rare words and less weight to common words, thus helping to identify distinguishing features in a document.
2. Can lead to better performance in text classification and information retrieval tasks compared to BoW, especially when the word order is not critical.

Disadvantages of TF-IDF:
1. More complex to understand and implement compared to BoW.
2. Still results in a sparse matrix with high dimensionality, similar to BoW.
3. Like BoW, it ignores word order and context, which can limit its effectiveness in some NLP tasks.

In summary, both BoW and TF-IDF have their strengths and weaknesses. BoW is a simpler approach that works well for basic text analysis tasks, while TF-IDF provides a more refined representation by considering the importance of words within documents and across the corpus. However, both methods disregard word order and context, which can be limiting factors in certain NLP tasks.

## 7. Exercises

**Exercise 1:** Given a text string, preprocess the text by performing the following tasks:

1. Tokenize the text into words. `word_tokenize()`
2. Convert all words to lowercase. `.lower()`
3. Remove stopwords.
4. Perform stemming on the remaining words.

**Exercise 2:** Using a corpus of your choice, create a Bag of Words representation and a TF-IDF representation. Compare the two representations and discuss the advantages and disadvantages of each method.