<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/ufidon/nlp/blob/main/01.btp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
    <a target="_blank" href="https://kaggle.com/kernels/welcome?src=https://github.com/ufidon/nlp/blob/main/01.btp.ipynb"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
  </td>
</table>
<br>

**Basic Text Processing**
---
📝 SALP chapter 2

## 🍎 An Intriguing Example

How do we read and comprehend the text below?
- parse sentences, words
- search for patterns
- recognize name entities
- find the meaning of words in their context
- feel the sentiment, etc.

In [3]:
text = """
John Smith, 123 Main St, Anytown USA 12345
Phone: (555) 123-4567
Email: [john.smith@example.com](mailto:john.smith@example.com)
Occupation: Software Engineer

Jane Doe, 456 Elm St, Othertown USA 67890
Phone: 1-800-789-0123
Email: janedoe@gmail.com
Occupation: Marketing Manager
"""

We can find the following information from this text:

* Names (first and last)
* Addresses
* Phone numbers
* Email addresses
* Occupations

In [2]:
import re

# Define regex patterns for each piece of information
name_pattern = r"[A-Za-z]+ [A-Za-z]+"
address_pattern = r"\d+ [A-Za-z]+ St, [A-Za-z]+ USA \d{5}"
phone_pattern = r"\(\d{3}\) \d{3}-\d{4}|\d-\d{3}-\d{4}"
email_pattern = r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
occupation_pattern = r"Software Engineer|Marketing Manager"

# Use regex to find all occurrences of each pattern
names = re.findall(name_pattern, text)
addresses = re.findall(address_pattern, text)
phones = re.findall(phone_pattern, text)
emails = re.findall(email_pattern, text)
occupations = re.findall(occupation_pattern, text)

# Print the extracted information
print("Names:")
for name in names:
    print(name)

print("\nAddresses:")
for address in addresses:
    print(address)

print("\nPhone Numbers:")
for phone in phones:
    print(phone)

print("\nEmail Addresses:")
for email in emails:
    print(email)

print("\nOccupations:")
for occupation in occupations:
    print(occupation)


Names:
John Smith
Main St
Anytown USA
Software Engineer
Jane Doe
Elm St
Othertown USA
Marketing Manager

Addresses:
123 Main St, Anytown USA 12345
456 Elm St, Othertown USA 67890

Phone Numbers:
(555) 123-4567
0-789-0123

Email Addresses:
john.smith@example.com
john.smith@example.com
janedoe@gmail.com

Occupations:
Software Engineer
Marketing Manager


The regex features used:

* Character classes (`[A-Za-z]+`, `\d+`)
* Word boundaries (`\b`)
* Groups (`(\d{3})`)
* Alternation (`|`)
* Quantifiers (`*`, `+`, `{5}`)
* Anchors (`^`, `$`)


## Introduction to Regular Expressions
* Regular expressions (regex) are a powerful tool for matching patterns in text data.
* In NLP, regex is used for tasks such as:
	+ Text preprocessing
	+ Information extraction
	+ Sentiment analysis


## Basic Concepts
* **Pattern**: A regular expression is a pattern that matches one or more strings of text.
* **Literal characters**: Characters that match themselves (e.g. `a` matches the letter "a").
* **Metacharacters**: Special characters that have special meanings (e.g. `.` matches any single 
character).
* **Escaping**: Using a backslash (`\`) to treat metacharacters as literal characters.


## Basic Regex Patterns
* **Matching a word**: `\bword\b` matches the whole word "word".
	+ Example: `import re; print(re.search(r'\bhello\b', 'hello world'))`
* **Matching a digit**: `\d` matches any single digit.
	+ Example: `import re; print(re.search(r'\d', 'abc123def'))`
* **Matching whitespace**: `\s` matches any whitespace character (space, tab, newline).
	+ Example: `import re; print(re.search(r'\s', 'hello world'))`

## Character Classes
* **Character class**: A set of characters enclosed in square brackets (`[]`).
* **Matching a single character from the class**: `[abc]` matches any one of "a", "b", or "c".
	+ Example: `import re; print(re.search(r'[abc]', 'hello'))`
* **Negating a character class**: `[^abc]` matches any single character that is not "a", "b", or "c".
	+ Example: `import re; print(re.search(r'[^abc]', 'hello'))`


## Quantifiers
* **Quantifier**: A metacharacter that specifies the number of times a pattern should be matched.
* **Matching zero or more occurrences**: `*` matches any number (including zero) of the preceding 
element.
	+ Example: `import re; print(re.search(r'ab*', 'a'))`
* **Matching one or more occurrences**: `+` matches one or more of the preceding element.
	+ Example: `import re; print(re.search(r'ab+', 'ab'))`

## Groups and Capturing
* **Group**: A set of characters enclosed in parentheses (`()`).
* **Capturing a group**: `(\w+)` captures one or more word characters as a group.
	+ Example: `import re; print(re.search(r'(\w+)', 'hello world'))`

## Anchors
* **Anchor**: A metacharacter that specifies the position of a pattern in a string.
* **Matching the start of a string**: `^` matches the start of a string.
	+ Example: `import re; print(re.search(r'^hello', 'hello world'))`
* **Matching the end of a string**: `$` matches the end of a string.
	+ Example: `import re; print(re.search(r'world$', 'hello world'))`

## Greedy vs. Lazy Matching
* **Greedy matching**: Matches as many characters as possible (default behavior).
* **Lazy matching**: Matches as few characters as possible (`?` quantifier).
	+ Example: `import re; print(re.search(r'a.*?b', 'a hello b'))`

## Using Regex in NLP
* **Preprocessing text data**: Use regex to remove punctuation, convert to lowercase, etc.
* **Extracting information**: Use regex to extract specific patterns from text data (e.g. phone 
numbers, email addresses).
* **Sentiment analysis**: Use regex to extract sentiment-bearing phrases from text data.

🔗 [re — Regular expression operations](https://docs.python.org/3/library/re.html)

## 🍎 Example
Text Analysis with NLTK:

* Tokenize the text into individual words and sentences
* Perform stemming on the tokens (i.e., reduce words to their base form)
* Identify named entities in the text (e.g., people, places, organizations)

In [4]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer
from nltk.chunk import ne_chunk

# Download required NLTK data if necessary
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('words')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker_tab')

# use the same text as above
# text = """
# The quick brown fox jumps over the lazy dog. The sun is shining brightly today.
# """

# Tokenize the text into individual words and sentences
word_tokens = word_tokenize(text)
sentence_tokens = sent_tokenize(text)

print("Word Tokens:")
for token in word_tokens:
    print(token)

print("\nSentence Tokens:")
for sentence in sentence_tokens:
    print(sentence)

# Perform stemming on the tokens
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in word_tokens]

print("\nStemmed Words:")
for stemmed_word in stemmed_words:
    print(stemmed_word)

# Identify named entities in the text
tagged_text = nltk.pos_tag(word_tokenize(text))
named_entities = ne_chunk(tagged_text)

print("\nNamed Entities:")
for tree in named_entities:
    if hasattr(tree, 'label'):
        print(tree.label(), end=': ')
        for leaf in tree.leaves():
            print(leaf[0], end=' ')
        print()

[nltk_data] Downloading package punkt to /home/qingshan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/qingshan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package words to /home/qingshan/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/qingshan/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/qingshan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /home/qingshan/nltk_data...
[nltk_data]   Package maxent_ne_chunker_tab is already up-to-date!


Word Tokens:
John
Smith
,
123
Main
St
,
Anytown
USA
12345
Phone
:
(
555
)
123-4567
Email
:
[
john.smith
@
example.com
]
(
mailto
:
john.smith
@
example.com
)
Occupation
:
Software
Engineer
Jane
Doe
,
456
Elm
St
,
Othertown
USA
67890
Phone
:
1-800-789-0123
Email
:
janedoe
@
gmail.com
Occupation
:
Marketing
Manager

Sentence Tokens:

John Smith, 123 Main St, Anytown USA 12345
Phone: (555) 123-4567
Email: [john.smith@example.com](mailto:john.smith@example.com)
Occupation: Software Engineer

Jane Doe, 456 Elm St, Othertown USA 67890
Phone: 1-800-789-0123
Email: janedoe@gmail.com
Occupation: Marketing Manager

Stemmed Words:
john
smith
,
123
main
st
,
anytown
usa
12345
phone
:
(
555
)
123-4567
email
:
[
john.smith
@
example.com
]
(
mailto
:
john.smith
@
example.com
)
occup
:
softwar
engin
jane
doe
,
456
elm
st
,
othertown
usa
67890
phone
:
1-800-789-0123
email
:
janedo
@
gmail.com
occup
:
market
manag

Named Entities:
PERSON: John 
GPE: Smith 
PERSON: Anytown 
PERSON: Software Engineer Jane

## NLTK features used:

* Tokenization (`word_tokenize`, `sent_tokenize`)
* Stemming (`PorterStemmer`)
* Part-of-speech tagging (`pos_tag`)
* Named entity recognition (`ne_chunk`)

🔗 [Natural Language Toolkit](https://www.nltk.org/)