### Natural Language Processing

#### Porter Stemmer
It is an algorithm for stemming words in NLP.  It reduces words to their root form (stem) by removing suffixes.  It was developed by Martin Porter in 1980.  Using stemmer reduces word variations and improves text processing efficiency.

Example: the word "running" can be normalized to "run", which is the __stem__ of the word.
easily -> easi
flying -> fli
happiness -> happi
organization -> organ

In [3]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ["running", "flies", "easily", "happiness", "happy", "organizer", "organization"]

# Apply stemming
stemmed_words = [stemmer.stem(word) for word in words]

# Print results
print(set(stemmed_words))

{'organ', 'run', 'easili', 'fli', 'happi'}


#### List Comprehension

<code>
[treatment(word) for word in tokens if condition]
</code>

* It iterates over each word in tokens.
* If condition is met
* Then applies treatment to the word (such as stemming)
* return all treated words in a list


In [13]:
tokens = ["running", "the", "quickly", "dogs", "happy", "is", "organization", "in"]
stopwords_en = {"the", "is", "in"}

# Processing with stemming and filtering
# notice that "dogs" was filtered
stemmed_tokens = [stemmer.stem(word) for word in tokens if word not in stopwords_en and len(word) >= 5]

print(stemmed_tokens)

['run', 'quickli', 'happi', 'organ']


#### n-gram

In NLP, n-gram refer to sequences of words in a text.
* Unigram (1-gram) -> Single words
* Bigram (2-gram) -> Pairs of consecutive words

Example: "This is a great product"
Unigram -> "This", "is", "a", "great", "product"
Bigram -> "This is", "is a", "a great", "great product"

Compare to unigram, bigram captures relatiionship between words and context.

In [8]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.util import bigrams

# Example text
text = "This is a great product"

# Tokenize the text into words
tokens = nltk.word_tokenize(text.lower())

[nltk_data] Downloading package punkt to /Users/zlu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/Users/zlu/nltk_data'
    - '/opt/anaconda3/nltk_data'
    - '/opt/anaconda3/share/nltk_data'
    - '/opt/anaconda3/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


In [None]:
# Generate unigrams (single words)
unigrams = tokens

# Generate bigrams (pairs of words)
bigrams_list = list(bigrams(tokens))

print("Unigrams:", unigrams)
print("Bigrams:", bigrams_list)

#### Pointwise Mutual Information (PMI)
It is a statistical measure used in Natural Language Processing (NLP) to identify how strongly two words are associated with each other.
- Measures how much more likely two words appear together compared to chance.
- Higher PMI → Stronger association (e.g., "New York" appears together often).
- Lower PMI → Words appear together randomly (e.g., "the book" is common but not meaningful).

$$ PMI(x, y) = \log_2 \frac{P(x, y)}{P(x) P(y)} $$

Where:
- $P(x, y)$ = Probability that words x and y appear together
- $P(x)$ = Probability of x appearing anywhere
- $P(y)$ = Probability of y appearing anywhere

#### Example

<code>
"The battery life is amazing."
"This phone has amazing battery performance."
"Amazing product with long battery life."
</code>

"Battery": count -> 3, Probability 3/15
"amazing": count -> 3, Probability 3/15

__Bigram "amazing battery" Appears 2 Times__

$PMI(“amazing”, “battery”) = \log_2 \frac{P(“amazing”, “battery”)}{P(“amazing”) \times P(“battery”)}$

$PMI(“amazing”, “battery”) = \log_2 \frac{2/15}{(3/15) \times (3/15)}$

$PMI(“amazing”, “battery”) = \log_2 (2.22) = 1.15$

Since PMI is positive, "amazing battery" is a meaningful phrase.

__Why Use PMI for Bigrams?__
- Identifies collocations (meaningful word pairs) → "New York", "machine learning", "battery life".
- Filters out common but unimportant word pairs → "the book", "in a", "this is".
- Used for bigram selection in NLP tasks.


In [17]:
from collections import Counter
import math
import nltk
from nltk.util import bigrams

# Example text corpus
corpus = [
    "The battery life is amazing.",
    "This phone has amazing battery performance.",
    "Amazing product with long battery life."
]

# Tokenize text
tokenized_corpus = [nltk.word_tokenize(doc.lower()) for doc in corpus]

# Count unigrams and bigrams
unigram_counts = Counter(word for doc in tokenized_corpus for word in doc)
bigram_counts = Counter(bigram for doc in tokenized_corpus for bigram in bigrams(doc))

# Total number of words and bigrams
total_words = sum(unigram_counts.values())
total_bigrams = sum(bigram_counts.values())

# Calculate PMI for each bigram
pmi_scores = {}
for bigram, bigram_count in bigram_counts.items():
    word_x, word_y = bigram
    p_x = unigram_counts[word_x] / total_words
    p_y = unigram_counts[word_y] / total_words
    p_xy = bigram_count / total_bigrams

    # Compute PMI score
    pmi_scores[bigram] = math.log2(p_xy / (p_x * p_y))

# Print top PMI bigrams
sorted_pmi = sorted(pmi_scores.items(), key=lambda x: x[1], reverse=True)
print("Top PMI Bigrams:", sorted_pmi[:5])

##### Expected output
Vocabulary: {'this': 5, 'is': 2, 'great': 1, 'product': 4, 'amazing': 0, 'love': 3}
Sparse Matrix:
 [[1 1 1 0 1 1]  # "This is a great product"
  [1 0 1 0 1 1]  # "This product is amazing"
  [1 1 0 1 1 1]] # "I love this great product"

- Each row represents a sentence.
- Each column corresponds to a word (from vectorizer.vocabulary_).
- Numbers indicate word occurrences in each sentence.

#### How fit_transform() Works Internally
| Text | Tokenized Words | Vectorized Output |
| -------- | ------- | -------- |
| "This is a great product" | ['this', 'is', 'great', 'product'] | [1, 1, 1, 1, 1, 1] |
| "This product is amazing" | ['this', 'product', 'is', 'amazing'] | [1, 0, 1, 0, 1, 1]|
| "I love this great product" | ['i', 'love', 'this', 'great', 'product'] |[1, 1, 0, 1, 1, 1] |

- fit() learns the vocabulary from the text corpus.
- transform() converts text into a sparse matrix of word counts.
- Efficient for NLP tasks like text classification, clustering, and topic modeling.


#### Bag of Words (BoW)
It is a text representation technique in NLP.
Where:
- Each document (text) is represented as a collection (bag) of __unique__ words.
- Ignores word order and grammar.
- Counts word occurances to create a numerical feature vector.


#### Corpus
It is a collection of text documents used for NLP analysis.
- A corpus can be a collection of books, news articles, tweets, or product reviews.
- In the BoW model, a corpus is used to build the vocabulary.

__Example: List of Documents__

<pre>
corpus = [
    "I love NLP and machine learning",
    "Machine learning is amazing",
    "Deep learning and NLP are the future"
]    
</pre>


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample Corpus
corpus = [
    "I love NLP and machine learning",
    "Machine learning is amazing",
    "Deep learning and NLP are the future"
]

# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Convert text corpus into BoW sparse matrix
X = vectorizer.fit_transform(corpus)

# Get feature names (vocabulary)
print("Vocabulary:", vectorizer.get_feature_names_out())

# Convert sparse matrix to array
print("BoW Representation:\n", X.toarray())

Expected Output:

<pre>
Vocabulary: ['and', 'amazing', 'are', 'deep', 'future', 'is', 'learning', 'love', 'machine', 'nlp', 'the']
BoW Representation:
 [[1 0 0 0 0 0 1 1 1 1 0]
  [0 1 0 0 0 1 1 0 1 0 0]
  [1 0 1 1 1 0 1 0 0 1 1]]    
</pre>

- Each row represents a sentence in vector form.
- Each column represents a word from the vocabulary.
- Word count is stored in the matrix.

