# Introduction to Natural Language Processing (NLP)

Natural Language Processing, or NLP, is a field at the intersection of computer science, artificial intelligence, and linguistics. It involves the development of algorithms and systems that enable computers to understand, interpret, and generate human language. NLP is a way for computers to analyze, understand, and derive meaning from human language in a smart and useful way.

NLP helps resolve ambiguity in language and adds useful numeric structure to the data for many downstream applications, such as speech recognition or text analytics.

![nlp](../assets/nlp.png)

## Challenges in NLP

### Part-of-Speech Tagging

![pos](../assets/pos.jpeg)

Part-of-speech (POS) tagging is the process of assigning a part of speech to each word in a text, such as noun, verb, adjective, etc. This is challenging because:

- **Ambiguity**: A word can have multiple parts of speech based on the context. For example, "book" can be a noun ("I read a book") or a verb ("Book a table").
- **Contextual Use**: Words may be used in a figurative sense, which can confuse POS taggers.
- **New Words**: New words, slang, and jargon keep emerging, and POS taggers need regular updates to handle them.

### Text Segmentation

Text segmentation involves dividing text into meaningful units, such as sentences or topics. Challenges include:

![senseg](../assets/sentence_segmentation.jpeg)

- **Sentence Boundary Detection**: Punctuation marks like periods can be used for abbreviations, decimals, etc., and not always to end sentences.
- **Tokenization**: Different languages and scripts have different tokenization rules, and some don't use whitespace.
- **Topic Segmentation**: Identifying topic shifts in a text requires understanding of the content, which is a non-trivial task.

### Word Sense Disambiguation

Word sense disambiguation is the task of determining which sense of a word is active in a given context. Challenges include:

- **Polysemy**: Many words have multiple meanings, and identifying the correct one is difficult without deep understanding.
- **Limited Context**: Sometimes the surrounding text is not enough to determine the word sense.
- **Lack of Resources**: For less-resourced languages, there might not be enough data to train disambiguation systems.

> - Many plants and animals live in the rainforest.
> - The manufacturing plant produced widgets.

### Syntax Disambiguation

Syntax disambiguation deals with the different ways in which words can be combined to form sentences. Challenges here include:

- **Structural Ambiguity**: Sentences can often be parsed in multiple ways ("I saw the man with the telescope").
- **Complex Constructions**: Some languages have free word order or allow for nested clauses, making parsing difficult.
- **Idiomatic Expressions**: Phrases that don't follow standard syntax rules can confuse parsers.

> - Annie hit a man with an umbrella.
> - I shot an elephant in my pyjamas.
> - The tourist saw the woman with a telescope.

### Imperfect or Irregular Input

Language is often messy and unpredictable, leading to challenges such as:

- **Typos and Spelling Errors**: Mistakes in writing can lead to misinterpretation by NLP systems.
- **Non-standard Language**: Use of slang, abbreviations, and non-standard grammar can be problematic.
- **Multilingual Text**: Text containing multiple languages can complicate processing.

## Applications of NLP

- **Text Classification**: Assigning categories or labels to text, such as spam detection in email services.
- **Machine Translation**: Translating text from one language to another, like Google Translate.
- **Sentiment Analysis**: Identifying the sentiment of text, used in social media monitoring and market research.
- **Chatbots and Virtual Assistants**: Powering conversational agents like Siri, Alexa, and customer service bots.
- **Information Extraction**: Extracting structured information from unstructured text, such as named entity recognition.
- **Summarization**: Generating a shortened version of a text, retaining its most important information.
- **Speech Recognition**: Translating spoken language into text, used in voice user interfaces.
- **Question Answering**: Building systems that automatically answer questions posed by humans in a natural language (ChatGPT).

## Brief History of NLP

The history of NLP generally starts in the 1950s, although work can be found from earlier periods. In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence, a fundamental goal of natural language processing.

### Milestones in the History of NLP

- **1950s**: The era of symbolic NLP, rule-based systems that tried to encode human knowledge and grammar rules into computers.
- **1960s**: Development of the first chatbot, ELIZA, and further work on machine translation.
- **1970s-1980s**: The rise of computational linguistics and the development of more sophisticated models for handling syntax and semantics.
- **1990s**: Introduction of statistical NLP, leveraging large amounts of data and statistical methods to process language.
- **2000s**: The emergence of machine learning in NLP, with systems beginning to learn from data rather than relying on hand-coded rules.
- **2010s-Present**: The rise of deep learning has revolutionized NLP, leading to the development of models like BERT and GPT that can handle complex language tasks with unprecedented accuracy.

# Text Preprocessing

Text preprocessing is a critical step in NLP. It involves preparing and cleaning text data for further analysis and modeling. The goal is to simplify the text and remove any noise that might distract the machine learning algorithms from understanding the core content.

Raw text data is often messy and unstructured, with various issues:

- Irrelevant characters and symbols
- Inconsistent formatting
- Typos and spelling errors
- Diverse languages and slang
- Stopwords (commonly used words that may not be useful in analysis)

## Tokenization

Tokenization is the process of breaking down text into smaller units, called *tokens*. Tokens can be words, numbers, or punctuation marks. It's the first step in turning unstructured text into a form that can be analyzed.

### White-space Tokenization

This is the simplest form of tokenization. It splits the text by white spaces, including spaces, tabs, and new line characters.

In [1]:
def whitespace_tokenizer(text):
    return text.split()

# Example usage:
text = "Natural language processing is fun."
tokens = whitespace_tokenizer(text)
print(tokens)

['Natural', 'language', 'processing', 'is', 'fun.']


### Punctuation-based Tokenization

This method not only splits by white spaces but also considers punctuation marks as separate tokens.

In [2]:
import re

def punctuation_tokenizer(text):
    return re.findall(r'\b\w+\b', text)

# Example usage:
text = "Natural language processing is fun!"
tokens = punctuation_tokenizer(text)
print(tokens)

['Natural', 'language', 'processing', 'is', 'fun']


### Using NLP Libraries for Tokenization

Libraries like `NLTK` and `spaCy` provide robust tokenization functions that handle edge cases and are more sophisticated than the simple white-space or punctuation-based methods.

In [3]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /home/sharondev/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [4]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

text = "Natural language processing is fun!"
tokens = word_tokenize(text)
print("\n", tokens)


 ['Natural', 'language', 'processing', 'is', 'fun', '!']


[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/sharondev/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [5]:
import spacy

# Download the spaCy model
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m8.6 MB/s[0m  [33m0:00:01[0mm0:00:01[0m00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [6]:
# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

In [7]:
text = "Natural language processing is fun!"
doc = nlp(text)

tokens = [token.text for token in doc]
print(tokens)

['Natural', 'language', 'processing', 'is', 'fun', '!']


# Text Normalization

Text normalization involves transforming text into a more uniform format to improve the performance of text analysis algorithms. Two common text normalization techniques are *stemming* and *lemmatization*.

## Stemming

Stemming is a process of reducing words to their word stem, base, or root form—generally a written word form. The idea is to remove affixes (prefixes and suffixes) from words to get to the core meaning of the word.

Stemming algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. This process is quite crude and a stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.

### Porter Stemmer

The Porter Stemming Algorithm is one of the oldest and most commonly used algorithms. It's designed for the English language and has a series of rules to determine the stripping of suffixes.

### Snowball Stemmer

The Snowball Stemmer, also known as the English Stemmer or Porter2 Stemmer, is a slightly improved version of the Porter stemmer and is part of a larger framework called Snowball. It offers stemmers for several languages besides English.

### Advantages and Disadvantages of Stemming

**Advantages:**
- Simple to implement and fast to run.
- Reduces the corpus of words the model is exposed to.
- Often improves the performance of text classification models.

**Disadvantages:**
- Can produce stems that are not actual words.
- Sometimes too aggressive, cutting off too much of the word and changing the meaning.
- Does not consider the context of the word, which can lead to inaccuracies.

In [8]:
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer

In [9]:
# Initialize stemmers
porter = PorterStemmer()
snowball = SnowballStemmer(language='english')

In [10]:
words = ['run', 'runner', 'running', 'ran', 'runs', 'easily', 'fairly', 'better', 'mice', 'feet']

porter_stems = [porter.stem(word) for word in words]
print(f"Porter Stemmer: {porter_stems}")

snowball_stems = [snowball.stem(word) for word in words]
print(f"Snowball Stemmer: {snowball_stems}")

Porter Stemmer: ['run', 'runner', 'run', 'ran', 'run', 'easili', 'fairli', 'better', 'mice', 'feet']
Snowball Stemmer: ['run', 'runner', 'run', 'ran', 'run', 'easili', 'fair', 'better', 'mice', 'feet']


## Lemmatization

Lemmatization is the process of grouping together the inflected forms of a word so they can be analyzed as a single item, identified by the word's lemma, or dictionary form.

While stemming often involves rule-based chopping of ends of words, lemmatization involves a linguistic approach to reduce a word to its base or root form. Lemmatization uses vocabulary and morphological analysis, often with the aid of part-of-speech tagging, to return the base or dictionary form of a word, known as the lemma.

### The Role of Part-of-Speech Tagging in Lemmatization

Part-of-speech (POS) tagging is crucial in lemmatization because many words have different lemmas based on their part of speech in a sentence. For example, the word "saw" can be a verb or a noun, and the lemma would differ accordingly ("see" for the verb, "saw" for the noun).

`nltk` or `spacy` contain pre-trained models for POS tagging and lemmatization.

### Advantages and Disadvantages of Lemmatization

**Advantages:**
- Produces lemmas, which are actual words, improving interpretability.
- More accurate than stemming as it considers the context.

**Disadvantages:**
- More computationally expensive than stemming.
- Requires additional information (POS tags).
- May not improve performance significantly more than stemming for some applications.

In [11]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/sharondev/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/sharondev/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/sharondev/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [12]:
# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

In [13]:
# Define a function to convert POS tag to a format recognized by the lemmatizer
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [14]:
# Lemmatize words with POS tags
lemmatized_words = [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in words]
print("Lemmatized Words:", lemmatized_words)

Lemmatized Words: ['run', 'runner', 'run', 'ran', 'run', 'easily', 'fairly', 'well', 'mouse', 'foot']


In [15]:
# Let's use spacy on another example to demonstrate pos and lemmatization
sentence = "The striped bats are hanging on their feet for better sleep."

In [16]:
doc = nlp(sentence)

In [17]:
# POS tagging and lemmatization
print(f"{'Text':{8}} {'Lemma':{8}} {'POS':{6}} {'Tag':{6}} {'Explanation'}")
print()
for token in doc:
    print(f"{token.text:{8}} {token.lemma_:{8}} {token.pos_:{6}} {token.tag_:{6}} {spacy.explain(token.tag_)}")

Text     Lemma    POS    Tag    Explanation

The      the      DET    DT     determiner
striped  striped  ADJ    JJ     adjective (English), other noun-modifier (Chinese)
bats     bat      NOUN   NNS    noun, plural
are      be       AUX    VBP    verb, non-3rd person singular present
hanging  hang     VERB   VBG    verb, gerund or present participle
on       on       ADP    IN     conjunction, subordinating or preposition
their    their    PRON   PRP$   pronoun, possessive
feet     foot     NOUN   NNS    noun, plural
for      for      ADP    IN     conjunction, subordinating or preposition
better   well     ADJ    JJR    adjective, comparative
sleep    sleep    NOUN   NN     noun, singular or mass
.        .        PUNCT  .      punctuation mark, sentence closer


In [18]:
# Output the lemmatized form of each word
lemmatized_sentence = " ".join([token.lemma_ for token in doc])
print(lemmatized_sentence)

the striped bat be hang on their foot for well sleep .


# Language Modeling

Language modeling is a critical task that deals with predicting the probability of a sequence of words. It is used in various applications such as speech recognition, machine translation, and text generation.

## Formula

A language model is a probabilistic model that assigns a probability to a sequence of words, effectively capturing the likelihood that the sequence will occur in a language. In mathematical terms, given a sequence of words $ w_1, w_2, \ldots, w_n $, the language model estimates the probability:

$$ P(w_1, w_2, \ldots, w_n) $$

This probability can be decomposed using the chain rule of probability as:

$$ P(w_1, w_2, \ldots, w_n) = P(w_1) \cdot P(w_2 | w_1) \cdot \ldots \cdot P(w_n | w_1, w_2, \ldots, w_{n-1}) $$

## N-gram Models

An n-gram is a contiguous sequence of $ n $ items from a given sample of text or speech. The items can be phonemes, syllables, letters, words, or base pairs according to the application. In the context of language modeling, we are typically talking about words. It approximates the probability of a word sequence by only considering the $ n-1 $ previous words. This is known as the Markov assumption.

### Unigram Models

A unigram model is the simplest form of a statistical language model. It assumes that the probability of a word is independent of the words before it.

$$ P(w_1, w_2, \ldots, w_n) = P(w_1) \cdot P(w_2) \cdot \ldots \cdot P(w_n) $$

### Bigram Models

A bigram model, also known as a 2-gram model, assumes that the probability of a word depends only on the immediately preceding word.

$$ P(w_n | w_1, w_2, \ldots, w_{n-1}) \approx P(w_n | w_{n-1}) = \frac{Count(w_{n-1}, w_n)}{Count(w_{n-1})} $$

### Trigram Models and Higher-Order Models

Trigram models extend this to consider the two preceding words, and higher-order models consider more history. However, as the history increases, these models become more complex and require more data to estimate the probabilities accurately.

### Challenges

**Sparsity**: As $ n $ increases, the likelihood of encountering unseen n-grams (those not present in the training corpus) increases, leading to sparsity.

**Curse of Dimensionality**: The number of possible n-grams increases exponentially with $ n $, which leads to a combinatorial explosion in the number of parameters to be estimated.

### Smoothing Techniques

Smoothing techniques are used to handle the issue of zero probabilities for unseen n-grams. Common techniques include:

- **Add-One (Laplace) Smoothing**: Adding one to all the n-gram counts.
- **Add-k Smoothing**: Adding a small constant $ k $ to the counts.
- **Backoff and Interpolation**: Using lower-order n-gram probabilities when higher-order n-grams have zero counts.

In [19]:
from nltk import bigrams
from collections import Counter, defaultdict

In [20]:
# Sample corpus
corpus = "I am Sam. Sam I am. I do not like green eggs and ham."

In [21]:
# Tokenize the corpus
tokens = nltk.word_tokenize(corpus)

In [22]:
# Calculate bigram frequencies
bigram_freqs = Counter(bigrams(tokens))

In [23]:
# Calculate total number of bigrams
total_bigrams = sum(bigram_freqs.values())

In [24]:
# Calculate bigram probabilities
bigram_probs = {bigram: freq / total_bigrams for bigram, freq in bigram_freqs.items()}

In [25]:
# Display bigram probabilities
for bigram, prob in bigram_probs.items():
    print(f"Probability of {bigram}: {prob}")

Probability of ('I', 'am'): 0.125
Probability of ('am', 'Sam'): 0.0625
Probability of ('Sam', '.'): 0.0625
Probability of ('.', 'Sam'): 0.0625
Probability of ('Sam', 'I'): 0.0625
Probability of ('am', '.'): 0.0625
Probability of ('.', 'I'): 0.0625
Probability of ('I', 'do'): 0.0625
Probability of ('do', 'not'): 0.0625
Probability of ('not', 'like'): 0.0625
Probability of ('like', 'green'): 0.0625
Probability of ('green', 'eggs'): 0.0625
Probability of ('eggs', 'and'): 0.0625
Probability of ('and', 'ham'): 0.0625
Probability of ('ham', '.'): 0.0625


# Vector Space Model

The Vector Space Model (VSM) is a mathematical model used to represent text documents as vectors of identifiers, such as index terms. It is used in information retrieval and text mining to measure the similarity between documents. In VSM, each dimension corresponds to a separate term, and the value in each dimension represents the significance of the term in the document.

## Term-Document Matrix

In VSM, a Term-Document Matrix is a mathematical representation of a text corpus. It describes the frequency of terms that occur in the collection of documents. In a Term-Document Matrix, rows correspond to terms in the corpus while columns correspond to documents. Each entry in this matrix denotes the frequency or the weight of a term in a document.

Here's a simple example of a Term-Document Matrix for five documents:

![tf](../assets/tf.png)

Such matrices are often sparse since not all words appear in all documents.

## TF-IDF

Term Frequency-Inverse Document Frequency (TF-IDF) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate the importance of a word to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

The TF-IDF value is calculated as follows:

- **Term Frequency (TF)**, which measures how frequently a term occurs in a document. It is calculated as the number of times a term `t` appears in a document `d`, divided by the total number of terms in the document.

$$ TF(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} $$

- **Inverse Document Frequency (IDF)**, which measures how important a term is within the entire corpus. It is calculated as the logarithm of the number of documents in the corpus divided by the number of documents where the term `t` appears.

$$ IDF(t, D) = \log \left( \frac{\text{Total number of documents in corpus } D}{\text{Number of documents with term } t} \right) $$

- **TF-IDF**, the product of TF and IDF:

$$ TF\text{-}IDF(t, d, D) = TF(t, d) \times IDF(t, D) $$

## Cosine Similarity

Cosine similarity is a measure used to determine how similar two documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. This metric is, therefore, a judgment of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.

The cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1]. The cosine similarity of two documents will range from 0 to 1, where 0 means no similarity and 1 means the same content.

The formula for calculating the cosine similarity between two vectors $ A $ and $ B $ is:

$$ \text{Cosine Similarity}(A, B) = \frac{A \cdot B}{\|A\| \|B\|} = \frac{\sum_{i=1}^{n} A_i \times B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \times \sqrt{\sum_{i=1}^{n} B_i^2}} $$

## Limitations

- Assumes term independence, which is not always the case.
- Does not capture the semantic relationship between words.
- High-dimensional and sparse vectors due to the size of the vocabulary.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer #CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [27]:
# Sample documents
documents = [
    'The sky is blue',
    'The sun is bright',
    'The sun in the sky is bright',
    'We can see the shining sun, the bright sun'
]

In [28]:
# Initialize a TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

In [29]:
# Vectorize the documents
tfidf_matrix = vectorizer.fit_transform(documents)

In [30]:
# Display the TF-IDF matrix
print(tfidf_matrix.toarray())

[[0.65919112 0.         0.         0.         0.42075315 0.
  0.         0.51971385 0.         0.34399327 0.        ]
 [0.         0.52210862 0.         0.         0.52210862 0.
  0.         0.         0.52210862 0.42685801 0.        ]
 [0.         0.3218464  0.         0.50423458 0.3218464  0.
  0.         0.39754433 0.3218464  0.52626104 0.        ]
 [0.         0.23910199 0.37459947 0.         0.         0.37459947
  0.37459947 0.         0.47820398 0.39096309 0.37459947]]


In [31]:
# Calculate Cosine Similarity between the 2nd document with all others
cosine_similarities = cosine_similarity(tfidf_matrix[1], tfidf_matrix)
print(cosine_similarities)

[[0.36651513 1.         0.72875508 0.54139736]]


In [32]:
# Let's initialize another TF-IDF Vectorizer, with stop words removal
vectorizer = TfidfVectorizer(stop_words='english')

In [33]:
# Vectorize the documents
tfidf_matrix = vectorizer.fit_transform(documents)

In [34]:
# Display the TF-IDF matrix
print(tfidf_matrix.toarray())

[[0.78528828 0.         0.         0.6191303  0.        ]
 [0.         0.70710678 0.         0.         0.70710678]
 [0.         0.53256952 0.         0.65782931 0.53256952]
 [0.         0.36626037 0.57381765 0.         0.73252075]]


In [35]:
# Calculate Cosine Similarity between the 2nd document with all others
cosine_similarities = cosine_similarity(tfidf_matrix[1], tfidf_matrix)
print(cosine_similarities)

[[0.         1.         0.75316704 0.77695558]]


# Word Embeddings

## What are Word Embeddings?

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation. They are a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging NLP problems.

In essence, word embeddings are a form of word representation that bridges the human understanding of language to that of a machine. They are mappings to a high-dimensional space, where words that have similar meanings are located in close proximity to one another.

## Why Use Word Embeddings?

Traditional language models often represent words as one-hot encoded vectors where each word is represented by a vector with a dimensionality equal to the size of the vocabulary. The main issue with this approach is that the resulting vectors are sparse and do not capture any information about word relationships.

Word embeddings address this by providing a dense representation where similar words have a similar encoding. Importantly, word embeddings can capture nuances about words, such as their semantic and syntactic information.

## Word2Vec

Word2Vec is a popular algorithm to produce word embeddings by training a neural network with a single hidden layer. Word2Vec comes with two model architectures:

### Architecture (CBOW and Skip-gram)

- **Continuous Bag of Words (CBOW)**: The CBOW model predicts the current word based on the context, and the context is represented as a bag of words. Hence, the order of words in the context does not influence prediction (bag of words model).

- **Skip-gram**: The Skip-gram model works in the reverse manner, it tries to predict the context for a given word.

### Training Word2Vec

The Word2Vec model is trained with either one of these architectures, each of which has the objective to learn word vector representations that are good at predicting their context in the input corpus.

For a given word $ w_I $ and its context $ w_O $ in the corpus, the objective of the Skip-gram model is to maximize the following log probability:

$$ \frac{1}{T} \sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log p(w_{t+j} | w_t) $$

where $ c $ is the size of the training context (which can be a function of the center word $ w_t $). The $ p(w_{t+j} | w_t) $ is defined using the softmax function:

$$ p(w_O | w_I) = \frac{\exp({v'_{w_O}}^T v_{w_I})}{\sum_{w=1}^{W} \exp({v'_w}^T v_{w_I})} $$

where $ v_w $ and $ v'_w $ are the "input" and "output" vector representations of $ w $, and $ W $ is the number of words in the vocabulary.

### Applications

One of the fascinating properties of Word2Vec embeddings is their ability to capture analogies and relationships between words. The classic example often cited to demonstrate this is the relationship between "man" and "woman," and "king" and "queen."

Word2Vec can capture these relationships because it learns vector representations of words in such a way that the geometric relationships between the vectors capture semantic relationships between the words. For instance, the difference between the vectors for "man" and "woman" often encodes the concept of gender. Similarly, the difference between "king" and "queen" captures the same concept of gender.

![word2vec](../assets/word2vec.png)

In practice, this means that if we take the vector for "king," subtract the vector for "man," and then add the vector for "woman," we end up with a vector that is close to the vector for "queen." Mathematically, this relationship can be represented as:

$$ \textbf{vector}('king') - \textbf{vector}('man') + \textbf{vector}('woman') \approx \textbf{vector}('queen') $$

Let's use the `20 Newsgroups` dataset, which is a collection of approximately 20,000 newsgroup documents, partitioned across 20 different newsgroups.

In [56]:
from sklearn.datasets import fetch_20newsgroups
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

In [57]:
# Fetch the 20 newsgroups dataset
newsgroups_train = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

In [58]:
# Preprocess the text using gensim's simple_preprocess
# This will tokenize the text, lowercasing, and remove punctuation
corpus = [simple_preprocess(doc) for doc in newsgroups_train.data]

In [51]:
corpus[0]

['am',
 'sure',
 'some',
 'bashers',
 'of',
 'pens',
 'fans',
 'are',
 'pretty',
 'confused',
 'about',
 'the',
 'lack',
 'of',
 'any',
 'kind',
 'of',
 'posts',
 'about',
 'the',
 'recent',
 'pens',
 'massacre',
 'of',
 'the',
 'devils',
 'actually',
 'am',
 'bit',
 'puzzled',
 'too',
 'and',
 'bit',
 'relieved',
 'however',
 'am',
 'going',
 'to',
 'put',
 'an',
 'end',
 'to',
 'non',
 'pittsburghers',
 'relief',
 'with',
 'bit',
 'of',
 'praise',
 'for',
 'the',
 'pens',
 'man',
 'they',
 'are',
 'killing',
 'those',
 'devils',
 'worse',
 'than',
 'thought',
 'jagr',
 'just',
 'showed',
 'you',
 'why',
 'he',
 'is',
 'much',
 'better',
 'than',
 'his',
 'regular',
 'season',
 'stats',
 'he',
 'is',
 'also',
 'lot',
 'fo',
 'fun',
 'to',
 'watch',
 'in',
 'the',
 'playoffs',
 'bowman',
 'should',
 'let',
 'jagr',
 'have',
 'lot',
 'of',
 'fun',
 'in',
 'the',
 'next',
 'couple',
 'of',
 'games',
 'since',
 'the',
 'pens',
 'are',
 'going',
 'to',
 'beat',
 'the',
 'pulp',
 'out',
 'o

In [59]:
# Train the Word2Vec model
model = Word2Vec(corpus, vector_size=100, window=5, min_count=5, workers=4)

# window: considers words up to 5 positions away as part of the context for the target word
# min_count: threshold for ignoring rare words. Any word that appears less than 5 times across the entire corpus will be ignored by the model, OOV
# workers: CPU cores to use for training

In [63]:
len(model.wv)

26415

In [64]:
# Explore the model
# Let's find the most similar words to 'computer'
similar_words = model.wv.most_similar('computer', topn=10)
print("Most similar words to 'computer':")
for word, similarity in similar_words:
    print(f"{word}: {similarity:.4f}")

Most similar words to 'computer':
network: 0.7449
computers: 0.7254
workstation: 0.7233
project: 0.6973
shopper: 0.6936
tech: 0.6903
lab: 0.6875
library: 0.6865
programming: 0.6849
graphics: 0.6814


In [65]:
model.wv.get_vector('car')

array([-0.8774082 ,  0.53276944, -0.32874092,  0.88266826, -0.8883441 ,
       -0.15561628, -0.15580866, -0.7052471 , -0.76239884, -0.1923875 ,
       -0.1924472 , -0.15094143, -1.6838006 , -1.6016271 ,  0.22194852,
        0.72246164, -0.52380466, -1.4295394 , -0.36418247,  2.6347096 ,
        0.81822747,  2.6630943 ,  0.6631937 , -0.90736663,  1.2503573 ,
        0.03981446,  0.45485   , -0.5701292 , -0.3246041 , -0.14345819,
       -1.5114781 ,  2.816173  , -0.21027419,  0.08023712, -2.0966618 ,
        0.54303384,  0.02905747,  0.31574315, -1.5534451 , -2.046396  ,
        1.2995341 ,  0.49864534,  0.22088557,  0.70367074,  1.97401   ,
       -0.5607663 , -1.1499857 , -0.08362885, -0.9210121 , -1.2266659 ,
        0.61028874,  0.73315275,  0.64478916, -0.9818015 , -0.42163518,
       -0.5276648 , -0.9552855 ,  2.0337946 ,  1.0988691 ,  0.3882128 ,
       -0.43149963,  1.9364501 ,  0.54808974, -2.835774  , -0.6691579 ,
        0.49988824, -0.83838195,  1.2940536 , -0.5815418 , -2.12

In [66]:
model.wv.similarity('car', 'engine')

0.69244367

In [68]:
similar_words = model.wv.most_similar('car', topn=10)
print("Most similar words to 'car':")
for word, similarity in similar_words:
    print(f"{word}: {similarity:.4f}")

Most similar words to 'car':
bike: 0.8367
helmet: 0.7180
tires: 0.7116
seat: 0.7113
gear: 0.7003
engine: 0.6924
battery: 0.6881
dealer: 0.6715
front: 0.6579
bought: 0.6566


# Advanced Embeddings

## GloVe

GloVe is an unsupervised learning algorithm for obtaining vector representations of words. It stands for "Global Vectors for Word Representation," and it is specifically designed to capture global word-word co-occurrence statistics from a corpus. The resulting representations showcase interesting linear substructures of the word vector space.

### How GloVe Works

The GloVe model is trained on the non-zero elements in a word-word co-occurrence matrix, which tabulates how frequently words co-occur with one another in a given corpus. Instead of using window-based co-occurrence, GloVe constructs an explicit word-context or word-word co-occurrence matrix using statistics across the whole text corpus.

The model then uses matrix factorization techniques to yield a word vector space, where the difference between any two word vectors aims to approximate the logarithm of the words' probability of co-occurrence.

Given a co-occurrence matrix $X$, where $X_{ij}$ denotes the number of times word $j$ occurs in the context of word $i$, the GloVe model aims to learn a vector $w_i$ for each word $i$ such that the dot product $w_i^T w_j$ is proportional to the logarithm of $X_{ij}$.

The training objective of GloVe is:

$$ J = \sum_{i,j=1}^V f(X_{ij}) (w_i^T w_j + b_i + b_j - \log X_{ij})^2 $$

where $V$ is the size of the vocabulary, $b_i$ and $b_j$ are scalar bias terms for words $i$ and $j$, and $f$ is a weighting function that helps prevent learning from large co-occurrence counts.

## FastText

FastText is another word embedding method that extends Word2Vec to consider subword information. This means that it takes into account the internal structure of words while learning word representations. FastText is particularly useful for languages with rich morphology and for understanding words outside the training vocabulary.

### How FastText Works

FastText represents each word as a bag of character n-grams, in addition to the word itself. This means that the word "apple" with $n=3$ would be represented as the following n-grams: "<ap", "app", "ppl", "ple", "le>" (where "<" and ">" are added to denote the beginning and end of the word, respectively).

The model then learns vector representations for these character n-grams, and the word vector is computed as the sum of the n-gram vectors. This allows FastText to produce representations for words not seen during training by summing the vectors of its component n-grams.

In [42]:
from gensim.models.fasttext import FastText

In [43]:
# Train the FastText model
# The `vector_size` parameter specifies the dimensionality of the word vectors,
# `window` specifies the maximum distance between the current and predicted word within a sentence
# `min_count` specifies the minimum count of words to consider
# `workers` specifies the number of worker threads to train the model.
ft_model = FastText(vector_size=100, window=5, min_count=5, workers=4)
ft_model.build_vocab(corpus_iterable=corpus)
ft_model.train(corpus_iterable=corpus, total_examples=len(corpus), epochs=5)

(12579369, 16476320)

In [44]:
# Let's find the most similar words to 'computer'
similar_words = ft_model.wv.most_similar('computer', topn=10)
print("Most similar words to 'computer':")
for word, similarity in similar_words:
    print(f"{word}: {similarity:.4f}")

Most similar words to 'computer':
microcomputer: 0.9610
supercomputer: 0.9562
compute: 0.9279
computrac: 0.9031
compusa: 0.8849
computers: 0.8772
compuadd: 0.8659
compulink: 0.8654
compulsion: 0.8566
computes: 0.8551


In [55]:
ft_model.wv.similarity('glamping', 'camping')

0.94228536

In [69]:
'glamping' in ft_model.wv.key_to_index

False

In [70]:
'f1' in ft_model.wv.key_to_index

False

In [71]:
ft_model.wv.similarity('car', 'f1')

0.052527193

# Text Classification with Naive Bayes

Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong independence assumptions between the features. They are among the most straightforward and effective algorithms used in machine learning and natural language processing (NLP) for text classification tasks, such as spam filtering and sentiment analysis.

## Understanding Naive Bayes

Bayes' theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For a class variable \(y\) and a dependent feature vector \(x_1\) through \(x_n\), Bayes' theorem states the following relationship:

$$ P(y \mid x_1, \ldots, x_n) = \frac{P(y) P(x_1, \ldots, x_n \mid y)}{P(x_1, \ldots, x_n)} $$

In the Naive Bayes classification, we are interested in finding the class with the highest probability, given the features. The "naive" assumption of conditional independence between every pair of features given the value of the class variable simplifies the computation, as follows:

$$ P(y \mid x_1, \ldots, x_n) \propto P(y) \prod_{i=1}^n P(x_i \mid y) $$

Since we are only interested in the class with the maximum probability, we can ignore the denominator and use the following classification rule:

$$ \hat{y} = \arg\max_y P(y) \prod_{i=1}^n P(x_i \mid y) $$

## Naive Bayes in NLP

In NLP, Naive Bayes classifiers are commonly applied to text classification problems. When dealing with text, the features are usually the frequency or presence of words. For example, in a spam filtering application, the features might be the presence or frequency of specific words or sequences of words in an email.

### Multinomial Naive Bayes

The Multinomial Naive Bayes classifier is a specific instance of a Naive Bayes classifier which is widely used for document classification problems. It accounts for the number of occurrences of each word (term frequency) for classification.

### Bernoulli Naive Bayes

The Bernoulli Naive Bayes classifier is suitable when your feature vectors are binary (i.e., 0s and 1s). An example might be text classification with a 'bag of words' model where the 1s & 0s represent the presence or absence of a word in the document.

In [45]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report

In [46]:
# Fetch the dataset
categories = ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)

In [47]:
# Create a pipeline that vectorizes the data then applies Multinomial Naive Bayes classifier
model = make_pipeline(CountVectorizer(), MultinomialNB())

In [48]:
# Train the model
model.fit(newsgroups_train.data, newsgroups_train.target)

0,1,2
,steps,"[('countvectorizer', ...), ('multinomialnb', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'
,ngram_range,"(1, ...)"

0,1,2
,alpha,1.0
,force_alpha,True
,fit_prior,True
,class_prior,


In [49]:
# Predict the categories of the test data
predicted_categories = model.predict(newsgroups_test.data)

In [50]:
# Evaluate the model
print(classification_report(newsgroups_test.target, predicted_categories, target_names=newsgroups_test.target_names))

                        precision    recall  f1-score   support

           alt.atheism       0.92      0.90      0.91       319
         comp.graphics       0.95      0.95      0.95       389
               sci.med       0.96      0.91      0.93       396
soc.religion.christian       0.91      0.97      0.94       398

              accuracy                           0.93      1502
             macro avg       0.93      0.93      0.93      1502
          weighted avg       0.93      0.93      0.93      1502

