# Understanding Language as Data

## **Outline**


- Introduction to NLP techniques and applications.
- Limitations of simple vectorization techniques like Bag-of-Words.
- Embeddings: mathematical text representation in a continuous vector space.
- **Hands-on Lab:** Find similar documents.


## **Natural Language Processing (NLP)**

- Natural Language Processing (NLP) is a branch of artificial intelligence (AI).
- It focuses on the interaction between computers and human language.
- NLP enables computers to understand, process, and generate human language as data.





## **Why NLP Matters**

- NLP has practical applications in various domains:
  - Customer service chatbots
  - Sentiment analysis of social media data
  - Language translation
  - Content generation
- It bridges the gap between human communication and AI systems.




## **NLP Tasks**

- NLP involves a range of tasks, including:
  - Understanding language structure
  - Text completion
  - Text generation
  - Dialogue systems (chatbots)
  - Story generation
- These tasks have real-world applications and are continuously evolving.


## **Understanding Language as Data**

**Tokenization Example**

- Tokenization is the process of splitting text into words or phrases.
- It's a fundamental step in NLP.

**Named Entity Recognition (NER) Demo**

- Visit the [spaCy NER Demo](https://explosion.ai/demos/displacy-ent).
- Input a sentence with named entities, e.g., "Apple Inc. is headquartered in Cupertino, California."
- Click "Visualize" to see entity recognition.


## **Language Generation Tasks**

**Text Completion**

- Demonstrated by Google Search's auto-suggestion feature.
- It suggests the next word or phrase as you type your search query.

**Text Generation**

- Try the "GPT-3 Playground" by OpenAI.
- Enter prompts like "Once upon a time, in a land far, far away..."
- Click "Create" to generate creative text.


## **Text Classification**

* Text classification is a common task where text sequences are categorized.
    * **Examples**:
        * Classifying e-mails into **spam** or **no-spam**.
        * Categorizing news articles into **sport**, **business**, **politics**, etc.

* In chatbot development, it's crucial to comprehend user intent. This is termed as **intent classification**.
    * **Example**:
        * User says: "How's the weather tomorrow?"
        * Bot understands the intent as: **Check Weather**

Note: With intent classification, there can often be a multitude of categories.



## **Sentiment Analysis**

* **Sentiment analysis** is typically a regression problem. The aim is to assign a numerical value representing the sentiment (how positive or negative) of a sentence.
    * **Example**:
        * Sentence: "I love this product!"
        * Sentiment Score: +0.9 (where +1 is very positive and -1 is very negative)

* An advanced form is **aspect-based sentiment analysis** (ABSA). Here, sentiment scores are given to different parts of the sentence.
    * **Example**:
        * Sentence: "In this restaurant, I liked the cuisine, but the atmosphere was awful."
        * Sentiments:
            * Cuisine: +0.8
            * Atmosphere: -0.9






## **Keyword Extraction**

* **Keyword extraction** is akin to NER. However, it focuses on automatically extracting words vital to a sentence's meaning, without any prior training on specific entity types.
    * **Example**:
        * Sentence: "The Great Barrier Reef is a natural wonder located off the coast of Australia."
        * Extracted Keywords: **Great Barrier Reef**, **natural wonder**, **coast**, **Australia**




## **Text Clustering**

* **Text clustering** involves grouping similar text sequences or sentences. It can be particularly helpful in contexts like grouping related inquiries in technical support interactions.
    * **Example**:
        * Tech Support Messages:
            * "My device won't turn on."
            * "I can't get my gadget to power up."
            * "How do I update my software?"
        * Clustered Results:
            * Cluster 1: **Power issues** - "My device won't turn on.", "I can't get my gadget to power up."
            * Cluster 2: **Software updates** - "How do I update my software?"





## **Question Answering**

* **Question answering** is about a model's capability to respond to a specific query. Given a text passage and a question, the model identifies the segment of the text containing the answer or, in some cases, generates the answer text.
    * **Example**:
        * Passage: "The Eiffel Tower is located in Paris and was completed in 1889."
        * Question: "When was the Eiffel Tower completed?"
        * Answer: "The Eiffel Tower was completed in **1889**."




## **Text Generation**

* **Text Generation** pertains to a model's ability to produce novel text. It's like a classification task predicting the next character or word based on a given *text prompt*.
    * **Example**:
        * Prompt: "Once upon a time,"
        * Generated continuation: "in a land far away, there lived a wise old dragon."

* Advanced models, like GPT-3, can tackle other NLP tasks such as classification via techniques like [prompt programming](https://towardsdatascience.com/software-3-0-how-prompting-will-change-the-rules-of-the-game-a982fbfe1e0) or [prompt engineering](https://medium.com/swlh/openai-gpt-3-and-prompt-engineering-dcdc2c5fcd29). See also: https://platform.openai.com/docs/guides/prompt-engineering




## **Text Summarization**

* **Text summarization** is about enabling a computer to "read" lengthy text and condense it into a short, coherent summary.
    * **Example**:
        * Original Text: "The solar system consists of the Sun and the objects that orbit it. These objects include planets, dwarf planets, moons, and asteroids. The largest planet in the solar system is Jupiter, while Mercury is the smallest."
        * Summarized Text: "The solar system includes the Sun, planets, and other celestial objects. Jupiter is the largest planet, and Mercury is the smallest."



## **Machine Translation and NLP**

* **Machine translation** is a blend of understanding text in one language and generating text in another.
    * **Traditional Approach**:
        1. Use parsers to convert a sentence into a syntax tree.
        2. Extract higher-level semantic structures for the sentence's meaning.
        3. Generate the translated output based on this meaning and the target language's grammar.
    * **Modern Approach**: Use neural networks for more effective results in many NLP tasks.

> Traditionally, many NLP tasks were tackled using methods like grammars. However, the paradigm has shifted towards neural network-based solutions in recent years.

* **Resources**:
    - Classical methods can be found in the [Natural Language Processing Toolkit (NLTK)](https://www.nltk.org).
    - The [NLTK Book](https://www.nltk.org/book/) provides an online guide on solving NLP tasks using NLTK.

* **Course Approach**:
    * We will predominantly focus on Neural Networks for NLP and incorporate NLTK as required.





## **Neural Networks: From Images to Text**

* **Tabular Data & Images**:
    - We've explored using neural networks for fixed-size inputs like tabular data and images.
    - Images have a predetermined input size.

* **Text**:
    - Text is a variable-length sequence, making it distinct.
    - Textual patterns can be intricate. For instance, the distance between a subject and its negation can vary but should be recognized as a singular pattern.
        * **Examples**:
            - "I do not like oranges."
            - "I do not like those big colorful tasty oranges."

* **Solution for Text**:
    - Traditional convolutional networks might not capture such complex patterns in text.
    - To process language effectively, we introduce new neural architectures:
        1. **Recurrent Networks**
        2. **Transformers**

## **Representing Text**

<img src="./images/ascii-character-map.png" width="500" align="center"/>


- To solve NLP tasks with neural networks, we need text representation as tensors.
- Computers use encodings like ASCII or UTF-8 to map text characters to numbers.
- Computers lack inherent understanding, and neural networks must learn meaning during training.
- Two common approaches for text representation:
  - character-level and word-level.
- Regardless of approach, text is tokenized, converted to numbers, and fed into the network using one-hot encoding.


## **N-Grams**

- Precise word meanings depend on context (e.g., "neural network" vs. "fishing network").
- Address context by considering pairs of words or even tri-grams.
- This approach, called n-grams, increases dictionary size.
- N-grams can also be used with character-level representation.

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

## **Bag-of-Words and TF/IDF**


<img src="./images/bow.png" width="90%"/>


- Text classification requires fixed-size vector representation.
- Bag of Words (BoW) combines word representations, often using word frequencies.
- BoW can indicate text content based on word frequencies.
- TF/IDF (Term Frequency-Inverse Document Frequency) reduces the importance of common words.
- TF/IDF considers word frequency across the document collection.

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

## **Semantics of Text**

- Existing approaches cannot fully capture text semantics.
- More powerful neural network models are required.
- Explore further in attached notebooks.

In [None]:
!pip install datasets

In [None]:
!pip install torchtext==0.6.0

In [None]:
import torch
import torchtext
import os
import collections
os.makedirs('./data',exist_ok=True)

In [None]:
from datasets import load_dataset

ag_news_dataset = load_dataset('ag_news')

# Access training and test splits
train_dataset = ag_news_dataset['train']
test_dataset = ag_news_dataset['test']

In [None]:
classes = ['World', 'Sports', 'Business', 'Sci/Tech']

In [None]:
list(train_dataset)[0]

Because datasets are iterators, if we want to use the data multiple times we need to convert it to list:

In [None]:
train_dataset = list(train_dataset)
test_dataset = list(test_dataset)

In [None]:
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')
tokenizer('Magnetic resonance imaging, or MRI, is a noninvasive medical imaging test that produces detailed images of almost every internal structure in the human body, including the organs, bones, muscles and blood vessels.')


In [None]:
test_phrase = "The seeds of the Coffea plant's fruits are separated to produce unroasted green coffee beans. The beans are roasted and then ground into fine particles typically steeped in hot water before being filtered out, producing a cup of coffee. "
print(test_phrase)

In [None]:
tokenizer(test_phrase)


- Tokenize using NLTK

In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

tokens = word_tokenize(test_phrase)

print(tokens)

- tokenize using SpaCy

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("test_phrase")

tokens = [token.text for token in doc]

print(tokens)

- Compute TFIDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# The given phrase
text = ["The seeds of the Coffea plant's fruits are separated to produce unroasted green coffee beans. The beans are roasted and then ground into fine particles typically steeped in hot water before being filtered out, producing a cup of coffee."]

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the text data
tfidf_matrix = vectorizer.fit_transform(text)

# Get the feature names (words)
feature_names = vectorizer.get_feature_names_out()

# Convert the TF-IDF matrix to a pandas DataFrame for better readability
import pandas as pd
df_tfidf = pd.DataFrame(tfidf_matrix.T.toarray(), index=feature_names, columns=["TF-IDF"])

# Display the TF-IDF scores
df_tfidf.sort_values(by=["TF-IDF"], ascending=False)


In [None]:
counter = collections.Counter()
for c in train_dataset:
    counter.update(tokenizer(c['text']))
vocab = torchtext.vocab.Vocab(counter, min_freq=1)
len(vocab)

In [None]:
## vocab.freqs

In [None]:
train_dataset[:10]

Using vocabulary, we can easily encode out tokenized string into a set of numbers:

In [None]:
vocab_size = len(vocab)
print(f"Vocab size {vocab_size}")

stoi = vocab.stoi # convert tokens to indices

def encode(x):
    return [stoi[s] for s in tokenizer(x)]

encode('I love to play with my words')

In [None]:
from collections import Counter

# Assuming you have your text data pre-processed (tokenized)
all_tokens = []  # List to store all tokens
for sample in train_dataset:  # Iterate through training data (adjust for test if needed)
    text = sample['text']  # Access the text field (might differ based on dataset format)
    tokens = text.split()  # Tokenize the text
    all_tokens.extend(tokens)

# Create a counter to track word frequencies
counter = Counter(all_tokens)

# Define a minimum frequency threshold (optional)
min_freq = 5  # Words appearing less than 5 times can be excluded

# Filter words based on frequency (optional)
filtered_words = [word for word, count in counter.items() if count >= min_freq]

# Create a dictionary mapping words to unique integer IDs (vocabulary)
word2idx = {word: i for i, word in enumerate(filtered_words)}

In [None]:
import pandas as pd
pd.DataFrame({"key":word2idx.keys(), "index":word2idx.values()}).head()

In [None]:
vocab

<img src="https://github.com/wsko/Generative_AI/blob/main/Day-3/images/border.jpg?raw=1" height="10" width="1500" align="center"/>

- BoW is a widely used traditional vector representation in text analysis.
- Each word is associated with a vector index.
- Vector elements store the count of word occurrences in a given document.
![Image showing how a bag of words vector representation is represented in memory.](./images/bag-of-words-example.png)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
        'I like hot dogs.',
        'The dog ran fast.',
        'Its hot outside.',
    ]
vectorizer.fit_transform(corpus)
vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()

In [None]:
train_corpus = [item['text'] for item in train_dataset]

In [None]:
vectorizer = CountVectorizer()
vectorizer.fit_transform(train_corpus)
vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()

In [None]:
X_counts = vectorizer.fit_transform(train_corpus[:1000])
#TfidfTransformer: Transform the document-term matrix to a TF-IDF matrix:

from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_tfidf = tfidf_transformer.fit_transform(X_counts)
#Convert to Array (Optional): Convert the sparse matrix to a dense array for easier inspection (optional):

tfidf_matrix = X_tfidf.toarray()
tfidf_matrix.shape

<img src="./images/border.jpg" height="10" width="1500" align="center"/>

## **Word Embeddings**

- When training classifiers based on BoW or TF/IDF, we work with high-dimensional one-hot encoded vectors of length `vocab_size`.
- One-hot encoding is memory-inefficient and treats words independently, lacking semantic similarity.


![Embedding and Semantic Similarity](./images/NLP1.webp)

- **Embedding** is the idea of representing words as lower-dimensional dense vectors that capture semantic meaning.
- It reduces the dimensionality of word vectors.

- The embedding layer takes a word as input and produces an output vector of specified `embedding_size`.
- Unlike one-hot encoding, it takes a word number as input, avoiding large one-hot-encoded vectors.

- Using an embedding layer as the first layer in a classifier network transforms it from a bag-of-words model to an **embedding bag** model.
- In this model, words are converted into embeddings, and an aggregate function (e.g., `sum`, `average`, `max`) is applied to these embeddings.

![Embedding Classifier Example](./images/embedding-classifier-example.png)


## **Word Embeddings**

- Early NLP models used word embeddings.
- Each word mapped to a fixed-size vector.
- Word2Vec, GloVe, and FastText were popular methods.


- Word2Vec
  - Word2Vec is a popular word embedding technique in natural language processing (NLP).
  - It transforms words into dense vectors in a continuous vector space.
  - Two main approaches: Skip-gram and Continuous Bag of Words (CBOW).
  - Word2Vec captures semantic relationships and context between words.
  - Used for various NLP tasks, including text classification, sentiment analysis, and recommendation systems.

In [None]:
%pip install gensim

In [None]:
# Import necessary libraries
from gensim.models import Word2Vec

# Sample sentences for training the Word2Vec model
sentences = [
    ["machine", "learning", "is", "awesome"],
    ["word", "embeddings", "capture", "context"],
    ["natural", "language", "processing", "rocks"]
]

# Train a Word2Vec model
model = Word2Vec(sentences, vector_size=10, window=5, min_count=1, sg=0)

# Get the word vector for a specific word
word_vector = model.wv['learning']
print("Vector for 'learning':\n", word_vector)

# Find similar words to a given word
similar_words = model.wv.most_similar('awesome', topn=2)
print("Words similar to 'processing':\n", similar_words)


## Example: Quantify Word Similarity Using Word2Vec Embeddings

In [None]:
import pandas as pd
import numpy as np
import string
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


import gensim
from gensim.models import Word2Vec
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt


import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer


In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/wsko/Statistics/main/complaints02.csv')
data.head(10)

In [None]:
# Cleaning the raw text
documents = data.iloc[:, 0].tolist()

nltk.download('stopwords')
nltk.download('punkt')


# Initialize stop words and stemmer
stop_words = set(stopwords.words('english')).union({"xxxx", "xxxxxxxx"})## additional/special stop words
stemmer = PorterStemmer()

# Function to clean text
def clean_text(doc):
    # Convert to lower case
    doc = doc.lower()
    # Remove punctuation
    doc = doc.translate(str.maketrans('', '', string.punctuation))
    # Remove stop words and perform stemming
    doc = ' '.join([stemmer.stem(word) for word in nltk.word_tokenize(doc) if word not in stop_words])
    return doc

# Clean the documents
cleaned_documents = [clean_text(doc) for doc in documents]

data['clean'] = [clean_text(doc) for doc in documents]
print(data.loc[1]['Consumer_complaint'])
print('__________')
print(data.loc[1]['clean'])

In [None]:
# Train Word2Vec model
word2vec_model = Word2Vec(sentences=[doc.split() for doc in cleaned_documents],
                          vector_size=100, window=5, min_count=5)

In [None]:
words = list(word2vec_model.wv.index_to_key)[:100]
list(word2vec_model.wv.index_to_key)[:10]

In [None]:
## visualize word similarity for the top 100 tokens (by the frequency)
word_embeddings = word2vec_model.wv[words]
#word_embeddings

In [None]:
# Reduce dimensions using t-SNE
tsne = TSNE(n_components=2, random_state=42)
word_embeddings_2d = tsne.fit_transform(word_embeddings)[:100,:]

# Plot the embeddings
plt.figure(figsize=(14, 10))
plt.scatter(word_embeddings_2d[:, 0], word_embeddings_2d[:, 1], edgecolors='k', c='r')

for i, word in enumerate(words):
    plt.annotate(word, xy=(word_embeddings_2d[i, 0], word_embeddings_2d[i, 1]))

plt.title('2D Word Embeddings Visualization')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.show()


## **Limitations of Word Embeddings**
  - Fixed-length vectors don't capture context well.
  - Struggle with polysemy (multiple meanings).
  - Limited understanding of word relationships.

## **Transformers**

  - The Transformer architecture revolutionized NLP.
  - Introduced self-attention mechanisms.
  - Captures context and dependencies effectively.




# **Lab:** Find similar documents


- In the Complaints dataset, find the most similar and most dissimiar customer complaints using word2Vec embedding

In [None]:
def document_vector(doc):
    # Remove out-of-vocabulary words
    ##(keep in mind that if you train w2v on a training set, your test set may contain new words)
    doc = [word for word in doc if word in word2vec_model.wv.key_to_index]
    # Calculate the mean of word vectors
    ## to represent a document as a single vector (a row in a table)
    return np.mean(word2vec_model.wv[doc], axis=0)

# Create feature vectors for each document
X_embed = np.array([document_vector(doc) for doc in cleaned_documents])
X_embed.shape

### Find similar and different documents

In [None]:
data['Consumer_complaint'][0]

In [None]:
data['Consumer_complaint'][1]

In [None]:
cosine_similarity(X_embed[[0,1],:])[1,0]

In [None]:
#compute the similarities between complaint [0] and other complaints:
similarity = []
for i in range(1,X_embed.shape[0]):
    similarity.append(cosine_similarity(X_embed[[0,i],:])[1,0])

In [None]:
## find the most dissimilar complaint to complaint [0]
(np.array(similarity).argmin(), similarity[np.array(similarity).argmin()])

In [None]:
data['Consumer_complaint'][np.array(similarity).argmin()]

In [None]:
## find the most similar complaints
(np.array(similarity[1:]).argmax(), similarity[np.array(similarity[1:]).argmax()])

In [None]:
data['Consumer_complaint'][np.array(similarity[1:]).argmax()]