# Lab 06 - Text Representation - Annette Bazan

#### What is text representation?
- Text representation is a process that involes converting textual data into numeric values that can be processed and understood by a machine.
- There are 3 popular techniques to perform text representation:
1. Bags-of-Words (BoW)
2. TF-IDF (Term Frequency-Inverse Document Frequency)
3. Word Embedding: Word2Vec

## 1. Bag-of-Words (BoW)
* In this technique the frequency of words is captured.
* Using the CountVectorizer from scikit-learn library enables us to create bag of words.

In [1]:
# 1. importing libraries

# general libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# BoW library
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# 2. getting the text (document)
documents = ['This is the first document.',
             'This document is the second document',
             'And this is the third one.']

In [3]:
# 3. let's create the bag-of-words model using unigram, bigram and trigram
# n-gram is a sequence of words or characters that are together
bow_vectorizer = CountVectorizer(ngram_range=(1,3))

x = bow_vectorizer.fit_transform(documents)

In [4]:
# 4. creating the features from the document as the output of the model
print(bow_vectorizer.get_feature_names_out())

['and' 'and this' 'and this is' 'document' 'document is' 'document is the'
 'first' 'first document' 'is' 'is the' 'is the first' 'is the second'
 'is the third' 'one' 'second' 'second document' 'the' 'the first'
 'the first document' 'the second' 'the second document' 'the third'
 'the third one' 'third' 'third one' 'this' 'this document'
 'this document is' 'this is' 'this is the']


In [5]:
# 5. coverting text (documents) into numeric values
print(x.toarray())

[[0 0 0 1 0 0 1 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 0 0 1 1]
 [0 0 0 2 1 1 0 0 1 1 0 1 0 0 1 1 1 0 0 1 1 0 0 0 0 1 1 1 0 0]
 [1 1 1 0 0 0 0 0 1 1 0 0 1 1 0 0 1 0 0 0 0 1 1 1 1 1 0 0 1 1]]


**Conclusion for BoW:**

1. **List 5 points you learned about BoW technique**
* Import necessary libraries like NumPy, pandas, and visualization tools (Matplotlib, Seaborn).
* Use CountVectorizer from sklearn.feature_extraction.text to implement BoW.
* Prepare a list of text documents as input data.
* Convert text into a bag-of-words model using unigram, bigram, and trigram.
* Transform text into numerical feature vectors and display the output.
2. **How is BoW performing text representation?**
BoW (Bag of Words) represents text by counting word occurrences, ignoring grammar and word order. It creates a sparse vector where each word is a feature with its frequency in the text.
3. **What are the limitations of BoW technique?**
It ignores context, word meaning, and semantic relationships, leading to high dimensionality and data sparsity.

##2.**TF-IDF (Term Frequency-Inverse Document Frequency)**
- While BoW perform text represenation without any meaningful information, TF-IDF runs text representation based on the significance and importance of each word in the context.
- TF: number of word occurance in the document,
- IDF: log base e (total number of documents / number of documents that are having the word, term)
- TF-IDF: TF * IDF

In [6]:
# 1. importing libraries

# general libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# TF-IDF library
from sklearn.feature_extraction.text import TfidfVectorizer

In [7]:
# 2. getting the text (document)
documents = ['This is the first document.',
             'This document is the second document',
             'And this is the third one.']

In [8]:
# 3. let's create the TF-IDF model
# creating TF, IDF, TF-IDF = TF * IDF
tfidf_vectorizer = TfidfVectorizer()

x_tfidf = tfidf_vectorizer.fit_transform(documents)

In [9]:
# 4. creating the features from the document as the output of the model
print(tfidf_vectorizer.get_feature_names_out())

['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']


In [10]:
# 5. coverting text (documents) into numeric values
# printing out the TF-IDF matrix
print(x_tfidf.toarray())

[[0.         0.46941728 0.61722732 0.3645444  0.         0.
  0.3645444  0.         0.3645444 ]
 [0.         0.7284449  0.         0.28285122 0.         0.47890875
  0.28285122 0.         0.28285122]
 [0.49711994 0.         0.         0.29360705 0.49711994 0.
  0.29360705 0.49711994 0.29360705]]


**Conclusion for TF-IDF:**

1. **List 5 points you learned about TF-IDF technique.**
- TfidfVectorizer from sklearn.feature_extraction.text is used to implement the technique.
- TF-IDF assigns weights to words based on their importance, reducing the impact of common words.
- It calculates Term Frequency (TF), Inverse Document Frequency (IDF), and their product (TF-IDF).
- The output is a feature matrix where each word is represented by its TF-IDF score.
- It converts text documents into numerical vectors, making them suitable for machine learning models.
2. **How is TF-IDF performing text representation?**
TF-IDF (Term Frequency-Inverse Document Frequency) represents text by weighting words based on their frequency in a document and rarity across documents.
It reduces the importance of common words and highlights unique terms for better text representation.
3. **What are the limitations of TF-IDF technique?**
It ignores word order, context, and semantic meaning, and it may not perform well with short texts or dynamic vocabulary changes.


##3.**Word Embedding (Word2Vec)**
- While TF-IDF perform text represenation without any meaningful relationshiop between the words, Word2Vec offers meaningful relationships and contextual and semantic nuances.

In [11]:
# 1. importing libraries

# general libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# word2vec libraries
# gensim library: it needs to be installed
# import nltk

In [12]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting scipy<1.14.0,>=1.7.0 (from gensim)
  Downloading scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.7/26.7 MB[0m [31m41.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.6/38.6 MB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: scipy, gensim
  Attempting uninstall: scipy
    Found existing installation: scipy 1.14.1
    Uninstalling scipy-1.14.1:
      Successfully 

In [13]:
from gensim.models import Word2Vec
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [14]:
# 2. getting the text (document)
documents = ['This is the first document.',
             'This document is the second document',
             'And this is the third one.']

In [15]:
# 3. tokenization of the text and display the words
tokenized_documents = [word_tokenize(doc.lower()) for doc in documents]
tokenized_documents

[['this', 'is', 'the', 'first', 'document', '.'],
 ['this', 'document', 'is', 'the', 'second', 'document'],
 ['and', 'this', 'is', 'the', 'third', 'one', '.']]

In [16]:
# 4. creating and training the Word2Vec model
w2vec_model = Word2Vec(sentences=tokenized_documents, vector_size= 100,
                       window=5, min_count=1, workers=4)

In [17]:
# 5. getting the vectors
vector_representation = w2vec_model.wv['document']
print(vector_representation)

[-5.3672609e-04  2.3644192e-04  5.1034638e-03  9.0088481e-03
 -9.3043381e-03 -7.1172416e-03  6.4599109e-03  8.9737894e-03
 -5.0164633e-03 -3.7643004e-03  7.3806536e-03 -1.5336993e-03
 -4.5375801e-03  6.5538706e-03 -4.8608989e-03 -1.8169655e-03
  2.8765604e-03  9.9168427e-04 -8.2857795e-03 -9.4498657e-03
  7.3113004e-03  5.0703306e-03  6.7592640e-03  7.6371350e-04
  6.3505652e-03 -3.4057787e-03 -9.4622851e-04  5.7676826e-03
 -7.5218501e-03 -3.9357515e-03 -7.5110761e-03 -9.3052583e-04
  9.5381113e-03 -7.3201829e-03 -2.3341659e-03 -1.9368482e-03
  8.0780918e-03 -5.9310989e-03  4.5382971e-05 -4.7547058e-03
 -9.6027423e-03  5.0065457e-03 -8.7603368e-03 -4.3926029e-03
 -3.5016976e-05 -2.9605580e-04 -7.6620397e-03  9.6154800e-03
  4.9820296e-03  9.2333099e-03 -8.1580607e-03  4.4947835e-03
 -4.1369139e-03  8.2554476e-04  8.4977103e-03 -4.4625821e-03
  4.5185820e-03 -6.7884498e-03 -3.5496773e-03  9.3978141e-03
 -1.5769212e-03  3.2110573e-04 -4.1396245e-03 -7.6837917e-03
 -1.5078725e-03  2.47046

* In Word2Vec:
1. numeric value is assigned to a word based on its context within the text,
2. the machine understand the relationships between the words based on their numeric value,
3. translates words to mathematical vectors for the machine to understand the words semantic values.

### **Conclusion:**
- Word2Vec captures word relationships and contextual meaning, unlike BoW and TF-IDF.
- It represents words as dense numerical vectors instead of sparse frequency-based representations.
- Word2Vec can identify word similarities using cosine similarity between vectors.
- The gensim library provides an efficient implementation of Word2Vec.
- Tokenization is necessary before training the Word2Vec model.
- Word2Vec has two main architectures: Continuous Bag of Words (CBOW) and Skip-Gram.
- CBOW predicts a word based on its surrounding context.
- Skip-Gram predicts surrounding words given a target word.
- A larger window size captures broader context, while a smaller one focuses on local meaning.
- Word2Vec requires sufficient training data for meaningful word representations.
- Words with similar meanings have closer vector representations.
- It helps in NLP tasks like word clustering and document classification.
- Word2Vec does not account for words with multiple meanings.
- The quality of word embeddings improves with more training data.
- The output vectors can be visualized using techniques like PCA or t-SNE.
- Unlike TF-IDF, Word2Vec captures semantic and syntactic word properties.
- Pre-trained Word2Vec models (e.g., Google’s) can be used for various NLP applications.
- Training a Word2Vec model requires tuning parameters like vector_size, window, and min_count.
- Words not present in training data won’t have vector representations.
- Word2Vec improves machine understanding of text, enabling more advanced NLP applications.

##### End of Lab 06