# 3.3 Text data vectorization

After *tokenizing, removing stopwords and punctuations, converting text to lowercase, and lemmatizing* the text data, the next common preprocessing steps for text data often include:

*   **Count-based representation/Feature extraction**: This step involves converting the text data into a numerical format that can be used for machine learning algorithms. Two common approaches are:

       a.   **Bag of Words** (**BoW**): This method creates a vocabulary of unique words in the corpus and represents each document as a vector indicating the frequency of each word's occurrence.  It ignores the order of words in the text.

       b.  **Term Frequency-Inverse Document Frequency (TF-IDF)**: This technique takes into account not only the frequency of words in a document but also the importance of words in the entire corpus. It can help in identifying words that are distinctive to a document.

<br>

*   **Word embeddings**: If you want to capture semantic meaning and context, you can use pre-trained word embeddings like `Word2Vec`, `GloVe`, or `FastText`. These embeddings provide vector representations for words, where words with similar meanings have similar vectors. This step is especially useful for tasks like text classification, sentiment analysis, and machine translation.


## Bag of Words as count-based representation 

Here's how the BoW model works:

**Vocabulary Creation**: The first step is to create a vocabulary, which is a list of all unique words in your corpus (collection of documents). Each word in the vocabulary is assigned a unique index.

**Counting Word Occurrences**: For each document in your corpus, you count how many times each word in the vocabulary appears in that document.

**Vector Representation**: You represent each document as a vector where each element corresponds to a word in the vocabulary, and the value of each element is the count of how many times that word appears in the document. These vectors are typically quite high-dimensional and sparse.

**CountVectorizer** is a feature extraction method provided by the scikit-learn (sklearn) library, which is a popular machine learning library in Python. It is used for converting a collection of text documents into a matrix of token (word) counts. 

In [1]:
# Import CountVectorizer in sklearn
from sklearn.feature_extraction.text import CountVectorizer

annak = ['Happy families are all alike;', 'Every unhappy family is unhappy in its own way.']

# Create an instance of CountVectorizer
annak_vectorizer = CountVectorizer()

# Fit your text data
annak_vec = annak_vectorizer.fit(annak)

# Transform your text data
annak_bow = annak_vectorizer.transform(annak)

# Extract the vocabulary
vocabulary = annak_vectorizer.get_feature_names_out()

In [2]:
print("annak_bow: \n", annak_bow, "\n")
print("annak_bow.toarray(): \n", annak_bow.toarray(), "\n")
print("vocabulary: \n",vocabulary)

In [3]:
import pandas as pd

pd.set_option('display.max_columns', None)
print(pd.DataFrame(annak_bow.toarray(), columns = annak_vectorizer.get_feature_names()))

`from sklearn.feature_extraction.text import CountVectorizer`

`CountVectorizer` implements both tokenization and occurrence counting in a single class.  It is used to convert a collection of text documents to a matrix of token (word) counts. 

Here are some of the most commonly used parameters:

*   `input`: This parameter can be a list of strings, where each string represents a document, or it can be a filename that contains text data. It specifies the data that should be vectorized.
*   `stop_words`: You can pass a list of words that should be considered as stop words and not included in the feature matrix. Stop words are common words like "the," "and," "in," which are often removed to reduce noise in text data.
*   `lowercase`: A boolean `(True`/`False`) parameter that determines whether the text should be converted to lowercase. This helps in treating words in a case-insensitive manner.
*   `analyzer`: It defines what constitutes a "token" in the text. The default is "word," but you can set it to "char" to consider character n-grams, or you can define a custom analyzer function.
*   `ngram_range`: This parameter controls the range of n-grams to consider. An n-gram is a contiguous sequence of n items from the given text. For example, if you set `ngram_range` to `(1, 2)`, it will consider both `unigrams` (individual words) and `bigrams` (pairs of consecutive words).
*   `max_features`: Limits the number of features (words) to be used. It can be set to an integer value to specify the maximum number of features, or you can use None to include all features.
*   `max_df` and `min_df`: Parameters that control the maximum and minimum document frequency of tokens to be included in the vectorization process.
*  ` binary`: If set to `True`, it will output binary values (`0` or `1`) for word presence or absence, instead of word counts. 
*  `token_pattern`: A regular expression pattern that defines what a token is. By default, it considers words as tokens.  
*  `vocabulary`: You can pass a custom vocabulary as a dictionary where keys are terms (words) and values are the indices to be used in the matrix.

The output of `CountVectorizer` is typically a sparse matrix with *rows* representing documents and *columns* representing unique tokens (words or n-grams). Each cell in the matrix contains the *count* of how many times a particular token appears in a particular document.

 The attributes in the CountVectorizer output are as follows:
 
*   `fit_transform()` or `transform()`: These are methods used to convert a collection of text documents into the matrix of token counts. You apply these methods to your text data to get the transformed representation.

*   `vocabulary_`: This attribute is a dictionary that maps the unique tokens (words or n-grams) to their respective column indices in the matrix. It provides a way to understand which column corresponds to which token.

*   `stop_words_`: If stop words were removed during the vectorization process, this attribute contains the list of stop words that were removed.

*   `get_feature_names()`: This method returns an array of feature names, which corresponds to the columns in the transformed matrix. The feature names are typically the unique tokens.

*   `inverse_transform()`: This method allows you to revert the transformed data back to its original text form. It takes the transformed matrix as input and returns a list of text documents.


These attributes and methods allow you to inspect and work with the output of the `CountVectorizer` to better understand and utilize the vectorized representation of your text data.


In [4]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
annak_vectorizer2 = CountVectorizer(stop_words = ENGLISH_STOP_WORDS)
annak_vect2 = annak_vectorizer2.fit(annak)
annak_bow2 = annak_vectorizer2.transform(annak)

print(pd.DataFrame(annak_bow2.toarray(), columns = annak_vectorizer2.get_feature_names()))

In [5]:
annak_vectorizer3 = CountVectorizer(ngram_range =(3,3))
annak_bow3 = annak_vectorizer3.fit_transform(annak)

print("annak: \n", annak, "\n")
print(pd.DataFrame(annak_bow3.toarray(), columns = annak_vectorizer3.get_feature_names()))

In [6]:
### Using the parameter `analyzer`

In [7]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [8]:
import re

def set_clean(raw_text):
    set_stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()
    
    no_punc = re.sub(r'[^\w\s]', '', raw_text)
    lowercase_no_punc = no_punc.lower()
    tokenized_text= word_tokenize(lowercase_no_punc)
    no_stop = [w for w in tokenized_text if w not in set_stop_words]
    lc_text = [lemmatizer.lemmatize(word, pos="v") for word in no_stop]
    lc_text = [lemmatizer.lemmatize(word, pos="n") for word in lc_text]
    lc_text = [lemmatizer.lemmatize(word, pos="a") for word in lc_text]
    lc_text = [lemmatizer.lemmatize(word, pos="r") for word in lc_text]
    lc_text = [lemmatizer.lemmatize(word, pos="s") for word in lc_text]
    return(lc_text)

In [9]:
# Create an instance of CountVectorizer
annak_vect3 = CountVectorizer(analyzer=set_clean)

# Fit and transform your text data
annak_bow3 = annak_vect3.fit_transform(annak)

print(pd.DataFrame(annak_bow3.toarray(), columns = annak_vect3.get_feature_names()))

In [10]:
tweets = pd.read_csv("datasets/tweets_sample.csv")
print("tweets.shape = ", tweets.shape)
print("tweets.columns = ", tweets.columns)

In [11]:
# Create an instance of CountVectorizer
tweets_vectorizer = CountVectorizer(analyzer=set_clean)

# Fit and transform your text data
tweets_bow = tweets_vectorizer.fit_transform(tweets.text)

# Extract the vocabulary
tweets_vocabulary = tweets_vectorizer.get_feature_names()

# bag-of-words as a data frame
tweets_bow_df = pd.DataFrame(tweets_bow.toarray(), columns = tweets_vocabulary)

print("tweets_bow_df.shape: ",tweets_bow_df.shape,"\n")
print("last five columns: \n", tweets_bow_df.iloc[:, -5:])

### TFIDF as count-based representation

In scikit-learn, the `TfidfVectorizer` is a feature extraction method that is used to transform a collection of text documents into a matrix of` TF-IDF` (**Term Frequency-Inverse Document Frequency**) features. 

TF-IDF is a statistical measure used in natural language processing and information retrieval to evaluate the importance of a word within a document relative to a collection of documents (corpus).

$$\text{TF-IDF}(t,d) = \text{TF}(t,d) \times \text{IDF}(t)$$

where 

*   **Term Frequency** (**TF**): This measures how frequently a term occurs in a specific document. It is calculated as the number of times a term appears in a document divided by the total number of terms in that document. This can be expressed as:

$$\text{TF}(t, d) = (\text{Number of times term } t \text{ appears in document } d) / (\text{Total number of terms in document } d)$$

*   **Inverse Document Frequency** (**IDF**): This measures how important a term is within the entire corpus. It is calculated as the logarithm of the total number of documents in the corpus divided by the number of documents containing the term, with a small constant added to prevent division by zero. The formula is:

$$\text{IDF}(t) = \log((\text{Total number of documents in the corpus}) / (\text{Number of documents containing term }t)) + 1$$

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create an instance of TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(analyzer = set_clean)

# Fit and transform your text data
annak_tfidf = tfidf_vectorizer.fit_transform(annak)

# Extract the vocabulary
vocabulary = tfidf_vectorizer.get_feature_names_out()

print("annak: \n", annak, "\n")
print("vocabulary: ", vocabulary, "\n")

print(pd.DataFrame(annak_tfidf.toarray(), columns = tfidf_vectorizer.get_feature_names()))

In [13]:
# Tfidf on the tweets dataset
tweets_tfidf_vect = TfidfVectorizer(analyzer=set_clean)

# Fit and transform your text data
tweets_tfidf = tweets_tfidf_vect.fit_transform(tweets.text)

# Extract the vocabulary
tweets_vocabulary = tweets_tfidf_vect.get_feature_names_out()

print("tweets_vocabulary shape:", tweets_vocabulary.shape,"\n")

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
print("Word vectors of 5 tokens: \n",pd.DataFrame(tweets_tfidf.toarray(), columns=tweets_tfidf_vect.get_feature_names()).iloc[:, -15:-10])

## One-hot encoding for text data vectorization

In [14]:
annak_tokens = [set_clean(line) for line in annak]

print("annak_tokens: \n", annak_tokens,"\n")

# One-hot encoding with pandas get_dummies function
print("word vectors with one-hot encording:")

[pd.get_dummies(tokens) for tokens in annak_tokens]

In [15]:
annak_str = ' '.join(annak)
annak_str_tokens = word_tokenize(annak_str)
print("annak_str_tokens: \n", annak_str_tokens,"\n")
print(pd.get_dummies(annak_str_tokens))

### Word embeddings for text data vectorization

Word embeddings are used to convert words or phrases into dense vectors of real numbers, typically with hundreds of dimensions, where each dimension represents a feature of the word or its context. 

* `GloVe`, short for "Global Vectors for Word Representation," is a popular and widely used word embedding model in natural language processing (NLP). It is developed by researchers at Stanford University.

 Here are some of the key features and concepts associated with GloVe:

*   **Vector Space Model**: GloVe is based on the Vector Space Model (VSM) of word semantics. It operates on the idea that words that frequently co-occur in similar contexts have related meanings.

*   **Word Co-occurrence**: GloVe takes a co-occurrence matrix as input, where each element of the matrix represents the number of times a word occurs in the context of another word within a specified window of text. This matrix is derived from a large corpus of text.

*   **Objective Function**: GloVe's objective function aims to minimize the difference between the dot product of word vectors and the logarithm of the word co-occurrence probabilities. It formulates word vectors in a way that words with similar meanings have similar vectors.

*   **Training**: GloVe is trained using a global approach, considering the entire corpus of text, as opposed to the local context window approach used in Word2Vec. It leverages statistical information from the entire dataset for word embedding creation.

*   **Dimensionality**: Users can specify the dimensionality of the word embeddings when training GloVe, typically ranging from 50 to 300 dimensions, depending on the application and available resources.

Reference and downloads: [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/)

In [16]:
import numpy as np

# Define the path to your downloaded GloVe file
glove_file = 'datasets/glove.6B.50d.txt'  # Example file for 50-dimensional vectorsw

In [17]:
#Load the GloVe word vectors into a dictionary
def load_glove_vectors(filename):
    word_vectors = {}
    with open(filename, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.strip().split()
            word = parts[0]
            vector = np.array(parts[1:], dtype=np.float32)
            word_vectors[word] = vector
    return word_vectors

In [18]:
glove_vectors = load_glove_vectors(glove_file)

In [19]:
# Access the vector for a specific word
word = "king"
if word in glove_vectors:
    vector = glove_vectors[word]
    print(f"Vector for '{word}':\n{vector}")

In [20]:
# !pip install pip install scikit-learn
# Compute the cosine similarity between two words
from sklearn.metrics.pairwise import cosine_similarity

In [21]:
word1 = "king"
word2 = "queen"
if word1 in glove_vectors and word2 in glove_vectors:
    similarity = cosine_similarity([glove_vectors[word1]], [glove_vectors[word2]])[0][0]
    print(f"Similarity between '{word1}' and '{word2}': {similarity}")

Here's a step-by-step guide on how to apply GloVe to text data:

*   **Download Pre-trained GloVe Vectors**: You can download pre-trained GloVe vectors from the official website or other sources. GloVe comes in different dimensions (e.g., 50D, 100D, 300D), and you should choose the one that best suits your needs. A larger dimension generally captures more semantic information but requires more memory.

*   **Load GloVe Vectors**: You can load the pre-trained GloVe vectors into your Python script using a library like gensim or simply by reading the text file. Here's an example using `gensim`:

In [22]:
from gensim.models import KeyedVectors

# Load pre-trained GloVe word vectors
glove_model = KeyedVectors.load_word2vec_format(glove_file, binary=False, no_header=True)

In [23]:
def text_to_vector(text, model):
    words = text.split()
    word_vectors = [model[word] for word in words if word in model]
    #if not word_vectors:
    #    return None
    #text_vector = sum(word_vectors) / len(word_vectors)
    words_list = [word for word in words if word in model]
    return words_list, word_vectors

In [24]:
# Text to be converted to word vectors
text = "This is a sample sentence."

# Convert text to word vector
vector = text_to_vector(text, glove_model)
print(vector[0], "\n")

print("is: \n", vector[1][0],"\n")
print("a: \n", vector[1][1],"\n")
print("sample: \n", vector[1][2])


In [25]:
glove_model.similar_by_vector(vector[1][2], topn=1)