### Count Vectorization
The easiest way to represent a document as a vector is with a **bag of words Count Vectorizer**. This will turn each document to be a 1-D array, which I think is a good starting point for my clustering algorithms to work. Let's see how it works. 

The Count Vectorizer exclusively accepts series in string format, not as tokenized lists. Each row, representing a document, must be a string. This ensures that every point in the clustering encapsulates an entire sequence, rather than isolated vectorized words.

I set the `min_df` parameter to be equal to 5 to only include terms that occur more than 5 times in my **count vectorized** data. 

In [4]:
import numpy as np


from sklearn.feature_extraction.text import TfidfVectorizer

In [20]:
# %load C:\Sagar Study\ML and Learning\Projects\customer-support-bot\amazon_customer_support\test_environment.py
import sys
import numpy as np 

REQUIRED_PYTHON = "python3"


def main():
    system_major = sys.version_info.major
    if REQUIRED_PYTHON == "python":
        required_major = 2
    elif REQUIRED_PYTHON == "python3":
        required_major = 3
    else:
        raise ValueError("Unrecognized python interpreter: {}".format(
            REQUIRED_PYTHON))

    if system_major != required_major:
        raise TypeError(
            "This project requires Python {}. Found: Python {}".format(
                required_major, sys.version))
    else:
        print(">>> Development environment passes all tests!")


if __name__ == '__main__':
    main()


>>> Development environment passes all tests!


In [21]:
print(REQUIRED_PYTHON)

python3


ImportError: attempted relative import with no known parent package

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample corpus
corpus = ['This is the first document.',
          'This document is the second document.',
          'And this is the third one.',
          'Is this the first document?']

# Create an instance of CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the corpus
X = vectorizer.fit_transform(corpus)

# Get the feature names
feature_names = vectorizer.get_feature_names()

# Print the count vectorized representation
print(X.toarray())

# Print the feature names
print(feature_names)

AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names'

In [1]:
# Vectorizing the data with Count Vectorizer 
bag_of_words = CountVectorizer(min_df=5)
inbound_cv = bag_of_words.fit_transform(string_processed_inbound)


NameError: name 'CountVectorizer' is not defined

### 2.TFIDF

In [None]:
tfidf = TfidfVectorizer(min_df=5, ngram_range = (1,3))
# Storing tfidf data and transforming them into sparse matrices
inbound_tfidf = tfidf.fit_transform(string_processed_inbound)
inbound_tfidf

## Pretrained word embeddings
While bag-of-words approaches can capture the presence of words in a text, they fail to preserve the crucial aspect of word order. This limitation necessitates the exploration of more sophisticated text vectorization techniques. With the advent of advanced methods, we can now effectively encode intent clusters into meaningful vector representations.

#### 3. Glove 

I'll explore various word embeddings to encode text differently and assess their effectiveness, beginning with GloVe word embeddings—an unsupervised learning algorithm for generating word vector representations.

Gensim simplifies the use of pretrained word embeddings, offering a specialized data format for easy loading into a numpy array.

Note: I won't utilize this method because my clustering algorithms require tweets to represent single points, while this method transforms individual words. I'm keeping this note in my notebook to track progress.


In [52]:
%%time
with open("C:\\Sagar Study\\ML and Learning\\Projects\\customer-support-bot\\amazon_customer_support\\objects\\glove.6B.50d.txt", "r", encoding="utf-8") as file: 
    word_to_vec_map = {} 
    for line in file: 
        values = line.split() 
        curr_word = values[0] 
        coefs = np.array(values[1:], dtype=np.float32)  # Fix: Convert values to float
        word_to_vec_map[curr_word] = coefs 

CPU times: total: 2.12 s
Wall time: 4.5 s


In [53]:
word_to_vec_map["enemy"]

array([ 0.70153 , -0.43853 ,  1.0509  , -0.33431 ,  0.67151 , -0.17677 ,
        0.80079 ,  0.90158 , -0.29513 , -0.65586 , -0.083404,  0.35146 ,
       -0.41071 ,  0.29446 , -1.1955  , -0.45611 ,  0.56877 ,  0.073522,
       -1.2616  ,  0.22276 , -0.57735 ,  0.12075 ,  0.54712 , -0.34094 ,
        0.2164  , -1.804   , -0.70362 , -0.56337 ,  1.8773  ,  0.1301  ,
        2.271   , -0.25882 , -0.46309 , -0.7759  , -0.22926 ,  0.62156 ,
       -0.043353, -0.60943 , -1.6791  , -0.018271,  0.53893 , -0.50689 ,
        0.88454 , -0.11158 ,  0.57013 , -0.69098 , -0.43072 , -0.45332 ,
       -0.27984 , -0.056133], dtype=float32)

In [28]:
word_to_vec_map["Sea"]

KeyError: 'Sea'

As for interpretation later on, since this being the main embedding algorithm in my pipeline, 

words_to_vec_map["sea"]

## 4. Doc2Vec
Since this is the main embedding method I will use for my pipeline, I display how I capitalized this embedding in the next notebook. {to be edited later on}

## 5. Hugging Face
This is a startup that does a lot with NLP. I explore their encoders.

BERT wouldn't really be a good option because a large part of that was trained with Wikipedia data.

I am not sure what doc2vec is trained on, I think my results will be better if I find a Twitter based word embedding!

## 6. Fast-text 


An embedding methodology developed by Facebook AI team

This one as well, like BERT, doesn't seem to be a good option at first glance due to its dependence on Wikipedia for training. 

Nevertheless, we will try and evaluate if it works well 

# Scaling the data 

Before we cluster, let's make scaled versions of our dataset first, which would be good for distance-based clustering methods. In general, these vectors shouldn't really need scaling, but it may help for computational purposes. I only do this for my count vectorized and tfidf vectorized data, not ones with the more meaningful word embeddings.