### Count Vectorization
The easiest way to represent a document as a vector is with a **bag of words Count Vectorizer**. This will turn each document to be a 1-D array, which I think is a good starting point for my clustering algorithms to work. Let's see how it works. 

The Count Vectorizer exclusively accepts series in string format, not as tokenized lists. Each row, representing a document, must be a string. This ensures that every point in the clustering encapsulates an entire sequence, rather than isolated vectorized words.

I set the `min_df` parameter to be equal to 5 to only include terms that occur more than 5 times in my **count vectorized** data. 

In [55]:
import numpy as np
import pickle as pkl
import matplotlib.pyplot as plt 
from sklearn.feature_extraction.text import TfidfVectorizer

ModuleNotFoundError: No module named 'matplotlib'

In [20]:
# %load C:\Sagar Study\ML and Learning\Projects\customer-support-bot\amazon_customer_support\test_environment.py
import sys
import numpy as np 

REQUIRED_PYTHON = "python3"


def main():
    system_major = sys.version_info.major
    if REQUIRED_PYTHON == "python":
        required_major = 2
    elif REQUIRED_PYTHON == "python3":
        required_major = 3
    else:
        raise ValueError("Unrecognized python interpreter: {}".format(
            REQUIRED_PYTHON))

    if system_major != required_major:
        raise TypeError(
            "This project requires Python {}. Found: Python {}".format(
                required_major, sys.version))
    else:
        print(">>> Development environment passes all tests!")


if __name__ == '__main__':
    main()


>>> Development environment passes all tests!


In [21]:
print(REQUIRED_PYTHON)

python3


ImportError: attempted relative import with no known parent package

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample corpus
corpus = ['This is the first document.',
          'This document is the second document.',
          'And this is the third one.',
          'Is this the first document?']

# Create an instance of CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the corpus
X = vectorizer.fit_transform(corpus)

# Get the feature names
feature_names = vectorizer.get_feature_names()

# Print the count vectorized representation
print(X.toarray())

# Print the feature names
print(feature_names)

AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names'

In [1]:
# Vectorizing the data with Count Vectorizer 
bag_of_words = CountVectorizer(min_df=5)
inbound_cv = bag_of_words.fit_transform(string_processed_inbound)


NameError: name 'CountVectorizer' is not defined

### 2.TFIDF

In [None]:
tfidf = TfidfVectorizer(min_df=5, ngram_range = (1,3))
# Storing tfidf data and transforming them into sparse matrices
inbound_tfidf = tfidf.fit_transform(string_processed_inbound)
inbound_tfidf

## Pretrained word embeddings
While bag-of-words approaches can capture the presence of words in a text, they fail to preserve the crucial aspect of word order. This limitation necessitates the exploration of more sophisticated text vectorization techniques. With the advent of advanced methods, we can now effectively encode intent clusters into meaningful vector representations.

#### 3. Glove 

I'll explore various word embeddings to encode text differently and assess their effectiveness, beginning with GloVe word embeddings—an unsupervised learning algorithm for generating word vector representations.

Gensim simplifies the use of pretrained word embeddings, offering a specialized data format for easy loading into a numpy array.

Note: I won't utilize this method because my clustering algorithms require tweets to represent single points, while this method transforms individual words. I'm keeping this note in my notebook to track progress.


In [52]:
%%time
with open("C:\\Sagar Study\\ML and Learning\\Projects\\customer-support-bot\\amazon_customer_support\\objects\\glove.6B.50d.txt", "r", encoding="utf-8") as file: 
    word_to_vec_map = {} 
    for line in file: 
        values = line.split() 
        curr_word = values[0] 
        coefs = np.array(values[1:], dtype=np.float32)  # Fix: Convert values to float
        word_to_vec_map[curr_word] = coefs 

CPU times: total: 2.12 s
Wall time: 4.5 s


In [53]:
word_to_vec_map["enemy"]

array([ 0.70153 , -0.43853 ,  1.0509  , -0.33431 ,  0.67151 , -0.17677 ,
        0.80079 ,  0.90158 , -0.29513 , -0.65586 , -0.083404,  0.35146 ,
       -0.41071 ,  0.29446 , -1.1955  , -0.45611 ,  0.56877 ,  0.073522,
       -1.2616  ,  0.22276 , -0.57735 ,  0.12075 ,  0.54712 , -0.34094 ,
        0.2164  , -1.804   , -0.70362 , -0.56337 ,  1.8773  ,  0.1301  ,
        2.271   , -0.25882 , -0.46309 , -0.7759  , -0.22926 ,  0.62156 ,
       -0.043353, -0.60943 , -1.6791  , -0.018271,  0.53893 , -0.50689 ,
        0.88454 , -0.11158 ,  0.57013 , -0.69098 , -0.43072 , -0.45332 ,
       -0.27984 , -0.056133], dtype=float32)

In [28]:
word_to_vec_map["Sea"]

KeyError: 'Sea'

As for interpretation later on, since this being the main embedding algorithm in my pipeline, 

words_to_vec_map["sea"]

## 4. Doc2Vec
Since this is the main embedding method I will use for my pipeline, I display how I capitalized this embedding in the next notebook. {to be edited later on}

## 5. Hugging Face
This is a startup that does a lot with NLP. I explore their encoders.

BERT wouldn't really be a good option because a large part of that was trained with Wikipedia data.

I am not sure what doc2vec is trained on, I think my results will be better if I find a Twitter based word embedding!

## 6. Fast-text 


An embedding methodology developed by Facebook AI team

This one as well, like BERT, doesn't seem to be a good option at first glance due to its dependence on Wikipedia for training. 

Nevertheless, we will try and evaluate if it works well 

# Scaling the data 

Before we cluster, let's make scaled versions of our dataset first, which would be good for distance-based clustering methods. In general, these vectors shouldn't really need scaling, but it may help for computational purposes. I only do this for my count vectorized and tfidf vectorized data, not ones with the more meaningful word embeddings.

In [None]:
# Fitting and transforming to create standard scald versions of the data through the use of MaxAbsScaler
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
inbound_cv = scaler.fit_transform(inbound_cv)
outbound_cv = scaler.fit_transform(outbound_cv)

In [None]:
inbound_cv.head()

In [None]:
outbound_cv.head()

# Tweet Collection With Clustering

Training Consideration: Because I will be working with neural networks in the next phase of my project, my objective is to identify approaximately `x` distinct intents in the data through Exploratory Data Analysis(EDA). Each of these should ideally consist of 1000 tweets similar to it to generate your training data for that intent with Gensim's `model.docvecs...` (complete next) function

I aim to employ clustering methods and topic modeling techniques on my dataset to extract prominent themes, intending to manually assign labels to these clusters.

Achieving successful outcomes will be more feasible if my dataset pertains uniformly to the domain of customer service. I've taken care to ensure that this Twitter data related to Amazon primarily focuses on this domain. This careful curation aims to enable my upcoming model to capture the subtleties required for intent classification. In essence, the conversational efficacy of the bot largely depends on the language it's been trained on.

With respect to the nature of clustering algorithms, I understand that they may not precisely group the data as anticipated beforehand. It's an algorithmic process and might not seamlessly cluster intents together as desired. However, I will back my curiosity to discover insightful trends - if found any!!!

## <font color='blue'>1. K Means</color>
My first approach for the clustering my word vectors is K-Means, which tends to perform well on blobs.

A drawback is that it is very slow, and picking the value for K is hard - I don't even know how many intents there are in the data. This is why I start with larger jumps of K to get a higher level idea of which performs the best, then I dive deeper to finally decide what K works the best for finding the optimal number of intents in my dataset.

## 1.1. K-Means for my TFIDF and Count Vectorized Data
First I cluster the TFIDF and Count Vectorized data. Honestly, I won't really be expecting good results from it, so I won't spend a lot of effort doing this actual clustering on these two. But it's worth a shot to demonstrate an older and suboptimal approach.


I'm currently running the K-Means algorithm on my entire dataset. Alongside this, I'm performing hyperparameter optimization specifically on the n_clusters parameter. The initial progress bar reflects its progress with the dataset, while the second bar shows its completion concerning different values of n_clusters.

The process of applying K-Means across 10 iterations spanning from 10 to 100 for both data types consumed approximately three hours before the scaling step. Thankfully, post-scaling, the training procedure noticeably sped up.

Below this section, I've saved my results using Python's serialization package called Pickle to avoid the need to rerun the process again!

In [None]:
%%time
# My grand dictionaries that will store all my results
wcss_grand = {}
labels_grand = {}
silhouette_scores_grand = {}
n_clusters = [10,20,30,40,50,60,70,80,90,100]

# Iterating through all the differently embedded data
for i,j in tqdm(enumerate(vectorized_data.items())): 
    name = j[0] # Here j[0] is the name of the dataset
    dataset = j[1] # And j[1] is the actual data
    
    # I store my metrics at these following lists
    wcss = []
    labels = []
    silhouette_scores = []
    
    # Looping through values of k
    for k in tqdm(n_clusters):    
        print(f'Currently fitting {name} with {k} clusters... Please wait')
        
        # Initializing with k-means++ ensures that you get don’t fall into the random initialization trap.
        kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10, random_state = 10)
        kmeans.fit(dataset)
        wcss.append(kmeans.inertia_)
        
        # Getting the silhouette score
        labels.append(kmeans.labels_)
        silhouette_scores.append(silhouette_score(dataset, kmeans.labels_))
        
        # Saving the models
        filename = f'models/kmeans/{name}-{k}neighbors.sav'
        joblib.dump(kmeans, filename)
        
    # Updating grand dictionary
    wcss_grand[name + '_wcss'] = wcss
    labels_grand[name + '_labels'] = labels
    silhouette_scores_grand[name + '_silhouettes'] = silhouette_scores

# Saving all my results
with open('objects/wcss_grand.pkl', 'wb') as handle:
    pickle.dump(wcss_grand, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('objects/labels_grand.pkl', 'wb') as handle:
    pickle.dump(labels_grand, handle, protocol=pickle.HIGHEST_PROTOCOL)
    
with open('objects/silhouette_scores_grand.pkl', 'wb') as handle:
    pickle.dump(silhouette_scores_grand, handle, protocol=pickle.HIGHEST_PROTOCOL)

### Reading back in the results

In [None]:
# Storing it into objects I can use in this notebook

with open('objects/wcss_grand.pkl', 'rb') as handle: # Change path 
    wcss_grand = pkl.load(handle)
with open('objects/labels_grand.pkl','rb') as handle: # Cnange path
    labels_grand = pkl.load(handle)
with open('objects/silhouette_scores_grand.pkl','rb') as handle: # Change path 
    silhouette_scores_grand = pkl.load(handle)

## Finding the Best K-Means Model
I will try creating an elbow plot to see if there exists a clear elbow. I will do so on both the **count vectorized** and **tfidf** features

In [None]:
# Elbow Plot count vectorized
plt.figure(figsize=(8,5))
plt.plot(range(10, 101, 10), wcss_grand['inbound_cv_ma_wcss'], color = 'magenta')
plt.title('Elbow Method (Count Vectorized)')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

# Elbow Plot tfidf
plt.figure(figsize=(8,5))
plt.plot(range(10, 101, 10), wcss_grand['inbound_tfidf_ma_wcss'], color = 'magenta')
plt.title('Elbow Method (TFIDF)')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

# Silouette Plot count vectorized
plt.figure(figsize=(8,5))
plt.plot(range(10, 101, 10), silhouette_scores_grand['inbound_cv_ma_silhouettes'], color = 'red')
plt.title('Silhouette Method (Count Vectorized)')
plt.xlabel('Silhouette Score')
plt.ylabel('WCSS')
plt.show()

# Silouette Plot tfidf
plt.figure(figsize=(8,5))
plt.plot(range(10, 101, 10), silhouette_scores_grand['inbound_tfidf_ma_silhouettes'], color = 'red')
plt.title('Silhouette Method (TFIDF)')
plt.xlabel('Silhouette Score')
plt.ylabel('WCSS')
plt.show()

We see that these plots don't really chaneg much between the TFIDF and Count Vectorized data, furthering my statement above that they aren't really going to be the most useful 

## Visualizing my clusters with t-SNE 

I try different color maps and choose one so its easier to distinguish between clusters 