# Intent Clustering, Document Embeddings, and Unsupervised Learning 

In this notebook, each of the preprocessed tweets from the previous notebook is assigned a label for each tweet in the dataset by using meaningful document embedding methods and other unsupervised learning techniques such as K-Means, DBScan, LDA, T-SNE, etc. 

My objective is to perform heuristic search using clustering algorithms and let them determine the intents (although I won't be able to determine which one they actually represent). This way atleast I will be able to estimate the number of different intents in the data 

In [17]:
# %load C:\Sagar Study\ML and Learning\Projects\customer-support-bot\amazon_customer_support\test_environment.py
import sys

REQUIRED_PYTHON = "python3"


def main():
    system_major = sys.version_info.major
    if REQUIRED_PYTHON == "python":
        required_major = 2
    elif REQUIRED_PYTHON == "python3":
        required_major = 3
    else:
        raise ValueError("Unrecognized python interpreter: {}".format(
            REQUIRED_PYTHON))

    if system_major != required_major:
        raise TypeError(
            "This project requires Python {}. Found: Python {}".format(
                required_major, sys.version))
    else:
        print(">>> Development environment passes all tests!")


if __name__ == '__main__':
    main()


>>> Development environment passes all tests!


In [18]:
print(REQUIRED_PYTHON)

python3


In [66]:
# Standard Libraries
import os 
import time 
import pickle
import joblib 


# Data Munging and Mathematical Operation
import numpy as np 
import pandas as pd 

# tqdm progress bars 
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")

# Sklearn Transformers, Utilities, and Metrics
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MaxAbsScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
from sklearn.model_selection import cross_val_score
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.datasets import make_multilabel_classification
from sklearn.metrics import silhouette_score, silhouette_samples

# Data Visualization 
import matplotlib.pyplot as plt
import seaborn as sns 
from sklearn.manifold import TSNE

# Word Embeddings (CountVectorizer, TFIDF, HashingVectorizer, Word2Vec, GloVe, FastText, Doc2Vec)
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
import gensim


# Doc2Vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.test.utils import get_tmpfile, common_texts

In [21]:
# Import processed_v1 data 
processed_v1_df = pd.read_pickle('../data/processed/processed_v1.pkl').reset_index(drop=True)

# An overview 
processed_v1_df.head()

Unnamed: 0,inbound_text,author_id,created_at,outbound_text,response_tweet_id,inbound_lang,inbound_hashtags,outbound_hashtags,clean_inbound_text,clean_outbound_text,outbound_tokens_pos,inbound_tokens_pos
0,@AmazonHelp 3 different people have given 3 di...,AmazonHelp,2017-10-31 23:28:00+00:00,@115820 We'd like to take a further look into ...,619.0,en,[],[],different people have given different answers ...,wed like to take a further look into this with...,"[-PRON-: NOUN, d: VERB, like: VERB, to: NOUN, ...","[different: NOUN, people: NOUN, have: NOUN, gi..."
1,Way to drop the ball on customer service @1158...,AmazonHelp,2017-10-31 22:29:00+00:00,@115820 I'm sorry we've let you down! Without ...,616.0,en,[],[],way to drop the ball on customer service so pi...,i am sorry we have let you down without provid...,"[i: NOUN, be: NOUN, sorry: NOUN, -PRON-: NOUN,...","[way: NOUN, to: NOUN, drop: VERB, the: NOUN, b..."
2,@115823 I want my amazon payments account CLOS...,AmazonHelp,2017-10-31 22:28:34+00:00,@115822 I am unable to affect your account via...,,en,[],[],i want my amazon payments account closed dm me...,i am unable to affect your account via twitter...,"[i: NOUN, be: NOUN, unable: NOUN, to: NOUN, af...","[i: NOUN, want: VERB, -PRON-: NOUN, amazon: NO..."
3,@AmazonHelp @115826 Yeah this is crazy we’re l...,AmazonHelp,2017-11-01 12:53:34+00:00,@115827 Thanks for your patience. ^KM,,en,[],[],yeah this is crazy were less than a week away ...,thanks for your patience km,"[thank: NOUN, for: NOUN, -PRON-: NOUN, patienc...","[yeah: NOUN, this: NOUN, be: NOUN, crazy: NOUN..."
4,@115828 How about you guys figure out my Xbox ...,AmazonHelp,2017-10-31 22:28:00+00:00,@115826 I'm sorry for the wait. You'll receive...,627.0,en,[],[],how about you guys figure out my xbox one x pr...,i am sorry for the wait you will receive an em...,"[i: NOUN, be: NOUN, sorry: NOUN, for: NOUN, th...","[how: NOUN, about: NOUN, -PRON-: NOUN, guy: NO..."


## Document Embedding and Vectorization 
Its necessary to represent the inbound requests in a dense vectorized format so that the clustering algorithms can distinguish/find similarity between different queries/requests

- Some popular Embedding Methods 
    - CountVectorizer 
    - TfidfVectorizer 
    - HashingVectorizer 
    - Word2Vec 
    - Glove 
    - FastText 
    - Doc2Vec 

- Since I am working on sequences, they are bound to be of different lengths. Word-based embeddings by default generate embeddings of different lengths and though I can pad them to bring to a specific length, the resultant representation isn't accepted by standard statistical clustering algorithms. However, I can experiment the following operations in the **upcoming iterations** and assess whether I get better results: 
    - **Averaging word embeddings**: Calculate the average of individual word embeddings in a sentence to obtain a single vector representing the sentence. While simple, this method might lose some sentence-specific information 
    - **Weighted average of word embeddings**: Assign weights to words based on their importance in the sentence (e.g., TFIDF weights) and calculate a weighted average of word embeddings 
    - **Doc2Vec/Fasttext with tagging**: Use techniques like Word2Vec or Fasttext with additional document tags to infer document-level vectors that encapsulate the meanings of sentences/documents 

- For the time being, I will stick with 2 sentence level embeddings going ahead, which are `CountVectorization`, `Tfidf`, `HashingVectorizer` and `Doc2Vec` 

### 1. Count Vectorization
The easiest way to represent a document as a vector is with a **bag of words Count Vectorizer**. This will turn each document to be a 1-D array, which I think is a good starting point for my clustering algorithms to work. Let's see how it works. 

The Count Vectorizer exclusively accepts series in string format, not as tokenized lists. Each row, representing a document, must be a string. This ensures that every point in the clustering encapsulates an entire sequence, rather than isolated vectorized words.

I set the `min_df` parameter to be equal to 5 to only include terms that occur more than 5 times in my **count vectorized** data. 

In [19]:
## Sample CountVectorizer Testing 
# Sample data


[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]
['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']


In [22]:
processed_v1_df["clean_inbound_text"]

0         different people have given different answers ...
1         way to drop the ball on customer service so pi...
2         i want my amazon payments account closed dm me...
3         yeah this is crazy were less than a week away ...
4         how about you guys figure out my xbox one x pr...
                                ...                        
122335    i sent you guys a dm regarding the status of m...
122336    this is happening in my area w prime deliverie...
122337    got my at am thanks for fulfilling the order fast
122338    no exchange available for i need to exchange m...
122339    there should be bonus and gifts for regular cu...
Name: clean_inbound_text, Length: 122340, dtype: object

In [76]:
# Vectorizing inbound requests with Count Vectorizer 
start_time = time.time()
bag_of_words = CountVectorizer(min_df=5)

# Storing count vectorized data and storing them into sparse matrices
inbound_cv = bag_of_words.fit_transform(processed_v1_df["clean_inbound_text"])
outbound_cv = bag_of_words.fit_transform(processed_v1_df["clean_outbound_text"])

end_time = time.time()  
print("Time Elapsed: {} seconds".format(end_time - start_time))

Time Elapsed: 3.8683745861053467 seconds


### 2.TFIDF

In [77]:
start_time = time.time()
tfidf = TfidfVectorizer(min_df=5, ngram_range = (1,3))

# Storing tfidf data and transforming them into sparse matrices
inbound_tfidf = tfidf.fit_transform(processed_v1_df["clean_inbound_text"])
outbound_tfidf = tfidf.fit_transform(processed_v1_df["clean_outbound_text"])
end_time = time.time()  
print("Time Elapsed: {} seconds".format(end_time - start_time))

Time Elapsed: 13.996706247329712 seconds


In [30]:
print(inbound_tfidf.toarray())

MemoryError: Unable to allocate 93.6 GiB for an array with shape (122340, 102638) and data type float64

## Pretrained word embeddings
While bag-of-words approaches can capture the presence of words in a text, they fail to preserve the crucial aspect of word order. This limitation necessitates the exploration of more sophisticated text vectorization techniques. With the advent of advanced methods, we can now effectively encode intent clusters into meaningful vector representations.

#### 3. Glove 

I'll explore various word embeddings to encode text differently and assess their effectiveness, beginning with GloVe word embeddings—an unsupervised learning algorithm for generating word vector representations.

Gensim simplifies the use of pretrained word embeddings, offering a specialized data format for easy loading into a numpy array.

Note: I won't utilize this method because my clustering algorithms require tweets to be represented in a single format, while this method transforms individual words. I'm keeping this note in my notebook to track progress.


In [31]:
%%time
with open("../objects/glove.6B.100d.txt", "r", encoding="utf-8") as file: 
    word_to_vec_map = {} 
    for line in file: 
        values = line.split() 
        curr_word = values[0] 
        coefs = np.array(values[1:], dtype=np.float32)  # Fix: Convert values to float
        word_to_vec_map[curr_word] = coefs 

CPU times: total: 4.89 s
Wall time: 9.07 s


In [53]:
processed_v1_df["clean_inbound_text"][5]

'why is my order at my local courier for the last days and still has not been delivered to me over week late 😡'

**Observations** 
- Oh I have forgotten to remove Emojis in my first preprocessing step. Will come back to it again. For the time being, see how other models embed them

In [57]:
def text_to_word_sequence(text): 
    return text.split()

# Apply embeddings to text data 
def text_to_embeddings(text): 
    # Split text into words 
    words = text_to_word_sequence(text)
    # Initialize empty list of embeddings 
    embeddings_list = []
    # Loop through words 
    for word in words: 
        # If word exists in embeddings, append to embeddings list 
        if word in word_to_vec_map: 
            embeddings_list.append(word_to_vec_map[word])
    # Return embeddings list 
    return embeddings_list  

processed_v1_df["inbound_text_glove"] = processed_v1_df["clean_inbound_text"].progress_apply(text_to_embeddings)

progress-bar: 100%|██████████| 122340/122340 [00:04<00:00, 29371.80it/s]


In [67]:
inbound_text_glove = processed_v1_df["inbound_text_glove"]

# Store inbound_text_glove as a pickle file
with open("../objects/inbound_text_glove.pkl", "wb") as file: 
    pickle.dump(inbound_text_glove, file, protocol=pickle.HIGHEST_PROTOCOL)

# Drop inbound_text_glove column
processed_v1_df.drop(columns=["inbound_text_glove"], inplace=True)

### 4. Doc2Vec
Main algorithm for my document embeddings, to be included later 

### 5. Hugging Face
- Search for Twitter trained sentence embeddings 

### 6. Fast Text 
- An embedding methodology developed by Facebook AI team
- Trained on Wikipedia text, doesn't seem like the ideal choice 
- Nevertheless, we will try and evaluate if it works well 

## Scaling the data 

Let's create scaled versions of our dataset before clustering. This is particularly useful for distance-based clustering techniques. Normally, these vectors don't require scaling, but it could enhance computational efficiency. I apply scaling only to my count vectorized and TF-IDF vectorized data, not to the ones with more meaningful word embeddings.


In [82]:
# Fitting and transforming to create standard scald versions of the data through the use of MaxAbsScaler
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()

scaled_inbound_cv = scaler.fit_transform(inbound_cv)
scaled_outbound_cv = scaler.fit_transform(outbound_cv)
scaled_inbound_tfidf = scaler.fit_transform(inbound_tfidf)  
scaled_outbound_tfidf = scaler.fit_transform(outbound_tfidf)    

In [72]:
inbound_cv.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [84]:
scaled_inbound_cv.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

# Tweet Collection With Clustering

Training Consideration: Because I will be working with neural networks in the next phase of my project, my objective is to identify approaximately `x` distinct intents in the data through Exploratory Data Analysis(EDA). Each of these should ideally consist of 1000 tweets similar to it to generate your training data for that intent with Gensim's `model.docvecs.similar` something function

I aim to employ clustering methods and topic modeling techniques on my dataset to extract prominent themes, intending to manually assign labels to these clusters.

Achieving successful outcomes will be more feasible if my dataset pertains uniformly to the domain of customer service. I've taken care to ensure that this Twitter data related to Amazon primarily focuses on this domain. This careful curation aims to enable my upcoming model to capture the subtleties required for intent classification. In essence, the conversational efficacy of the bot largely depends on the language and domain it's been trained on.

With respect to the nature of clustering algorithms, I understand that they may not precisely group the data as anticipated beforehand. It's an algorithmic process and might not seamlessly cluster intents together as desired. However, I will back my curiosity to discover insightful trends - if found any!!!

## <font color='blue'>1. K Means</color>
My first approach for the clustering my word vectors is K-Means, which tends to perform well on blobs.

A drawback is that it is very slow, and picking the value for K is hard - I don't even know how many intents there are in the data. This is why I start with larger jumps of K to get a higher level idea of which performs the best, then I dive deeper to finally decide what K works the best for finding the optimal number of intents in my dataset.

## 1.1. K-Means for Count Vectorized and Tfidf data
First I cluster the TFIDF and Count Vectorized data. Honestly, I won't really be expecting good results from it, so I won't spend a lot of effort doing this actual clustering on these two. But it's worth a shot to demonstrate an older and suboptimal approach.


I'm currently running the K-Means algorithm on my entire dataset. Alongside this, I'm performing hyperparameter optimization specifically on the n_clusters parameter. The initial progress bar reflects its progress with the dataset, while the second bar shows its completion concerning different values of n_clusters.

The process of applying K-Means across 10 iterations spanning from 10 to 100 for both data types consumed approximately three hours before the scaling step. Thankfully, post-scaling, the training procedure noticeably sped up.

Below this section, I've saved my results using Python's serialization package called Pickle to avoid the need to rerun the process again!

In [87]:
# Vectorized data
vectorized_data = {"scaled_inbound_cv": scaled_inbound_cv, "scaled_inbound_tfidf": scaled_inbound_tfidf}

# Briefly showing the contents of i and
for i,j in enumerate(vectorized_data.items()): print(i,j);

0 ('scaled_inbound_cv', <122340x8365 sparse matrix of type '<class 'numpy.float64'>'
	with 1941976 stored elements in Compressed Sparse Row format>)
1 ('scaled_inbound_tfidf', <122340x102638 sparse matrix of type '<class 'numpy.float64'>'
	with 4094473 stored elements in Compressed Sparse Row format>)


In [89]:
%%time
# My grand dictionaries that will store all my results
wcss_grand = {}
labels_grand = {}
silhouette_scores_grand = {}
n_clusters = [10,20,30,40,50,60,70,80,90,100]

# Iterating through all the differently embedded data
for i,j in tqdm(enumerate(vectorized_data.items())): 
    name = j[0] # Here j[0] is the name of the dataset
    dataset = j[1] # And j[1] is the actual data
    
    # I store my metrics at these following lists
    wcss = []
    labels = []
    silhouette_scores = []
    
    # Looping through values of k
    for k in tqdm(n_clusters):    
        print(f'Currently fitting {name} with {k} clusters... Please wait')
        
        # Initializing with k-means++ ensures that you get don’t fall into the random initialization trap.
        kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10, random_state = 10)
        kmeans.fit(dataset)
        wcss.append(kmeans.inertia_)
        
        # Getting the silhouette score
        labels.append(kmeans.labels_)
        silhouette_scores.append(silhouette_score(dataset, kmeans.labels_))
        
        # Saving the models
        filename = f'models/kmeans/{name}-{k}neighbors.sav'
        joblib.dump(kmeans, filename)
        
    # Updating grand dictionary
    wcss_grand[name + '_wcss'] = wcss
    labels_grand[name + '_labels'] = labels
    silhouette_scores_grand[name + '_silhouettes'] = silhouette_scores

# Saving all my results
with open("../objects/wcss_grand.pkl", "wb") as file:
    pickle.dump(wcss_grand, file, protocol=pickle.HIGHEST_PROTOCOL)

with open("../objects/labels_grand.pkl", "wb") as file:
    pickle.dump(labels_grand, file, protocol=pickle.HIGHEST_PROTOCOL)
    
with open("../objects/silhouette_scores_grand.pkl", "wb") as file:
    pickle.dump(silhouette_scores_grand, file, protocol=pickle.HIGHEST_PROTOCOL)

CPU times: total: 0 ns
Wall time: 1.96 ms


### Reading back in the results

In [91]:
# Storing it into objects I can use in this notebook

with open("../objects/wcss_grand.pkl", "rb") as file:
    wcss_grand = pkl.load(file)
with open("../objects/labels_grand.pkl", "rb") as file: 
    labels_grand = pkl.load(file)
with open("../objects/silhouette_scores_grand.pkl", "rb") as file: 
    silhouette_scores_grand = pkl.load(file)

In [96]:
silhouette_scores_grand

{}

## Finding the Best K-Means Model
I will try creating an elbow plot to see if there exists a clear elbow. I will do so on both the **count vectorized** and **tfidf** features

In [92]:
# Elbow Plot count vectorized
plt.figure(figsize=(8,5))
plt.plot(range(10, 101, 10), wcss_grand['inbound_cv_ma_wcss'], color = 'magenta')
plt.title('Elbow Method (Count Vectorized)')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

# Elbow Plot tfidf
plt.figure(figsize=(8,5))
plt.plot(range(10, 101, 10), wcss_grand['inbound_tfidf_ma_wcss'], color = 'magenta')
plt.title('Elbow Method (TFIDF)')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

# Silouette Plot count vectorized
plt.figure(figsize=(8,5))
plt.plot(range(10, 101, 10), silhouette_scores_grand['inbound_cv_ma_silhouettes'], color = 'red')
plt.title('Silhouette Method (Count Vectorized)')
plt.xlabel('Silhouette Score')
plt.ylabel('WCSS')
plt.show()

# Silouette Plot tfidf
plt.figure(figsize=(8,5))
plt.plot(range(10, 101, 10), silhouette_scores_grand['inbound_tfidf_ma_silhouettes'], color = 'red')
plt.title('Silhouette Method (TFIDF)')
plt.xlabel('Silhouette Score')
plt.ylabel('WCSS')
plt.show()

KeyError: 'inbound_cv_ma_wcss'

<Figure size 800x500 with 0 Axes>

We see that these plots don't really chaneg much between the TFIDF and Count Vectorized data, furthering my statement above that they aren't really going to be the most useful 

## Visualizing my clusters with t-SNE 

I try different color maps and choose one so its easier to distinguish between clusters 

Available sequential colormaps:
```['viridis', 'plasma', 'inferno', 'magma', 'cividis']```

```['Greys', 'Purples', 'Blues', 'Greens', 'Oranges', 'Reds',
            'YlOrBr', 'YlOrRd', 'OrRd', 'PuRd', 'RdPu', 'BuPu',
            'GnBu', 'PuBu', 'YlGnBu', 'PuBuGn', 'BuGn', 'YlGn']```

Available qualitative colormaps:
```['Pastel1', 'Pastel2', 'Paired', 'Accent',
                        'Dark2', 'Set1', 'Set2', 'Set3',
                        'tab10', 'tab20', 'tab20b', 'tab20c']```

In [None]:
# Check out the shape of current data 
inbound_cv.shape, inbound_tfidf_ma.shape 


T-SNE is a probabilistic model, so it will take some time to run, especially because we have a large no of rows 

In [None]:
%%time 
# Instantiate t-SNE
tsne = TSNE(n_components=2, random_state=1, n_jobs=-1)

# Fit t-SNE
inbound_cv_ma_tsne = tsne.fit_transform(inbound_cv_ma)

In [None]:
# Plotting my visualization for each of my n_neighbors with my count vectorized data

for k in range(10,101,10):
    # Getting the right K-Means cluster labels.
    labels = joblib.load(f'models/kmeans/inbound_cv_ma-{str(k)}neighbors.sav').labels_ # See where these labels are saved 
    
    # Visualize high-dimensional data
    plt.figure(figsize=(13,12))
    plt.scatter(inbound_cv_ma_tsne[:,0], inbound_cv_ma_tsne[:,1], s=20, c = labels, cmap = 'magma')
    plt.title(f'2-D t-SNE Representation of my Count Vectorized Inbound data with {k} Clusters')
    plt.xlabel('t-SNE dimension 1')
    plt.ylabel('t-SNE dimension 2')
    plt.show()

- For now, we are seeing so many clusters embedded into each other without any clear demarcation between them. Should I play with `perspective` parameter to gain any more insights? 
- Also, at this stage it would be difficult to observe clusters separately since there are many dimensions and we are representing them on a 2D space 

In [None]:
# Plotting my visualization for each of my n_neighbors with my tfidf data

for k in range(10,101,10):
    # Getting the right K-Means cluster labels.
    labels = joblib.load(f'models/kmeans/inbound_tfidf_ma-{str(k)}neighbors.sav').labels_
    
    # Visualize high-dimensional data
    plt.figure(figsize=(13,12))
    plt.scatter(inbound_tfidf_ma_tsne[:,0], inbound_tfidf_ma_tsne[:,1], s=20, c = labels, cmap = 'magma')
    plt.title(f'2-D t-SNE Representation of my TFIDF Inbound data with {k} Clusters')
    plt.xlabel('t-SNE dimension 1')
    plt.ylabel('t-SNE dimension 2')
    plt.show()

There doesn't seem to be much difference in the outputs of CountVectorized clusters and TFIDF clusters. Let's check for Doc2Vec and other semantic embeddings

## 1.2. K-Means for my Doc2Vec data

Notice that I did not scale my d2v data on purpose because I do not want to skew the distances that the pretrained model created.

In [None]:
# Vectorized data
vectorized_data = {'inbound_cv_d2v': inbound_d2v}

# Briefly showing the contents of i and j
for i, j in enumerate(vectorized_data.items()):
    print(i, j)

In [None]:
# My d2v dictionaries that will store all my results
wcss_d2v = {}
labels_d2v = {}
silhouette_scores_d2v = {}
n_clusters = [10,20,30,40,50,60,70,80,90,100]

# Iterating through all the differently embedded data
for i,j in tqdm(enumerate(vectorized_data.items())): 
    name = j[0] # Here j[0] is the name of the dataset
    dataset = j[1] # And j[1] is the actual data
    
    # I store my metrics at these following lists
    wcss = []
    labels = []
    silhouette_scores = []
    
    # Looping through values of k
    for k in tqdm(n_clusters):    
        print(f'Currently fitting {name} with {k} clusters... Please wait')
        
        # Initializing with k-means++ ensures that you get don’t fall into the random initialization trap.
        kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10, random_state = 10)
        kmeans.fit(dataset)
        wcss.append(kmeans.inertia_)
        
        # Getting the silhouette score
        labels.append(kmeans.labels_)
        silhouette_scores.append(silhouette_score(dataset, kmeans.labels_))
        
        # Saving the models
        filename = f'models/kmeans/{name}-{k}neighbors.sav'
        joblib.dump(kmeans, filename)
        
    # Updating d2v dictionary
    wcss_d2v[name + '_wcss'] = wcss
    labels_d2v[name + '_labels'] = labels
    silhouette_scores_d2v[name + '_silhouettes'] = silhouette_scores

# Saving all my results, now with a d2v tag
with open('objects/wcss_d2v.pkl', 'wb') as handle:
    pickle.dump(wcss_d2v, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('objects/labels_d2v.pkl', 'wb') as handle:
    pickle.dump(labels_d2v, handle, protocol=pickle.HIGHEST_PROTOCOL)
    
with open('objects/silhouette_scores_d2v.pkl', 'wb') as handle:
    pickle.dump(silhouette_scores_d2v, handle, protocol=pickle.HIGHEST_PROTOCOL)

### Reading back in the results.

In [None]:
# Storing it into objects I can use in this notebook

with open('objects/wcss_d2v.pkl', 'rb') as handle:
    wcss_d2v = pickle.load(handle)
with open('objects/labels_d2v.pkl','rb') as handle:
    labels_d2v = pickle.load(handle)
with open('objects/silhouette_scores_d2v.pkl','rb') as handle:
    silhouette_scores_d2v = pickle.load(handle)

Here are my plots: 

In [None]:
# Elbow Plot d2v
plt.figure(figsize=(8,5))
plt.plot(range(10, 101, 10), wcss_d2v['inbound_cv_d2v_wcss'], color = 'magenta')
plt.title('Elbow Method (Doc2Vec)')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

# Silouette Plot d2v
plt.figure(figsize=(8,5))
plt.plot(range(10, 101, 10), silhouette_scores_d2v['inbound_cv_d2v_silhouettes'], color = 'red')
plt.title('Silhouette Method (Doc2Vec)')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.show()

Great, now we see that we have a slightly higher silhouette score which isn't negative now. It looks like K = 20 would be the best in this case as it has the highest silhouette score and there is sort of an elbow in the elbow plot, definitely more than at 80 clusters where it seems to be completely smooth.

In [None]:
%time
# Instantiate t-SNE
tsne = TSNE(n_components=2, random_state=1, n_jobs=-1)

# Fit t-SNE
inbound_d2v_tsne = tsne.fit_transform(inbound_d2v)

In [None]:
# Plotting my visualization for each of my n_neighbors, now with Doc2Vec embedded data

for k in range(10,101,10):
    # Getting the right K-Means cluster labels.
    labels = joblib.load(f'models/kmeans/inbound_cv_d2v-{str(k)}neighbors.sav').labels_
    
    # Visualize high-dimensional data
    plt.figure(figsize=(13,12))
    plt.scatter(inbound_d2v_tsne[:,0], inbound_d2v_tsne[:,1], s=20, c = labels, cmap = 'magma')
    plt.title(f'2-D t-SNE Representation of my Doc2Vec Inbound data with {k} Clusters')
    plt.xlabel('t-SNE dimension 1')
    plt.ylabel('t-SNE dimension 2')
    plt.show()

It's super hard to tell based on a t-SNE plot, so for that reason, I will evaluate how it clustered at a later section by actually looking at the labels!

## <font color = 'blue'>2. LDA (Latent Dirichlet Allocation) </color>
My second approach for the clustering is LDA topic modelling. It basically takes your data and splits it into topics. My goal is still to cluster, but with this method I hope to get more useful, distinct topics.

Useful articles:
* https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0
* https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05

There are also newer, deep-learning based methods called LDA2Vec which could be interesting to explore as well.

However, due to prorities shifting, I will employ this step as a future step.

In [None]:
import warnings
warnings.simplefilter("ignore", DeprecationWarning)
# Load the LDA model from sk-learn
from sklearn.decomposition import LatentDirichletAllocation as LDA
 
# Helper function
def print_topics(model, count_vectorizer, n_top_words):
    words = count_vectorizer.get_feature_names()
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(" ".join([words[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
        
# Tweak the two parameters below
number_topics = 5
number_words = 10
# Create and fit the LDA model
lda = LDA(n_components=number_topics, n_jobs=-1)
lda.fit(count_data)
# Print the topics found by the LDA model
print("Topics found via LDA:")
print_topics(lda, count_vectorizer, number_words)

In [None]:
%%time
from pyLDAvis import sklearn as sklearn_lda
import pickle 
import pyLDAvis
LDAvis_data_filepath = os.path.join('./ldavis_prepared_'+str(number_topics))
# # this is a bit time consuming - make the if statement True
# # if you want to execute visualization prep yourself
if 1 == 1:
LDAvis_prepared = sklearn_lda.prepare(lda, count_data, count_vectorizer)
with open(LDAvis_data_filepath, 'w') as f:
        pickle.dump(LDAvis_prepared, f)
        
# load the pre-prepared pyLDAvis data from disk
with open(LDAvis_data_filepath) as f:
    LDAvis_prepared = pickle.load(f)
pyLDAvis.save_html(LDAvis_prepared, './ldavis_prepared_'+ str(number_topics) +'.html')

For time constraint's sake, I decided not to use DBScan because they will achieve a clustering result similar to K-Means. I could have also use Gaussian Mixed Models or Heirarchical Clustering to achieve this clustering result. **Use GMMs and analyze your result on them as well**

## Finding and Visualizing Intent Differences Between Clusters

I saved all my models in a folder in this directory called folder. All that I have to do is pick a hyperparamater setting for that model and visualize what words are in those clusters.

In [None]:
processed_inbound

I think it's really useful to see the top 10 words in a cluster to get a good idea of what the intents in that cluster is! (Check out the intents...Edit sentences in the final notebook)