# RAPIDS UMAP, Tfidf, and KMeans Discovers 15 Essay Topics
In this notebook we will find the essay topics using RAPIDS cudf, UMAP, Tfidf, and KMeans. First we will convert each text into a Tfidf embedding. Then we will use UMAP to reduce these embeddings to two dimensions. Lastly we will use KMeans to find the essay topics!

# Load RAPIDS

In [None]:
import pandas as pd, os
import cudf, cuml, cupy
from tqdm import tqdm
import numpy as np
print('RAPIDS',cudf.__version__)

# RAPIDS cudf
We will read train text into a RAPIDS cudf.

In [None]:
# https://www.kaggle.com/raghavendrakotala/fine-tunned-on-roberta-base-as-ner-problem-0-533
train_names, train_texts = [], []
for f in tqdm(list(os.listdir('../input/feedback-prize-2021/train'))):
    train_names.append(f.replace('.txt', ''))
    train_texts.append(open('../input/feedback-prize-2021/train/' + f, 'r').read())
train_text_df = cudf.DataFrame({'id': train_names, 'text': train_texts})
train_text_df.head()

In [None]:
train_text_df.tail()

# RAPIDS Tfidf
We will use Tfidf to convert each text into a embedding vector of length 25,000.

In [None]:
from cuml.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english', binary=True, max_features=25_000)
text_embeddings = tfidf.fit_transform( train_text_df.text ).toarray()

# RAPIDS UMAP
We will use UMAP to reduce embedding vectors to two dimensions

In [None]:
from cuml import UMAP
umap = UMAP()
embed_2d = umap.fit_transform(text_embeddings)
embed_2d = cupy.asnumpy( embed_2d )

# RAPIDS KMeans
We will use KMeans to find clusters of essays. These are the essay topics!

In [None]:
from cuml import KMeans
kmeans = cuml.KMeans(n_clusters=15)
kmeans.fit(embed_2d)
train_text_df['cluster'] = kmeans.labels_

# Display Essay Topics
We will display the result of UMAP which reduced text to two dimension. We observe that the essays cluster into 15 groups. These are the 15 essay topics! Additionally we will plot the most important word from each group.

In [None]:
import matplotlib.pyplot as plt

centers = kmeans.cluster_centers_
print(kmeans.labels_)
plt.figure(figsize=(10,10))
plt.scatter(embed_2d[:,0], embed_2d[:,1], s=1, c=kmeans.labels_)
plt.title('UMAP Plot of Train Text using Tfidf features\nRAPIDS Discovers the 15 essay topics!',size=16)

for k in range(len(centers)):
    mm = cupy.mean( text_embeddings[train_text_df.cluster.values==k],axis=0 )
    ii = cupy.argmax(mm)
    top_word = tfidf.vocabulary_.iloc[ii]
    plt.text(centers[k,0]-1,centers[k,1]+0.75,f'{k+1}-{top_word}',size=16)

plt.show()

# Display Example Text
We will display three example text from each essay topic. And we will display the five most important words from each topic.

In [None]:
for k in range(5):
    mm = cupy.mean( text_embeddings[train_text_df.cluster.values==k],axis=0 )
    ii = cupy.asnumpy( cupy.argsort(mm)[-5:][::-1] )
    top_words = tfidf.vocabulary_.to_array()[ii]
    print('#'*25)
    print(f'### Essay Topic {k+1}')
    print('### Top 5 Words',top_words)
    print('#'*25)
    tmp = train_text_df.loc[train_text_df.cluster==k].sample(3, random_state=123)
    for j in range(3):
        txt = tmp.iloc[j,1]
        print('-'*10,f'Example {j+1}','-'*10)
        print(txt,'\n')