# The Hot Forum Topics

This notebook borrowed the methods of words tokenization and vectorization from [https://www.kaggle.com/pavlofesenko/strategies-to-earn-discussion-medals/notebook](http://) 

## 1. Introduction

Usually, different forms of forum topics will result in a different number of views, messages and replies. But this notebook will focus on the relation between the number of views and words in forum topics. By applying hierarchical clustering in tokenized and vectorized words, some keywords, which frequently appear in hot topics, are found. It might provide the kagglers with a perspective of making forum topics with specific words to attract more attention in the Kaggle community.


In [None]:
import pandas as pd
import spacy # For natural language processing (NLP)

from sklearn.manifold import TSNE # For dimension reduction
from sklearn.cluster import KMeans #For clustering
from sklearn.cluster import AgglomerativeClustering
import seaborn as sns #For visualization

from collections import Counter

## 2. Preprocessing of forum messages

In [None]:
messages = pd.read_csv('/kaggle/input/meta-kaggle/ForumTopics.csv')
messages = messages[messages.Title.notna()] #Remove topics without titles
messages.tail()

## 3. Vectorization and Visualization of Forum Topics
For simplicity, forum topics with over 2000 views are considered as "Hot".

In [None]:
corpus = messages[messages.TotalViews >=2000].Title.tolist()[:1000] #Find hot topics and select top 1000 ones
corpus[-5:]

In [None]:
nlp = spacy.load('en_core_web_lg', disable=['ner']) #Using NLP module, Spacy, and setting up this module

In [None]:
batch = nlp.pipe(corpus) #Tokenization
corpus_tok = []
for doc in batch:
    tokens = [token.lemma_.lower() for token in doc if token.is_alpha and token.has_vector and not token.is_stop]
    tokens_str = ' '.join(tokens)
    if tokens_str != '':
        corpus_tok.append(tokens_str)

corpus_tok[-5:]

In [None]:
batch_tok = nlp.pipe(corpus_tok) #Vectorization
X = []
for doc in batch_tok:
    X.append(doc.vector)

In [None]:
X_emb = TSNE(random_state=0).fit_transform(X) #Dimension Reduction
df = pd.DataFrame(X_emb, columns=['x', 'y'])
df.tail()

In [None]:
sns.scatterplot('x', 'y', data=df, edgecolor='none', alpha=0.5) #Visualize topics

In [None]:
model_h = AgglomerativeClustering(n_clusters=3) #Setting four clusters for hierarchical clustering
df['Label'] = model_h.fit_predict(X_emb)
df['Tokens'] = corpus_tok
df.tail()

In [None]:
palette = sns.color_palette(n_colors=3) #Visualize clusters
sns.scatterplot('x', 'y', data=df, edgecolor='none', alpha=0.5, hue='Label', palette=palette)

### Display words in each cluster and their frequency

In [None]:
cluster0 = ' '.join(df[df.Label == 0].Tokens.tolist())
words0 = Counter(cluster0.split())
words0.most_common(10)

#### In **cluster0**, the most frequent words are "submission","leaderboard","final","score". Perhaps, topics on submission and final results of competitions are popular among kagglers.

In [None]:
words0.most_common()[:-10:-1] # The least common words in cluster0

In [None]:
cluster1 = ' '.join(df[df.Label == 1].Tokens.tolist())
words1 = Counter(cluster1.split())
words1.most_common(10)

#### We can see from **cluster1** that kagglers also like talking about dataset and data files.

In [None]:
cluster2 = ' '.join(df[df.Label == 2].Tokens.tolist())
words2 = Counter(cluster2.split())
words2.most_common(10)

#### From **cluster2**, it can be concluded that topics about model training and evaluation can be hot in the Kaggle community. 

## Conclusion

By conducting simple NLP and hierarchical clustering on forum topics, we can find some characteristics of hot forum topics. But as the data in the experiment is not enough, with only 972 words as shown below, there might be some bias on the patterns of popular topics. Hence, more valid data can be put in similar experiments with NLP and clustering techniques to exploit features of hot forum topics.

In [None]:
df