# WordClouds of Competition Forums

This Notebook generates visualisations of the most common terms in each competition forum using the forum posts HTML source available in [Meta Kaggle][1]. It's a simple method to show an eye-catching, bandwidth-busting, browser-melting, at-a-glance view of what each competition is about.

If you use simple counts of words for each competition all the common words rise to the top every time, like *data*, *kaggle*, *model*, *thanks for sharing*, *congratulations* etc :)

We need a way to ignore the words common to all competitions and let the competition-specific terms rise to the top. Using [Tf-Idf][2] and treating each forum as a document works quite well!

The title of each thread is prepended to every message, then the complete text of all the forum posts for a competition is concatenated together and treated as one document. The *Inverse document frequency* weights demote common words, whilst the *term frequency* weights count most frequent words in a competition, the `tf*idf` product is used for the word cloud frequencies.

**There's now a [T-SNE scatterplot][5] at the end showing all competitions in a 2D space built from the words that the Kagglers/hosts/sponsors chose to write :)**

### Credits

None of the actual mathematics of doing this are apparent in this notebook;
what you see here is just glue code sitting on libraries, so many thanks to the authors:

 - https://amueller.github.io/word_cloud/
 - https://github.com/amueller/word_cloud
 - https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
 - https://plotly.com/python/plotly-express/

### Directions


&uarr; Use *Edit &rarr; Find in This Page...* (or *âŒ˜+F* or *control+F*) to search words (top 20 per forum are shown. e.g. search "magic")

&rarr; Use the navigation bar to find specific competitions.

### See Also

The [same idea can be applied to the top ranked discussion users][4]. Concatenate all their posts and generate a wordcloud per user instead of per forum. Then see each user's specific areas of expertise ;)


[1]: https://www.kaggle.com/kaggle/meta-kaggle "Meta Kaggle"
[2]: https://en.wikipedia.org/wiki/Tf%E2%80%93idf
[3]: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html "TfidfVectorizer"
[4]: https://www.kaggle.com/jtrotman/what-are-you-talking-about
[5]: #Scatter-Plot


In [1]:
import gc, os, re, sys, time
import pandas as pd, numpy as np
from pathlib import Path
import matplotlib.pyplot as plt
from IPython.display import HTML, display
from wordcloud import WordCloud
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.manifold import TSNE
import plotly.express as px
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer("english")

MK = Path(f'../input/meta-kaggle')
ID = 'Id'
FORUM_ID = 'ForumId'
HOST = 'https://www.kaggle.com'

# Copied from:
# https://github.com/GaelVaroquaux/my_topics/blob/master/topics_extraction.py
PROTECTED_WORDS = ['pandas', 'itertools', 'physics', 'keras']

def no_plural_stemmer(word):
    """ A stemmer that tries to apply only on plural. The goal is to keep
        the readability of the words.
    """
    word = word.lower()
    if word.endswith('s') and not (word in PROTECTED_WORDS
                                   or word.endswith('sis')):
        stemmed_word = stemmer.stem(word)
        if len(stemmed_word) == len(word) - 1:
            word = stemmed_word
    return word

# Not perfect but better than nothing
class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(TfidfVectorizer, self).build_analyzer()
        return lambda doc: (no_plural_stemmer(w) for w in analyzer(doc))

def simple_slug(txt):
    return re.sub('[^a-zA-Z0-9\-_]+', '-', txt.lower())

def html_to_text(r):
    return BeautifulSoup(r, 'html').text

def search_url(q):
    return f'https://www.kaggle.com/search?q={q}'

In [2]:
NROWS = None # For testing on subset

# Competitions
comps = pd.read_csv(MK / 'Competitions.csv', parse_dates=['DeadlineDate'], index_col=ID)
tags = pd.read_csv(MK / 'Tags.csv', index_col=ID)
ctags = pd.read_csv(MK / 'CompetitionTags.csv')
ctags['Slug'] = ctags.TagId.map(tags.Slug)
comps['Tags'] = ctags.groupby('CompetitionId').Slug.apply(" : ".join)
comps['Tags'] = comps['Tags'].fillna("none")
comps['Year'] = comps.DeadlineDate.dt.year
comps = comps.drop_duplicates(subset=['ForumId'], keep='last') # 575380 and 585319

# Forum Details
forums = pd.read_csv(MK / 'Forums.csv', index_col=ID)
topics = pd.read_csv(MK / 'ForumTopics.csv', index_col=ID)
topics.Title.fillna('', inplace=True)

# Forum Messages
msgs = pd.read_csv(MK / 'ForumMessages.csv', index_col=ID, nrows=NROWS)
msgs = msgs.dropna(subset=['Message'])
msgs.insert(0, 'ForumId', msgs.ForumTopicId.map(topics.ForumId))
msgs.insert(0, 'ParentForumId', msgs.ForumId.map(forums.ParentForumId))
text = ('<html>' + msgs.Message + '</html>').apply(html_to_text)
text = text.str.replace(r'(https?://\S+|\[/?quote.*?\])', ' ') # strip URLs, [quote] 
text = text.str.replace(r'([\W_]{4,})', ' ') # and junk
text = text.str.replace(r'([a-fA-F0-9]{12,})', ' ') # long hash-like strings
# Add topic titles to each post - this over-weights the title words a little,
#  depending on how many messages are in a topic
msgs['Text'] = msgs.ForumTopicId.map(topics.Title) + " " + text
msgs.shape

# General Forums

Quick preview of the forums table.

Meta Kaggle does not generally include forums for active competitions - the exceptions within the *Active Competitions* category refer to *InClass* competitions (teaching/learning).

The main attention-grabbing *Featured* and *Research* competitions appear under *Past Competitions*. There are over 300 and these are the focus: there will be one wordcloud for each.

In [3]:
forums.shape

In [4]:
forums.describe(include='all').T

In [5]:
top_level = forums[forums.ParentForumId.isnull()].copy()
top_level['ForumCount'] = forums.groupby('ParentForumId').size()
top_level

In [6]:
colormaps = [
    # Sequential
    'Purples',
    'Blues',
    'Greens',
    'Oranges',
    'Reds',
    'YlOrBr',
    'YlOrRd',
    'OrRd',
    'PuRd',
    'RdPu',
    'BuPu',
    'GnBu',
    'PuBu',
    'PuBuGn',
    'BuGn',
    'YlGn',
    # Qualitative
    'Paired',
    'Accent',
    'Set1',
    'Set2',
    'Set3',
    'tab10',
    'tab20',
    # Sequential2
    'spring',
    'summer',
    'autumn',
    'winter',
    'cool',
    'Wistia',
    # Miscellaneous
    'gist_rainbow',
    'rainbow'
]

In [7]:
np.random.seed(42) # what else?
np.random.shuffle(colormaps)

In [8]:
def competition_html(forumid):
    df = comps[comps['ForumId'] == forumid]
    if len(df) != 1:
        return ""
    c = df.iloc[0]
    return (
        '<p>'
        f'<i>{c.HostSegmentTitle} Competition</i>:'
        f'   <b><a target=_blank href="{HOST}/c/{c.Slug}">{c.Title}</a></b>'
        f'   "<i>{c.Subtitle}</i>"'
        '<br/>'
        f'<i>TotalTeams</i>: <b>{c.TotalTeams}</b>'
        '<br/>'
        f'<i>DeadlineDate</i>: <b>{c.DeadlineDate.strftime("%c")}</b>'
    )

In [9]:
NTOP = 200
SHOW_TOP_WORDS = 20

# Using a class is better than one big function.
# For example you can fork the Notebook and have a look at the 'tfv' member.
class CloudGenerator:
    def __init__(self, tag, par, max_df=0.95):
        self.par = par
        self.tag = tag
        docs = {}
        # One big document per FORUM_ID.
        # Note: code in 'run' relies on Python 3 feature of storing key/value pairs
        #  in order they were added.
        for fid, df in par.groupby(FORUM_ID):
            docs[fid] = '\n'.join(df.Text)

        self.tfv = StemmedTfidfVectorizer(ngram_range=(1, 1),
                                     max_df=max_df,
                                     dtype=np.float32,
                                     stop_words='english')
        self.xall = self.tfv.fit_transform(docs.values())
        self.words = self.tfv.get_feature_names()
        self.ids = list(docs.keys())
        self.rows = []

    def save(self):
        tag = self.tag
        # save the stop words (determined by the max_df parameter)
        np.savetxt(f'{tag}_stop_words.txt', list(sorted(self.tfv.stop_words_)), '%s')
        # save stats of all forums to one file
        cols = [ FORUM_ID, 'count', 'max', 'mean' ] + [f'tok{i}' for i in range(NTOP)]
        df = pd.DataFrame(self.rows, columns=cols).set_index(FORUM_ID)
        df.insert(0, 'Title', df.index.map(forums.Title))
        df.to_csv(f'{tag}_word_stats.csv')
        
    def run(self):
        for row, (fid, df) in enumerate(self.par.groupby(FORUM_ID)):
            x = self.xall[row]
            s = pd.Series(index=self.words, data=x.toarray().ravel())
            s = s.sort_values(ascending=False)
            
            # save top words for CSV output
            l = s.head(NTOP).index.str.replace(' ', '_').tolist()
            self.rows.append([fid, (s > 0).sum(), s.max(), s.mean()] + l)
        
            title = forums.Title[fid]
            nchars = df.Message.str.len().sum()
            ntopics = df.ForumTopicId.nunique()
            nmsg = df.shape[0]
            top = s.head(SHOW_TOP_WORDS).index
            top = [f"<a href='{search_url(w)}'>{w}</a>" for w in top]
            top = ', '.join(top)
            query = title
        
            html = f"<h1 id='{simple_slug(title)}'>{title}</h1>"
            html += competition_html(fid)
            url = f"{HOST}/search?q={query}+in%3Atopics"

            html += (
                f"<h3>Forum</h3>"
                f"<ul>"
                f"<li>Search Kaggle for <a href='{url}'>{query}</a> in topics"
                f"<li>{ntopics} topics; {nmsg/ntopics:.1f} messages per topic"
                f"<li>{nmsg} messages; {nchars} raw characters; {nchars/nmsg:.0f} chars per message"
                f"<li>{df.PostUserId.nunique()} unique users"
                f"<li>Top {SHOW_TOP_WORDS} words: {top}"
                f"</ul>"
            )
            
            wc = WordCloud(background_color='black',
                           width=800,
                           height=600,
                           colormap=colormaps[row % len(colormaps)],
                           collocations=False,
                           random_state=row,
                           min_font_size=10,
                           max_font_size=200).generate_from_frequencies(s[s>0])
            
            if False:
                # wordcloud library now supports SVG
                #   - but needs latest docker image; and
                #   - renders poorly on this site, with overlapping words
                html += wc.to_svg()
                display(HTML(html))
            else:
                display(HTML(html))
                fig, ax = plt.subplots(figsize=(12, 9))
                ax.imshow(wc, interpolation='bilinear')
                ax.axis('off')
                plt.tight_layout()
                plt.show()

In [10]:
cg = CloudGenerator('Competitions', msgs.query("ParentForumId==8"))
cg.run()
cg.save()

# \*\*\* Bonus Content \*\*\*

Not just competition forums - let's also look at the top level ***General Forums*** found where *ParentForumId* is 9.


In [11]:
forums.query("ParentForumId==9")

In [12]:
cg2 = CloudGenerator('General', msgs.query("ParentForumId==9")) #, max_df=1.0
cg2.run()
cg2.save()

# Forums TSNE

Reduce the competitions down to two dimensions based on their forum text - competitions with similar forums should appear close together.

First reduce using SVD.

In [13]:
NSVD = 120
svd = TruncatedSVD(n_components=NSVD, random_state=42)
xc = svd.fit_transform(cg.xall)
np.round(svd.explained_variance_ratio_.cumsum(), 2)

In [14]:
svd_df = pd.DataFrame(xc, index=list(map(int,cg.ids))).add_prefix('svd')
svd_df['Title'] = forums['Title']
svd_df.to_csv("CompetitionForumsSVD.csv", index_label=FORUM_ID)
svd_df.shape

In [15]:
tsne = TSNE(perplexity=20,
            early_exaggeration=1,
            init='pca',
            method='exact',
            learning_rate=5,
            n_iter=5000)
x2 = tsne.fit_transform(xc)
tsne_df = pd.DataFrame(x2, index=list(map(int, cg.ids))).add_prefix('tsne')
tsne_df.shape

Define type of competition for scatterplot symbol

In [16]:
CTYPE = 'CType'
comps[CTYPE] = "Default"
comps.loc[comps.Title.str.contains(r"Santa\b"), CTYPE] = 'Santa'
comps.loc[comps.Tags.str.contains("tabular-"), CTYPE] = 'Tabular'
comps.loc[comps.Tags.str.contains("image-"), CTYPE] = 'Image'
comps.loc[comps.Tags.str.contains("basketball"), CTYPE] = 'Basketball'
comps[CTYPE].value_counts()

Add competition fields into the forums table

In [17]:
forums_full = forums.join(tsne_df)
forums_full['TopicCount'] = forums_full.index.map(topics.ForumId.value_counts())
forums_full = forums_full.join(comps.reset_index().set_index(FORUM_ID).drop(["Title"], 1))
forums_full = forums_full.dropna(subset=['Year'] + list(tsne_df.columns))
forums_full.shape

In [18]:
forums_full.to_csv("CompetitionForumsTSNE.csv", index_label=FORUM_ID)

# Scatter Plot

Note how there seems to be a drift over time - towards image competitions.

The long running Christmas optimization competitions get a space of their own, in a space closer to "old" forums. Similarly for the long running NCAA March Madness (most of the "basketball" type).

Image based competitions appear to cluster in the "newer" region (2019 onwards), presumably because the forums discuss the same metrics. etc.

(This makes a good competition browsing UI! Would be nice to be able to click links to competitions here...)

In [19]:
# Forums TSNE
tmp = forums_full.assign(DeadlineDate=forums_full.DeadlineDate.dt.strftime('%c'))
fig = px.scatter(tmp,
                 title='Competition Forums',
                 x='tsne0',
                 y='tsne1',
                 symbol=CTYPE,
                 hover_name='Title',
                 hover_data=[
                     'EvaluationAlgorithmAbbreviation', 'TopicCount',
                     'DeadlineDate', 'TotalTeams', 'Tags'
                 ],
                 color='Year')
fig.update_traces(marker=dict(size=9,
                              line=dict(width=1, color='black')),
                  selector=dict(mode='markers'))
fig.update_layout(height=750, showlegend=False)

### See Also

[What Are You Talking About?](https://www.kaggle.com/jtrotman/what-are-you-talking-about)

In [20]:
_ = """
Re-run to include recent competitions:

    2021-06-23 | Slug:iwildcam2021-fgvc8
    2021-06-28 | Slug:coleridgeinitiative-show-us-the-data
    2021-07-05 | Slug:tabular-playground-series-jun-2021
    2021-08-10 | Slug:google-smartphone-decimeter-challenge
    2021-08-11 | Slug:commonlitreadabilityprize


"""