# What Are You Talking About?

This notebook investigates the words and phrases used by the top Kaggle users in the discussions rankings - all discussion masters and grandmasters are also included.
It uses the forum posts HTML source available in [Meta Kaggle][1].

Using [Tf/Idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) from [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), by treating all of a user's discussion posts as one document, words (and bigrams/trigrams) common to all *shown* users will be downweighted or ignored; so the visualisations highlight words and phrases that are particular to each user.
BeautifulSoup is used for parsing messages and cleaning is fairly minimal - for example source code in posts is left in (possible improvement: remove *&lt;code&gt;* tags).

***NOTE:*** *Competition details (including forum posts) are not included in [Meta Kaggle][1] until* ***after the competition deadline***.

### Directions

&uarr; Use *Edit &rarr; Find in This Page...* or *CMD+F* to search words (top N per user are shown. e.g. search "vowpal")

&rarr; Use the navigation bar to jump around.

## Contents

 * [Read Forums](#Read-Forums)
 * [TfidfVectorizer](#TfidfVectorizer)
 * [User Reports](#User-Reports)
 * [Kaggle Writers 2D Semantic Space](#Kaggle-Writers-2D-Semantic-Space)

<!--
### Ideas for Improvement

 - use color function: e.g. words relating to metrics, sponsors, users get their own color.
 - better mapping of infrequent plural terms to the singular version
 - remove text within "code" tags?
 - count &lt;img&gt; tags (and others? links?) in posts
 - more user stats e.g. gold/silver/bronze counts - although these are available on the [main rankings page][5].

### Done

 - custom tokenizing to include smileys :) (although smileys are already sideways - when they are printed vertically in the word cloud they end up upside down!)

-->

### Main customization settings are in the first cell - feel free to fork & edit, e.g. if you're ranked 101 ;-P


[1]: https://www.kaggle.com/kaggle/meta-kaggle "Meta Kaggle"
[2]: https://en.wikipedia.org/wiki/Tf%E2%80%93idf
[3]: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html "TfidfVectorizer"
[4]: https://www.kaggle.com/jtrotman/wordclouds-of-competition-forums
[5]: https://www.kaggle.com/rankings?group=discussion
[6]: https://www.kaggle.com/search


In [1]:
# All users currently ranked here or higher are shown
SHOW_TOP_RANKS = 100

# Words to show as text/search links, above the word cloud
SHOW_TOP_WORDS = 30

# Top words per user to save in CSV file
SAVE_TEXT_WORDS = 200

# If >MAX_DF of users use a word/bigram/trigram it will be ignored
# Look in the stop_words.txt output file to see what this excludes.
MAX_DF = 0.8

# Colormap names:
# https://matplotlib.org/3.1.0/tutorials/colors/colormaps.html
COLORS = {
    4: 'Wistia',  # GM
    3: 'autumn',  # Master
    2: 'spring',  # Expert
}
HOST = 'https://www.kaggle.com'
TIER_NAMES = ['Novice', 'Contributor', 'Expert', 'Master', 'Grand Master']
TIER_COLORS = ['green', 'blue', 'purple', 'orange', 'gold', 'black']
IMGS = {
    2: f"{HOST}/static/images/tiers/expert@48.png",
    3: f"{HOST}/static/images/tiers/master@48.png",
    4: f"{HOST}/static/images/tiers/grandmaster@48.png"
}

In [2]:
from jt_mk_utils import *

In [3]:
%matplotlib inline
import gc, os, re, sys, time
import pandas as pd, numpy as np
from pathlib import Path
import matplotlib.pyplot as plt
from IPython.display import HTML, display
import plotly.express as px
from wordcloud import WordCloud
from bs4 import BeautifulSoup
import zlib
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer("english")

ID = 'Id'
KEY = 'ForumId'
USER_RANKS_CSV = Path(f'../input/kaggle-discussion-user-rankings')

FONT_PATH = '/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf'

EMOJI = ''.join([chr(c) for c in range(0x1f600, 0x1f641)])
EMOTICON = r"[:;8X]['`\-\^]?[\)\(/dp]"

REGEX = [
    re.compile(r'(?:[\'\"]?https?://\S+)'),         # strip URLs   
    re.compile(r'(?:\'s\b)'),                       # strip 's
    re.compile(r'([a-fA-F0-9_\-]{12,})'),           # long hash-like strings
    re.compile(r'\[quote.*\[/quote\]', flags=re.S)  # strip [quote]
]

TOKEN_PATTERNS = [
    f"(?:[{EMOJI}])",
    f"(?:{EMOTICON})",
    r"(?:\bt-sne\b)",
    r"(?:\b\w[\w']+\w+\b)",
    r"(?:\b\w\.[\w\.]+\b)", # e.g.  i.e.
    r"(?:\b\w\w+\b)"
]
TOKEN_PATTERN = re.compile("|".join(TOKEN_PATTERNS))

def bigrams(words):
    return list(map(' '.join, zip(words, words[1:])))

def trigrams(words):
    return list(map(' '.join, zip(words, words[1:], words[2:])))

def tokenize(s):
    soup = BeautifulSoup(s, 'lxml')
    text = soup.get_text('\n', strip=True)
    # text cleaning
    for r in REGEX:
        text = r.sub(' ', text)
    lines = text.splitlines()
    tokens = []
    # tokenize and add 2-grams and 3-grams *within lines*
    for line in lines:
        l = TOKEN_PATTERN.findall(line.lower())
        tokens.extend(l)
        tokens.extend(bigrams(l))
        tokens.extend(trigrams(l))
    return tokens

# this is to ensure :-p :-D and XD are upper case
def fix_emoticons(s):
    return re.sub(f'({EMOTICON}\\b)', lambda m: m.group(1).upper(), s,
                  flags=re.IGNORECASE)

def simple_slug(txt):
    return re.sub('[^a-zA-Z0-9\-_]+', '-', txt.lower())

def search_url(q):
    return f'{HOST}/search?q={q}'

def compress(s):
    return zlib.compress(s.lower().encode('utf-8'))

# Read Forums

In [4]:
def add_discussion_tier_old(dst, col):
    # UserAchievements.csv is out of date, see:
    # https://www.kaggle.com/kaggle/meta-kaggle/discussion/181048
    # and no longer maintained:
    # https://www.kaggle.com/kaggle/meta-kaggle/discussion/255635#1405368
    df = pd.read_csv(MK / 'UserAchievements.csv')
    df = df.query('AchievementType=="Discussion"').set_index('UserId')
    dst['DiscussionTier'] = dst[col].map(df.Tier)
    dst['DiscussionRanking'] = dst[col].map(df.CurrentRanking) 

def add_discussion_tier(dst, col):
    # my replacement for UserAchievements.csv
    df = pd.read_csv(USER_RANKS_CSV / 'DiscussionRankings.csv').set_index('UserId')
    dst['DiscussionTier'] = dst[col].map(df.Tier)
    dst['DiscussionRanking'] = dst[col].map(df.CurrentRanking) 

users = read_users(index_col=ID)
forums = read_forums(index_col=ID)
topics = read_forum_topics(index_col=ID)
posts_df = read_forum_messages(index_col=ID).dropna(subset=['Message'])
posts_df = posts_df.sort_index()
add_discussion_tier(posts_df, 'PostUserId')
posts_df.insert(0, 'ForumId', posts_df.ForumTopicId.map(topics.ForumId))
posts_df.insert(0, 'ParentForumId', posts_df.ForumId.map(forums.ParentForumId))
posts_df.shape

In [5]:
# fork to try "Medal>0 and (DiscussionTier>=3 or DiscussionRanking<=@SHOW_TOP_RANKS)"
sub_df = posts_df.query("DiscussionTier>=3 or DiscussionRanking<=@SHOW_TOP_RANKS")
sub_df.shape

In [6]:
# tokenize all messages
tokenized_messages = sub_df.Message.apply(tokenize)
tokenized_messages.shape

In [7]:
# concatenate all the tokenized messages for each user
KEY_LIST = ['DiscussionRanking', 'PostUserId']
docs = tokenized_messages.groupby([sub_df[k] for k in KEY_LIST]).sum()
docs.shape

# TfidfVectorizer

In [8]:
# Full list of settings here:
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
tfv = TfidfVectorizer(analyzer=lambda l: l,
                      max_df=MAX_DF,
                      dtype=np.float32)
xall = tfv.fit_transform(docs)
display(HTML(f"fit_transform shape: {xall.shape}"))
# Post processing!
tfv.vocabulary_ = {fix_emoticons(k): v for k, v in tfv.vocabulary_.items()}
words = tfv.get_feature_names()
np.savetxt(f'stop_words.txt', list(sorted(tfv.stop_words_)), '%s')

In [9]:
# Save IDF weights (rarer words have a higher weight)
idf = pd.Series(tfv.idf_, words).sort_values()
idf.to_frame('IDF').to_csv(f'idf_weights.csv', index_label='Words')

# User Reports

In [10]:
def generate_clouds():
    rows = []
    for row, ((rank, uid), df) in enumerate(sub_df.groupby(KEY_LIST)):
        u = users.loc[uid]
        s = pd.Series(index=words, data=xall[row].toarray().ravel())
        s = s[s > 0].sort_values(ascending=False)
        
        top_words = s.head(SAVE_TEXT_WORDS).index.str.replace(' ', '_').tolist()
        rows.append([rank, u.UserName, len(s), s.max(), s.mean()] + top_words)

        if 0:
            # filter out words/bigrams that are in longer bi/trigrams
            s = s.head(1000)
            # sort by token length, longest first
            s = s[s.index.str.len().argsort()[::-1]]
            use = {}
            for w, c in s.items():
                if all(w not in prev for prev in use): # O(N^2) runtime alert!
                    use[w] = c
            s = pd.Series(use)
            s = s.sort_values(ascending=False)

        tier = df.iloc[0].DiscussionTier
        days = df.PostDate.dt.date.nunique()
        chars = df.Message.str.len().sum()

        top = s.head(SHOW_TOP_WORDS).index
        top = [f"<a href='{search_url(w)}'>{w}</a>" for w in top]
        top = ', '.join(top)

        html = (
            f'<img src="{IMGS[tier]}" style="display: inline;" /> &nbsp; '
            f'<h1 style="display: inline;" id="{u.UserName}">#{rank:.0f} {u.DisplayName}</h1> '
            f'(@{u.UserName}) '
            f'- Joined {u.RegisterDate.strftime("%A %-d %b %Y")}'
            #
            f'<ul>'
            f'<li><a href="{HOST}/{u.UserName}/discussion">Discussion index</a>'
            f'<li>Posted in {df.ForumTopicId.nunique()} unique topics'
            f'<li>{days} unique days;'
            f'    {df.shape[0]/days:.1f} posts per day'
            f'<li>{df.shape[0]} messages;'
            f'    {chars} raw characters;'
            f'    {int(chars/df.shape[0])} chars per message'
            f'<li>Top {SHOW_TOP_WORDS} words: {top}'
            f'</ul>'
            f'<h3>Top Forums</h3>'
        )
        
        display(HTML(html))
        topf = forums.loc[forums.index.intersection(df.ForumId)].Title.value_counts()
        topf = topf.to_frame("Message Count")
        topf.index.name = "Forum"
        display(topf.head(5))
        
        wc = WordCloud(background_color='black',
                       width=800,
                       height=600,
                       colormap=COLORS[tier],
                       font_path=FONT_PATH,
                       random_state=1 + row,
                       min_font_size=10,
                       max_font_size=200).generate_from_frequencies(s)
        fig, ax = plt.subplots(figsize=(12, 9))
        ax.imshow(wc, interpolation='bilinear')
        ax.axis('off')
        plt.tight_layout()
        plt.show()
        
    # save stats of all users to one file
    c1 = [ 'Rank', 'UserName', 'count', 'max', 'mean' ]
    c2 = [f'tok{i}' for i in range(SAVE_TEXT_WORDS)]
    df = pd.DataFrame(rows, columns=c1+c2).set_index('Rank')
    df.to_csv(f'user_word_stats.csv')

In [11]:
generate_clouds()

# Kaggle Writers 2D Semantic Space

Is there a way to see how similar each word cloud is to the others?

We have one row of data per user, and one column per word or bi-gram, so many columns. We can use linear algebra (TruncatedSVD) to compress it to fewer dimensions, making it easier to compare entries. Then use manifold learning (T-SNE) to force it down to just two dimensions. The result is that each user should be positioned in the 2D space, close to other users with similar word and bi-gram usage. We may see clusters of users that use similar language, or outliers, where users are posting one specific kind of message a lot.


In [12]:
from sklearn.decomposition import TruncatedSVD
NSVD = 80
svd = TruncatedSVD(n_components=NSVD, random_state=42)
xc = svd.fit_transform(xall)
svd.explained_variance_ratio_.cumsum()

In [13]:
from sklearn.manifold import TSNE
tsne = TSNE(perplexity=20, early_exaggeration=1, init='pca', method='exact', learning_rate=5, n_iter=5000)
x2 = tsne.fit_transform(xc)

users_df = pd.DataFrame(list(docs.index), columns=KEY_LIST)
users_df = pd.concat((users_df, pd.DataFrame(x2).add_prefix('tsne')), axis=1)
users_df = users_df.set_index('PostUserId')
users_df = users_df.join(users)
users_df = users_df.sort_values('DiscussionRanking', ascending=False)

Now plot it!

In [14]:
fig = px.scatter(
    users_df.assign(Tier=np.asarray(TIER_NAMES)[users_df.PerformanceTier],
                    Size=(10 / np.log1p(users_df.DiscussionRanking)).round(3),
                    RegisterDate=users_df.RegisterDate.dt.strftime("%A %-d %b %Y"),
                    Year=users_df.RegisterDate.dt.year),
    title='Kaggle Writers 2D Semantic Space',
    x='tsne0',
    y='tsne1',
    #symbol='Year',
    size='Size',
    hover_name='DisplayName',
    hover_data=['UserName', 'RegisterDate', 'DiscussionRanking'],
    color='Tier',
    color_discrete_map=dict(zip(TIER_NAMES, TIER_COLORS)))
fig.update_traces(marker=dict(line=dict(width=1, color='DarkSlateGrey')),
                  selector=dict(mode='markers'))
fig.update_layout(showlegend=False)

It seems there is a set of *core* users who've been on the site longer, in the middle of the plot, probably using similar terms.

Some of the points are so close they are overlapping - hover over them to see the names or click+drag to zoom in.

Idea: use voting information to connect the dots. Do users who write using similar vocubularies up-vote **each other** more often?!

### See Also

[Wordclouds of Competition Forums](https://www.kaggle.com/jtrotman/wordclouds-of-competition-forums)