# CommonLit: What Are We Reading About? 
## Visualization with Bokeh and Topic Modeling with BERTopic

When I got interested in this competition, I wondered what are all the texts in our training and test sets about? It seems we have limited information about the sources and categories of texts, so we need to use some other tools. I found some public noteboooks using topic modeling with LSA/LDA, but I found that in my experience BERTopic might be an interesting alternative. 

The dataset is not huge, so I found we can display all the texts in a single scatterplot and explore them with a hovering tool.

If you're interested in topic modeling, I've written a blog post about it here (you can open and run it in colab): https://skok.ai/2021/05/27/Topic-Models-Introduction.html

**If you find this useful, please upvote!**

In [None]:
!pip install --upgrade pip
!pip install --upgrade numpy
!pip install --upgrade sentence_transformers
!conda install -c conda-forge hdbscan --y
!pip install bokeh
!pip install --upgrade bertopic[visualization]

In [None]:
from bertopic import BERTopic
import pandas as pd
from sentence_transformers import SentenceTransformer
import sklearn.manifold
import numpy as np
import pandas as pd
import random
random.seed(42)
from bokeh.io import output_file, show
from bokeh.models import ColumnDataSource, HoverTool, LinearColorMapper
from bokeh.palettes import plasma, d3, Turbo256
from bokeh.plotting import figure
from bokeh.transform import transform
import bokeh.io
bokeh.io.output_notebook()

import bokeh.plotting as bpl
import bokeh.models as bmo
bpl.output_notebook()

In [None]:
sub = pd.read_csv('../input/commonlitreadabilityprize/sample_submission.csv')
test = pd.read_csv('../input/commonlitreadabilityprize/test.csv')
train = pd.read_csv('../input/commonlitreadabilityprize/train.csv')

train['set'] = 'train'
test['set'] = 'test'

combined = pd.concat([train, test], ignore_index=True)
combined.target.fillna(3, inplace=True)

texts = combined.excerpt.values.tolist()
targets = combined.target.values.tolist()
sets = combined.set.values.tolist()

model = SentenceTransformer('stsb-distilbert-base')
embeddings = model.encode(texts)
out = sklearn.manifold.TSNE(n_components=2).fit_transform(embeddings)

color_mapper = LinearColorMapper(palette='Plasma256', low=min(targets), high=max(targets))

# Semantic space with targets color map

In the first chart, I'd like to explore the semantic space of the texts and see if there is any correlation with the targets. The dots represent our texts, the color intensity corresponds to our target. I arbitrarily set the target value from texts coming from the test set to 3, so that I can distinguish them on the map.

In [None]:
SETS = ['train', 'test']
MARKERS = ['circle', 'triangle']

list_x = out[:,0]
list_y = out[:,1]
desc = texts

source = ColumnDataSource(data=dict(x=list_x, y=list_y, desc=desc, targets=targets, dset=sets))
hover = HoverTool(tooltips=[
    ("index", "$index"),
    ("(x,y)", "(@x, @y)"),
    ('desc', '@desc'),
    ('targets', '@targets'),
    ('dset', '@dset')
])

p = figure(plot_width=800, plot_height=800, tools=[hover], title="First Look at the Data")
p.scatter('x', 'y', size=10, source=source, legend='dset', color={'field': 'targets', 'transform': color_mapper},
         marker=bokeh.transform.factor_mark('dset', MARKERS, SETS),)

bpl.show(p)

# Semantic space with topics

Now let's see if we can cluster together the texts with BERTopic topic model. It seems the model has a hard time discovering topics from the stories/prose, it's much easier for texts that appear to come from Wikipidia or similar sources. Also, there is a substantial percentage of outliers. 

In [None]:
model = BERTopic(language="english", min_topic_size=20)
topics, probs = model.fit_transform(texts)

topic_words = ['-1: outlier']
for i in range(len(set(topics))-1):
  tpc = model.get_topic(i)[:7]
  words = [x[0] for x in tpc]
  tw = ' '.join([str(i) + ':'] + words)
  topic_words.append(tw)

exp_topics = [topic_words[x+1] for x in topics]

clrs = random.sample(Turbo256, len(set(topics)))
color_map = bmo.CategoricalColorMapper(factors=topic_words, palette=clrs)

In [None]:
list_x = out[:,0]
list_y = out[:,1]
desc = texts

source = ColumnDataSource(data=dict(x=list_x, y=list_y, desc=desc, topic=exp_topics, target=targets, dset=sets,))
hover = HoverTool(tooltips=[
    ("index", "$index"),
    ('desc', '@desc'),
    ('topic', '@topic'),
    ('target', '@target'),
    ('dset', '@dset'),
])

p = figure(plot_width=800, plot_height=800, tools=[hover], title="Topics from BERTopic model")
p.scatter('x', 'y', size=10, source=source,
         fill_color=transform('topic', color_map),
         marker=bokeh.transform.factor_mark('dset', MARKERS, SETS),
         legend='dset'
)
# p.legend.location = "top_left"
# p.legend.click_policy="hide"

bokeh.plotting.show(p)

Here is the list of topics discovered by BERTopic in our dataset. 

In [None]:
topic_df = model.get_topic_freq()

def get_keywords(i):
    if i == -1: return 'outlier'
    tpc = model.get_topic(i)[:7]
    words = [x[0] for x in tpc]
    tw = ' '.join(words)
    return tw

topic_df['keywords'] = topic_df['Topic'].apply(get_keywords)

topic_df

## If you found this useful, please upvote :) Thank you!