# Stack Overflow Tags

This notebook shows how to train dense vectors for question tags.

The dataset has been extracted using the [Stack Exchange Data Explorer](https://data.stackexchange.com/), and is released under [CC BY-SA 4.0](http://creativecommons.org/licenses/by-sa/4.0/). The first one million questions with at least 4 tags were extracted:

```sql
SELECT Id, Tags
FROM Posts
WHERE LEN(Tags) - LEN(REPLACE(Tags, '<', '')) >= 4
ORDER BY Id
```

In [1]:
import pandas as pd

import umap

from bokeh.plotting import ColumnDataSource, figure, show
from bokeh.io import output_notebook

from itembed import (
    pack_itemsets,
    initialize_syn,
    UnsupervisedTask,
    train,
)

In [2]:
# Initialize Bokeh
output_notebook()

In [3]:
# Load raw dataset
tag_df = pd.read_csv('stackoverflow.csv')
tag_df.head(10)

Unnamed: 0,id,tags
0,4,c#;floating-point;type-conversion;double;decimal
1,11,c#;datetime;time;datediff;relative-time-span
2,13,html;browser;timezone;user-agent;timezone-offset
3,16,c#;linq;web-services;.net-3.5
4,17,mysql;database;binary-data;data-storage
5,19,performance;algorithm;language-agnostic;unix;pi
6,25,c++;c;sockets;mainframe;zos
7,36,sql;sql-server;datatable;rdbms
8,39,c#;.net;vb.net;timer
9,42,php;plugins;architecture;hook


In [4]:
# Get tags as a list of list of string
itemsets = tag_df.tags.str.split(';').values

In [5]:
# Pack itemsets into contiguous arrays
labels, indices, offsets = pack_itemsets(itemsets, min_count=10)
num_label = len(labels)

In [6]:
# Initialize embeddings sets from uniform distribution
num_dimension = 64
syn0 = initialize_syn(num_label, num_dimension)
syn1 = initialize_syn(num_label, num_dimension)

In [7]:
# Define unsupervised task, i.e. using co-occurrences
task = UnsupervisedTask(indices, offsets, syn0, syn1, num_negative=5)

In [8]:
# Do training
# Note: due to a different sampling strategy, more epochs than word2vec are needed
train(task, num_epoch=100)

100%|██████████████████████████████████████████████████████████████████████| 1562400/1562400 [09:14<00:00, 2820.07it/s]


In [9]:
# Both embedding sets are equivalent, just choose one of them
syn = syn0

In [10]:
# Project with UMAP, using cosine similarity measure
model = umap.UMAP(metric='cosine', verbose=1)
projection = model.fit_transform(syn)

UMAP(a=None, angular_rp_forest=False, b=None, init='spectral',
     learning_rate=1.0, local_connectivity=1.0, metric='cosine',
     metric_kwds=None, min_dist=0.1, n_components=2, n_epochs=None,
     n_neighbors=15, negative_sample_rate=5, random_state=None,
     repulsion_strength=1.0, set_op_mix_ratio=1.0, spread=1.0,
     target_metric='categorical', target_metric_kwds=None,
     target_n_neighbors=-1, target_weight=0.5, transform_queue_size=4.0,
     transform_seed=42, verbose=1)
Construct fuzzy simplicial set
Mon Apr 27 11:08:39 2020 Finding Nearest Neighbors
Mon Apr 27 11:08:39 2020 Building RP forest with 11 trees
Mon Apr 27 11:08:41 2020 NN descent for 14 iterations


The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.

To find out why, try turning on parallel diagnostics, see http://numba.pydata.org/numba-doc/latest/user/parallel.html#diagnostics for help.

File "..\..\..\.conda\envs\dev\lib\site-packages\umap\utils.py", line 409:
@numba.njit(parallel=True)
def build_candidates(current_graph, n_vertices, n_neighbors, max_candidates, rng_state):
^

  current_graph, n_vertices, n_neighbors, max_candidates, rng_state
The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.

To find out why, try turning on parallel diagnostics, see http://numba.pydata.org/numba-doc/latest/user/parallel.html#diagnostics for help.

File "..\..\..\.conda\envs\dev\lib\site-packages\umap\nndescent.py", line 47:
    @numba.njit(parallel=True)
    def nn_descent(
    ^

  state.func_ir.loc))


	 0  /  14
	 1  /  14
	 2  /  14
	 3  /  14
	 4  /  14
	 5  /  14
Mon Apr 27 11:08:47 2020 Finished Nearest Neighbor Search
Mon Apr 27 11:08:51 2020 Construct embedding
	completed  0  /  200 epochs
	completed  20  /  200 epochs
	completed  40  /  200 epochs
	completed  60  /  200 epochs
	completed  80  /  200 epochs
	completed  100  /  200 epochs
	completed  120  /  200 epochs
	completed  140  /  200 epochs
	completed  160  /  200 epochs
	completed  180  /  200 epochs
Mon Apr 27 11:09:07 2020 Finished embedding


In [11]:
# Pack as a Bokeh data source
source = ColumnDataSource(data=dict(
    x=projection[:, 0],
    y=projection[:, 1],
    label=labels,
))

# Create plot
p = figure(
    width=900,
    height=600,
    tooltips=[
        ('label', '@label'),
    ],
)

# Draw tags as points
p.scatter(
    'x', 'y',
    source=source,
)

# Show in notebook
show(p)