## StarSpace

StarSpace [[1]](#fn1) is an example of the entity embedding approach which uses a similarity function between entities to construct a prediction task for a neural network. It learns to represent objects of different types into a common vector space where they can be compared do each other. The problems that StarSpace can solve include learning word, sentence and document level embeddings, ranking, text classification, embedding graphs, image classification etc.

This notebook requires a working SparSpace program which can be built on any modern Linux or Windows machine as described in the builind instructions in its [GitHub repository](https://github.com/facebookresearch/StarSpace). In addition, the following packages are required:

- gensim==3.8.3
- matplotlib==3.3.2
- scikit-learn==0.23.2

-----
<span id="fn1"> [1] Ledell Yu Wu, Adam Fisch, Sumit Chopra, Keith Adams, Antoine Bordes, and Jason Weston. Starspace: Embed all the things! In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, pages 5569–5577, 2018. </span>

----

We will follow the official documentation of StarSpace and rewrite the text classification example. First of all, we need to compile the starspace binary.

In [5]:
!git clone git@github.com:facebookresearch/StarSpace.git

Cloning into 'StarSpace'...
remote: Enumerating objects: 5, done.[K
remote: Counting objects: 100% (5/5), done.[K
remote: Compressing objects: 100% (5/5), done.[K
remote: Total 873 (delta 0), reused 0 (delta 0), pack-reused 868[K
Receiving objects: 100% (873/873), 3.05 MiB | 654.00 KiB/s, done.
Resolving deltas: 100% (567/567), done.


In [6]:
!cd StarSpace && make

g++ -pthread -std=gnu++11 -O3 -funroll-loops -g -c src/utils/normalize.cpp
g++ -pthread -std=gnu++11 -O3 -funroll-loops -I/usr/local/bin/boost_1_63_0/ -g -c src/dict.cpp
g++ -pthread -std=gnu++11 -O3 -funroll-loops -g -c src/utils/args.cpp
g++ -pthread -std=gnu++11 -O3 -funroll-loops -I/usr/local/bin/boost_1_63_0/ -g -c src/proj.cpp
g++ -pthread -std=gnu++11 -O3 -funroll-loops -I/usr/local/bin/boost_1_63_0/ -g -c src/parser.cpp -o parser.o
g++ -pthread -std=gnu++11 -O3 -funroll-loops -I/usr/local/bin/boost_1_63_0/ -g -c src/data.cpp -o data.o
g++ -pthread -std=gnu++11 -O3 -funroll-loops -I/usr/local/bin/boost_1_63_0/ -g -c src/model.cpp
g++ -pthread -std=gnu++11 -O3 -funroll-loops -I/usr/local/bin/boost_1_63_0/ -g -c src/starspace.cpp
g++ -pthread -std=gnu++11 -O3 -funroll-loops -I/usr/local/bin/boost_1_63_0/ -g -c src/doc_parser.cpp -o doc_parser.o
g++ -pthread -std=gnu++11 -O3 -funroll-loops -I/usr/local/bin/boost_1_63_0/ -g -c src/doc_data.cpp -o doc_data.o
g++ -pthread -std=gnu++11

The executable is now available as `data/StarSpace/starspace`. The original bash script for the text classification example is available in the [Starspace GitHub repository](https://github.com/facebookresearch/Starspace/blob/master/examples/classification_ag_news.sh). We will reimplement it as a Jupyter notebook.

The data is based on [Antonio Gulli's corpus (AG)](http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html) which is a collection of more than 1 million news articles. Zhang et al. [[2]](#fn2) used it to construct a smaller corpus by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is thus 120,000 for training and 7,600 for testing. Let's download, unpack and inspect the corpus.

----
<span id="fn2"> [2] Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).</span>

In [14]:
!wget -c https://dl.fbaipublicfiles.com/starspace/ag_news_csv.tar.gz -P data
!cd data && tar -xzvf ag_news_csv.tar.gz
!ls data

--2020-11-10 16:08:16--  https://dl.fbaipublicfiles.com/starspace/ag_news_csv.tar.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 2606:4700:10::6816:4b8e, 2606:4700:10::ac43:904, 2606:4700:10::6816:4a8e, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|2606:4700:10::6816:4b8e|:443... connected.
HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable

    The file is already fully retrieved; nothing to do.

ag_news_csv/
ag_news_csv/train.csv
ag_news_csv/test.csv
ag_news_csv/classes.txt
ag_news_csv/readme.txt
ag_news_csv  ag_news_csv.tar.gz


There are four classes and each news from the train and test set is classified using the line number of the actual class value.

In [107]:
!cat data/ag_news_csv/classes.txt
!head -n 5 data/ag_news_csv/train.csv

World
Sports
Business
Sci/Tech
"3","Wall St. Bears Claw Back Into the Black (Reuters)","Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again."
"3","Carlyle Looks Toward Commercial Aerospace (Reuters)","Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market."
"3","Oil and Economy Cloud Stocks' Outlook (Reuters)","Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\hang over the stock market next week during the depth of the\summer doldrums."
"3","Iraq Halts Oil Exports from Main Southern Pipeline (Reuters)","Reuters - Authorities have halted oil export\flows from the main pipeline in southern Iraq after\intelligence showed a rebel militia could strike\infrastructure, an oil official said on Saturday."
"3","Oil prices soar to all-time re

Let's read the data into a Pandas DataFrame object and perform text processing. The text will be converted to lowercase and a number of characters will be replaced. The category also needs to be prefixed with `__label__` as required for the FastText file format. The transformed data will be shuffled and written back into a FastText compatible text file. As shown below, both the train and test data are perfectly balanced on four categories.

In [55]:
import pandas as pd
import os
from pprint import pprint

idx2category = {1: '__label__world',2: '__label__sports', 3:'__label__business', 4:'__label__scitech'}

def preprocess(df):
    df = df.replace({'category': idx2category})
    df['text'] = df['title'] + ' ' + df['body']
    df = df.drop(labels=['title', 'body'], axis=1)
    df['text'] = df['text'].str.lower()
    for s, rep in [("'"," ' "),
                   ('"',''),
                   ('.',' . '),
                   ('<br />',' '),
                   (',',' , '),
                   ('(',' ( '),
                   (')',' ) '),
                   ('!',' ! '),
                   ('?',' ? '),
                   (';',' '),
                   (':',' '),
                   ('\\',''),
                   ('  ',' ')
                  ]:
        df['text'] = df['text'].str.replace(s, rep)   
    df = df.sample(frac=1)
    return df

for filename in ['data/ag_news_csv/train.csv','data/ag_news_csv/test.csv']:
    df = pd.read_csv(filename, names=['category', 'title', 'body'])
    df = preprocess(df)
    print('File {}'.format(os.path.split(filename)[1]))
    pprint(df['category'].value_counts().to_dict())
    with open('{}.pp'.format(os.path.splitext(filename)[0]), 'w') as fp:
        for row in df.itertuples():
            fp.write('{} {}\n'.format(row.category, row.text))

File train.csv
{'__label__business': 30000,
 '__label__scitech': 30000,
 '__label__sports': 30000,
 '__label__world': 30000}
File test.csv
{'__label__business': 1900,
 '__label__scitech': 1900,
 '__label__sports': 1900,
 '__label__world': 1900}


We can now run StarSpace on the preprocessed files. The set of parameters will be the same as in the example from the StarSpace repository. The `trainMode=0` and `fileFormat='FastText'` combinations defines the mode where the labels are individual words, i.e., a classification task. We will use exactly the same settings as in the official example.

In [56]:
!./StarSpace/starspace train \
  -trainFile "data/ag_news_csv/train.pp" \
  -model "data/ag_news_csv/model" \
  -initRandSd 0.01 \
  -adagrad false \
  -ngrams 1 \
  -lr 0.01 \
  -epoch 5 \
  -thread 20 \
  -dim 10 \
  -negSearchLimit 5 \
  -trainMode 0 \
  -label "__label__" \
  -similarity "dot" \
  -verbose false

Arguments: 
lr: 0.01
dim: 10
epoch: 5
maxTrainTime: 8640000
validationPatience: 10
saveEveryEpoch: 0
loss: hinge
margin: 0.05
similarity: dot
maxNegSamples: 10
negSearchLimit: 5
batchSize: 5
thread: 20
minCount: 1
minCountLabel: 1
label: __label__
label: __label__
ngrams: 1
bucket: 2000000
adagrad: 0
trainMode: 0
fileFormat: fastText
normalizeText: 0
dropoutLHS: 0
dropoutRHS: 0
useWeight: 0
weightSep: :
Start to initialize starspace model.
Build dict from input file : data/ag_news_csv/train.pp
Read 5M words
Number of words in dictionary:  94698
Number of labels in dictionary: 4
Loading data from file : data/ag_news_csv/train.pp
Total number of examples loaded : 120000
Training epoch 0: 0.01 0.002
Epoch: 100.0%  lr: 0.008000  loss: 0.032078  eta: <1min   tot: 0h0m0s  (20.0%)%  lr: 0.009767  loss: 0.122484  eta: <1min   tot: 0h0m0s  (1.3%)69.7%  lr: 0.008383  loss: 0.035938  eta: <1min   tot: 0h0m0s  (13.9%)82.3%  lr: 0.008150  loss: 0.034168  eta: <1min   tot: 0h0m0s  (16.5%)
 ---+++   

The resulting starspace model is the embedding of the input into a common space which is 10-dimensional in our case (remember the `-dim 10` setting). It can be easily loaded into a dataframe and inspected. As shown in the table below, the model embedds everything into a common space: words that are present in documents but also the categories (the last four rows). This way, we are free to compare entities of different kinds.

In [57]:
pd.read_csv('data/ag_news_csv/model.tsv', sep='\t', header=None, keep_default_na=False)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,.,0.007482,0.003819,0.012805,0.023903,-0.014069,0.037838,-0.020683,-0.100598,0.019306,-0.064737
1,the,0.012291,-0.020647,-0.005732,0.001266,0.007939,0.029393,-0.015839,-0.009221,-0.006826,0.030366
2,",",0.014190,-0.031597,0.019656,0.018430,0.028667,-0.000426,-0.013080,0.034038,0.033295,0.019923
3,,-0.018692,-0.010998,0.024330,0.028363,-0.088141,0.007864,0.014710,-0.001971,0.046167,-0.050714
4,to,-0.004290,0.027648,0.003929,0.017618,-0.025623,0.008797,0.000542,0.017099,0.002773,0.011998
...,...,...,...,...,...,...,...,...,...,...,...
94697,hand-to-hand,0.010720,-0.020686,-0.004865,0.016998,-0.028212,0.002781,-0.018049,0.001938,-0.016406,0.000593
94698,__label__business,-0.096305,0.075784,-0.090052,-0.233349,0.024357,-0.265917,0.222153,0.049783,0.356689,0.134585
94699,__label__scitech,0.029514,-0.248513,0.031928,0.022888,-0.443912,-0.124359,0.026620,0.110927,0.017640,-0.150377
94700,__label__world,0.031889,0.127961,0.114498,0.191659,0.071189,0.209188,-0.016802,-0.242704,0.146559,-0.245721


Now we can compute predictions and measure the peformance. In test mode, StarSpace reports the hit@k evaluation metric which tells us how many correct answers are among the top k predictions. In our case where we assign only one category the hit@1 metris is relevant (although in the general case assigning tags to texts is a multi-label classification problem). Starspace achieves the score $hit@1=0.46$ which means that in 46% of test cases the model's first prediction is the correct answer.

In [58]:
!./StarSpace/starspace test \
  -model "data/ag_news_csv/model" \
  -testFile "data/ag_news_csv/test.pp" \
  -ngrams 1 \
  -dim 10 \
  -label "__label__" \
  -thread 10 \
  -similarity "dot" \
  -trainMode 0 \
  -verbose false \
  -predictionFile "data/ag_news_csv/test.y"

Arguments: 
lr: 0.01
dim: 10
epoch: 5
maxTrainTime: 8640000
validationPatience: 10
saveEveryEpoch: 0
loss: hinge
margin: 0.05
similarity: dot
maxNegSamples: 10
negSearchLimit: 50
batchSize: 5
thread: 10
minCount: 1
minCountLabel: 1
label: __label__
label: __label__
ngrams: 1
bucket: 2000000
adagrad: 1
trainMode: 0
fileFormat: fastText
normalizeText: 0
dropoutLHS: 0
dropoutRHS: 0
useWeight: 0
weightSep: :
Start to load a trained starspace model.
STARSPACE-2018-2
Model loaded.
Loading data from file : data/ag_news_csv/test.pp
Total number of examples loaded : 7600
Predictions use 4 known labels.
------Loaded model args:
Arguments: 
lr: 0.01
dim: 10
epoch: 5
maxTrainTime: 8640000
validationPatience: 10
saveEveryEpoch: 0
loss: hinge
margin: 0.05
similarity: dot
maxNegSamples: 10
negSearchLimit: 5
batchSize: 5
thread: 10
minCount: 1
minCountLabel: 1
label: __label__
label: __label__
ngrams: 1
bucket: 2000000
adagrad: 1
trainMode: 0
fileFormat: fastText
normalizeText: 0
dropoutLHS: 0
dropout

The performance of starspace in this particular example is not great. It differs significantly from the published results [[1]](#fn1) where on this task the authors report 91.6% accuracy on the test set. Is is unclear what is the reason for this discrepancy. On the other hand, the performance of a baseline classifier based on tf-idf + SVM can be easily demonstrated. Its performance is similar to the the performance of the BOW + multinomial logistic regression reported in the paper.

In [60]:
import gensim
def to_tfidf(documents, dic=None, tfidf_model=None):
    documents = [gensim.parsing.preprocessing.preprocess_string(doc) for doc in documents]
    if dic is None:
        dic = gensim.corpora.Dictionary(documents)
        dic.filter_extremes()
    bows = [dic.doc2bow(doc) for doc in documents]
    if tfidf_model is None:
        tfidf_model = gensim.models.tfidfmodel.TfidfModel(dictionary=dic)
    tfidf_vectors = tfidf_model[bows]
    return tfidf_vectors, dic, tfidf_model


train = pd.read_csv('data/ag_news_csv/train.csv', names=['category', 'title', 'body'])
X_train = [x.title + ' ' + x.body for x in train.itertuples()]
y_train = [x.category for x in train.itertuples()]

test = pd.read_csv('data/ag_news_csv/test.csv', names=['category', 'title', 'body'])
X_test = [x.title + ' ' + x.body for x in test.itertuples()]
y_test = [x.category for x in test.itertuples()]

X_train_tfidf, dic, tfidf_model = to_tfidf(X_train)
X_test_tfidf, _, __ = to_tfidf(X_test, dic, tfidf_model)

The tf-idf model with combination with linear SVM achieves the accuracy of 91%. Because this is an ordinary multiclass classification problem, this metric is the same as hit@1 as reported by starspace.

In [61]:
from sklearn.svm import LinearSVC
from sklearn import metrics
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(y_train)

svc = LinearSVC()
svc.fit(gensim.matutils.corpus2csc(X_train_tfidf, num_terms=len(dic)).T, le.transform(y_train))
y_predicted = svc.predict(gensim.matutils.corpus2csc(X_test_tfidf, num_terms=len(dic)).T)
print('Accuracy: {:.3f}'.format(metrics.accuracy_score(le.transform(y_test), y_predicted)))

Accuracy: 0.910


Now that we have embeddings for a large number of words, let's try to run clustering to see if the embeddings vectors can be used to partition words four categories.

In [62]:
import numpy as np
from sklearn.cluster import KMeans

model = pd.read_csv('data/ag_news_csv/model.tsv', sep='\t', header=None, keep_default_na=False)
embeddings = model[model.columns[1:]]
kmeans = KMeans(n_clusters=4, random_state=42).fit(embeddings)

The resulting clusters are quite interesting. The first three match very closely to topics World, Sci/Tech, and Business while the last and by far the largest is less specific and contains words from all topics.

In [63]:
words_array = model[0].to_numpy()
for ci in range(kmeans.n_clusters):
    cluster_words = np.compress(kmeans.labels_==ci, words_array)
    print('Cluster {} ({} instances)'.format(ci, len(cluster_words)))
    print(cluster_words[:100])
    print('')

Cluster 0 (1584 instances)
['.' '-' "'" 'ap' 'iraq' 'york' 'president' 'says' 'sunday' 'would'
 'security' 'government' 'people' 'afp' 'night' 'china' 'minister' 'bush'
 'killed' 'stocks' 'european' 'talks' 'league' 'country' 'reported'
 'british' 'japan' 'india' 'police' 'prime' 'iraqi' 'leader' 'say'
 'baghdad' 'expected' 'election' 'her' 'north' 'under' 'war' 'australia'
 'military' 'cut' 'nuclear' 'higher' 'un' 'official' 'palestinian' 'sox'
 'attack' 'troops' 'russia' 'israeli' 'gaza' 'press' 'west' 'even'
 'including' 'general' 'man' 'iran' 'football' 'released' 'forces'
 'athens' 'past' 'europe' 'investors' 'peace' 'release' 'canadian'
 'russian' 'beat' 'pakistan' 'public' 'eu' 'where' 'foreign'
 'presidential' 'bomb' 'attacks' 'israel' 'nations' 'championship' 'korea'
 'australian' 'kerry' 'leaders' 'french' 'men' 'face' 'death' 'killing'
 'darfur' 'leading' 'arafat' '#36' 'seven' 'army' 'capital']

Cluster 1 (1735 instances)
['' '(' 'by' 'reuters' '&lt' 'first' 'u' 'company' '