## StarSpace

StarSpace [[1]](#fn1) is an entity embedding approach which uses a similarity function between entities to construct a prediction task for a neural network. It maps objects of different types into a common vector space where they can be compared to each other. StarSpace can learn word, sentence and document level embeddings, ranking, text classification, embedding graphs, image classification, etc. We will follow the official documentation of StarSpace and implement simple text classification.

This notebook requires a working SparSpace program which can be built on any modern Linux or Windows machine as described in the building instructions in the [GitHub repository](https://github.com/facebookresearch/StarSpace). Here, we use the Linux toolchain to build the StarSpace executable. If you run this notebook on Windows, you can use either Visual Studio or tools such as [MinGW with MSYS](http://www.mingw.org/) or [Cygwin](https://www.cygwin.com/) to compile StarSpace.

-----
<span id="fn1"> [1] Ledell Yu Wu, Adam Fisch, Sumit Chopra, Keith Adams, Antoine Bordes, and Jason Weston. Starspace: Embed all the things! In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, pages 5569–5577, 2018. </span>

----

First of all, we need to ensure that all the required libraries are available. The `-q` parameter is used to suppress long installation reports produced by `pip`.

In [1]:
!pip -q install gensim==3.8.3
!pip -q install matplotlib==3.3.3
!pip -q install scikit-learn==0.23.2
!pip -q install pandas==1.1.4

We clone the source code repository and compile the starspace binary.

In [14]:
import os
class StopExecution(Exception):
    def _render_traceback_(self):
        pass

if os.name == 'nt':
    print('ERROR: you are running this notebook on a Windows system. Please open the StarSpace Visual Studio solution file (https://github.com/facebookresearch/StarSpace/blob/master/MVS/StarSpace.sln) and build the project.')   
    raise StopExecution
else:
    !git clone git@github.com:facebookresearch/StarSpace.git && cd StarSpace && make

Cloning into 'StarSpace'...
remote: Enumerating objects: 5, done.[K
remote: Counting objects: 100% (5/5), done.[K
remote: Compressing objects: 100% (5/5), done.[K
remote: Total 873 (delta 0), reused 0 (delta 0), pack-reused 868[K
Receiving objects: 100% (873/873), 3.05 MiB | 5.39 MiB/s, done.
Resolving deltas: 100% (567/567), done.
g++ -pthread -std=gnu++11 -O3 -funroll-loops -g -c src/utils/normalize.cpp
g++ -pthread -std=gnu++11 -O3 -funroll-loops -I/usr/local/bin/boost_1_63_0/ -g -c src/dict.cpp
g++ -pthread -std=gnu++11 -O3 -funroll-loops -g -c src/utils/args.cpp
g++ -pthread -std=gnu++11 -O3 -funroll-loops -I/usr/local/bin/boost_1_63_0/ -g -c src/proj.cpp
g++ -pthread -std=gnu++11 -O3 -funroll-loops -I/usr/local/bin/boost_1_63_0/ -g -c src/parser.cpp -o parser.o
g++ -pthread -std=gnu++11 -O3 -funroll-loops -I/usr/local/bin/boost_1_63_0/ -g -c src/data.cpp -o data.o
g++ -pthread -std=gnu++11 -O3 -funroll-loops -I/usr/local/bin/boost_1_63_0/ -g -c src/model.cpp
g++ -pthread -std

The executable is now available as `StarSpace/starspace`. The original bash script (classification_ag_news.sh) for the text classification example is available in the [Starspace GitHub repository](https://github.com/facebookresearch/Starspace/blob/master/examples/classification_ag_news.sh). We reimplement it as a Jupyter notebook.

The data is based on [Antonio Gulli's corpus (AG)](http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html) which is a collection of more than 1 million news articles. From this collection, Zhang et al. [[2]](#fn2) constructed a smaller corpus, containing only the four largest news categoriess from the original corpus. Each category (i.e. class value) contains 30,000 training instances and 1,900 testing instances. The total number of training samples is 120000 while 7600 samples are reserved for testing. We download, unpack and inspect the corpus.

----
<span id="fn2"> [2] Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).</span>

----

In [2]:
import tarfile
import requests
import os

request = requests.get('https://dl.fbaipublicfiles.com/starspace/ag_news_csv.tar.gz')
with open("data/ag_news_csv.tar.gz", "wb") as file:
    file.write(request.content)

with tarfile.open('data/ag_news_csv.tar.gz', 'r:gz') as tar:
    tar.extractall(path='data')
print(os.listdir('data/ag_news_csv'))

['classes.txt', 'test.csv', 'readme.txt', 'train.csv']


There are four classes and each news from the train and test set is classified using the line number of the actual class value. The training data looks as follows.

In [3]:
import pandas as pd
pd.set_option('display.max_colwidth', 30)
print(pd.read_csv('data/ag_news_csv/classes.txt', names=['categories']))
print(pd.read_csv('data/ag_news_csv/train.csv', names=['category', 'title', 'body']).iloc[:5])

  categories
0      World
1     Sports
2   Business
3   Sci/Tech
   category                          title                           body
0         3  Wall St. Bears Claw Back I...  Reuters - Short-sellers, W...
1         3  Carlyle Looks Toward Comme...  Reuters - Private investme...
2         3  Oil and Economy Cloud Stoc...  Reuters - Soaring crude pr...
3         3  Iraq Halts Oil Exports fro...  Reuters - Authorities have...
4         3  Oil prices soar to all-tim...  AFP - Tearaway world oil p...


We read the data into a Pandas DataFrame object and preprocess the text by converting it to lowercase and replacing a number of characters. The category is prefixed with `__label__` as required for the fastText word embedding file format. The transformed data is randomly shuffled and written into a fastText compatible text file. The four news categories are balanced in the train as well as in the test data.

In [4]:
from pprint import pprint

idx2category = {1: '__label__world',2: '__label__sports', 3:'__label__business', 4:'__label__scitech'}

def preprocess(df):
    df = df.replace({'category': idx2category})
    df['text'] = df['title'] + ' ' + df['body']
    df = df.drop(labels=['title', 'body'], axis=1)
    df['text'] = df['text'].str.lower()
    for s, rep in [("'"," ' "),
                   ('"',''),
                   ('.',' . '),
                   ('<br />',' '),
                   (',',' , '),
                   ('(',' ( '),
                   (')',' ) '),
                   ('!',' ! '),
                   ('?',' ? '),
                   (';',' '),
                   (':',' '),
                   ('\\',''),
                   ('  ',' ')
                  ]:
        df['text'] = df['text'].str.replace(s, rep)   
    df = df.sample(frac=1, random_state=42)
    return df

for filename in ['data/ag_news_csv/train.csv','data/ag_news_csv/test.csv']:
    df = pd.read_csv(filename, names=['category', 'title', 'body'])
    df = preprocess(df)
    print('File {}'.format(os.path.split(filename)[1]))
    pprint(df['category'].value_counts().to_dict())
    with open('{}.pp'.format(os.path.splitext(filename)[0]), 'w') as fp:
        for row in df.itertuples():
            fp.write('{} {}\n'.format(row.category, row.text))

File train.csv
{'__label__business': 30000,
 '__label__scitech': 30000,
 '__label__sports': 30000,
 '__label__world': 30000}
File test.csv
{'__label__business': 1900,
 '__label__scitech': 1900,
 '__label__sports': 1900,
 '__label__world': 1900}


We can now run StarSpace on the preprocessed files. The set of parameters is the same as in the example from the StarSpace repository. The `trainMode=0` and `fileFormat='FastText'` combinations defines the mode where the labels are individual words, i.e. the classification task. 

In [5]:
!./StarSpace/starspace train \
  -trainFile "data/ag_news_csv/train.pp" \
  -model "data/ag_news_csv/model" \
  -initRandSd 0.01 \
  -adagrad false \
  -ngrams 1 \
  -lr 0.01 \
  -epoch 5 \
  -thread 20 \
  -dim 10 \
  -negSearchLimit 5 \
  -trainMode 0 \
  -label "__label__" \
  -similarity "dot" \
  -verbose false

Arguments: 
lr: 0.01
dim: 10
epoch: 5
maxTrainTime: 8640000
validationPatience: 10
saveEveryEpoch: 0
loss: hinge
margin: 0.05
similarity: dot
maxNegSamples: 10
negSearchLimit: 5
batchSize: 5
thread: 20
minCount: 1
minCountLabel: 1
label: __label__
label: __label__
ngrams: 1
bucket: 2000000
adagrad: 0
trainMode: 0
fileFormat: fastText
normalizeText: 0
dropoutLHS: 0
dropoutRHS: 0
useWeight: 0
weightSep: :
Start to initialize starspace model.
Build dict from input file : data/ag_news_csv/train.pp
Read 5M words
Number of words in dictionary:  94698
Number of labels in dictionary: 4
Loading data from file : data/ag_news_csv/train.pp
Total number of examples loaded : 120000
Training epoch 0: 0.01 0.002
Epoch: 100.0%  lr: 0.008117  loss: 0.035385  eta: <1min   tot: 0h0m0s  (20.0%)0.2%  lr: 0.008833  loss: 0.043099  eta: <1min   tot: 0h0m0s  (12.0%)74.4%  lr: 0.008600  loss: 0.039451  eta: <1min   tot: 0h0m0s  (14.9%)99.7%  lr: 0.008117  loss: 0.035413  eta: <1min   tot: 0h0m0s  (19.9%)
 ---++

The resulting Starspace model embeddsthe input into a common 10-dimensional space (set by the `-dim 10` setting). We load it into a dataframe and inspect it. As shown in the table below, the model embedds everything into a common space: words that are present in documents but also the categories (the last four rows). In this way, we can now compare entities of different kinds.

In [6]:
pd.read_csv('data/ag_news_csv/model.tsv', sep='\t', header=None, keep_default_na=False)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,.,-0.001581,-0.055738,-0.001461,0.013572,-0.024389,0.012898,-0.027400,0.030329,-0.078572,-0.081473
1,the,0.041248,0.020253,-0.005631,-0.013228,-0.002068,0.004240,-0.013099,0.036625,0.028696,0.005871
2,",",-0.044975,0.007411,-0.001072,-0.001351,0.026816,0.001681,0.010960,-0.018680,-0.026508,-0.018127
3,,-0.091575,-0.034052,0.025836,-0.002135,-0.019016,0.052091,-0.035150,-0.017636,-0.067598,0.067879
4,to,0.017022,0.029204,-0.007912,0.016093,-0.007380,-0.014567,0.000096,0.024154,-0.013684,0.001100
...,...,...,...,...,...,...,...,...,...,...,...
94697,maleafter,0.010264,-0.013049,-0.005277,0.017525,-0.015361,0.006922,-0.019601,-0.002084,-0.017456,0.004337
94698,__label__business,-0.216275,-0.143102,0.020306,-0.139674,0.052156,-0.408132,0.139542,0.122431,0.164689,0.138649
94699,__label__world,-0.038814,-0.109869,0.016513,0.057183,-0.339918,0.145005,-0.015179,0.134849,-0.202327,-0.179774
94700,__label__sports,0.295794,0.340246,-0.038027,0.061239,-0.011269,0.242878,0.025850,-0.299439,0.355422,-0.252484


Wen compute predictions and measure the peformance. In the test mode, StarSpace reports the hit@k evaluation metric which tells us how many correct answers are among the top k predictions. We are interested in the most probable category, therefore we use the hit@1 metric (in general, assignment of categories to text can be viewed as a multi-label classification problem). StarSpace achieves the score $hit@1=0.46$ which means that in 46% of test cases the model's first prediction is the correct answer.

In [7]:
!./StarSpace/starspace test \
  -model "data/ag_news_csv/model" \
  -testFile "data/ag_news_csv/test.pp" \
  -ngrams 1 \
  -dim 10 \
  -label "__label__" \
  -thread 10 \
  -similarity "dot" \
  -trainMode 0 \
  -verbose false \
  -predictionFile "data/ag_news_csv/test.y"

Arguments: 
lr: 0.01
dim: 10
epoch: 5
maxTrainTime: 8640000
validationPatience: 10
saveEveryEpoch: 0
loss: hinge
margin: 0.05
similarity: dot
maxNegSamples: 10
negSearchLimit: 50
batchSize: 5
thread: 10
minCount: 1
minCountLabel: 1
label: __label__
label: __label__
ngrams: 1
bucket: 2000000
adagrad: 1
trainMode: 0
fileFormat: fastText
normalizeText: 0
dropoutLHS: 0
dropoutRHS: 0
useWeight: 0
weightSep: :
Start to load a trained starspace model.
STARSPACE-2018-2
Model loaded.
Loading data from file : data/ag_news_csv/test.pp
Total number of examples loaded : 7600
Predictions use 4 known labels.
------Loaded model args:
Arguments: 
lr: 0.01
dim: 10
epoch: 5
maxTrainTime: 8640000
validationPatience: 10
saveEveryEpoch: 0
loss: hinge
margin: 0.05
similarity: dot
maxNegSamples: 10
negSearchLimit: 5
batchSize: 5
thread: 10
minCount: 1
minCountLabel: 1
label: __label__
label: __label__
ngrams: 1
bucket: 2000000
adagrad: 1
trainMode: 0
fileFormat: fastText
normalizeText: 0
dropoutLHS: 0
dropout

This result was obtained using the parameters as specified by the authors in the [published example](https://github.com/facebookresearch/Starspace/blob/master/examples/classification_ag_news.sh). The performance (46.4%) differs significantly from the published results [[1]](#fn1) where the authors report 91.6% accuracy on the test set for this task using the same number of dimensions (10).

On the other hand, our implementation of the baseline classifier based on TF-IDF + SVM presented below shows similar performance (91%) to the BOW + multinomial logistic regression (88.8%) reported in the paper [[3]](#fn3).

---
<span id="fn3"> [3]  Zhang, X., and LeCun, Y. 2015. Text understanding from scratch. arXiv preprint arXiv:1502.01710. </span>

----

In [8]:
import gensim
def to_tfidf(documents, dic=None, tfidf_model=None):
    documents = [gensim.parsing.preprocessing.preprocess_string(doc) for doc in documents]
    if dic is None:
        dic = gensim.corpora.Dictionary(documents)
        dic.filter_extremes()
    bows = [dic.doc2bow(doc) for doc in documents]
    if tfidf_model is None:
        tfidf_model = gensim.models.tfidfmodel.TfidfModel(dictionary=dic)
    tfidf_vectors = tfidf_model[bows]
    return tfidf_vectors, dic, tfidf_model


train = pd.read_csv('data/ag_news_csv/train.csv', names=['category', 'title', 'body'])
X_train = [x.title + ' ' + x.body for x in train.itertuples()]
y_train = [x.category for x in train.itertuples()]

test = pd.read_csv('data/ag_news_csv/test.csv', names=['category', 'title', 'body'])
X_test = [x.title + ' ' + x.body for x in test.itertuples()]
y_test = [x.category for x in test.itertuples()]

X_train_tfidf, dic, tfidf_model = to_tfidf(X_train)
X_test_tfidf, _, __ = to_tfidf(X_test, dic, tfidf_model)

The TF-IDF weighting used with the linear SVM achieves the accuracy of 91%. Because this is a multiclass classification problem, this metric is the same as hit@1, reported by StarSpace.

In [9]:
from sklearn.svm import LinearSVC
from sklearn import metrics
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(y_train)

svc = LinearSVC()
svc.fit(gensim.matutils.corpus2csc(X_train_tfidf, num_terms=len(dic)).T, le.transform(y_train))
y_predicted = svc.predict(gensim.matutils.corpus2csc(X_test_tfidf, num_terms=len(dic)).T)
print('Accuracy: {:.3f}'.format(metrics.accuracy_score(le.transform(y_test), y_predicted)))

Accuracy: 0.910


We have embeddings for a large number of words, so we can run clustering to see if the embeddings vectors can be used to partition words into four categories.

In [14]:
import numpy as np
from sklearn.cluster import KMeans

model = pd.read_csv('data/ag_news_csv/model.tsv', sep='\t', header=None, keep_default_na=False)
embeddings = model[model.columns[1:]]
kmeans = KMeans(n_clusters=4, random_state=12345).fit(embeddings)

The three smaller clusters closely match the topics Business, World, and Sci/Tech while the largest cluster is less specific and contains words from all topics.

In [15]:
words_array = model[0].to_numpy()
for ci in range(kmeans.n_clusters):
    cluster_words = np.compress(kmeans.labels_==ci, words_array)
    print('Cluster {} ({} instances)'.format(ci, len(cluster_words)))
    print(cluster_words[:100])
    print('')

Cluster 0 (1640 instances)
['us' 'company' 'oil' 'inc' 'yesterday' '?' 'corp' 'prices' 'years'
 'group' 'season' 'deal' 'sales' 'business' 'billion' 'former'
 'washington' 'profit' 'states' '/b&gt' 'b&gt' 'chief' 'american' 'shares'
 'take' 'bank' 'third' 'federal' 'companies' 'co' 'maker' 'bid' 'largest'
 'industry' 'big' 'giant' '5' 'growth' 'investor' '//www' 'href=http'
 '/a&gt' 'trade' 'earnings' 'dollar' 'buy' 'gold' 'union' 'amp' 'stock'
 'loss' 'agreed' 'months' 'aspx' 'com/fullquote'
 'target=/stocks/quickinfo/fullquote&gt' 'like' 'firm' 'air' 'rose'
 'executive' 'jobs' 'update' 'price' 'boston' 'economy' 'drug' 'ahead'
 'pay' 'near' 'biggest' 'economic' 'peoplesoft' 'car' 'o' 'street' 'work'
 'your' 'free' '2005' 'much' '6' 'presidential' 'workers' 'wins' 'america'
 'nation' 'share' 'financial' 'fall' 'wall' 'fell' 'lower' 'september'
 'crude' 'october' 'chicago' 'job' '11' 'consumer']

Cluster 1 (89619 instances)
['the' ',' 'to' 'a' 'of' 'in' 'and' 's' 'on' 'for' '#39' ')' '