## An example of using SLEEC

In [1]:
import sleec

### Check math kernel
The project should be run under the environment created by [anaconda](https://conda.io/docs/user-guide/install/index.html#regular-installation). <br>
- This makes the matrix calculation faster.
- If <code>*_mkl_info</code> displays as <code>NOT AVAILABLE</code>, an environment should be created using <code>conda create -n __[env_name]__ python=3.6 anaconda</code>.

In [2]:
sleec.check_mkl()

mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/amiszhang_rx/tools/anaconda3/envs/nlu/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/amiszhang_rx/tools/anaconda3/envs/nlu/include']
blas_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/amiszhang_rx/tools/anaconda3/envs/nlu/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/amiszhang_rx/tools/anaconda3/envs/nlu/include']
blas_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/amiszhang_rx/tools/anaconda3/envs/nlu/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/amiszhang_rx/tools/anaconda3/envs/nlu/include']
lapack_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/amiszhang_rx/tools/anaconda3/envs/nlu/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs =

### Load data
Download the [data](http://manikvarma.org/downloads/XC/XMLRepository.html) used for SLEEC. **Bibtex** is used as an example.<br>
SLEEC uses bag-of-word model, there is no need for raw data for deep learning.

In [3]:
dataset_name = 'Bibtex'
data_dir = 'data/' + dataset_name + '/' + dataset_name + '_data.txt'

In [4]:
data, spec_tup = sleec.load_file(data_dir, dataset_name)
feats_labels_list, avg_feats, avg_labels = sleec.load_data(data)
print(avg_feats, avg_labels)

Dataset=Bibtex, num_ents=7395, num_feats=1836, num_labels=159
68.66071670047329 2.401893171061528


### Train/dev/test splits
Split the data with an of ratio 60:20:20.

In [5]:
X, Y = sleec.gen_data(feats_labels_list, spec_tup)
X_train, X_test, Y_train, Y_test = sleec.split_data(X, Y, test_size=0.4)
X_dev, X_test, Y_dev, Y_test = sleec.split_data(X_test, Y_test, test_size=0.5)

### Find *k*-nearest neighbors index $\Omega$
- This indicates that $(i,j) \in \Omega \iff j \in \mathcal{N}_i$, where $\mathcal{N}_i = \arg\max_{S,\ S \leq \alpha \cdot n}\sum_{j \in S}(\mathbf{y}_i^T, \mathbf{y}_j)$.<br>
- $k=\alpha\cdot n$ means that the number of nearest neighbors we choose is propotional to the size of the training data. The proportion $\alpha$ is varies for different dataset.<br>
    - Different numbers of neighbors to preserve are chosen compared with the original paper.
- Note that this *k*-nn is based on training label data, i.e., <code>Y_train</code>.<br>
- Implemented by [sklearn.neighbors.NearestNeighbors](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html).<br>

In [6]:
alpha=0.3

In [7]:
distances, indices = sleec.knn(Y_train, sleec.mydist, alpha=alpha)

k: 1331


### Singular Value Projection (SVP)
Learn label vector embedding <code>Z</code>:
- Learn the embedded label vector using top eigen decomposition.
    - This step is slow - run on Google cloud, n1-standard-16 (16 vCPUs, 60 GB memory)
    - The obervation matrix $\mathbf{G}=\mathbf{Y\cdot Y}^T$ is of dimension $\mathbb{R}^{n\times n}$, where $n$ is the number of training samples.
    - Needs for training samples pre-clustering.
        - From the paper, the empirical number of clusters is chosen as $\lfloor\frac{N_\text{Train}}{6000}\rfloor$, making it proper for both accuracy and computational efficiency.
    - Choice of embedding size: 100 for small dataset, 50 for large dataset.
    - Idealy, the SVP algorithm should run a high percentage of <code>max_iteration</code>. An early convergence with large differences between iterations may well indicate an inproper choice of learning rate <code>eta</code>.
        - For smaller choice of $\alpha$ in __*k*-nn__ above, the learning rate should be smaller as well.

In [8]:
U, Lmd = sleec.SVP(Y_train, 100, indices, eta=0.2, max_iteration=200)

iter: 0 ; loss: 447.0109468896578
iter: 1 ; loss: 360.46257774330763
iter: 2 ; loss: 291.67946136725305
iter: 3 ; loss: 237.19610930224286
iter: 4 ; loss: 194.2460010924912
iter: 5 ; loss: 160.6152027492054
iter: 6 ; loss: 134.52097565715118
iter: 7 ; loss: 114.51085968979237
iter: 8 ; loss: 99.3812571121566
iter: 9 ; loss: 88.11754972583913
iter: 10 ; loss: 79.85754923717768
iter: 11 ; loss: 73.87519075130191
iter: 12 ; loss: 69.57515910825988
iter: 13 ; loss: 66.4875471893749
iter: 14 ; loss: 64.25626458649006
iter: 15 ; loss: 62.621289275583685
iter: 16 ; loss: 61.39837244650075
iter: 17 ; loss: 60.45984077719891
iter: 18 ; loss: 59.718564839681406
iter: 19 ; loss: 59.115635221360606
iter: 20 ; loss: 58.611396321594555
iter: 21 ; loss: 58.17915640145919
iter: 22 ; loss: 57.80087934133445
iter: 23 ; loss: 57.46427927492752
iter: 24 ; loss: 57.16088085614975
iter: 25 ; loss: 56.884732628318424
iter: 26 ; loss: 56.63155732278845
iter: 27 ; loss: 56.39819248947103
iter: 28 ; loss: 56.18

In [9]:
temp_filename = 'temp_data/' + dataset_name + '_SVP_' + str(alpha)

In [10]:
sleec.save_temp(temp_filename, U, Lmd, X_train, Y_train, X_dev, Y_dev, X_test, Y_test)

In [11]:
U, Lmd, X_train, Y_train, X_dev, Y_dev, X_test, Y_test = sleec.load_temp(temp_filename)

### ADMM
Optimizing $\min_{\mathbf{V}\in \mathbb{R}^{\hat{L} \times n}}\Vert\mathbf{Z} - \mathbf{XV}\Vert^2_F + \lambda \Vert\mathbf{V}\Vert^2_F + \mu \Vert\mathbf{XV}\Vert_1$.
- Learn regressors <code>V</code>. <br>
- Idealy, the ADMM algorithm should run a high percentage of <code>max_iteration</code>. An early convergence with large differences between iterations may well indicate an inproper choice of parameters.
    - <code>rho</code>: learning step
    - <code>lmd</code>: $l_2$ prior regularization parameter of $\mathbf{V}$
    - <code>mu</code>: $l_1$ prior regularization parameter of $\mathbf{XV}$ (should be small since $\mathbf{XV}$ could be relatively large)

In [12]:
Z = sleec.get_init_Z(U, Lmd) # Z is used for ADMM

### Evaluation metrics
- Precision at k (P@K)
- Hamming loss, Jaccard similarity
- Precision, recall, and f1 curve over the number of important labels choosed.
- See funcion <code>eval_all()</code> in [evaluation.py](evaluation.py) for details.

### Experiments with different parameters
This step tunes the model based on <code>dev</code> set and evaluates the model based on <code>test</code> set. <br>
Parameters includes:
- The $l_2$ regularization $\lambda$ in ADMM; default values = <code>np.arange(0.1, 1.1, 0.1)</code> 
- The $l_1$ regularization $\mu$ in ADMM; default values = <code>np.arange(0.01, 0.06, 0.01)</code> 
- The number of nearest neighbors $nn$ used to predict lables; default values = <code>[5, 10, 15, 20, 25]</code>

In [11]:
specs = sleec.experiment(X_train, Y_train, X_dev, Y_dev, X_test, Y_test, Z,
                     k_list=[1, 3, 5])


Specs: lmd=0.1, mu=0.01
#### nn: 5
P@1: 0.6267748478701826
P@3: 0.3786342123056109
P@5: 0.26409736308316095
Hamming_loss: 0.03384489775090216
Jaccard similarity: 0.9661551022490976
P@1: 0.6139283299526708
P@3: 0.34798287130944117
P@5: 0.24759972954698747
Hamming_loss: 0.03421060464958005
Jaccard similarity: 0.9657893953504205
#### nn: 10
P@1: 0.6348884381338742
P@3: 0.3869731800766276
P@5: 0.2757268424611188
Hamming_loss: 0.04999978737970988
Jaccard similarity: 0.9500002126202887
P@1: 0.623394185260311
P@3: 0.36646382691007284
P@5: 0.2578769438809969
Hamming_loss: 0.05096083109018908
Jaccard similarity: 0.9490391689098094
#### nn: 15
P@1: 0.6315077755240027
P@3: 0.3896777101645253
P@5: 0.28100067613251833
Hamming_loss: 0.06433889973252381
Jaccard similarity: 0.9356611002674765
P@1: 0.6220419202163624
P@3: 0.36984448951994475
P@5: 0.2630155510480016
Hamming_loss: 0.06499377022550518
Jaccard similarity: 0.9350062297744952
#### nn: 20
P@1: 0.6281271129141311
P@3: 0.39057922019382457
P@5:

In [16]:
import analysis
spec_fn = 'results/spec_list_sleec_' + dataset_name + '_'+ str(alpha) + '.txt'
analysis.save_specs(spec_fn, specs)