## Explore the vector space

We have created a Word2Vec model from the adjacency matrix representing document similarities for a set of 7000 NeurIPS papers from 1987-2017. Here we use the gensim API to explore the space.

In [1]:
import gensim
import os

In [2]:
DATA_DIR = "../data"
MODEL_FILE = os.path.join(DATA_DIR, "w2v-nips-papers.model")
ID_TITLE_MAPPING_FILE = os.path.join(DATA_DIR, "NeurIPS_id-title.csv")

### Mapping from paper ID to Title (Year)

Note that paper ID provided in the Kaggle dataset was 1 based, whereas the paper ids we have used in our model is 0 based (since it is just the column number of the similarity matrix), so we need to make the correction here.

In [3]:
pid2title = {}
fitm = open(ID_TITLE_MAPPING_FILE, "r")
for line in fitm:
    pid, title = line.strip().split('\t')
    pid = int(pid) - 1      # pids provided are 1-based
    pid2title[pid] = title

fitm.close()

### Load Model

In [4]:
model = gensim.models.Word2Vec.load(MODEL_FILE).wv

### Show papers in Word2Vec space

In [5]:
# elements of the embedding
pids = model.vocab.keys()
sample_pids = [int(p) for p in pids][0:10]
for pid in sample_pids:
    print("{:4d} {:s}".format(pid, pid2title[pid]))

   0 Self-Organization of Associative Database and Its Applications (1987)
4255 Recovering Intrinsic Images with a Global Sparsity Prior on Reflectance (2011)
4686 Large Scale Distributed Deep Networks (2012)
2868 Beyond Gaussian Processes: On the Distributions of Infinite Networks (2005)
2240 Approximate Linear Programming for Average-Cost Dynamic Programming (2002)
7077 Hash Embeddings for Efficient Word Representations (2017)
3287 Density Estimation under Independent Similarly Distributed Sampling Assumptions (2007)
5897 Efficient Output Kernel Learning for Multiple Tasks (2015)
4009 Learning Efficient Markov Networks (2010)
5622 Spike Frequency Adaptation Implements Anticipative Tracking in Continuous Attractor Neural Networks (2014)


### Find papers similar to given paper

Here we will look for papers similar to our given paper, in our case:

    4429 A concave regularization technique for sparse mixture models (2011)


In [6]:
# check for most similar to given
source_paper_id = 4429
for rpid, score in model.most_similar(str(source_paper_id)):
    print("{:.3f} {:s} {:s}".format(
        score, rpid, pid2title[int(rpid)]
    )
)

0.718 2550 An Auditory Paradigm for Brain-Computer Interfaces (2004)
0.683 5014 Similarity Component Analysis (2013)
0.670 3878 Statistical Consistency of Top-k Ranking (2009)
0.660 2785 Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games (2005)
0.646 3084 Blind Motion Deblurring Using Image Statistics (2006)
0.643 5231 Dependent nonparametric trees for dynamic hierarchical clustering (2014)
0.636 4276 Trace Lasso: a trace norm regularization for correlated designs (2011)
0.629 2813 Cue Integration for Figure/Ground Labeling (2005)
0.624 1774 Reinforcement Learning for Spoken Dialogue Systems (1999)
0.624 3800 Sparse and Locally Constant Gaussian Graphical Models (2009)


### Arithmetic in Paper Vector Space

We can do analogies similar to word analogies by doing vector arithmetic. Consider the following papers.

    A: 1180    Representing Face Images for Emotion Classification (1996)
    B: 5209    Transfer Learning in a Transductive Setting (2013)
    C: 3488    Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization (2008)

We want to find a paper D that satisfies the analogy:

    A : B :: C : D
    
This translates to the following vector arithmetic.

    D = B - A + C
    

In [7]:
paper_d = model.most_similar(
    positive=["5209", "3488"],
    negative=["1180"])[0:1][0][0]

print("{:s} {:s}".format(paper_d, pid2title[int(paper_d)]))

5359 Content-based recommendations with Poisson factorization (2014)


### Paper vector shape

In [8]:
vec = model[str(source_paper_id)]
print(vec)
print(vec.shape)

[ 0.13583909 -0.02244393  0.3873859   0.29312187 -0.16611952  0.0433773
 -0.2423136   0.06728296  0.25277796  0.05704262  0.12595896  0.25350544
 -0.08501874 -0.00789746  0.41689166  0.11127666  0.02860924  0.083965
 -0.09773003 -0.3235104  -0.00899219  0.4631019   0.21359956 -0.17719519
 -0.1465802  -0.12468671 -0.11025383 -0.1728496   0.44396612 -0.11851183
  0.11029281 -0.03216375 -0.00953399  0.3021255  -0.119446    0.18065514
 -0.05294707  0.05189841 -0.03704946 -0.11981184 -0.05654713 -0.1881413
 -0.05596744  0.19265322 -0.05588282 -0.01161718  0.25358528  0.06337409
 -0.00299477  0.06828507  0.10308459  0.28453144 -0.02787677  0.12925784
 -0.07739409  0.06806321 -0.2513676  -0.12105264  0.17838612 -0.01458598
  0.18330823 -0.05671765  0.09071547 -0.05309341 -0.1006922   0.08662736
  0.02447449 -0.17677718  0.02643329 -0.27073658 -0.16924134 -0.2692922
  0.06955224  0.15653731 -0.1708236  -0.08607574 -0.16008934  0.12662053
  0.29479867 -0.36931068 -0.23695946  0.10333052 -0.0584