# Recommendations via Dimensionality Reduction

All the content discovery approaches we have explored in previous notebooks can be used to do content recommendations. Here we explore yet another approach to do that, but instead of considering a single article as input, we will look at situations where we know that a user has read a set of articles and he is looking for recommendations on what to read next.

Since we have already extracted the authors, orgs and keywords for each article, we can now construct a bipartite graph between author and article, orgs and article and keywords and article, which gives us the basis for a recommender.

In [1]:
from sklearn.decomposition import NMF
import joblib
import json
import numpy as np
import os
import requests
import urllib

In [2]:
DATA_DIR = "../data"
MODEL_DIR = "../models"

SOLR_URL = "http://localhost:8983/solr/nips2index"
FEATURES_DUMP_FILE = os.path.join(DATA_DIR, "comb-features.tsv")

NMF_MODEL_FILE = os.path.join(MODEL_DIR, "recommender-nmf.pkl")

PAPERS_METADATA = os.path.join(DATA_DIR, "papers_metadata.tsv")

## Extract features from index

In [3]:
query_string = "*:*"
field_list = "id,keywords,authors,orgs"
cursor_mark = "*"
num_docs, num_keywords = 0, 0
doc_keyword_pairs = []

fdump = open(FEATURES_DUMP_FILE, "w")
all_keywords, all_authors, all_orgs = set(), set(), set()

while True:
    if num_docs % 1000 == 0:
        print("{:d} documents ({:d} keywords, {:d} authors, {:d} orgs) retrieved"
              .format(num_docs, len(all_keywords), len(all_authors), len(all_orgs)))
    payload = {
        "q": query_string,
        "fl": field_list,
        "sort": "id asc",
        "rows": 100,
        "cursorMark": cursor_mark
    }
    params = urllib.parse.urlencode(payload, quote_via=urllib.parse.quote_plus)
    search_url = SOLR_URL + "/select?" + params
    resp = requests.get(search_url)
    resp_json = json.loads(resp.text)
    docs = resp_json["response"]["docs"]
    
    docs_retrieved = 0
    for doc in docs:
        doc_id = int(doc["id"])
        keywords, authors, orgs = ["NA"], ["NA"], ["NA"]
        if "keywords" in doc.keys():
            keywords = doc["keywords"]
            all_keywords.update(keywords)
        if "authors" in doc.keys():
            authors = doc["authors"]
            all_authors.update(authors)
        if "orgs" in doc.keys():
            orgs = doc["orgs"]
            all_orgs.update(orgs)
        fdump.write("{:d}\t{:s}\t{:s}\t{:s}\n"
                    .format(doc_id, "|".join(keywords), "|".join(authors), 
                            "|".join(orgs)))
        num_docs += 1
        docs_retrieved += 1
    if docs_retrieved == 0:
        break

    # for next batch of ${rows} rows
    cursor_mark = resp_json["nextCursorMark"]

print("{:d} documents ({:d} keywords, {:d} authors, {:d} orgs) retrieved, COMPLETE"
      .format(num_docs, len(all_keywords), len(all_authors), len(all_orgs)))
fdump.close()

0 documents (0 keywords, 0 authors, 0 orgs) retrieved
1000 documents (1628 keywords, 1347 authors, 159 orgs) retrieved
2000 documents (1756 keywords, 2601 authors, 214 orgs) retrieved
3000 documents (1814 keywords, 3948 authors, 269 orgs) retrieved
4000 documents (1833 keywords, 5210 authors, 311 orgs) retrieved
5000 documents (1842 keywords, 6537 authors, 350 orgs) retrieved
6000 documents (1847 keywords, 7983 authors, 385 orgs) retrieved
7000 documents (1847 keywords, 9517 authors, 420 orgs) retrieved
7238 documents (1847 keywords, 9719 authors, 426 orgs) retrieved, COMPLETE


## Build sparse feature vector for documents

The feature vector for each document will consist of a sparse vector of size 11992 (1847+9719+426). An entry is 1 if the item occurs in the document, 0 otherwise.

In [4]:
def build_lookup_table(item_set):
    item2idx = {}
    for idx, item in enumerate(item_set):
        item2idx[item] = idx
    return item2idx

keyword2idx = build_lookup_table(all_keywords)
author2idx = build_lookup_table(all_authors)
org2idx = build_lookup_table(all_orgs)
print(len(keyword2idx), len(author2idx), len(org2idx))

1847 9719 426


In [5]:
def build_feature_vector(items, item2idx):
    vec = np.zeros((len(item2idx)))
    if items == "NA":
        return vec
    for item in items.split("|"):
        idx = item2idx[item]
        vec[idx] = 1
    return vec


Xk = np.zeros((num_docs, len(keyword2idx)))
Xa = np.zeros((num_docs, len(author2idx)))
Xo = np.zeros((num_docs, len(org2idx)))

fdump = open(FEATURES_DUMP_FILE, "r")
for line in fdump:
    doc_id, keywords, authors, orgs = line.strip().split("\t")
    doc_id = int(doc_id)
    Xk[doc_id] = build_feature_vector(keywords, keyword2idx)
    Xa[doc_id] = build_feature_vector(authors, author2idx)
    Xo[doc_id] = build_feature_vector(orgs, org2idx)
fdump.close()    

X = np.concatenate((Xk, Xa, Xo), axis=1)
print(X.shape)
print(X)

(7238, 11992)
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


## Reduce dimensionality

We reduce the sparse feature vector to a lower dimensional dense vector which effectively maps the original vector to a new "taste" vector space. Topic modeling has the same effect. We will use non-negative matrix factorization.

In [6]:
if os.path.exists(NMF_MODEL_FILE):
    print("model already generated, loading")
    model = joblib.load(NMF_MODEL_FILE)
    W = model.transform(X)
    H = model.components_
else:    
    model = NMF(n_components=150, init='random', solver="cd", 
                verbose=True, random_state=42)
    W = model.fit_transform(X)
    H = model.components_
    joblib.dump(model, NMF_MODEL_FILE)
    
print(W.shape, H.shape)

model already generated, loading
violation: 1.0
violation: 0.2411207712867099
violation: 0.0225518954481444
violation: 0.00395945567371017
violation: 0.0004979448419219516
violation: 8.176770536033433e-05
Converged at iteration 6
(7238, 150) (150, 11992)


## Similar Documents

In [7]:
sim = np.matmul(W, np.transpose(W))
print(sim.shape)

(7238, 7238)


In [8]:
def similar_docs(filename, sim, topn):
    doc_id = int(filename.split(".")[0])
    row = sim[doc_id, :]
    target_docs = np.argsort(-row)[0:topn].tolist()
    scores = row[target_docs].tolist()
    target_filenames = ["{:d}.txt".format(x) for x in target_docs]
    return target_filenames, scores
    

filename2title = {}
with open(PAPERS_METADATA, "r") as f:
    for line in f:
        if line.startswith("#"):
            continue
        cols = line.strip().split("\t")
        filename2title["{:s}.txt".format(cols[0])] = cols[2]

source_filename = "1032.txt"
top_n = 10
target_filenames, scores = similar_docs(source_filename, sim, top_n)
print("Source: {:s}".format(filename2title[source_filename]))
print("--- top {:d} similar docs ---".format(top_n))
for target_filename, score in zip(target_filenames, scores):
    if target_filename == source_filename:
        continue
    print("({:.5f}) {:s}".format(score, filename2title[target_filename]))

Source: Forward-backward retraining of recurrent neural networks
--- top 10 similar docs ---
(0.05010) Context-Dependent Multiple Distribution Phonetic Modeling with MLPs
(0.04715) Is Learning The n-th Thing Any Easier Than Learning The First?
(0.04123) Learning Statistically Neutral Tasks without Expert Guidance
(0.04110) Combining Visual and Acoustic Speech Signals with a Neural Network Improves Intelligibility
(0.04087) The Ni1000: High Speed Parallel VLSI for Implementing Multilayer Perceptrons
(0.04038) Subset Selection and Summarization in Sequential Data
(0.04003) Back Propagation is Sensitive to Initial Conditions
(0.03939) Semi-Supervised Multitask Learning
(0.03862) SoundNet: Learning Sound Representations from Unlabeled Video


## Suggesting Documents based on Read Collection

We consider an arbitary set of documents that we know a user has read or liked or marked somehow, and we want to recommend other documents that he may like.

To do this, we compute the average feature among these documents (starting from the sparse features) convert it to a average dense feature vector, then find the most similar compared to that one.

In [9]:
collection_size = np.random.randint(3, high=10, size=1)[0]
collection_ids = np.random.randint(0, high=num_docs+1, size=collection_size)

feat_vec = np.zeros((1, 11992))
for collection_id in collection_ids:
    feat_vec += X[collection_id, :]
feat_vec /= collection_size
y = model.transform(feat_vec)
doc_sims = np.matmul(W, np.transpose(y)).squeeze(axis=1)
target_ids = np.argsort(-doc_sims)[0:top_n]
scores = doc_sims[target_ids]

print("--- Source collection ---")
for collection_id in collection_ids:
    print("{:s}".format(filename2title["{:d}.txt".format(collection_id)]))
print("--- Recommendations ---")
for target_id, score in zip(target_ids, scores):
    print("({:.5f}) {:s}".format(score, filename2title["{:d}.txt".format(target_id)]))

violation: 1.0
violation: 0.23129634545431624
violation: 0.03209572604136983
violation: 0.007400997221153011
violation: 0.0012999049199094925
violation: 0.0001959522250959198
violation: 4.179248920879007e-05
Converged at iteration 7
--- Source collection ---
A Generic Approach for Identification of Event Related Brain Potentials via a Competitive Neural Network Structure
Implicit Surfaces with Globally Regularised and Compactly Supported Basis Functions
Learning Trajectory Preferences for  Manipulators via Iterative Improvement
Statistical Modeling of Cell Assemblies Activities in Associative Cortex of Behaving Monkeys
Learning to Traverse Image Manifolds
--- Recommendations ---
(0.06628) Fast Second Order Stochastic Backpropagation for Variational Inference
(0.06128) Scalable Model Selection for Belief Networks
(0.05793) Large Margin Discriminant Dimensionality Reduction in Prediction Space
(0.05643) Efficient Globally Convergent Stochastic Optimization for Canonical Correlation Analy