# Publication recommendation system

*Lukas Vlcek*

In this notebook we read a previously downloaded dataset of Wikipedia publication references into a pandas dataframe.
Subsequently, we build a bipartite directed graph of the publication-web page relations using the core module of the recommendation system (stored in the project src directory). Finally we test the system on two examples of publications and evaluate the relevance of the recommended literature and discuss further possibilities to improve the model.

## 1. Read dataset linking wikipedia articles with publications 

In [1]:
import os
import sys
import numpy as np
import pandas as pd
import networkx as nx
%matplotlib inline
import matplotlib.pyplot as plt

In [2]:
%load_ext autoreload
%autoreload 2

### Read wikipedia references from a TSV file

In [3]:
base_path = '../data/raw'
processed_path = '../data/processed'

In [4]:
# read TSV data
df = pd.read_csv(os.path.join(base_path,'enwiki.tsv'), sep='\t', parse_dates=['timestamp'],infer_datetime_format=True)

# Convert mistakenly converted type nan to string 'NaN' (wikipedia page name)
df.page_title = df.page_title.fillna("NaN")

df.head(5)

Unnamed: 0,page_id,page_title,rev_id,timestamp,type,id
0,2867096,Mu Aquilae,503137751,2012-07-19 16:08:41,doi,10.1051/0004-6361:20078357
1,2867096,Mu Aquilae,508363722,2012-08-20 22:56:21,arxiv,astro-ph/0604502
2,2867096,Mu Aquilae,508363722,2012-08-20 22:56:21,arxiv,astro-ph/0003329
3,2867096,Mu Aquilae,508363722,2012-08-20 22:56:21,arxiv,0708.1752
4,2867096,Mu Aquilae,503137751,2012-07-19 16:08:41,doi,10.1051/0004-6361:20064946


Add new book information to the dataframe.

In [5]:
df_new = pd.DataFrame({'page_id':[19555312], 'page_title':['Lager'], 'rev_id':[0], 'timestamp':['2015-02-09 17:01:25'],'type':['isbn'],'id':['9780937381502']})

In [6]:
df = df.append(df_new, ignore_index=True)

## 2. Create a directed bipartite graph of references from wikipedia pages to publications

** Import the 'recomm' module containing functions for reading data from wikipedia and building the graph-based recommendation system.**

In [98]:
sys.path.append('../src')
from recomm.graph_rank import GraphRank

**Create a GraphRank object - a graph-based model for publication recommendation.**

In [24]:
gr = GraphRank()

In [25]:
gr.build_graph(df, 'page_title', 'page_id', 'type', 'id')

## 3. Tests of the recommendation system

### 3.1 Publications related to "Designing Great Beers: The Ultimate Guite to Brewing Classic Beer Styles" by Ray Daniels

In [44]:
gr.find_most_relevant(('isbn','9780937381502'), 13)

Original publication: ('isbn', '9780937381502') 
Title: Designing Great Beers: The Ultimate Guide to Brewing Classic ... 


3 pages referring to the publication:
 ['Mash ingredients', 'India pale ale', 'Lager'] 


Number of categories for level 2 publications: 3
Rank: 1 
Citations: 6
ID: ('isbn', '0195367138')
Source: https://books.google.com/books?isbn=0195367138
Title: The Oxford Companion to Beer 

Rank: 2 
Citations: 2
ID: ('isbn', '9781466881952')
Source: https://books.google.com/books?isbn=9781466881952
Title: The Encyclopedia of Beer: The Beer Lover's Bible - A ... 

Rank: 3 
Citations: 2
ID: ('isbn', '9780299188948')
Source: https://books.google.com/books?isbn=9780299188948
Title: The Best Breweries and Brewpubs of Illinois: Searching for ... 

Rank: 4 
Citations: 2
ID: ('isbn', '9780865715561')
Source: https://books.google.com/books?isbn=9780865715561
Title: Fermenting Revolution: How to Drink Beer and Save the World 

Rank: 5 
Citations: 1
ID: ('isbn', '9780937381694')
Source

**Observation:** The system has identified publications highly relevant to the topic. The highest ranking publications can be considered reference sources, while the lower ranking publications are generally more specific, but still interesting suggestions.

### 3.2 Publications related to article "Learning the parts of objects by non-negative matrix factorization" by Lee and Seung, Nature (1999).

In [100]:
gr.find_most_relevant(('doi','10.1038/44565'), 13)

Original publication: ('doi', '10.1038/44565') 
Title: Learning the parts of objects by non-negative matrix factorization 


2 pages referring to the publication:
 ['Non-negative matrix factorization', 'Dimensionality reduction'] 


Number of categories for level 2 publications: 5
Rank: 1 
Citations: 10
ID: ('pmid', '29234465')
Source: https://www.ncbi.nlm.nih.gov/pubmed/29234465
Title: Ten quick tips for machine learning in computational biology. 

Rank: 2 
Citations: 8
ID: ('arxiv', '1207.4197')
Source: https://arxiv.org/abs/1207.4197
Title: Detection and Characterization of Exoplanets and Disks using Projections  on Karhunen-Loeve Eigenimages 

Rank: 3 
Citations: 6
ID: ('doi', '10.1086/510127')
Source: https://doi.org/10.1086/510127
Title: K-Corrections and Filter Transformations in the Ultraviolet, Optical, and Near-Infrared 

Rank: 4 
Citations: 6
ID: ('doi', '10.3847/1538-4357/aaa1f2')
Source: https://doi.org/10.3847/1538-4357/aaa1f2
Title: Non-negative Matrix Factorization: Rob

The search results in identification of relevant publications. They are a mix of general review-type articles, technical papers describing new algorithms, as well as field-specific research papers. As a whole they provide useful suggestions for further reading, which I have followed.

** Save graph in a JSON file for future reuse. **

In [94]:
from networkx.readwrite import json_graph
import json

In [96]:
def int64_to_int(o):
    """
    Convert np.int64 to python int - otherwise json.dumps does not work
    """
    if isinstance(o, np.int64): return int(o)  
    raise TypeError

json_string = json.dumps(json_graph.node_link_data(gr.G), default=int64_to_int)

In [97]:
with open(os.path.join(processed_path, 'graph.json'), 'w') as fo:
    json.dump(json_string, fo)

## 4. Conculsions and general comments

We have built the core of a recommendation system based on semantic closeness expressed and implemented as a graph of relations. At present the code (stored in the src subdirectory) includes two main parts: (I) functionality for web scraping wikipedia and Google search results and (II) building a graph of relevant semantic relations.

In its current form the model can be considered a basis of a system for the recommendation of significant related literature, which could be used in a systematic research of a given subject matter, especially for meta-research comparing results of different studies. As such the most likely customer would be a research institution or a technology company. If augmented by additional resources, such as shopping history, the system could be adapted as a part of a more general recommendation system.

The system can be further improved by adjusting relative weights for ranking based on the path lengthts between the original and recommended publications and including higher-level categories. To prevent diverging from the topic, only categories that are connected to two or more lower-level categories would be included.

One of the main limitations of the current system stems from the incomplete dataset of wikipedia references. The next step would therefore include updating the data through additional web scraping. Also, more complete information about publications could be stored as a new table in a local relational database, which would help connecting the same publications with different identification numbers. As a result, the search could be faster and search for additional information about the publications more complete.

The system can be further adapted to work with other resources by keeping the core of the semantic graph functionality (part II) while replacing or combining the Wikipedia web scraping part I with other sources, such as Web of Science or Google Scholar.