# Understanding the Community: Social Discovery of Artificial Intelligence Researchers and Literatures

Artificial Intelligence is one of the most influential subjects in computer science. We will analyze scholars’ activities based on literatures published in the top-tier conferences to gain some interesting insights into this field. 

In this report, we start by a brief introduction of the dataset and data processing. Then we present a temporal analysis of AI topic popularity. In section 3, spatial distribution of AI research is discussed. Next, in section 4, we implement a community finding algorithm on coauthorship and citation graphs, which identifies important research communities and centers within the graph. Finally, we build a simple paper recommendation system based on content similarity and citation relationship. 

Note: The code cannot be run directly from this Python Notebook. Because our code is too long to be included. We have included a separate folder for code.

## Data Selection and Processing
For this project, our data mainly come from two public dataset, DBLP and ArnetMiner. Apart from them, we also scraped data from Google Map for geographical information. In this section, we will discuss how we parsed them and how we used them.

### Data Selection
DBLP is a well-maintained dataset of computer science bibliography. It contains 3,583,224 publications and 1,824,011 authors[1]. It covers a huge amount of computer science publications. Taking advantage of the widely covered publications, DBLP is used for analyzing the trending of topics. We also gathered and formatted the co-authorship information for the analysis on community finding. However, DBLP has two obvious drawbacks. DBLP does not have well maintained author information and it does not have the paper citation data. It causes great inconvenience for our further analysis so we turned to a second dataset ArnetMiner.

The other public dataset we use is ArnetMiner[2], which is a supplement to DBLP that has well-maintained metadata including citations and author metadata. ArnetMiner contains 2,092,356 publications, 1,712,433 authors and 8,024,869 citation relationships[2]. Thus, we used the ArnetMiner dataset for community finding based on citation relationships. For the analysis of authors’ geographical distribution, we also scraped the geographical coordinates of the authors’ affiliations to locate each author.

### Data Processing
DBLP dataset is formatted in a huge XML file which contains all paper records and metadata. It becomes impossible and super slow to parse it by loading the whole file into memory. So it has to be parsed in stream. SAX, an event-driven parser is used to parse DBLP dataset. 

In [None]:
class DBLPContentHandler(xml.sax.ContentHandler):
    """
    This is a SAX XML parser for dblp.
    Reads the dblp.xml file and produces four output files. Each file is tab-separated
    """
    cur_element = -1
    ancestor = -1
    paper = None
    conf = None
    line = 0
    errors = 0
    author = ""

    def __init__(self):
        xml.sax.ContentHandler.__init__(self)
        DBLPContentHandler.inproc_file = open('inproc.txt', 'w')
        DBLPContentHandler.cite_file = open('cite.txt', 'w')
        DBLPContentHandler.conf_file = open('conf.txt', "w")
        DBLPContentHandler.author_file = open("author.txt", "w")

    def startElement(self, name, attrs):
        if name == "inproceedings":
            self.ancestor = ELEMENT.INPROCEEDING
            self.cur_element = PAPER.INPROCEEDING
            self.paper = Paper()
            self.paper.key = attrs.getValue("key")
        elif name == "proceedings":
            self.ancestor = ELEMENT.PROCEEDING
            self.cur_element = CONFERENCE.PROCEEDING
            self.conf = Conference()
            self.conf.key = attrs.getValue("key")
        elif name == "author" and self.ancestor == ELEMENT.INPROCEEDING:
            self.author = ""

        if self.ancestor == ELEMENT.INPROCEEDING:
            self.cur_element = PAPER.get_element(name)
        elif self.ancestor == ELEMENT.PROCEEDING:
            self.cur_element = CONFERENCE.get_element(name)
        elif self.ancestor == -1:
            self.ancestor = ELEMENT.OTHER
            self.cur_element = ELEMENT.OTHER
        else:
            self.cur_element = ELEMENT.OTHER

        self.line += 1

    def endElement(self, name):
        if name == "author" and self.ancestor == ELEMENT.INPROCEEDING:
            self.paper.authors.append(self.author)

        if ELEMENT.get_element(name) == ELEMENT.INPROCEEDING:
            self.ancestor = -1
            try:
                if self.paper.title == "" or self.paper.conference == "" or self.paper.year == 0:
                    print ("error in parsing " + self.paper.key)
                    print self.paper.title
                    print self.paper.conference
                    print self.paper.year
                    self.errors += 1
                    return
                
                # filter only important AI conference paper
                keywords = ['AAAI', 'CVPR', 'ICCV', 'ECCV', 'ICML', 'IJCAI', 'NIPS','ACL', 
                            'COLT', 'EMNLP', 'ECAI', 'ICRA', 'ICCBR', 'COLING', 'KR', 'UAI', 
                            'PPSN', 'ACCV', 'CoNLL', 'ICPR', 'BMVC', 'IROS', 'ACML', 
                            'SIGMOD Conference', 'SIGMOD', 'KDD', 'SIGKDD', 'SIGIR', 'VLDB', 
                            'ICDE', 'CIKM', 'PODS', 'PKDD', 'ECML/PKDD', 'ICDM', 'SDM', 'ICDT', 
                            'CIDR', 'WSDM', 'ECIR', 'PAKDD']
                for keyword in keywords:
                    # if keyword in self.paper.title.lower() or keyword in self.paper.conference.lower():
                    if keyword.lower() in self.paper.conference.lower().split(" "):
                        self.write_paper(self.paper)
                        for t in self.paper.authors:
                            self.write_author(t, self.paper)
                        for c in self.paper.citations:
                            if c != "...":
                                self.write_citation(c, self.paper)
                        return

            except ValueError:
                print "error"

        elif ELEMENT.get_element(name) == ELEMENT.PROCEEDING:
            self.ancestor = -1
            try:
                if self.conf.name == "":
                    self.conf.name = self.conf.detail
                if self.conf.key == "" or self.conf.name == "" or self.conf.detail == "":
                    print "no conference: line ", self.line
                    return
                self.write_conf(self.conf)
            except ValueError:
                print "error"

    def write_conf(self, conf):
        DBLPContentHandler.conf_file.write(conf.key + "\t")
        DBLPContentHandler.conf_file.write(conf.name + "\t")
        DBLPContentHandler.conf_file.write(conf.detail + "\n")

    def write_paper(self, paper):
        # print paper.toString()
        DBLPContentHandler.inproc_file.write(paper.title + "\t")
        DBLPContentHandler.inproc_file.write(str(paper.year) + "\t")
        DBLPContentHandler.inproc_file.write(paper.conference + "\t")
        DBLPContentHandler.inproc_file.write(paper.key + "\n")

    def write_author(self, t, paper):
        DBLPContentHandler.author_file.write(t + "\t")
        DBLPContentHandler.author_file.write(paper.key + "\n")

    def write_citation(self, c, paper):
        DBLPContentHandler.cite_file.write(paper.key + "\t")
        DBLPContentHandler.cite_file.write(c + "\n")

    def characters(self, content):
        content = content.encode('utf-8').replace('\\', '\\\\').replace('\n', "").replace('\r', "")
        if self.ancestor == ELEMENT.INPROCEEDING:
            if self.cur_element == PAPER.AUTHOR:
                self.author += content
            elif self.cur_element == PAPER.CITE:
                if len(content) == 0:
                    return
                self.paper.citations.append(content)
            elif self.cur_element == PAPER.CONFERENCE:
                self.paper.conference += content
            elif self.cur_element == PAPER.TITLE:
                self.paper.title += content
            elif self.cur_element == PAPER.YEAR:
                if len(content) == 0:
                    return
                try:
                    self.paper.year = int(content)
                except ValueError:
                    print "s" + content + "s"
                    print float(content)
        elif self.ancestor == ELEMENT.PROCEEDING:
            if self.cur_element == CONFERENCE.CONFNAME:
                self.conf.name = content
            elif self.cur_element == CONFERENCE.CONFDETAIL:
                self.conf.detail = content

In [None]:
# call function to parse dblp
source = open("dblp.xml")
xml.sax.parse(source, DBLPContentHandler())

From the raw DBLP dataset, the publication title, year, author and venue are extracted. Then we further filter the parsed data by only keeping papers from 30 hand-picked top-tier Artificial Intelligence conferences, i.e. ICML, NLPS, CVPR, SIGKDD etc. The filtered result is presented in 3 `.csv` files which include 144,051 papers and 429,247 authors from 30 conferences.

### ArnetMiner Data Processing
We used an open-source parser to parse and format the ArnetMiner dataset[3]. It had done most of the parsing jobs for us on the ArnetMiner dataset by formatting the raw dataset to multiple `.csv` files, filtering publications by the range of years and generating formatted citation graph. Upon the formatted and filtered publication data, we further filtered the publications related to Artificial Intelligence base on the title and scraped the location information of the author using the affiliation data through the Google Map API. 

```python
>>> python aminer.py ParseAminerNetworkDataToCSV --local-scheduler
```
This will parse the raw dataset into formatted `.csv` files. 

Then run
```python
>>> python filtering.py FilterAllCSVRecordsToYearRange --start <start_year> --end <end_year> --local-scheduler
```
, which can filter the AI related literatures by year.

Finally running
```python 
>>> python build_graphs.py BuildAllGraphData --start <start_year> --end <end_year> --local-scheduler
```
to generate the formatted citation graph.

## Temporal Analysis of Artificial Intelligence Research

Artificial Intelligence is a broad research area that encompasses many topics from learning methods, robotics, vision, language to problem solving, philosophy even cognitive science. One interesting question to be investigated is how AI research evolves over time. In this section, we will present results for our temporal analysis.

Firstly, we studied number of AI conference paper in DBLP each year. It is clear that AI research has increasingly gained more popularity since 1980. There is a huge jump after 2000.
<img src="picture/temporal/number.png",width=500,height=500>

Then, we investigate how trending research topics change along with time. By counting bigram frequency in papers’ title (ignore stopwords), we are able to identify top 50 hot research topics of the year. Here 1985, 2000 and 2015 are selected as three representative years and their hottest research topic word-cloud are generated below. The size of the keyword is proportional to its frequency of appearing in the paper title.
<img src="picture/temporal/1985.png",width=500,height=500>

In 1985, it is obvious that Expert System (logic programming, AI programs...), Natural Language (information retrieval) and Database System were three most influential subjects of the year. We did not see any current common machine learning methods here.
<img src="picture/temporal/2000.png",width=500,height=500>


Turning to year 2000, we witnessed lots of hot research related to Vision (Robots, Object Recognition) and Data Mining. Machine learning methods such as SVM, Bayesian Net, Hidden Markov and Neural Net are now part of hot topic lists.
<img src="picture/temporal/2015.png",width=500,height=500>

Then, let’s look at the past year 2015. In 2015, the most trending topics are Neural Network, Social Media and Big Data. Advanced machine learning techniques such as deep learning, CNN, recurrent neural began to appear in hot topics. It is also worth noticing that with extremely large data volume generated in current age, research topics such big data, dictionary learning and matrix factorization become much more popular these days.

According to the popularity investigated above, several topics are selected to examine their popularity change over time. As example, 3 topics which were among the most popular topics of AI during years, are picked and discussed below. 

Expert system was extremely popular around 1985s. However, In the 1990s and beyond, the term expert system mostly dropped from the IT lexicon. This is due to "expert systems failed”: the IT world moved on because expert systems didn't deliver on their over hyped promise[4].
<img src="picture/temporal/expertsystem.png",width=500,height=500>

Data mining is a new field that gained popular around 1990s. Since then, it continues to be popular even until now.
<img src="picture/temporal/mining.png",width=500,height=500>

Convolutional neural network has an interesting history of development. It is shown that convolutional network has been applied since early 1990, yet it has not become very popular until 2010 when it proved to be working extremely effective in vision research. So we observed an extreme surge around 2010.
<img src="picture/temporal/cnn.png",width=500,height=500>


## Spatial Analysis of Artificial Intelligence Research Community

Another question we want to answer is how AI research popularity differs from place to place. So next, we will further investigate the geographic distribution of machine learning research. We use ArnetMiner dataset which contains author’s affiliation. By using Google Map API we are able to scrape the latitude and longitude location for each author. Then Map Data website http://www.mapsdata.co.uk/about is used to create a heat map for AI research.

In [None]:
import geocoder
import requests

# short example of how to get coordinate of an address
g = geocoder.google("Carnegie Mellon University")
print (str(g.latlng[0]) + "\t" + str(g.latlng[1]) + "\n")

Let’s take a look at current AI research distribution around the world. The figure shows the number of researchers located in each area. We found that North America, Europe and Asia are three largest current AI research community. We can discovered that AI research has been grown aggressively in recent years around China, India, Iran, Singapore and Taiwan.

<img src="picture/geo/world.png",width=800,height=800>

These figures below show detailed distribution within different continents. From these figures we are able to identify huge clusters around Spain, France, Germany, Italy, China, Japan, South Korea. There are also small clusters of authors distributed in Canada, India, Brazil, India, Iran, and Turkey.

<img src="picture/geo/combination.png",width=900,height=900>

By observing the figure, we found there is a strong correlation between the AI research popularity and economic prosperity. Most of the authors are distributed in the developed countries or developing countries with high investment of research such as China and India. It is sad to see there is currently very few research in north Africa and middle east.

## Community Finding within Coauthorship and Citation Network

To fully understand a research field, it is important to look at relationship between researchers. Coauthorship and Citation networks provide an easy way to address many of these questions. In this session, firstly, we made a simple statistic analysis on the network to identify the most-cited authors. Next, we implemented fast unfolding algorithm to identify different research communities within the graph. Fast unfolding of communities in large networks is an efficient algorithm used to extract network structure. We visualized the citation and coauthor graph using open-source software Gephi. Through the visualization, we are able to identify important research communities, research fields and leading researcher within each field.

Firstly, by doing simple degree count on the graph, we were able to identify the most cited authors in recent years (20101-2016). Chih-Jen Lin and Chih-Chung Chang are ranked on top because of their work of LIBVIM in 2011. Jiawei’s work in data mining has profound influence on the field which makes him ranked at the 3rd.

<img src="picture/graph/table.png",width=600>

Next, we tried to do community finding on coauthorship and citation networks. We implemented the fast unfolding algorithm to extract the community structure. Here we will briefly introduce the algorithm.

Fast unfolding algorithm is based on modularity optimization, which has considered the tradeoff between runtime and accuracy. There are two passes in this algorithm. We add an edge for each pair of people who are the coauthor in the same paper, and then we build an undirected graph. Then we run the fast unfolding algorithm on this graph: For each node in the first pass, we try to find the best adjacent cluster, based on the highest modularity increments:

$$\Delta Q= \left[\frac{\sum_{in} + k_{i,in}}{2m} - \left(\frac{\sum_{tot} + k_i}{2m}\right)^2\right] - \left[\frac{\sum_{in}}{2m} - \left(\frac{\sum_{tot}}{2m}\right)^2 - \left(\frac{k_i}{2m}\right)^2\right]$$

And then in the second pass, we will merge the same cluster as one single point, regard all the inner connection as one self connection with weight, and merge all the outer connection. This helps us to identify communities in a hierarchical way. Run this graph from the first pass, then we will have a compact graph in the end. Fast unfolding proved to be an efficient way to partition network into different communities[6].

For efficiency, the algorithm above is implemented in C++. So it is not included in the report. For further detail, please refer to `code/cluster_pipeline/`.


## Coauthorship Network
Coauthorship network provides rich information in research collaboration. By running fast unfolding community-finding algorithm on the graph, we are able to form collaboration relationship between authors therefore identify important research groups. Our coauthor network is generated by manually parsing DBLP dataset. The initial network contains 540k authors. We firstly filtered out authors who has less than 50 publications which left only 3220 nodes. Our algorithm were able to detect 73 communities within them. We selected top 100 authors and visualized our communities using Gephi software. Each color represents a research community where research collaboration is intense.

![co-author graph](picture/graph/coauthor.png)

As we can see, the network splits into many communities, many of which are groups with special characteristics. On the top we have a large  “data mining” group lead by Jiawei Chen and Phillip S Yu. The group also contains researchers outside US probably because their rich connection to Asia institutes. On the left, we have “computer vision” group lead by Tomas S Huang and Shuicheng Yan.  On the right, we have “Database & Graph Mining” group lead by Christos Faloutsos and Raghu Ramakrishnan. We also see some small groups at bottom such as “AI robotics” group of Sebastian Thrun and Wolfram Burgard, “learning & vision” group by Eric Xing, Fei-Fei Li and Michael Jordan.

From the graph, we can tell several important observations. Data mining and computer vision are two of the most active research fields with largest research communities. Secondly, there are increasingly cross-regional research collaboration between different institute. For example, Jiawei Han is a professor located at UIUC yet his coauthor network has people from China, Singapore and UK. Thirdly, although we manually splits the network into different communities, they are not isolated from each other. Many cross-communities collaboration could be found within the network which demonstrates that researchers are working closely with each other.

## Citation Network
Besides coauthorship, we also study the cluster within the citation graph. Coauthor network allows us to identify different research communities while citation  network allows us to identify different research field and leading researchers within each field. We use ArnetMiner dataset for citation information which contains 28526 researchers and 191,128 citation links of AI research from 2010-2016.

By running algorithm of this paper, we can partition the graph into 180 communities. We visualized top 100 most cited authors in the following figure. On the left, the blue and pink clusters are for machine learning and computer vision research. On the right, we have several small green and orange clusters for data mining. On the bottom, the dark red nodes represents NLP research communities.

![citation graph](picture/graph/citation.png)

We were able to make several conclusions from this graph. From the number of citation, we can tell that from 2010-2016, computer vision research has more citations than other research which indicates the popularity for the field. In the computer vision research field, based on citation analysis, it seems L.Van Gool, Li Fei-Fei, P.H.S. Torr are important researchers in the field. (The size of the node indicates its pagerank score.) For data mining field, the leading researchers are Jiawei Han and Nitesh V. Chawla. 

## Artificial Intelligence Literature Recommendation System

From our own experience of doing researches and writing research literatures on unfamiliar topics, it is always the case that we left some important papers out when there is no one instructing us. In this scenario, it would be extremely helpful if there is a system giving recommendations on related paper that worth reading. Therefore, at last but not least, we designed and implemented a simple literature recommendation system. Given a research paper the user is focusing on, the system can recommend the most related and influential paper to the user.


The recommendation problem is simply modeled by the Bayesian equation:
$$P(l_i|l_q) = \frac{P(l_q|l_i)\cdot P(l_i)}{P(l_q)}$$
Where $l_i$, $l_q$ separately denote the candidate literature and query literature. Since $P(l_q)$ is a constant to all $l_i$, we can further deduct the equation to:
$$P(l_i|l_q) \propto P(l_q|l_i)\cdot P(l_i)$$
For the problem solving, we define the likelihood $P(l_q|l_i)$ to be the topic similarity between $l_i$ and $l_q$ and the prior $P(l_i)$ to be the number of appearances of $l_i$ near by $l_q$ in the citation graph $G_c$ (to speed up the calculation).  

In [3]:
from nltk.corpus import stopwords
import sys
from collections import Counter
import config
import os
import pandas as pd
import pickle
import string
from Queue import Queue

def save_obj(obj, name):
    with open(name + '.pkl', 'wb') as f:
        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)


def load_obj(name):
    with open(name + '.pkl', 'rb') as f:
        return pickle.load(f)

For topic similarity, we separately calculated the similarity of the two publications’ titles and abstracts. 


The title similarity is defined as the number of same bigram appeared in the two titles. Here, we use the k-gram index method[5] to firstly build a hashtable with the key of every bigram ever appeared in every title and value of a list of title indexes that contain this bigram. Then given a query title, the similarity calculation can be simplified to merging the lists found by the bigrams appeared in the query title. The score $S_t$ of title similarity will be the number of appearances of the titles in the lists.  

In [4]:
def bigram_add(bigram_dict, key, value):
    if key not in bigram_dict:
        bigram_dict[key] = set()
    bigram_dict[key].add(value)


def bigram_index(titles):
    bigram = dict()

    for key, title in titles.iteritems():
        stop_words = set(stopwords.words('english'))
        try:
            words = [w.lower() for w in title.split(' ') if w not in stop_words]

            if len(words) < 2:
                bigram_add(bigram, words[0], key)
                continue
            for i in range(len(words) - 1):
                bi = words[i: i + 2]
                bigram_add(bigram, ' '.join(bi), key)
        except:
            print 'exception:', key, title
            continue
    return bigram


def bigram_search(bigram_dict, title):
    stop_words = set(stopwords.words('english'))
    words = [w.lower() for w in title.split(' ') if w not in stop_words]

    bi_list = list()
    for i in range(len(words) - 1):
        bi = ' '.join(words[i: i + 2])
        if bi in bigram_dict:
            bi_list += list(bigram_dict[bi])
    bi_cnt = Counter(bi_list)
    retval = dict()
    for w in bi_cnt:
        if bi_cnt[w] >= 2:
            retval[w] = bi_cnt[w]

    # normalization
    max_val = max(retval.values())
    for w in retval:
        retval[w] = float(retval[w]) / max_val
    return retval


def build_bigram_index():
    paper_df = pd.read_csv(os.path.join(config.base_csv_dir, 'paper.csv'))
    title_dict = dict()
    for index, row in paper_df.iterrows():
        title_dict[int(row['id'])] = row['title']

    print type(title_dict)
    print title_dict.keys()[0], title_dict[title_dict.keys()[0]]

    bigram_dict = bigram_index(title_dict)
    save_obj(bigram_dict, 'bigram_idx')

    
def load_bigram_dict():
    if not os.path.exists('./bigram_idx.pkl'):
        build_bigram_index()
    return load_obj('bigram_idx')


def load_paper_dict():
    if not os.path.exists('./paper_dict.pkl'):
        paper_dict = dict()
        paper_df = pd.read_csv(os.path.join(config.base_csv_dir, 'paper.csv'))
        for _, row in paper_df.iterrows():
            pid = int(row['id'])
            paper_dict[pid] = dict()
            paper_dict[pid]['title'] = row['title']
            paper_dict[pid]['abstract'] = row['abstract']
        save_obj(paper_dict, 'paper_dict')
    return load_obj('paper_dict')


def load_paper_inv_dict():
    if not os.path.exists('./paper_inv_dict.pkl'):
        paper_inv_dict = dict()
        paper_df = pd.read_csv(os.path.join(config.base_csv_dir, 'paper.csv'))
        for _, row in paper_df.iterrows():
            pid = int(row['id'])
            title = row['title']
            paper_inv_dict[title] = pid
        save_obj(paper_inv_dict, 'paper_inv_dict')
    return load_obj('paper_inv_dict')

The abstract similarity is calculated using the popular technique in search engine - Indri. For each candidate’s abstract $d$, we calculate the score $S_a$ with the query abstract $q$ using the following equation:
$$ S_a(l) = P(q|d) = \sum_{i} p(q_i|d) = \sum_{i} (1-\lambda )\frac{tf_{q_i,d} + \mu p_{MLE}(q_i|C)}{length(d) + \mu} + \lambda p_{MLE}(q_i|C) $$
Where $\lambda$ and $\mu$ are parameters for smoothing, we take 0.1 here. $tf_{q_i,d}$ is the term frequency of query term in the doc, $p_{MLE}(q_i|C)$ is calculated as corpus term frequency of query term.

Then the likelihood $P(l_q|l_i)$ is calculated as:
$$P(l_q|l_i) = 0.5 \cdot (normalized(S_t(l)) + normalized(S_a(l)) )$$

In [5]:
class abstract_similarity:
    lam = 0.1
    mu = 0.1

    def __init__(self, path=os.path.join(config.filtered_dir, 'paper-2010-2016.csv')):
        file = open(path)
        stop = set(stopwords.words('english'))
        stop.add("using")
        stop.add("based")

        self.dict = {}
        self.total = 0
        for line in file:
            line = line.rstrip().lower()
            line = line[line.find('"'):]
            title = line.lower().rstrip().translate(None, string.punctuation)
            words = title.split(" ")
            for i in range(0, len(words)):
                word = words[i]
                if words[i] in stop :
                    continue
                if word in self.dict:
                    self.dict[word] += 1
                else:
                    self.dict[word] = 1
                self.total += 1
        file.close()

    def getScore(self, abstract1, abstract2):
        '''
        :param abstract1: input
        :param abstract2: another doc that needs to be compared
        :return: similarity score
        '''
        counter2 = {}
        stop = set(stopwords.words('english'))
        words1 = abstract1.rstrip().lower().rstrip().translate(None, string.punctuation).split(" ")
        words2 = abstract2.rstrip().lower().rstrip().translate(None, string.punctuation).split(" ")
        stop.add("using")
        stop.add("based")
        doclen = 0
        for word in words2:
            if word in stop:
                continue
            doclen += 1
            if word in counter2:
                counter2[word] += 1
            else:
                counter2[word] = 1

        score = 0.0
        for word in words1:
            tf = 0.0
            if word in stop:
                continue
            if word in counter2:
                tf = counter2[word]
            if word in self.dict:
                ctf = self.dict[word]
            else:
                ctf = 0.0
            score += (1-self.lam) * ((tf + self.mu * ctf / self.total) / (doclen + self.mu)) + self.lam * (ctf / self.total)
        return score

For the reference score $P(l_i)$, we simply count the appearance number of the papers which are within distance 3 from the query paper in the citation graph. So that this score will capture the popularity among the papers in the small cluster of papers around the query paper.

In [6]:
def build_ref_dict():
    print 'building'
    ref_df = pd.read_csv(os.path.join(config.base_csv_dir, 'refs.csv'))
    ref_dict = dict()
    for _, row in ref_df.iterrows():
        pid = int(row['paper_id'])
        if pid not in ref_dict:
            ref_dict[pid] = list()
        ref_dict[pid].append(int(row['ref_id']))
    save_obj(ref_dict, 'ref_idx')


def load_ref_dict():
    if not os.path.exists('ref_idx.pkl'):
        build_ref_dict()

    return load_obj('ref_idx')


def graph_count(ref_idx, paper_id, niter=3):
    vertices = list()
    queue = Queue()
    queue.put(paper_id)
    for i in range(niter):
        n = queue.qsize()
        for j in range(n):
            pid = queue.get()
            try:
                citations = ref_idx[pid]
                for cite in citations:
                    queue.put(cite)
                    vertices.append(cite)
            except:
                continue

    retval = dict()
    vertices_cnt = Counter(vertices)
    for pid in vertices_cnt:
        retval[pid] = vertices_cnt[pid]

    # normalization
    max_val = max(retval.values())
    for w in retval:
        retval[w] = float(retval[w]) / max_val
    return retval

The system will then return the top 5 scored candidates as the recommendation paper.

In [7]:
import title_similarity
import ref_score
import similarity
import operator

class recommendation:
    def __init__(self):
        # set up citation reference score index
        self.ref_dict = ref_score.load_ref_dict()
        print 'index loading done'

        # set up title bigram feature
        self.bigram_dict = title_similarity.load_bigram_dict()
        print 'bigram load done'

        self.paper_dict = title_similarity.load_paper_dict()
        self.paper_inv_dict = title_similarity.load_paper_inv_dict()
        print 'paper load done'

        # set up abstract similarity feature
        self.simi_eva = similarity.abstract_similarity()
        print "abstract index done"


    def get(self, paper_title):
        if paper_title not in self.paper_inv_dict:
            return list()
        pid = self.paper_inv_dict[paper_title]
        title = self.paper_dict[pid]['title']
        abstract = self.paper_dict[pid]['abstract']

        # calculate title bigram feature
        idx_list = title_similarity.bigram_search(self.bigram_dict, title)

        # calculate citation feature
        cnt = ref_score.graph_count(self.ref_dict, pid, 3)

        # calculate prior
        candidates = {}
        for k, v in idx_list.iteritems():
            if k in candidates:
                candidates[k] += v
            else:
                candidates[k] = v

        for k, v in cnt.iteritems():
            if k in candidates:
                candidates[k] += v
            else:
                candidates[k] = v

        # calculate total score
        for id, prior in candidates.iteritems():
            if id in self.paper_dict and type(self.paper_dict[id]['abstract']) == str:
                score = prior * self.simi_eva.getScore(abstract1=abstract, abstract2=self.paper_dict[id]['abstract'])
            else:
                score = prior * 0.5
            candidates[id] = score

        sort_score = sorted(candidates.items(), key=operator.itemgetter(1), reverse=True)

        retval = list()
        for item in sort_score:
            try:
                title = self.paper_dict[item[0]]['title']
                retval.append(title)
            except KeyError:
                continue

        return retval[1:6]

In [8]:
re = recommendation()
print re.get("New unsupervised clustering algorithm for large datasets")

building
index loading done
<type 'dict'>
1747627 Sequential feature selection for classification
exception: 1446891 nan
exception: 1309608 nan
bigram load done
paper load done
abstract index done
['K-AP Clustering Algorithm for Large Scale Dataset', 'A New Clustering Algorithm of Large Datasets with O(N) Computational Complexity', 'Fast discovery of association rules', 'An Efficient Density Based Clustering Algorithm for Large Databases', 'An effective hash-based algorithm for mining association rules']


For the query above, our implementation returns:
```python
['K-AP Clustering Algorithm for Large Scale Dataset', 'A New Clustering Algorithm of Large Datasets with O(N) Computational Complexity', 'Fast discovery of association rules', 'An Efficient Density Based Clustering Algorithm for Large Databases', 'An effective hash-based algorithm for mining association rules']
``` 

## Reference
[1] http://dblp.uni-trier.de/

[2] https://aminer.org/billboard/aminernetwork

[3] https://github.com/macks22/dblp

[4] https://en.wikipedia.org/wiki/Expert_system

[5] http://nlp.stanford.edu/IR-book/html/htmledition/k-gram-indexes-for-spelling-correction-1.html

[6] Blondel, Vincent D., et al. "Fast unfolding of communities in large networks." Journal of statistical mechanics: theory and experiment 2008.10 (2008): P10008.
