# Understanding the Community: Social Discovery of Artificial Intelligence Researchers and Literatures

Artificial Intelligence is one of the most influential subjects in computer science. We will analyze scholars’ activities based on literatures published in the top-tier conferences to gain some interesting insights into this field. 

In this report, we start by a brief introduction of the dataset and data processing. Then we present a temporal analysis of AI topic popularity. In section 3, spatial distribution of AI research is discussed. Next, in section 4, we implement community finding algorithm on coauthorship and citation graphs, which identifies important research communities and centers within the graph. Finally, we build a simple paper recommendation system based on content similarity and citation relationship. 

Note: The code cannot be run directly from this Python Notebook. Because our code is too long to be included. We have included a seperate folder for code.

## Data Selection and Processing
For this project, our data mainly came from two public dataset, DBLP and ArnetMiner. Apart from them, we also scraped data from Google Map for geographical information. In this section, we will discuss how we parsed them and how we used them.

### Data Selection
DBLP is a well-maintained dataset of computer science bibliography. It contains 3,583,224 publications and 1,824,011 authors[1]. It covers a huge amount of computer science publications. Taking advantage of the widely covered publications, DBLP is used for analysing the trending of topics. We also gathered and formatted the co-authorship information for the analysis on community finding. However, DBLP has two obvious drawbacks. DBLP does not have well maintained author information and it does not have the paper citation data. It causes great inconvenience for our further analysis so we turned to a second dataset ArnetMiner.

The other public dataset we use is ArnetMiner[2], which is a supplement to DBLP that has well-maintained metadata including citations and author metadata. ArnetMiner contains 2,092,356 publications, 1,712,433 authors and 8,024,869 citation relationships[2]. Thus, we used the ArnetMiner dataset for community finding based on citation relationships. For the analysis of authors’ geographical distribution, we also scrape the geographical coordinates of the authors’ affiliations to locate each author.

### Data Processing
DBLP dataset is formatted in a huge XML file which contains all paper records and metadata. It become impossible and super slow to parse it by loading the whole file to memory. So it become a requirement to parse the dataset in stream. SAX, an event-driven parser is used to parse DBLP dataset. 

In [None]:
class DBLPContentHandler(xml.sax.ContentHandler):
    """
    This is a SAX XML parser for dblp.
    Reads the dblp.xml file and produces four output files. Each file is tab-separated
    """
    cur_element = -1
    ancestor = -1
    paper = None
    conf = None
    line = 0
    errors = 0
    author = ""

    def __init__(self):
        xml.sax.ContentHandler.__init__(self)
        DBLPContentHandler.inproc_file = open('inproc.txt', 'w')
        DBLPContentHandler.cite_file = open('cite.txt', 'w')
        DBLPContentHandler.conf_file = open('conf.txt', "w")
        DBLPContentHandler.author_file = open("author.txt", "w")

    def startElement(self, name, attrs):
        if name == "inproceedings":
            self.ancestor = ELEMENT.INPROCEEDING
            self.cur_element = PAPER.INPROCEEDING
            self.paper = Paper()
            self.paper.key = attrs.getValue("key")
        elif name == "proceedings":
            self.ancestor = ELEMENT.PROCEEDING
            self.cur_element = CONFERENCE.PROCEEDING
            self.conf = Conference()
            self.conf.key = attrs.getValue("key")
        elif name == "author" and self.ancestor == ELEMENT.INPROCEEDING:
            self.author = ""

        if self.ancestor == ELEMENT.INPROCEEDING:
            self.cur_element = PAPER.get_element(name)
        elif self.ancestor == ELEMENT.PROCEEDING:
            self.cur_element = CONFERENCE.get_element(name)
        elif self.ancestor == -1:
            self.ancestor = ELEMENT.OTHER
            self.cur_element = ELEMENT.OTHER
        else:
            self.cur_element = ELEMENT.OTHER

        self.line += 1

    def endElement(self, name):
        if name == "author" and self.ancestor == ELEMENT.INPROCEEDING:
            self.paper.authors.append(self.author)

        if ELEMENT.get_element(name) == ELEMENT.INPROCEEDING:
            self.ancestor = -1
            try:
                if self.paper.title == "" or self.paper.conference == "" or self.paper.year == 0:
                    print ("error in parsing " + self.paper.key)
                    print self.paper.title
                    print self.paper.conference
                    print self.paper.year
                    self.errors += 1
                    return
                
                # filter only important AI conference paper
                keywords = ['AAAI', 'CVPR', 'ICCV', 'ECCV', 'ICML', 'IJCAI', 'NIPS', 'ACL', 'COLT', 'EMNLP', 'ECAI', 'ICRA', 'ICCBR', 'COLING', 'KR', 'UAI', 'PPSN', 'ACCV', 'CoNLL', 'ICPR', 'BMVC', 'IROS', 'ACML', 'SIGMOD Conference', 'SIGMOD', 'KDD', 'SIGKDD', 'SIGIR', 'VLDB', 'ICDE', 'CIKM', 'PODS', 'PKDD', 'ECML/PKDD', 'ICDM', 'SDM', 'ICDT', 'CIDR', 'WSDM', 'ECIR', 'PAKDD']
                for keyword in keywords:
                    # if keyword in self.paper.title.lower() or keyword in self.paper.conference.lower():
                    if keyword.lower() in self.paper.conference.lower().split(" "):
                        self.write_paper(self.paper)
                        for t in self.paper.authors:
                            self.write_author(t, self.paper)
                        for c in self.paper.citations:
                            if c != "...":
                                self.write_citation(c, self.paper)
                        return

            except ValueError:
                print "error"

        elif ELEMENT.get_element(name) == ELEMENT.PROCEEDING:
            self.ancestor = -1
            try:
                if self.conf.name == "":
                    self.conf.name = self.conf.detail
                if self.conf.key == "" or self.conf.name == "" or self.conf.detail == "":
                    print "no conference: line ", self.line
                    return
                self.write_conf(self.conf)
            except ValueError:
                print "error"

    def write_conf(self, conf):
        DBLPContentHandler.conf_file.write(conf.key + "\t")
        DBLPContentHandler.conf_file.write(conf.name + "\t")
        DBLPContentHandler.conf_file.write(conf.detail + "\n")

    def write_paper(self, paper):
        # print paper.toString()
        DBLPContentHandler.inproc_file.write(paper.title + "\t")
        DBLPContentHandler.inproc_file.write(str(paper.year) + "\t")
        DBLPContentHandler.inproc_file.write(paper.conference + "\t")
        DBLPContentHandler.inproc_file.write(paper.key + "\n")

    def write_author(self, t, paper):
        DBLPContentHandler.author_file.write(t + "\t")
        DBLPContentHandler.author_file.write(paper.key + "\n")

    def write_citation(self, c, paper):
        DBLPContentHandler.cite_file.write(paper.key + "\t")
        DBLPContentHandler.cite_file.write(c + "\n")

    def characters(self, content):
        content = content.encode('utf-8').replace('\\', '\\\\').replace('\n', "").replace('\r', "")
        if self.ancestor == ELEMENT.INPROCEEDING:
            if self.cur_element == PAPER.AUTHOR:
                self.author += content
            elif self.cur_element == PAPER.CITE:
                if len(content) == 0:
                    return
                self.paper.citations.append(content)
            elif self.cur_element == PAPER.CONFERENCE:
                self.paper.conference += content
            elif self.cur_element == PAPER.TITLE:
                self.paper.title += content
            elif self.cur_element == PAPER.YEAR:
                if len(content) == 0:
                    return
                try:
                    self.paper.year = int(content)
                except ValueError:
                    print "s" + content + "s"
                    print float(content)
        elif self.ancestor == ELEMENT.PROCEEDING:
            if self.cur_element == CONFERENCE.CONFNAME:
                self.conf.name = content
            elif self.cur_element == CONFERENCE.CONFDETAIL:
                self.conf.detail = content

In [None]:
# to parse dblp
source = open("dblp.xml")
xml.sax.parse(source, DBLPContentHandler())

From the raw DBLP dataset, the publication title, year, author and venue are extracted. Then we further filter the parsed data by only keeping papers from 30 hand-picked top-tier Artificial Intelligence conferences, i.e. ICML, NLPS, CVPR, SIGKDD etc. The filtered result is presented in 3 `.csv` files which include 144,051 papers and 429,247 authors from 30 conferences.

### ArnetMiner Data Processing
We used an open-source parser to parse and format the ArnetMiner dataset[3]. It had done most of the parsing jobs for us on the ArnetMiner dataset by formatting the raw dataset to multiple `.csv` files like we did on parsing DBLP, filtering publications by the range of years and generating formatted citation graph. On the formatted and filtered publications by the parser, we further filtered the publications related to Artificial Intelligence base on the title of the publication and fulfilled the location information of the author’s using the affiliation data through the Google Map API. 

```python
>>> python aminer.py ParseAminerNetworkDataToCSV --local-scheduler
```
This will parse the raw dataset into formatted `.csv` files. 

Then run
```python
>>> python filtering.py FilterAllCSVRecordsToYearRange --start <start_year> --end <end_year> --local-scheduler
```
, which can filter the AI related literatures by year.

Finally running
```python 
python build_graphs.py BuildAllGraphData --start <start_year> --end <end_year> --local-scheduler
```
to generate the formatted citation graph.

## Temporal Analysis of Artificial Intelligence Research

Artificial Intelligence is a broad research area that encompasses many topics from learning methods, robotics, vision, language to problem solving, philosophy even cognitive science. One interesting question to be investigated is how AI research evolve over time. In this section, we will present results for our temporal analysis.

Firstly, we studied number of AI conference paper in DBLP each year. It is clear that AI research has increasingly gained more popularity since 1980. There is a huge jump after 2000.

``` TODO: 
insert image <AI Conference Paper Published for Each Year
```
Then, we investigate how trending research topics change along with time. By counting bigram frequency in papers’ title (ignore stopwords), we are able to identify top 50 hot research topics of the year. Here 1985, 2000 and 2015 are selected as three representative years and their hottest research topic word-cloud are generated below. The size of the keyword is proportional to its frequency of appearing in the paper title.

``` TODO:
insert image <word cloud 1985>
```
In 1985, it is obvious that Expert System (logic programming, ai programs...), Natural Language (information retrieval) and Database System were three most influential subjects of the year. We did not see any current common machine learning methods here.
``` TODO:
insert image <word cloud 2000>
```
Turning to year 2000, we witnessed lots of hot research related to Vision (Robots, Object Recognition) and Data Mining. Machine learning methods such as SVM, Bayesian Net, Hidden Markov and Neural Net are now part of hot topic lists.
``` TODO:
insert image <word cloud 2015>
```
Then, let’s look at the past year 2015. In 2015, the most trending topics are Neural Network, Social Media and Big Data. Advanced machine learning techniques such as deep learning, CNN, recurrent neural began to appear in hot topics. It is also worth noticing that with extremely large data volume generated in current age, research topics such big data, dictionary learning and matrix factorization become much more popular these days.



## Artificial Intelligence Literature Recommendation System

From our own experience of doing researches and writing research literatures on unfamiliar topics, it is always the case that we left some important papers out when there is no one instructing us. In this scenario, it would be extremely helpful if there is a system giving recommendations on related paper that worth reading. Therefore, at last but not least, we designed and implemented a simple literature recommendation system. Given a research paper the user is focusing on, the system can recommend the most related and influential paper to the user.


The recommendation problem is simply modeled by the Bayesian equation:
$$P(l_i|l_q) = \frac{P(l_q|l_i)\cdot P(l_i)}{P(l_q)}$$
Where $l_i$, $l_q$ separately denote the candidate literature and query literature. Since $P(l_q)$ is a constant to all $l_i$, we can further deduct the equation to:
$$P(l_i|l_q) \propto P(l_q|l_i)\cdot P(l_i)$$
For the problem solving, we define the likelihood $P(l_q|l_i)$ to be the topic similarity between $l_i$ and $l_q$ and the prior $P(l_i)$ to be the number of appearances of $l_i$ near by $l_q$ in the citation graph $G_c$ (to speed up the calculation).  


For topic similarity, we separately calculated the similarity of the two publications’ titles and abstracts. The title similarity is defined as the number of same bigram appeared in the two titles. And we calculate the abstract similarity by using the popular technique in search engine - Indri.


For the reference score, we simply count the appearance number of the papers which are within distance 2 from the query paper in the citation graph. So that this score will capture the popularity among the papers in  the small cluster of papers around the  query paper.


Finally, we define the score of the candidate paper as:
$S = S_{topic} + S_{reference}$
The system will then return the top 5 scored candidates as the recommendation paper.


`code` and explain



## Reference
[1] http://dblp.uni-trier.de/

[2] https://aminer.org/billboard/aminernetwork

[3] https://github.com/macks22/dblp