# Understanding the Community: Social Discovery of Artificial Intelligence Researchers and Literatures

Artificial Intelligence is one of the most influential subjects in computer science. In this project, we will analyze scholars’ activities based literatures published in the top-tier conferences to gain interesting insights into this field. 

In this project, we take advantage of the public computer science literature dataset and analysis outstanding features of the research area of Artificial Intelligence. Separately, we analysed the popularity of difference AI topics, spatial distribution of AI researchers and the communities of AI researches. At last, a literature recommendation system is designed and implemented to help researchers do their jobs better. In this report, we will start by introducing the datasets we used. Then the work we have done above will be discussed separately.

## Data Selection and Processing
For this project, our data mainly came from two public dataset, DBLP and ArnetMiner. Apart from them, we also scraped data from Google Map for geographical information. In this section, we will discuss how we parsed them and how we used them.

### Data Selection
DBLP is a well-maintained dataset of computer science bibliography. It contains 3,583,224 publications and 1,824,011 authors[1]. The statistics reveals that although DBLP covers a huge amount of computer science publications, each of them does not have well maintained metadata in it, i.e author affiliations, paper citations. Taking advantage of the widely covered publications, DBLP is used for analysing the trending of topics. Using the recent publications which have their authors will maintained, we gathered and formatted the co-authorship information from DBLP for the analysis on community finding. Two obvious drawbacks are that DBLP does not have thoroughly maintained author information and it does not have the paper citation data. 

The other public dataset we use is ArnetMiner[2], which is a supplement of DBLP and have well maintained metadata for the publications including citations and author metadata. The dataset we use has 2,092,356 publications, 1,712,433 authors and 8,024,869 citation relationships[2]. Thus, the ArnetMiner dataset is used for community finding based on citation relationships. For the analysis of authors’ geographical distribution, we also scrape the geographical coordinates of the authors’ affiliations to locate each author.

### Data Processing
DBLP dataset is formatted in a huge XML file which contains all paper records and metadata. It become impossible and super slow to parse it by loading the whole file to memory. So it become a requirement to parse the dataset in stream. SAX, an event-driven parser is used to parse DBLP dataset. 

In [None]:
# TODO: add DBLP parsing code

From the raw DBLP dataset, the publication title, year, author and venue are extracted. Then we further filter the parsed data by only keeping papers from 30 hand-picked top-tier Artificial Intelligence conferences, i.e. ICML, NLPS, CVPR, SIGKDD etc. The filtered result is presented in 3 `.csv` files which include 144,051 papers and 429,247 authors from 30 conferences.

### ArnetMiner Data Processing
We used an open-source parser to parse and format the ArnetMiner dataset[3]. It had done most of the parsing jobs for us on the ArnetMiner dataset by formatting the raw dataset to multiple `.csv` files like we did on parsing DBLP, filtering publications by the range of years and generating formatted citation graph. On the formatted and filtered publications by the parser, we further filtered the publications related to Artificial Intelligence base on the title of the publication and fulfilled the location information of the author’s using the affiliation data through the Google Map API. 

```python
>>> python aminer.py ParseAminerNetworkDataToCSV --local-scheduler
```
This will parse the raw dataset into formatted `.csv` files. 

Then run
```python
>>> python filtering.py FilterAllCSVRecordsToYearRange --start <start_year> --end <end_year> --local-scheduler
```
, which can filter the AI related literatures by year.

Finally running
```python 
python build_graphs.py BuildAllGraphData --start <start_year> --end <end_year> --local-scheduler
```
to generate the formatted citation graph.

## Temporal Analysis of Artificial Intelligence Research

Artificial Intelligence is a broad research area that encompasses many topics from learning methods, robotics, vision, language to problem solving, philosophy even cognitive science. One interesting question to be investigated is how AI research evolve over time. In this section, we will present results for our temporal analysis.

Firstly, we studied number of AI conference paper in DBLP each year. It is clear that AI research has increasingly gained more popularity since 1980. There is a huge jump after 2000.

``` TODO: 
insert image <AI Conference Paper Published for Each Year
```
Then, we investigate how trending research topics change along with time. By counting bigram frequency in papers’ title (ignore stopwords), we are able to identify top 50 hot research topics of the year. Here 1985, 2000 and 2015 are selected as three representative years and their hottest research topic word-cloud are generated below. The size of the keyword is proportional to its frequency of appearing in the paper title.

``` TODO:
insert image <word cloud 1985>
```
In 1985, it is obvious that Expert System (logic programming, ai programs...), Natural Language (information retrieval) and Database System were three most influential subjects of the year. We did not see any current common machine learning methods here.
``` TODO:
insert image <word cloud 2000>
```
Turning to year 2000, we witnessed lots of hot research related to Vision (Robots, Object Recognition) and Data Mining. Machine learning methods such as SVM, Bayesian Net, Hidden Markov and Neural Net are now part of hot topic lists.
``` TODO:
insert image <word cloud 2015>
```
Then, let’s look at the past year 2015. In 2015, the most trending topics are Neural Network, Social Media and Big Data. Advanced machine learning techniques such as deep learning, CNN, recurrent neural began to appear in hot topics. It is also worth noticing that with extremely large data volume generated in current age, research topics such big data, dictionary learning and matrix factorization become much more popular these days.



## Reference
[1] http://dblp.uni-trier.de/

[2] https://aminer.org/billboard/aminernetwork

[3] https://github.com/macks22/dblp