## Introduction

This tutorial will introduce you to some basic methods for processing the publication data. The publication data is coming from the DBLP(https://en.wikipedia.org/wiki/DBLP), which is a computer science bibliography website. DBLP listed more than 3.4 million journal articles, conference papers, and other publications on computer science in July 2016, up from about 14,000 in 1995.

We want to use this data to analyze the authors' co-author relationship and the references relationship between the publications. At first, we will try to preprocess the data and load it into a graph database, which is good for storing and analyzing the relationship. Maybe later we could even try to analyze the issues like the popularity field each year and which authors are of most importance in one field.

### Tutorial content

In this tutorial, we will show how to do some basic data process and analysis in Python. We will use some libaraies like [xml](https://docs.python.org/2/library/xml.html), [neo4j-driver](https://neo4j.com/developer/python/), [numpy](http://www.numpy.org/), [scipy](https://www.scipy.org/), [scikit-learn](http://scikit-learn.org/stable/) and others.

As said in the introduction, we will use the data from the DBLP. These data is in XML format for the convenience of processing. 

## Installing the libraries

Before getting started, you'll need to install the various libraries that we will use. We assume that you have already installed `python` and `pip`.  Then we can install the libraries using `pip`:

If you already have a working installation of numpy and scipy, the easiest way to install scikit-learn is using:

    $ pip install -U scikit-learn
    
Besides, another very important libaray we need is the neo4j-python libaray. Here we can use the same way to install it:

    $ pip install neo4j-driver

If you haven't installed numpy and scipy, then you could go to their website and check how to install these libraries on your own computer. After you run all the installs, make sure the following commands work for you:

In [1]:
import numpy as np
from xml.dom import minidom
from neo4j.v1 import GraphDatabase, basic_auth

## Loading data and preprocessing

Now that we've installed and loaded the libraries, let's load our publication data. We're going to load data provided in the XML format, which has tag and attribute for each element in the file.

Because the DBLP data is too big to be used. We downloaded and extract a part of data from it. You can download the data file `prepared_dblp.xml` from this link: https://drive.google.com/open?id=0B4I8jOssDdRDOGlHNzNrMXhvNkk. 
If you are interested in the dataset, you can also downlaod the `dblp.xml.gz` and `dblp.dtd` from DBLP website: http://dblp.uni-trier.de/xml/. Then unzip the `dblp.xml.gz` to get the `dblp.xml` file. This is the file that contains the data we want to extract and analyze. And the file `dblp.dtd` contains the description of data structure. You can look into it to get how the data is stored in the XML file.


### XML Parsing
Then, please move the file `prepared_dblp.xml` to the same folder as this notebook, an you can then load the data using the xml libarary.
The code is written in the following cell. with the help of `getElementsByTagName`, we can get the all elements which has the same tag name as a list. The detail of the xml libaray is in this [link](https://docs.python.org/2/library/xml.dom.html#module-xml.dom). The main function we would use in this tutorial is `getElementsByTagName` and `getValueFromLine`. 

For exmaple, in this tutorial, we only care about the element which have 'article' as the tag name, so we use 'article' as the tag name and use it to extract all the article elements. Then we want to implement two functions: one `get_count_of_origin_xml` and `get_article_detail`.
`get_count_of_origin_xml`: Used to get the total article count in the origin xml. 
`get_article_detail`: Input is a article element in the xml, used to return title, year, url, pages, volume, journal and ee infomation of this element. You may need some helper function in this function.

In [12]:
def get_count_of_origin_xml(xmldoc):
    articlelist = xmldoc.getElementsByTagName('article')
    return len(articlelist)

def get_article_detail(element):
    title = getValueFromLine(element, "title")
    year = getValueFromLine(element, "year")
    url = getValueFromLine(element, "url")
    pages = getValueFromLine(element, "pages")
    volume = getValueFromLine(element, "volume")
    journal = getValueFromLine(element, "journal")
    ee = getValueFromLine(element, "ee")
    return title, year, url, pages, volume, journal, ee

def getValueFromLine(s, key):
    if len(s.getElementsByTagName(key)) == 0:
        return ""
    else:
        return s.getElementsByTagName(key)[0].childNodes[0].nodeValue.replace('"', '').replace('\\', '')


In [13]:
# AUTOLAB_COMMENT_START
xmldoc = minidom.parse('prepared_dblp.xml')

count = get_count_of_origin_xml(xmldoc)
print count

articles = xmldoc.getElementsByTagName('article')
detail = get_article_detail(articles[0])
print detail
# AUTOLAB_COMMENT_STOP

5015
(u'Parallel Integer Sorting and Simulation Amongst CRCW Models.', u'1996', u'db/journals/acta/acta33.html#Saxena96', u'607-619', u'33', u'Acta Inf.', u'http://dx.doi.org/10.1007/BF03036466')


### Neo4j connection

Then we need to setup the Neo4j database. You could download and install Neo4j from here: https://neo4j.com/download/
We recommand to use the community edition. It is powerful enough to handle our data.
After you installed the Neo4j, you could run the Neo4j application. Select an empty folder as the storing location of database. Then click the "start" button, then after a few seconds, the Neo4j database is connected. By default, you could access the databse by accessing http://localhost:7474/browser/. If you are the first time to connect to this database, then the system will ask you to set the user name and password. After all the things is setup, you could connect to this database with the following command.

If you meet error `ProtocolError: Server certificate does not match known certificate for 'localhost'; check details in file '/Users/[your name]/.neo4j/known_hosts'`, please delete the files `known_hosts` under `/Users/[your name]/.neo4j/known_hosts` and run the code again.

In [14]:
def get_neo4j_session(user_name, password):
    driver = GraphDatabase.driver("bolt://localhost", auth=basic_auth(user_name, password))
    session = driver.session()
    return session

neo4j_user_name = "neo4j"  # Use your user name of Neo4j
neo4j_password = "123"     # Use your password of Neo4j

neo4j_session = get_neo4j_session(neo4j_user_name, neo4j_password)
print neo4j_session

<neo4j.v1.session.Session object at 0x123637a50>


Here, if you can see the session been printed out. That means that the connection has been built successfully. Then next step in to extract the information from the XML file and insert these data into the Neo4j database. 

### Neo4j Cypher
After parse the data from the xml file and connect to the neo4j database. The next step is to store all the data into the Neo4j database. The tutorial of Neo4j cypher is here: https://neo4j.com/developer/cypher-query-language/. Cypher is very powerful to manage the Neo4j database, but in this tutorial we only use `create` and `match` of cypher language. To learn and practice the cypher, we can first try some examples which use the cypher to manage the data in Neo4j database. 


In [45]:
def reset(session):
    cypher = 'MATCH (n) DETACH DELETE n'
    session.run(cypher)

def insert_author(session, author_name):
    cypher = 'CREATE (a:Author {name:\"' + author_name + '\"})'
    session.run(cypher)

def insert_paper(session, title, year, url, pages, volume, journal, ee):
    cypher = 'CREATE (p:Paper {title:\"' + title +\
    '\", year:\"' + year +\
    '\", url:\"' + url +\
    '\", pages:\"' + pages +\
    '\", volume:\"' + volume +\
    '\", journal:\"' + journal +\
    '\", ee:\"' + ee +\
    '\"})'
    session.run(cypher)
    
def get_author_existed(session, author_name):
    cypher = 'MATCH (a:Author) WHERE a.name = \"' + author_name + '\" RETURN count(a) as count'
    result = session.run(cypher)
    count = False
    for record in result:
        count = record["count"] != 0
    return count

def get_paper_detail(session, paper_title):
    cypher = 'MATCH (p:Paper) WHERE p.title = \"' + paper_title + '\" RETURN p'
    result = session.run(cypher)
    for record in result:
        detail = record
    return detail


In [47]:
# AUTOLAB_COMMENT_START

# First clean the data in database
reset(neo4j_session)

print get_author_existed(neo4j_session, "Richard Lee")
insert_author(neo4j_session, "Richard Lee")
print get_author_existed(neo4j_session, "Richard Lee")


insert_paper(neo4j_session, 'title', 'year', 'url', 'pages', 'volume', 'journal', 'ee')
detail = get_paper_detail(neo4j_session, 'title')
print detail

# AUTOLAB_COMMENT_STOP

False
True
<Record p=<Node id=45 labels=set([u'Paper']) properties={u'title': u'title', u'url': u'url', u'journal': u'journal', u'volume': u'volume', u'ee': u'ee', u'year': u'year', u'pages': u'pages'}>>


### Merge all things together!

OK! Now we have known how to parse the data from XML file and how to insert the data into Neo4j database. Let's merge them together and load 3000 valid article element from the file `prepared_dblp.xml` and get the useful data from them, storing into the Neo4j database.

In [48]:
def load_3000_into_neo4j(session, article_list, v = True):
    session.run('MATCH (n) DETACH DELETE n')
    i = 0
    for s in article_list:
        if i == 3000:
            break
        try:
            title, year, url, pages, volume, journal, ee = get_article_detail(s)

            session.run('CREATE (p:Paper {title:\"' + title +
                        '\", year:\"' + year +
                        '\", url:\"' + url +
                        '\", pages:\"' + pages +
                        '\", volume:\"' + volume +
                        '\", journal:\"' + journal +
                        '\", ee:\"' + ee +
                        '\"})')

            authors = s.getElementsByTagName('author')
            for author in authors:
                authorName = author.childNodes[0].nodeValue
                result = session.run('MATCH (a:Author) WHERE a.name = \"' + authorName + '\" RETURN count(a) as count')
                for record in result:
                    if record["count"] == 0:
                        session.run('CREATE (a:Author {name:\"' + authorName + '\"})')
                session.run('Match (p:Paper {title:\"' + title +
                            '\"}), (a:Author {name:\"' + authorName +
                            '\"}) CREATE (a)-[:AUTHORS]->(p)')
            i += 1
            if v and i % 100 == 0:
                print i
        except:
            pass


In [49]:
# AUTOLAB_COMMENT_START

# First clean the data in database
reset(neo4j_session)

# Load the article data
load_3000_into_neo4j(neo4j_session, articlelist)
result = neo4j_session.run('MATCH (a:Paper) RETURN count(a) as count')
for record in result:
    print "Papers in neo4j:", record['count']
    
# AUTOLAB_COMMENT_STOP

100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
Papers in neo4j: 3000


That's all for preprocessing and loading the data into an graph database. Then we may start some analysis based on the data and data relationship. 

## Analysis on data
Since we have already imported the data into the database, we can do some simple queries to get some information about the data we imported, especially the relationship between these paper and author nodes. And in this tutorial, we are also more interested in the relationship between the nodes. That is also why we pick Neo4j as our database as it is good at storing the data and relationship between them.
As we did before, currently there are two kinds of nodes in the database: `Author` and `Paper`, and the relationship between them is named `AUTHORS` which means certain author `AUTHORS` certain Paper. Based on this, we could do some simple analysis like to get which author is the most popular one based on the paper amount they author.

In [71]:
def get_most_k_popular_author(session, k):
    cypher = 'MATCH (a:Author)-[:AUTHORS]->(p:Paper) RETURN a.name as name, COUNT(p) as count ORDER BY COUNT(p) DESC LIMIT 10'
    result = session.run(cypher)
    authors = []
    for record in result:
        authors.append((record['name'], record['count']))
    return authors


In [72]:
# AUTOLAB_COMMENT_START
authors = get_most_k_popular_author(neo4j_session, 10)
print authors
# AUTOLAB_COMMENT_STOP

[(u'Grzegorz Rozenberg', 25), (u'Joost Engelfriet', 16), (u'Melvyn L. Smith', 15), (u'Lyndon N. Smith', 15), (u'Zhenbo Deng', 14), (u'Weiming Shen', 13), (u'Kary Frauml;mling', 13), (u'Walter Vogler', 12), (u'Wim H. Hesselink', 11), (u'Ladjel Bellatreche', 11)]



## Summary and Improvement
Now we have know how to parse the data from XML and how to import the data into a graph database, with which we can better use the relationship and nodes information to do the analysis. 
I think in the future, we may use this to do some better and deeper analysis. Like do some dis-ambigous analysis when there are same name, or give the main research focus in given time.