# Introduction

In this problem, you're going to get started with xml raw data parsing, storing, and querying based on DBLP publication metadata(will be attached in the handout) to analyse coauthorship and other kinds of relationships between scientists and authors. 

Cause the adventure we would most likely to discover is about relationships, using structured database like MySQL is fine, but as for the search, the entity's relationships between each other is complicated when dealing with multi-level relationships, cause we need to join so many tables together to perform the search, so to make the search faster and cheaper, we introduce the Graph Database here.

Graph database is not like the relational database, it's suitable for big data analysis as the schemaless design, it no longer focus on the data itself, while the connections between them. Data relationships drive today's intelligent applications, and the Graph Database is exactly you need to harnesses those connections for sustainable competitive advantage.

To get started with graph database, we use __Neo4j__, it has strengths in powerful data models just as general as RDBMS, and suitable for agile development also fast for query connected data. Supported with the query functions for the graph database, we use __Cypher__ here, it is a pattern-matching query language for graphs.

Scoring for this section: 5pts for parsing xml data, 10pts for storing metadata into Neo4j, 50pts for query the database based on 5 different queries. 

## Q0 XML Parsing (5pt)

The raw data we provided is around 3000 papers crawled in the dblp dataset. We provide the xml file with one dtd file to define the xml schema. Example of format shows below:

```
<dblp>
<article mdate="2011-01-11" key="journals/acta/Saxena96">
<author>Sanjeev Saxena</author>
<title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
<pages>607-619</pages>
<year>1996</year>
<volume>33</volume>
<journal>Acta Inf.</journal>
<number>7</number>
<url>db/journals/acta/acta33.html#Saxena96</url>
<ee>http://dx.doi.org/10.1007/BF03036466</ee>
</article>
...
</dblp>
```

To simplify the data structure we will use for this project and save the only valuable information, we only need to parse `title`, `year`, `authors`, `pages`, `volume` and store them into a dictionary as return value for one single article.

Note: for one article contains multiple authors, so the value for the key `authors` should be an array of all the author names.

We will use `xml.dom` package for python to parse the xml file

In [8]:
from xml.dom import minidom

In [9]:
def parse_xml(item):
    """
    Return the dictionary of parsed raw data

    Args:
        item (Document): one single article parsed from xml dataset

    Returns:
        record (dict): A dictionary of parsed raw data.
    """
    record = {}
    title = ''
    if item.getElementsByTagName('title'):
        title = item.getElementsByTagName('title')[0].firstChild.nodeValue
    pages = ''
    if item.getElementsByTagName('pages'):
        pages = item.getElementsByTagName('pages')[0].firstChild.nodeValue
    year = ''
    if item.getElementsByTagName('year'):
        year = item.getElementsByTagName('year')[0].firstChild.nodeValue
    volume = ''
    if item.getElementsByTagName('volume'):
        volume = item.getElementsByTagName('volume')[0].firstChild.nodeValue
    authors = item.getElementsByTagName('author')
    authorlist = []
    if authors:
        for a in authors:
            authorlist.append(a.firstChild.nodeValue)
    record['title'] = title.replace('"', '\'').replace('\\', '') if title else title
    record['pages'] = pages
    record['year'] = year
    record['volume'] = volume
    record['authors'] = authorlist
    return record

## Q1 Store to Neo4j (10pt)

Neo4j has 4 major components:
__Nodes__: Entities, e.g. article and author
__Relationship__: Connect entities and structure domain, e.g. connect author and article
__Properties__: Attributes and metadata, e.g. title, pages, etc of one article
__Labels__: Group nodes by role, e.g. for this assignment, there will be two roles(Article and Author)

To store the whole network, we need to work on three parts:
1. create entities of articles with its attributes
2. create entities of authors with its attributes
3. create relationships between article and author

To use neo4j, we need first download neo4j into your own PC with url: https://neo4j.com/product/
After successfully installed, open Neo4j and create a folder to store it, after that click start to open the UI. At the default page you will need to reset the __username__ and __password__, keep it in mind cause we will use it later.

Then, the database is successfully setted up, all things left is to use the API to import the data.

In [10]:
from neo4j.v1 import GraphDatabase, basic_auth

ImportError: No module named neo4j.v1

In [None]:
def get_driver():
    # connect to neo4j
    driver = GraphDatabase.driver("bolt://localhost", auth=basic_auth("neo4j", "1234"))
    return driver

def store_neo4j(driver, xml_path):
    '''
    Store all of the metadata into neo4j, using parse_xml function to parse the metadata
    
    Args:
        xml_path (String): dblp xml file path
    '''
    xmldoc = minidom.parse(xml_path)
    # get all of the articles
    xmllist = xmldoc.getElementsByTagName('article')
    session = driver.session()
    
    for i in xmllist:
        res = parse_xml(i)
        try:
            title = res['title']
            pages = res['pages']
            year = res['year']
            volume = res['volume']
            authors = res['authors']
            # store article entity
            article = 'create (:Article{' + \
                'title: "' + title + '",' + \
                'pages: "' + pages + '",' + \
                'year: "' + year + '",' + \
                'volume: "' + volume + '"})'
            session.run(article)
            # store each author and their relationship with article
            for a in authors:
                relation = 'Match (p:Article{' + \
                    'title: "' + title + '"}),' + \
                    '(a:Author {name: "' + a + '"})' + \
                    'create (a)-[:WRITE]->(p)'
                author = 'create (:Author{' + \
                    'name: "' + a + '"})'
                session.run(relation)
                session.run(author)
        except:
            pass
    session.close()

In [None]:
## Test
driver = get_driver()
store_neo4j(driver, 'dblp.xml')

## Q2 Query the Graph Database (50pt)

Now we successfully load the dblp data into Neo4j, we can perform different kinds of query on it now. As we store the relationship between authors and articles and their related metadata, so our query will mostly focus on relationship between each entity based on attributes.

The Query language is Cypher, a declarative, SQL-inspired language for describing patterns in graphs visually using an ascii-art syntax. Here is a detailed tutorial for introducing Cypher. https://neo4j.com/developer/cypher-query-language/

The query result is based on your metadata imported into Neo4j, so you have to make sure the first storage step is completed and correct. Do not perform query before __store_neo4j__ finished.


### Query 1 (10pt)
Given author name A, list all of his/her co-authors. This is a simple warm up query for single field search, based on author name field, and relation label __WRITE__, you can easily find the result.

Note: A person B may have co-authored with A for multiple papers, but only count as one co-author.
Let's suppose A is "Walter Vogler"

The result should return all of the distinct coauthor nodes.

### Query 2 (10pt)
Given an author name, list all of her publications and detailed publication information. This query is based on attributes match on entities, just return a list of nodes is enough.

Note: There will be no duplication author name. Let's suppose A is "Walter Vogler".

### Query 3 (10pt)
Given two author names, find out whether they ever co-author some papers and if yes, the details. This query represents the benefits of graph database, you can use __WRITE__ relation label to easily find the result using just one line query language. While using relational database you have to write nested query and join table together which will cost a lot.

Note: Author name are "Lars Jenner" and "Walter Vogler"

### Query 4 (10pt)
Given some keywords, list all publications that contain some or all of the keywords in the paper title. Using keyword __LIKE__ in relational database will do such thing, while in graph database you have to find out an alternative way.

Note: For example, keywords are "Pattern" and "Step"

### Query 5 (10pt)
Find out all the authors who has published 20 or more articles at this dblp dataset. Query all of the author and their published articles, using statistical method count to sum up their articles to check whether it meet the requirement.

Note: The result should return all of the distinct coauthor nodes.

### Return results
All of the five queries should return result set as list format, containing all of the result nodes.
e.g. 
```
[<Node id=139 labels=set([u'Author']) properties={u'name': u'Christian Stahl'}>, <Node id=229 labels=set([u'Author']) properties={u'name': u'Christian Stahl'}>, <Node id=319 labels=set([u'Author']) properties={u'name': u'Christian Stahl'}>, <Node id=910 labels=set([u'Author']) properties={u'name': u'Christian Stahl'}>, <Node id=2402 labels=set([u'Author']) properties={u'name': u'Christian Stahl'}>, ...]
```

In [None]:
def query1(driver, name):
    '''
    Args:
        driver: A driver is used to connect to a Neo4j server. 
                It provides sessions that are used to execute statements and retrieve results.
        name: String of specified author name.
    Returns: List of result nodes.
    '''
    session = driver.session()
    result = session.run("""
        match (author: Author {name: '%s'})-[:WRITE]->(p: Article)<-[:WRITE]-(coauthor: Author)
        where coauthor.name <> 'Walter Vogler'
        return distinct coauthor
    """ % (name,))
    res = []
    for record in result:
        res.append(record['coauthor'])
    session.close()
    return res

def query2(driver, name):
    '''
    Args:
        driver: A driver is used to connect to a Neo4j server. 
                It provides sessions that are used to execute statements and retrieve results.
        name: String of specified author name.
    Returns: List of result nodes.
    '''
    session = driver.session()
    result = session.run("""
        match (author: Author{name:'%s'})-[:WRITE]->(p: Article) return p
    """ % (name,))
    res = []
    for record in result:
        res.append(record['p'])
    session.close()
    return res

def query3(driver, name1, name2):
    '''
    Args:
        driver: A driver is used to connect to a Neo4j server. 
                It provides sessions that are used to execute statements and retrieve results.
        name1: String of specified first author name.
        name2: String of specified second author name.
    Returns: List of result nodes.
    '''
    session = driver.session()
    result = session.run("""
        match (a: Author {name: '%s'})-[:WRITE]->
        (p: Article)
        <-[:WRITE]-(b: Author {name: '%s'})
        return p
    """ % (name1, name2))
    res = []
    for record in result:
        res.append(record['p'])
    session.close()
    return res

def query4(driver, keywords):
    '''
    Args:
        driver: A driver is used to connect to a Neo4j server. 
                It provides sessions that are used to execute statements and retrieve results.
        keywords: list of keywords.
    Returns: List of result nodes.
    '''
    search = []
    for keyword in keywords:
        search.append("p.title=~'.*" + keyword + ".*'")
    query = ' or '.join(search)
    session = driver.session()
    result = session.run("match (p: Article) where " + query + " return p")
    res = []
    for record in result:
        res.append(record['p'])
    session.close()
    return res

def query5(driver):
    '''
    Args:
        driver: A driver is used to connect to a Neo4j server. 
                It provides sessions that are used to execute statements and retrieve results.
        name: String of specified author name.
    Returns: List of result nodes.
    '''
    session = driver.session()
    result = session.run("""
        match (a:Author)-[:WRITE]->(p) with a, count(DISTINCT p) as c where c >= 20
        return a
    """)
    res = []
    for record in result:
        res.append(record['a'])
    session.close()
    return res

In [None]:
print query1(driver, "Walter Vogler")
print query2(driver, "Walter Vogler")
print query3(driver, "Walter Vogler", "Alex Kondratyev")
print query4(driver, ["Pattern", "Step"])
print query5(driver)