<a id="top"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:#7a96ea; border:0' role="tab" aria-controls="home"><center><font color='white'>Quick Navigation</font></center></h3>

* [1. Data Cleaning](#0)
* [2. Exploring Importance of Nodes in XML Tree](#1)
* [3. Summary](#2)

In [86]:
import pandas as pd
import xml.etree.ElementTree as ET

from collections import Counter
from collections import defaultdict
import os

<a id="0"></a>
<h2 style='background:#7a96ea; border:0; color:white'><center><font color = 'white'>1. Data Cleaning</font><center><h2>

From our EDA, we can find that some of the xml files has a depth deeper than 3. This implies within a tag in node 3 (`article` e.g.), there is yet another node to be found. We will investigate the importance of the sub-nodes and perform some manually cleaning as we go along:

1. Douglas Leslie Maskell.xml
2. Kwoh Chee Keong.xml
3. Lam Kwok Yan.xml
4. Li Fang Flora.xml
5. Li Mo.xml
6. Liu Weichen.xml
7. Liu Zhiwei.xml
8. Lu Shijian.xml
9. Luo Jun.xml
10. Mohamed M. Sabry  .xml
11. Ong Yew Soon.xml
12. Quek Hiok Chai.xml
13. Sourav S Bhowmick.xml
14. Sun Aixin.xml
15. Sun Chengzheng.xml
16. Tang Xueyan.xml
17. Thambipillai Srikanthan.xml
18. Wen Yonggang.xml
19. Zhang Hanwang.xml
20. Zhang Tianwei.xml
21. Zhao Jun.xml
22. Zheng Jianmin.xml
23. He Ying.xml
24. Arvind Easwaran.xml
25. Anupam Chattopadhyay.xml
26. Thambipillai Srikanthan.xml


In [193]:
# This seeks to find how deep the trees go for every xml file. 
# Ideally they should all be only at a depth of 3. 
# Manually cleaning is done to ensure that no important information is lost should
# the xml tree go beyond depth 3.
nodes = defaultdict(set)

for f in os.listdir('../xml/'):
    # There is a folder named "Problematic". Ignore it.
    if f == "Problematic":
        continue
    nodes[f] = defaultdict(set)
    # Dig how deep the tree goes. No need for dynamic programming as we know
    # from EDA it won't go beyond depth 4.
    tree = ET.parse('../xml/'+f)
    root = tree.getroot()
    for node1 in root:
        if node1.tag == "r":
            nodes[f]['node1'].add(node1.tag)
            for node2 in node1:
                nodes[f]['node2'].add(node2.tag)
                for node3 in node2:
                    nodes[f]['node3'].add(node3.tag)
                    for node4 in node3:
                        nodes[f]['node4'].add(node4.tag)
                        for node5 in node4:
                            nodes[f]['node5'].add(node5.tag)
        else: nodes[f]['node1'].add(node1.tag)


We cleaned the XML files as it was discovered that tags beyond node 3 were purely used for formatting purposes and are not relevant. We removed these tags.

[Back to Top](#top)
<a id="1"></a>
<h2 style='background:#7a96ea; border:0; color:white'><center><font color = 'white'>2. Exploring Importance of Nodes in XML Tree</font><center><h2>

Before we begin exploring, we need to make sure that the keys in any node2 header are unique for every xml file.

Understandably, when we do stack XML files there may be a chance of rows having the same key as a result of collaboration between NTU Profs.

In [178]:
# From the results we know the keys are all unique.
for f in os.listdir('../xml/'):
    try:
        tree = ET.parse('../xml/'+f)
        root = tree.getroot()
        n_elements = int(root.attrib['n'])
        n_found = set()
        for node1 in root:
            if node1.tag == "r":
                for node2 in node1:
                    n_found.add(node2.attrib['key'])
        if n_elements != len(n_found):
            print(f)
    except:
        continue

We can confirm that keys in each xml file are unique, we next investigate whether it is worth considering nodes that do not appear in all xml files (i.e. proceedings, incollections). To reference the project requirement,
> Here we measure research collaboration as
co-authorship among faculty members in **scientific papers/articles**

In [179]:
depth1 = []
depth2 = []
depth3 = []
for k in nodes.keys():
    depth1 += (list(nodes[k]['node1']))
    depth2 += (list(nodes[k]['node2']))
    depth3 += (list(nodes[k]['node3']))
    
depth1 = Counter(depth1)
depth2 = Counter(depth2)
depth3 = Counter(depth3)

print("Lets look for nodes which do not appear in all xml files. From our EDA we know that the\
 following professors have very few publications:\n1. Tan Kheng Leong\n2. Loke Yuan Ren\n3.\
 Oh Hong Lye\n4. Tay Kian Boon (0 results found on DBLP)\nAs such,\
 any nodes with >81 occurences are OK.")

print("\nNode 1:")
for k, v in depth1.items():
    print("{}: {}".format(k,v))
print()

print("Node 2:")
for k, v in depth2.items():
    print("{}: {}".format(k,v))
print()

print("Node 3:")
for k, v in depth3.items():
    print("{}: {}".format(k,v))

Lets look for nodes which do not appear in all xml files. From our EDA we know that the following professors have very few publications:
1. Tan Kheng Leong
2. Loke Yuan Ren
3. Oh Hong Lye
4. Tay Kian Boon (0 results found on DBLP)
As such, any nodes with >81 occurences are OK.

Node 1:
r: 84
person: 84
coauthors: 84
homonyms: 16

Node 2:
inproceedings: 83
article: 82
proceedings: 34
incollection: 31
book: 12
phdthesis: 10

Node 3:
author: 84
ee: 84
journal: 82
number: 82
booktitle: 83
crossref: 83
url: 84
pages: 84
year: 84
volume: 82
title: 84
publisher: 37
series: 33
isbn: 35
editor: 36
school: 10
note: 15
cite: 3
cdrom: 4


In [197]:
# find XML files with less frequent node 2 tags.
check = False
for k in nodes.keys():
    if 'book' in nodes[k]['node2']:
        check = True
        print("XML Files with <book>\t\t", k)
        
    if 'phdthesis' in nodes[k]['node2']:
        check = True
        print("XML Files with <phdthesis>\t", k)
    if check:
        print()
        check = False

XML Files with <book>		 Arijit Khan.xml

XML Files with <phdthesis>	 Boyang Li.xml

XML Files with <phdthesis>	 Cai Wentong.xml

XML Files with <book>		 Dusit Niyato.xml

XML Files with <phdthesis>	 Erik Cambria.xml

XML Files with <phdthesis>	 Kong Wai-Kin Adams.xml

XML Files with <phdthesis>	 Lam Kwok Yan.xml

XML Files with <book>		 Lin Weisi.xml

XML Files with <book>		 Loy Chen Change.xml
XML Files with <phdthesis>	 Loy Chen Change.xml

XML Files with <phdthesis>	 Mahardhika Pratama.xml

XML Files with <book>		 Ng Wee Keong.xml

XML Files with <book>		 Ong Yew Soon.xml
XML Files with <phdthesis>	 Ong Yew Soon.xml

XML Files with <phdthesis>	 Quek Hiok Chai.xml

XML Files with <book>		 Sourav S Bhowmick.xml

XML Files with <book>		 Sun Aixin.xml

XML Files with <book>		 Sun Chengzheng.xml

XML Files with <book>		 Tan Ah Hwee.xml

XML Files with <book>		 Wen Yonggang.xml
XML Files with <phdthesis>	 Wen Yonggang.xml

XML Files with <book>		 Yu Han.xml



In [169]:
# Find XML files with less frequent node 3 tags.
check = False
for k in nodes.keys():
    if 'school' in nodes[k]['node3']:
        check = True
        print("XML Files with <school>\t\t", k)
        
    if 'note' in nodes[k]['node3']:
        check = True
        print("XML Files with <note>\t\t", k)
        
    if 'cite' in nodes[k]['node3']:
        check = True
        print("XML Files with <cite>\t\t", k)
        
    if 'cdrom' in nodes[k]['node3']:
        check = True
        print("XML Files with <cdrom>\t\t", k)

    if check:
        print()
        check = False

XML Files with <school>		 Boyang Li.xml
XML Files with <note>		 Boyang Li.xml

XML Files with <school>		 Cai Wentong.xml
XML Files with <note>		 Cai Wentong.xml

XML Files with <school>		 Erik Cambria.xml
XML Files with <note>		 Erik Cambria.xml

XML Files with <school>		 Kong Wai-Kin Adams.xml
XML Files with <note>		 Kong Wai-Kin Adams.xml

XML Files with <school>		 Lam Kwok Yan.xml
XML Files with <note>		 Lam Kwok Yan.xml

XML Files with <school>		 Loy Chen Change.xml
XML Files with <note>		 Loy Chen Change.xml

XML Files with <school>		 Mahardhika Pratama.xml
XML Files with <note>		 Mahardhika Pratama.xml

XML Files with <note>		 Miao Chunyan.xml

XML Files with <cite>		 Ng Wee Keong.xml
XML Files with <cdrom>		 Ng Wee Keong.xml

XML Files with <school>		 Ong Yew Soon.xml
XML Files with <note>		 Ong Yew Soon.xml

XML Files with <school>		 Quek Hiok Chai.xml
XML Files with <note>		 Quek Hiok Chai.xml

XML Files with <note>		 Shen Zhiqi.xml

XML Files with <cite>		 Sourav S Bhowmick.x

### First, we'll handle node 1.
* `r`: Every paper starts with an r heading.
* `person`: Info is covered in root node. Discard.
* `coauthors`: List of every coauthor the author has worked with. Includes only name and pid. Does not include the number of papers co-authored nor the year. Quite useless information for our use case.
* `homonyms`: Authors with similar name as the faculty member in question. Not very necessary for us.

### Next, we'll handle node 2. Based on our read-up, the following can be explained:

**mdates vs year: mdates refers to the last update on the paper, and year refers to the year of publication.**

* [`Proceedings`](https://tex.stackexchange.com/questions/516802/what-are-differences-amongst-conference-proceedings-and-inproceedings#:~:text=You%20use%20proceedings%20when%20you,not%20to%20a%20single%20article.&text=inproceedings%20on%20the%20other%20hand,single%20written%2Dup%20conference%20talk.) and `Inproceedings` are similar. `Proceedings` refer to a collection of articles at a conference, while `Inproceedings` refers to a single article within a collection.

* [`Incollections`](https://www.bibtex.com/t/template-incollection/) refers to an article in a collection.

* `Book`, `Article` and `PhD Thesis` are self explanatory.

To conclude, all attributes of node 2 will be considered an article.

### Finally, we will handle node 3. We will determine which attribute is necessary.

Necessary Tags:
1. `author`: Authors of an `article` or `inproceedings`.
5. `booktitle`: The conference the paper was published/presented in.
9. `year`: The year of paper publication. It is noted that EVERY node 3 has a year tag.
11. `title`: Self explanatory. We will include this into the dataframe for robustness when checking if the keys are consistent across different xml files. It is noted that EVERY node 3 has a title tag.
9. `year`: The year of publication. EVERY paper will have a year tag.
11. `title`: The title of the paper. We will keep this for checking if all keys are similar across XML files.
15. `editor`: Same as authors, but for other publications. Mutually exclusive with authors.
18. `cite`: Citations. Only in 3/85 XML files, too few to be helpful, would've been nice for comparing how often other papers are cited. We will include anyway. (Cite contains either a key to an inproceeding/article etc. or '...')

Unnecessary Tags:

1. `ee`: The link to the published paper.
3. `journal`: The journal that the paper is published in.
4. `number`: The number related to the journal.
6. `crossref`: Superset of the inproceeding/proceeding key value (i.e. where the overall conference was).
8. `pages`: Which pages of the journal etc. the paper was published in.
10. `volume`: Which volume of the journal etc. the paper was published in.
12. `publisher`: Superset of a conference.
13. `series`: Similar to volume.
14. `isbn`: Additional info for proceedings - not present in all.
16. `school`: Corresponds to phd thesis entries.
17. `note`: Not very useful. Rare occurence, and mostly is meant for notes on schools (not always).
19. `cdrom`: Self explanatory. Not helpful.




[Back to Top](#top)
<a id="2"></a>
<h2 style='background:#7a96ea; border:0; color:white'><center><font color = 'white'>3. Summary</font><center><h2>

### We will take the following from the XML files.
* [x] Denotes many to one relationship.

### **Root Node:**

* Take key `name`, `pid` and `n` from root node (dblpperson)

### **Node 1:**
* N.A.

### **Node 2:**

Take tag.key of any of the following:
* `proceedings`
* `inproceedings`
* `incollection`
* `article`
* `book`
* `phdthesis`

### **Node 3:**

Take attrib.key, text from the following:
1. [x] Author (key = pid)
2. [x] Editor (key = pid)

Take text from the following:

3. Booktitle 
4. Year
5. Title = Title of paper. Exists in every entry. Similar to the key of any node 2 tag.