# Author Name Disambiguation Experiment

+ Nickname on **biendata.com**: **Dracarys**
+ Rank: 4 (by 2020/02/22)

## Summary Report

---
### introduction
This project focus on author name disambiguation in academic publications. The target is to disambiguate the author names by publicated papers' attributes, such as title, co-author, organization and keywords.

### method
The flow of this project is quite straightforward: select suitable attuributes of papers to construct feature vectors, the do clustering.  

 + **data analysis and feature selection**: 
In the given data set of papers, several attributes are provided as string: title, co-authors and their organizations, keywords, abstract, venue, publication year .,etc. All these properties may provide useful infomation to seprate authors with the same name. In this project, some of these attributes are selected to construct the feature vectors. 
 
 + **feature vectorization**: 
In this experiment, [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) algorithm is used to convert features into numerical vectors. This algorithm is proved to be effective in extracting unique infomations across different documents, which is very similar to this project. 

+ **clustering**
Since the fact that how many differnt authors shared the same name is not known, th unsupervised clustering method [DBSCAN](https://en.wikipedia.org/wiki/DBSCAN) is used. This algorithm does not require user to set the number of clusters, make it suitable for this experiment.

### experiment

Data prepreocessing is necessary for this project. Some of the attributes are missing, some of them are not in English. For the author's name, different format also co-exists, **"-", "_", "." or " "** all appeared in authors name as seprator. For the origanization name, some abbreviation also used along its full name. All these problems need to be addressed. Therefore, some of the preprocess techniques are used to clean the data. Most of the data clean functions are from these [shared notebooks](https://biendata.com/models/index_category/97/). 

The cleaned attributes(co-author, organization, title, keywords) is then concatenate into a long feature string. All these feature strings are added to a list, then the list is converted to numerical feature vectors. The feature vectors are feeded into DBSCAN to generate the cluster results.

### result
In the experiment we tried different combinations of  paper's attirbutes(title, co-authors and their organizations, keywords, abstract, venue, publication year), and the final score are around 0.25 ~ 0.3. The best combination is co-authors, their organizations and abstract. 

### summary
Due to limted knowledge in NLP, this experiment result is not very good overall. New methods are expected to be tested.

### reference:
[shared notebook](https://biendata.com/models/category/3000/L_notebook/)



## Code

### enviroment setup

In [66]:
import numpy as np
import json
import re
import string
from collections import defaultdict
from pprint import pprint
from tqdm import tqdm




import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

from sklearn.cluster import DBSCAN
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.feature_extraction.text import TfidfVectorizer





## global variables

In [26]:
TRAIN_AUTHOR = "./data/train/train_author.json"
TRAIN_PUB = "./data/train/train_pub.json"
SNA_VALID_AUTHOR_RAW = "./data/sna_data/sna_valid_author_raw.json"
SNA_VALID_EXAMPLE = "./data/sna_data/sna_valid_example_evaluation_scratch.json"
SNA_VALID_PUB = "./data/sna_data/sna_valid_pub.json"


train_author_data = load_json(TRAIN_AUTHOR)
train_pub_data = load_json(TRAIN_PUB)
test_author_data = load_json(SNA_VALID_AUTHOR_RAW)
test_pub_data = load_json(SNA_VALID_PUB)
example_output = load_json(SNA_VALID_EXAMPLE)

## utils

In [68]:
def load_json(file_path):
    with open(file_path, "r") as f:
        return json.load(f)

def preprocess_name(name):   
    name = name.lower().replace(' ', '_')
    name = name.replace('.', '_')
    name = name.replace('-', '')
    name = re.sub(r"_{2,}", "_", name) 
    return name


def preproces_org(org):
    if org != "":
        org = org.replace('Sch.', 'School')
        org = org.replace('Dept.', 'Department')
        org = org.replace('Coll.', 'College')
        org = org.replace('Inst.', 'Institute')
        org = org.replace('Univ.', 'University')
        org = org.replace('Lab ', 'Laboratory ')
        org = org.replace('Lab.', 'Laboratory')
        org = org.replace('Natl.', 'National')
        org = org.replace('Comp.', 'Computer')
        org = org.replace('Sci.', 'Science')
        org = org.replace('Tech.', 'Technology')
        org = org.replace('Technol.', 'Technology')
        org = org.replace('Elec.', 'Electronic')
        org = org.replace('Engr.', 'Engineering')
        org = org.replace('Aca.', 'Academy')
        org = org.replace('Syst.', 'Systems')
        org = org.replace('Eng.', 'Engineering')
        org = org.replace('Res.', 'Research')
        org = org.replace('Appl.', 'Applied')
        org = org.replace('Chem.', 'Chemistry')
        org = org.replace('Prep.', 'Petrochemical')
        org = org.replace('Phys.', 'Physics')
        org = org.replace('Phys.', 'Physics')
        org = org.replace('Mech.', 'Mechanics')
        org = org.replace('Mat.', 'Material')
        org = org.replace('Cent.', 'Center')
        org = org.replace('Ctr.', 'Center')
        org = org.replace('Behav.', 'Behavior')
        org = org.replace('Atom.', 'Atomic')
        #org = org.split(';')[0]
        org = ' '.join(org.split(';'))
        org = org.lower()
    return org


def remove_stopwords(content):
    use_stopwords = set(stopwords.words('english'))
    stemmer = WordNetLemmatizer()
    content = [stemmer.lemmatize(word) for word in content.split()
                   if word not in use_stopwords and len(word) > 1]
    return ' '.join(content)

def remove_non_english_words(content):
    return " ".join(word for word in nltk.wordpunct_tokenize(content) \
         if word.lower() in ENGLISH_WORDS or not word.isalpha())


def remove_non_printable_words(content):
    result = ""
    printable_set = set(string.printable)
    for c in content:
        if c in printable_set:
            result += c
    
    return result

    
#remove seprators by regular expression
def etl(content):
    content = re.sub("[\s+\.\!\/,;$%^*(+\"\')]+|[+——()?【】“”！，。？、~@#￥%……&*（）]+", " ", content)
    content = re.sub(r" {2,}", " ", content)
    return content

def get_org(co_authors, author_name):
    for au in co_authors:
        name = precessname(au['name'])
        name = name.split('_')
        if ('_'.join(name) == author_name or '_'.join(name[::-1]) == author_name) and 'org' in au:
            return au['org']
    return ''
    

### disambiguation: TF-IDF 

In [69]:
def disambiguate():
    result_dict = {}
    
    for author in tqdm(test_author_data.keys()):
        #print(f"{author}, ", end = "")
        
        feature_string_list = []
        papers = test_author_data[author]
        if len(papers) == 0:
            result_dict[author] = []
            continue
        
        #print("paper count: ", len(papers))
        paper_dict = {}
        for paper in papers:
           
            feature_list = []
            
            for k in test_pub_data[paper]["authors"]:
                if "name" in k:
                    feature_list.append(preprocess_name(k["name"]))
                
                if "org" in k:
                    feature_list.append(preproces_org(k["org"]))                    

                    
            '''
            if "venue" in k:
                    feature_list.append(k["venue"])
            '''
            
            '''
            if "title" in k:
                if train_pub_data[paper]["title"] is not None: 
                    feature_list.append(train_pub_data[paper]["title"].lower())
            '''
            
            if "abstract" in test_pub_data[paper]:
                feature_list.append(test_pub_data[paper]["abstract"].lower())            
            
            '''
            if "keywords" in test_pub_data[paper]:
                feature_list += test_pub_data[paper]["keywords"]
            '''
            
            feature_string = etl(" ".join(feature_list)).lower()
            #print(feature_string)
            feature_string = remove_stopwords(feature_string)
            #print("before:\n", feature_string)
            feature_string = remove_non_printable_words(feature_string)
            
            feature_string_list.append(feature_string)
            
        
        #print("feature string list length:", len(feature_string_list))
        vectorizer = TfidfVectorizer()
        tfidf = vectorizer.fit_transform(feature_string_list)
       
        clf = DBSCAN(metric="cosine", min_samples = 3)
        s = clf.fit_predict(tfidf)
      
        
        for label, paper in zip(clf.labels_, papers):
            if str(label) not in paper_dict:
                paper_dict[str(label)] = [test_pub_data[paper]["id"]]
            else:
                paper_dict[str(label)].append(test_pub_data[paper]["id"])
            
        #pprint(paper_dict)
        result_dict[author] = list(paper_dict.values())
    
    f = open("./result/result_0223.json", "w", encoding='utf-8')
    json.dump(result_dict, f, indent=4)
    f.close()
    print("result saved.")
                
               
                    
                
disambiguate()

100%|██████████| 50/50 [01:52<00:00,  2.25s/it]

result saved.



