# Not On Doras

This notebook details the method used to find missing publications on doras. For finding the missing publications, I used the standard Levenshtein distance similarity ratio and regular expressions. I had to specify specific thresholds for certain researchers, however, the majority were set at  a default (90). However, there were some rogue publications that were above such a threshold but still dissimilar. I hardcoded such files into  a csv file called mismatches_not_on_doras. This may a problem for future iterations. I tried other methods such as  the cosine similarity ratio using word n-gram, character n-grams and ‘char_wb’( which creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space) to try and catch the rogue publications .  However, none provided a  clear threshold  for which I  could separate the rogues from the similar publications and in some cases, the cosine similarity ratio did not find the nearest title!


**Example:** The google scholar  title was "Computer Vision for Lifelogging: Characterizing Everyday Activities Based on Visual Semantics" the nearest title on doras was actually "Computer Vision for Lifelogging" . However, the cosine similarity ratio suggested "Characterizing everyday activities from visual lifelogs based on enhancing concept representation." as the nearest title. So, in the end, I abandoned using the cosine  similarity ratio.

This notebook contains **three** key components:

1. Creating methods that will help me identify the missing and rogue publications. In addition, recounciling the two versions of 
   doras (the faculty page and a researcher's page) as there are some publications that appeared in a researchers page that did not appear in the faculty page and vice versa. I also dropped redundant columns.

2. Perfoming the splitting.

3. Creating a pandas dataframe(detailed below) for the missing doras and rogue doras publications.



In [1]:
import pandas as pd
from fuzzywuzzy import process
import re
import numpy as np
import string

In [2]:
researchers = pd.read_csv("../data/SOC_Researchers_with_doras_names.csv",encoding = "ISO-8859-1")

In [3]:
# a dataframe containing publications that are in a researcher's google scholar profile but not on doras
not_on_doras = pd.DataFrame(columns=['Research name','Publication Title','Author List','Conf/Journal Details','Citation count','Year'])

# Creating functions that will find the missing doras publications and reconcile the  two versions of Doras 

**add_mis_matches** method that assists in the development of a dataframe that contains  mismatched/rogue publications

**check_years** checks if there is a year present in the google scholar publication title and the nearest publication title in doras

**equal_years**  tests if the year in a  google scholar publications title and the year in  the nearest publication title in doras are the same

**check_ncir** another method testing whether the year in a publications title containing "NICIR" and the year in the nearest publication title in doras are the same. Note, the year is in the format NCIR-19 (for example) so I could not simply use the method above.

**check_sr** another method testing whether the year in a publications title containing "shared task (SR" and the year in the nearest publication title in doras are the same. Note, the year is in the format SR'19 or SR’19 so so I could not simply use the method above.

**check_wmt** checks whether a scholar publication title  contains wmt and the nearest publication title in doras also contain wmt

**equal_wmt**  another method testing whether the year in a publications title containing "wmt" and the year in the nearest publication title in doras are the same. Note, the year is in the format SR'19 or SR’19 so so I could not simply used the method above.

**reconcile_doras** a method used to recouncile the two versions of doras (the faculty page and a researcher's page) as there are some publications that appeared in a researchers page that did not appear in the faculty page and vice versa. I also dropped redundant columns.

In [4]:
list_mis_match_title = []
list_nearest_title  = []
list_score = []
list_researcher = []
list_year = []
# method that assists in the development of a dataframe that contains  mismatched/rogue publications 
def add_mis_matches(research, title,near,score,year):
    list_researcher.append(research)
    list_mis_match_title.append(title)
    list_nearest_title.append(near)
    list_score.append(score)
    list_year.append(year)
#checks if there is a year present in the publication title and the nearest publication title in doras
def check_years(title,near_title):
    try:
        year1 = re.findall("[2][0][0-9]{2}",title)[0]
        year2 = re.findall("[2][0][0-9]{2}",near_title)[0]
    except IndexError:
        return False
    return True
# tests if the year in a publications title and the year in  the nearest publication title in doras are the same
def equal_year(title,near_title):
    year1 = re.findall("[2][0-9]{3}",title)[0]
    year2 = re.findall("[2][0-9]{3}",near_title)[0]
    return year1 == year2
#another method testing whether the year in a publications title containing "NICIR" and the year in the nearest publication title in doras are the same 
def check_ncir(title,near_title):
    year1 = re.findall("NTCIR-[0-9]{2}",title)[0]
    year2 = re.findall("NTCIR-[0-9]{2}",near_title)[0]
    return year1 == year2
#another method testing whether the year in a publications title containing "shared task (SR" and the year in the nearest publication title in doras are the same 
def check_sr(title, near_title):
    year1 = re.findall("(SR[’|'][0-9]{2})",title)[0][-2:]
    year2 = re.findall("(SR[’|'][0-9]{2})",near_title)[0][-2:]
    return year1 == year2
# checks whether a  publication title  contains wmt and the nearest publication title in doras also contain wmt
def check_wmt(title, near_title):
    try:
        year1 = re.findall("wmt[0-9]{2}",title.lower())[0][-2:]
        year2 = re.findall("wmt[0-9]{2}",near_title.lower())[0][-2:]
    except IndexError:
        return False
    return True
# another method testing whether the year in a publications title containing "wmt" and the year in the nearest publication title in doras are the same 
def equal_wmt(title,near_title):
        year1 = re.findall("wmt[0-9]{2}",title.lower())[0][-2:]
        year2 = re.findall("wmt[0-9]{2}",near_title.lower())[0][-2:]
        return year1 == year2
    
    
    
    

In [5]:
# recounciling the doras profile verion and doras faculty page as there is publications that do not appear on both!
def reconcile_doras(value,doras_name,doras_df,doras_soc_df):
    soc = doras_soc_df[doras_soc_df["Author List"].str.contains(doras_name)]

    soc = soc[[
     'Publication Title',
     'Author List',
     'Conf/Journal Details',
     'Year',
     'ISBN',
     'ISSN',
     'Item Type',
     'Event Type',
     'Refereed',
     'Date of Award',
     'Supervisor(s)',
     'Uncontrolled Keywords',
     'Subject',
     'DCU Faculties and Centres',
     'Use License',
     'ID Code',
     'Deposited On',
     'Published in',
     'Publisher',
     'Official URL',
     'Copyright Information',
     'Funders',
     'Additional Information',
     'Tweets',
     'Mendeley Readers']]

    soc.insert(0, 'Research name', value)


    doras_df = doras_df[['Research name', 'Publication Title', 'Author List',
       'Conf/Journal Details', 'Year', 'ISBN', 'ISSN', 'Item Type',
       'Event Type', 'Refereed', 'Date of Award', 'Supervisor(s)',
       'Uncontrolled Keywords', 'Subject', 'DCU Faculties and Centres',
       'Use License', 'ID Code', 'Deposited On', 'Published in', 'Publisher',
       'Official URL', 'Copyright Information', 'Funders',
       'Additional Information', 'Tweets', 'Mendeley Readers']]

    final = pd.concat([soc,doras_df] ,ignore_index=True)
    final["Publication Title"] = final["Publication Title"].apply(lambda x: " ".join(str(x).splitlines())) 



    final.drop_duplicates(inplace=True, ignore_index=True)
    return final 

    

Through emperical analysis I found out  the  best suited standard Levenshtein distance similarity ratio for a particular researcher, those not in the dictionary default to 90. The process.extractone funcition of the fuzzywuzzy package automatically turns all the strings to lowercase.

In [6]:

dic = {"Rob Brennan": 92, "Annalina Caputo": 89, "Long Cheng":89, "Jennifer Foster":89, "Jane Kernan": 89,"Alistair Sutherland":88,"Andy Way":88,"Murat YILMAZ":87,"Paul M. Clarke":89,"Gareth Jones":87}

In [7]:
#doras_soc_df is a dataframe containing the publications listed on the faculty page
doras_soc_df =  pd.read_csv("../data/Doras SOC/doras_soc.csv")
for index, value in researchers['Researcher'].items():
    value = value.strip()# value is the researchers name
    doras_df = ""
    scholar_df = ""
    filename = "_".join(value.split(" "))
    doras_name = "" # the name of the researcher in doras format
    # need try and except as not every researcher has a doras page
    try:
        doras_df1 = pd.read_csv("../data/Doras publications/{}.csv".format(filename))
        doras_name = researchers.iloc[index,4].strip()
        doras_df = reconcile_doras(value,doras_name,doras_df1,doras_soc_df)
        doras_df = doras_df[doras_df["Conf/Journal Details"].str.contains("preprint", na=False) == False].reset_index(drop=True)# removing preprints
        
    except FileNotFoundError:
        pass
    # need try and except as not every researcher has a doras page
    try:
        scholar_df = pd.read_csv("../data/Google Scholar Publications/{}.csv".format(filename))
        scholar_df = scholar_df[scholar_df["Conf/Journal Details"].str.contains("preprint", na=False) == False].reset_index(drop=True)# removing preprints
    except FileNotFoundError:
        pass
    # testing if we have both profiles(doras and google scholar) present
    if len(doras_df) > 0 and len(scholar_df) > 0:
        try:
            score = int(dic[value])
        except KeyError:
            score = 90
        choices1 = doras_df["Publication Title"].values.tolist()#adding all of the doras tiitles to a list
        for index, title in scholar_df["Publication Title"].items():
            distance = process.extractOne(str(title.strip()),choices1)[-1]# the ratio score
            nearest_title = process.extractOne(str(title.strip()),choices1)[0]# the nearest title based on the ratio
            if distance >= score and distance < 99: # regular expression methods used to find missing publications above the threshold 
                if check_years(title,nearest_title) == True:
                    if equal_year(title,nearest_title) == False:
                        not_on_doras = pd.concat([not_on_doras,scholar_df.iloc[[index]]],ignore_index=True)
                elif "NTCIR" in title and "NTCIR" in nearest_title:
                    if check_ncir == False:
                        not_on_doras = pd.concat([not_on_doras,scholar_df.iloc[[index]]],ignore_index=True)
                        
                elif "shared task (SR" in title and "shared task (SR" in nearest_title:
                    if check_sr(title,nearest_title) == False:
                        not_on_doras = pd.concat([not_on_doras,scholar_df.iloc[[index]]],ignore_index=True)
                elif check_wmt(title,nearest_title) == True:
                    if equal_wmt(title,nearest_title) == False:
                        not_on_doras = pd.concat([not_on_doras,scholar_df.iloc[[index]]],ignore_index=True)
                        
                        
                else:# ones I could not catch and were ambigious. put them in the mismatch dataframe
                    if title == "Universal dependencies 1.1":
                        add_mis_matches(value,title,nearest_title,distance,scholar_df["Year"].loc[index])
                    elif title == "Evaluation of coordination techniques in synchronous collaborative information retrieval":
                        add_mis_matches(value,title,nearest_title,distance,scholar_df["Year"].loc[index])
                    elif title == "Oracle-based training for phrase-based statistical machine translation":
                        add_mis_matches(value,title,nearest_title,distance,scholar_df["Year"].loc[index])
                    elif title == "Machine translation":
                        add_mis_matches(value,title,nearest_title,distance,scholar_df["Year"].loc[index])
                    elif title == "Results of the wmt18 metrics shared task: Both characters and embeddings achieve good performance":
                        add_mis_matches(value,title,nearest_title,distance,scholar_df["Year"].loc[index])
                    elif title == "Results of the wmt16 metrics shared task":
                        add_mis_matches(value,title,nearest_title,distance,scholar_df["Year"].loc[index])
                    elif title == "Interaction and engagement for information research and learning with lifelogging devices":
                        add_mis_matches(value,title,nearest_title,distance,scholar_df["Year"].loc[index])
                    elif title == "Multimedia for personal health and health care":
                        add_mis_matches(value,title,nearest_title,distance,scholar_df["Year"].loc[index])
                    elif title == "Streamrule: a nonmonotonic stream reasoning system for the semantic web":
                        add_mis_matches(value,title,nearest_title,distance,scholar_df["Year"].loc[index])
                   
            
            # adding google scholar publication titles that do not appear in doras
            elif process.extractOne(str(title.strip()),choices1)[-1] < score:
                not_on_doras = pd.concat([not_on_doras,scholar_df.iloc[[index]]],ignore_index=True)
    # if a researcher has no doras profile, adding all of his/her publications to not_on_doras            
    elif len(scholar_df) > 0 and len(doras_df)  == 0:# if they dont have a doras profile putting all of their google scholar profile papers into the not on doras df
        not_on_doras = pd.concat([not_on_doras,scholar_df],ignore_index=True)
        
                

## Mismatch dataframe
| Column name | description |
|---:|:---|
| Research name | The researcher's name.|
| Mismatch Title | The title of the google scholar  paper |
| Nearest Title | The title of the doras paper. |
| Score | The ratio score.|
| Year  | Year of the scholar paper. |


## Not on doras dataframe 
| Column name | description |
|---:|:---|
| Research name | The researcher's name.|
| Publication title | The title of a  paper |
| Author list | The author's of a paper.|
| Conf/Journal Details  | Details about a publication. |
| Citation Count  | The citation count garnered by a particular paper. |
| Year  | Year of a paper. |

In [12]:
d = {'Research name':list_researcher,'Mismatch Title':list_mis_match_title,'Nearest Title':list_nearest_title,'Score':list_score,"Year":list_year,"Filename":None,"Ignore Y/N":False}
mis_matches_df = pd.DataFrame(data=d)
not_on_doras["Filename"] = None
not_on_doras["Ignore Y/N"] = False

In [13]:

mis_matches_df.to_csv("../data/Missing Publications/mismatches_not_on_doras.csv", index = None, header=True)

In [14]:
not_on_doras.to_csv("../data/Missing Publications/not_on_doras.csv", index = None, header=True)