# Author enrichment
Development notebook for code looking at conference author enrichment.

In [1]:
import jsonlines
import networkx as nx
import pandas as pd
from infomap import Infomap

## Read in data

In [2]:
with jsonlines.open('../data/wos_files/core_collection_destol_or_anhydro_with_authors_FILTERED_24Jan2023.jsonl') as reader:
    data = [obj for obj in reader]

In [3]:
graph = nx.read_graphml('../data/citation_network/core_collection_destol_or_anhydro_FILTERED_classified_network_06Jan2023_MANUALLY_VERIFIED.graphml')

In [11]:
attendees_2016 = pd.read_excel('../data/conference_data/2016_DesWork_Delegate_List.xlsx')
attendees_2016 = attendees_2016.rename(columns={'First name': 'First_name'})
attendees_2016.head()

Unnamed: 0,Surname,First_name,Affiliation,Country
0,Barak,Simon,Ben-Gurion University of the Negev,Israel
1,Bar-Eyal,Leeat,The Hebrew University of Jerusalem,Israel
2,Bentley,Joanne,University of Cape Town,South Africa
3,Bharuth,Vishal,University of KwaZulu Natal,South Africa
4,Borner,Andreas,Leibniz Institute of Plant Genetics & Crop Pla...,Germany


## Clustering algorithms

Choosing the right clustering algorithm for an empirical task is difficult because none of the stand-alone quality metrics that can be calculated on empirical (real-world) networks reliably reflect information retreival capabilities, as discussed in [Emmons et al 2016](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0159161). The authors found that conductance was the metric that correlated best with information retrieval, and that the Infomap clustering algorithm had the highest conductance on a real-world dataset of co-authorships, which is relatively similar to the kind of network we're working with, so we'll start with the `infomap` package for Python.

Some salient initial details about `infomap` from [the docs](https://www.mapequation.org/infomap/):
* Infomap handles both unweighted and weighted, undirected and directed links.
* Infomap clusters tightly interconnected nodes into modules (two-level clustering) or the optimal number of nested modules (multi-level clustering)
    * I'll perform two-level clustering here to begin with

In [5]:
# Build network in Infomap
im = Infomap(clu=True, seed=1234)
mapping = im.add_networkx_graph(graph)

In [6]:
# Run the clustering algorithm
im.run()

  Infomap v2.7.1 starts at 2024-01-24 16:02:40
  -> Input network: 
  -> No file output!
  -> Configuration: clu
                    seed = 1234
  OpenMP 201511 detected with 1 threads...
  -> Ordinary network input, using the Map Equation for first order network flows
Calculating global network flow using flow model 'directed'... 

In [7]:
# Get output
cluster_output = im.get_dataframe(columns=["node_id", "module_id", "name"])
cluster_output.head()


  -> Using unrecorded teleportation to links. (Normalizing ranks after 0 power iterations with error -3.03607572e-05) 
  -> PageRank calculation done in 58 iterations.

  => Sum node flow: 1, sum link flow: 1
Build internal network with 5564 nodes and 41995 links...
  -> One-level codelength: 9.77797916

Trial 1/1 starting at 2024-01-24 16:02:40
Two-level compression: 18% 2.3% 0.0174579947% 0.0505075094% 0.00388751037% 
Partitioned to codelength 1.9057605 + 5.9298387 = 7.8355992 in 574 (167 non-trivial) modules.
Super-level compression: 0.548838812% to codelength 7.79259439 in 434 top modules.

Recursive sub-structure compression: 10.5084948% 0.104446176% 0.00254699959% 0% 

Unnamed: 0,node_id,module_id,name
0,114,1,WOS:000166537400004
1,460,1,WOS:000085277800001
2,107,1,WOS:000072812400006
3,515,1,WOS:000184821900010
4,108,1,WOS:000234240100013


In [8]:
print(f'There are {len(cluster_output.module_id.unique())} unique clusters in the network.')

There are 434 unique clusters in the network.


In [9]:
# Write out to pajek format to use online visualization tool
im.write_pajek('../data/citation_network/citenet.pajek')

#### NOTE
I am not totally clear on the interpretation of the output of this clustering algorithm, but I'm going to move ahead with the assumption that there are top-level clusters reflected by module number. Need to confirm that this is a valid interpretation.

## Author name normalization
Before we can go about calculating author enrichment, we need to figure out how to match the names provided by DesWorks to the names in the WoS dataset. There are a few ways that names can be formatted in WoS, so we'll write a function to check all conference attendees' names in all three formats.

In [19]:
char_11_names = [paper for paper in data if 1964 <= int(paper['year']) <= 1975]
pre_2006_names = [paper for paper in data if (int(paper['year']) < 2006) and (paper not in char_11_names)]
modern_names = [paper for paper in data if (paper not in char_11_names) and (paper not in pre_2006_names)]

In [20]:
print(f'{len(char_11_names)} papers are from between 1964 - 1975, {len(pre_2006_names)} are from the period where only '
     f'first initial was recorded, and {len(modern_names)} are from the era where full name is recorded. Remember that '
     'it is possible that some older entries have been updated since 2006 and may contain full names.')

10 papers are from between 1964 - 1975, 1799 are from the period where only first initial was recorded, and 4154 are from the era where full name is recorded. Remember that it is possible that some older entries have been updated since 2006 and may contain full names.


In [27]:
pre_2006_names[0]

{'UID': 'WOS:000165982500024',
 'title': 'Genetic analysis of seed-soluble oligosaccharides in relation to seed storability of Arabidopsis',
 'year': '2000',
 'authors': [{'seq_no': '1',
   'role': 'author',
   'display_name': 'Bentsink, L',
   'full_name': 'Bentsink, L',
   'wos_standard': 'Bentsink, L',
   'first_name': 'L',
   'last_name': 'Bentsink'},
  {'seq_no': '2',
   'role': 'author',
   'display_name': 'Alonso-Blanco, C',
   'full_name': 'Alonso-Blanco, C',
   'wos_standard': 'Alonso-Blanco, C',
   'first_name': 'C',
   'last_name': 'Alonso-Blanco'},
  {'seq_no': '3',
   'role': 'author',
   'display_name': 'Vreugdenhil, D',
   'full_name': 'Vreugdenhil, D',
   'wos_standard': 'Vreugdenhil, D',
   'first_name': 'D',
   'last_name': 'Vreugdenhil'},
  {'seq_no': '4',
   'role': 'author',
   'display_name': 'Tesnier, K',
   'full_name': 'Tesnier, K',
   'wos_standard': 'Tesnier, K',
   'first_name': 'K',
   'last_name': 'Tesnier'},
  {'seq_no': '5',
   'role': 'author',
   'disp

In [28]:
char_11_names[8]

{'UID': 'WOS:A1971L065600025',
 'title': 'DESICCATION TOLERANCE IN 3 RACES OF AMBYSTOMA-TIGRINUM IN NORTH DAKOTA',
 'year': '1971',
 'authors': [{'seq_no': '1',
   'role': 'author',
   'display_name': 'LARSON, DW',
   'full_name': 'LARSON, DW',
   'wos_standard': 'LARSON, DW',
   'first_name': 'DW',
   'last_name': 'LARSON'}],
 'references': [{'UID': 'WOS:A1958WR62800010',
   'year': '1958',
   'title': 'VITAL LIMITS AND RATES OF DESICCATION IN SALAMANDERS'},
  {'UID': 'WOS:A1971L065600025.9', 'year': '1943'},
  {'UID': 'WOS:A1958WV57700015',
   'year': '1958',
   'title': 'COMPARISON OF DEHYDRATION AND HYDRATION OF 2 GENERA OF FROGS (HELEIOPORUS AND NEOBATRACHUS) THAT LIVE IN AREAS OF VARYING ARIDITY'},
  {'UID': 'WOS:A1955WR61600014',
   'year': '1955',
   'title': 'THE RELATIONSHIP OF WATER ECONOMY TO TERRESTRIALISM IN AMPHIBIANS'},
  {'UID': 'WOS:A1968B735000024',
   'year': '1968',
   'title': 'OCCURRENCE OF NEOTENIC SALAMANDERS AMBYSTOMA TIGRINUM DIABOLI DUNN IN DEVILS LAKE NORTH

In [14]:
data[100]

{'UID': 'WOS:000267158400009',
 'title': 'DNA damage in storage cells of anhydrobiotic tardigrades',
 'year': '2009',
 'authors': [{'seq_no': '1',
   'role': 'author',
   'addr_no': '1',
   'display_name': 'Neumann, Simon',
   'full_name': 'Neumann, Simon',
   'wos_standard': 'Neumann, S',
   'first_name': 'Simon',
   'last_name': 'Neumann'},
  {'seq_no': '2',
   'role': 'author',
   'addr_no': '1',
   'display_name': 'Reuner, Andy',
   'full_name': 'Reuner, Andy',
   'wos_standard': 'Reuner, A',
   'first_name': 'Andy',
   'last_name': 'Reuner'},
  {'seq_no': '3',
   'role': 'author',
   'addr_no': '1',
   'display_name': 'Bruemmer, Franz',
   'full_name': 'Bruemmer, Franz',
   'wos_standard': 'Brummer, F',
   'first_name': 'Franz',
   'last_name': 'Bruemmer'},
  {'seq_no': '4',
   'role': 'author',
   'reprint': 'Y',
   'addr_no': '1',
   'display_name': 'Schill, Ralph O.',
   'full_name': 'Schill, Ralph O.',
   'wos_standard': 'Schill, RO',
   'first_name': 'Ralph O.',
   'last_name

In [80]:
def find_author_papers(attendees, dataset):
    """
    Find papers that were authored by conference attendees.
    
    parameters:
        attendees, df: columns are Surname, First_name, Affiliation, Country
        dataset, list of dict: WoS papers with author and affiliation data
    
    returns:
        conference_authors, dict: keys are DesWorks recorded author names in WOS standard, values are WOS UIDs
    """
    ## TODO edge cases
    # Names with internal punctuation (e.g. O'Neill, Farooq-E-Azam)
    # Chinese names
    # Papers between 64-75 for authors with last names more than 8 characters
    
    # Process attendees to strip trailing whitespace and to lowercase
    for col in attendees.columns:
        attendees[col] = attendees[col].str.strip().str.lower()
    
    # Check for conference authors
    conference_authors = {f'{surname}, {"".join([n[0] for n in first_name.split()])}' for surname, first_name in zip(attendees['Surname'], attendees['First_name'])}
    for paper in dataset:
        # Check only surname first
        surnames = []
        for a in paper['authors']:
            try:
                surnames.append(a['last_name'].lower())
            except KeyError:
                # Dot 11-character names don't have a last name and we want to include them
                # Other authors with no last name are mostly organizations or repeats
                try:
                    if len(a['wos_standard'].split('.')) == 2:
                        surnames.append(a['wos_standard'].split('.')[0].lower())
                        ## TODO: this is a problem if the last name is longer than 8 characters
                except KeyError:
                    continue
        for name in surnames:
            # If surname is present, confirm with wos standard (including first name) name
            if name in attendees.Surname.tolist():
                # Get the rows with this surname, possible multiple have same surname
                possible_authors = attendees[attendees['Surname'] == name]
                print(f'Paper {paper["UID"]} contains {len(possible_authors)} possible authors with the surname {name}')
                # Process first names to initials
                possible_first_names = possible_authors.First_name.tolist()
                possible_initials = ["".join([n[0] for n in fn.split()]) for fn in possible_first_names]
                # Get possible oldest name format
                char_11s_space = [f'{name[:8]}, {initials}' for initials in possible_initials]
                char_11s_period = [f'{name[:8]}.{initials}' for initials in possible_initials]
                # Get first name as initial name format
                pre_2006s_and_WOS_standard = [f'{name}, {initials}' for initials in possible_initials]
                # Get full name
                full_names = [f'{name}, {first_name}' for first_name in possible_first_names]
                # Combine for all possibilities
                to_check = char_11s_space + char_11s_period + pre_2006s_and_WOS_standard + full_names
                # Now check all names against paper authors
                for author in paper['authors']:
                    if (author['full_name'] in to_check) or (author['wos_standard'] in to_check):
                        print(author)
                        ## TODO debug
                        
                

In [81]:
find_author_papers(attendees_2016, data)

Paper WOS:000281350300006 contains 1 possible authors with the surname hincha
Paper WOS:000276490900017 contains 1 possible authors with the surname lin
Paper WOS:000208165500004 contains 1 possible authors with the surname jin
Paper WOS:000277902300031 contains 1 possible authors with the surname farrant
Paper WOS:000537458800001 contains 1 possible authors with the surname fu
Paper WOS:000803565400001 contains 1 possible authors with the surname jin
Paper WOS:000791175600001 contains 1 possible authors with the surname oliver
Paper WOS:000224794200009 contains 1 possible authors with the surname hincha
Paper WOS:000225678200003 contains 1 possible authors with the surname hincha
Paper WOS:000249142800068 contains 1 possible authors with the surname pammenter
Paper WOS:000600247600001 contains 1 possible authors with the surname oliver
Paper WOS:000509003600002 contains 1 possible authors with the surname verhoeven
Paper WOS:000285069400036 contains 1 possible authors with the surname

Paper WOS:000244118800006 contains 1 possible authors with the surname farrant
Paper WOS:A1992JG87900011 contains 1 possible authors with the surname farrant
Paper WOS:A1992JG87900011 contains 1 possible authors with the surname pammenter
Paper WOS:000448142200021 contains 1 possible authors with the surname williams
Paper WOS:000448142200021 contains 1 possible authors with the surname mundree
Paper WOS:000424760300015 contains 1 possible authors with the surname varghese
Paper WOS:000424760300015 contains 1 possible authors with the surname pammenter
Paper WOS:000424350000006 contains 1 possible authors with the surname moore
Paper WOS:000072858400018 contains 1 possible authors with the surname kranner
Paper WOS:A1997WW05300011 contains 1 possible authors with the surname hoekstra
Paper WOS:A1997WW05300010 contains 1 possible authors with the surname golovina
Paper WOS:A1997WW05300010 contains 1 possible authors with the surname hoekstra
Paper WOS:A1997XH01700012 contains 1 possible

Paper WOS:000297605800010 contains 1 possible authors with the surname lin
Paper WOS:000397409300009 contains 1 possible authors with the surname fu
Paper WOS:000397409300009 contains 1 possible authors with the surname fu
Paper WOS:000605988300001 contains 1 possible authors with the surname marks
Paper WOS:000605988300001 contains 1 possible authors with the surname farrant
Paper WOS:000274412500010 contains 1 possible authors with the surname lin
Paper WOS:000274412500010 contains 1 possible authors with the surname hoekstra
Paper WOS:000275571400005 contains 1 possible authors with the surname lin
Paper WOS:000275571400005 contains 1 possible authors with the surname hoekstra
Paper WOS:000284979100003 contains 1 possible authors with the surname rafudeen
Paper WOS:000284979100003 contains 1 possible authors with the surname farrant
Paper WOS:000290845500008 contains 1 possible authors with the surname mycock
Paper WOS:000290845500008 contains 1 possible authors with the surname pam

We can common-sense check our results by looking at the authors' Google Scholar pages.

## Author enrichment
Now we can calculate what percentage of authors in 