## Part 1: Network Construction
Nodes represent authors of academic papers.

Edge from node A to B indicates a joint paper written by both.

Edge weights are the number of papers they have written together.

In [61]:
import pandas as pd
from itertools import combinations # to create unique co-author pairs from the list of authors for each paper
from itertools import chain # to flatten the list of lists of co-author pairs efficiently
from collections import Counter # to count each co-author pair
import networkx as nx
import json
from networkx.readwrite import json_graph
import numpy as np
from scipy import stats
import ast


##### Create a weighted edgelist

In [62]:
# Read the data
df_papers = pd.read_csv('IC2S2_combined_papers.csv')
df_authors = pd.read_csv('IC2S2_combined_authors.csv')

In [63]:
# Convert the author id to list 
df_papers['author_ids'] = df_papers['author_ids'].apply(ast.literal_eval) # convert each string to a list

In [65]:
# Get the co-author pairs for each paper (each row in dataframe)
coauthor_pairs = df_papers['author_ids'].apply(lambda x: list(combinations(x, 2))) # find unique combos of 2 authors in author list for each paper

# Flatten list of lists into a single list (of tuples as combinations returns tuples) --> using chain from itertools for efficiency
flattened_pairs = list(chain.from_iterable(coauthor_pairs))

# Count number of co-author pairs
coauthor_count = Counter()

for sublist in coauthor_pairs: # incrementally count co-author pairs (increase efficiency)
    coauthor_count.update(sublist)

# Make edgelist
edgelist = []
for (a, b), count in coauthor_count.items():
    edgelist.append((a, b, count))

In [66]:
edgelist[:5]

[('https://openalex.org/A5087421071', 'https://openalex.org/A5077795637', 3),
 ('https://openalex.org/A5087421071', 'https://openalex.org/A5082742221', 5),
 ('https://openalex.org/A5077795637', 'https://openalex.org/A5082742221', 4),
 ('https://openalex.org/A5003697141', 'https://openalex.org/A5041252321', 1),
 ('https://openalex.org/A5003697141', 'https://openalex.org/A5070114879', 1)]

##### Graph construction

In [67]:
Graph = nx.Graph()
Graph.add_weighted_edges_from(edgelist)

##### Node attributes

In [68]:
# First add author attributes: display_name, country 

for index, row in df_authors.iterrows():
    author_id = row['id']
    Graph.add_node(author_id, display_name=row['display_name'], country = row['country_code'])

# (possibly faster than the for loop above but not sure if it gives the same content) Graph.add_nodes_from(df_authors['id'], display_name = df_authors['display_name'], country = df_authors['country_code'])

# Get citation count from df_papers
author_citation_counts = df_papers.explode('author_ids').groupby('author_ids')['cited_by_count'].sum() # explode to get one author per row, groupby author and sum citations

# Add citation count as an attribute to the nodes
for author_id, citation_count in author_citation_counts.items(): # (author_citation_counts is a Series where index is author_id and value is citation count)
    Graph.nodes[author_id]['citation'] = citation_count

# Get first publication year for each author from df_papers
first_pub_year = df_papers.explode('author_ids').groupby('author_ids')['publication_year'].min() # explode to get one author per row, groupby author and get min publication year

# Add first publication year as an attribute to the nodes
for author_id, year in first_pub_year.items(): # (first_pub_year is a Series where index is author_id and value is first publication year)
    Graph.nodes[author_id]['first_pub_year'] = year

In [69]:
# Save the graph as a json file
graph_data = json_graph.node_link_data(Graph)
with open("network.json", "w") as f:
    json.dump(graph_data, f, indent = 4) # indent = 4 to make the json file more readable

The default value will be `edges="edges" in NetworkX 3.6.


  nx.node_link_data(G, edges="links") to preserve current behavior, or
  nx.node_link_data(G, edges="edges") for forward compatibility.


## Part 2: Preliminary Network Analysis


##### Network Metrics

In [70]:
num_nodes = Graph.number_of_nodes()
num_edges = Graph.number_of_edges()

print(f"Total number of authors is {num_nodes} and total number of collaborations is {num_edges}.")

Total number of authors is 28735 and total number of collaborations is 414671.


In [71]:
max_possible_edges = num_nodes * (num_nodes - 1) / 2 # n choose 2 is max possible edges for an undirected graoh where n is the number of nodes
density = num_edges/max_possible_edges
print(f"Density of the network is {density}")

Density of the network is 0.0010044454847290415


Would you say that the network is sparse? Justify your answer.

The density of the network is ~0.001 indicating that the network is quite sparse as the number of links is much less than the maximum possible number of links.

Is the network fully connected (i.e., is there a direct or indirect path between every pair of nodes within the network), or is it disconnected?

The network is not fully connected as can be seen below, there are 226 isolated nodes (authors).

In [72]:
# Find number of connected components
num_cc = nx.number_connected_components(Graph)
print(f"Number of connected components in the network is {num_cc}.")

Number of connected components in the network is 292.


In [73]:
# Find number of isolated nodes
num_isolated = len(list(nx.isolates(Graph)))
print(f"Number of isolated nodes in the network is {num_isolated}.")

Number of isolated nodes in the network is 226.


Discuss the results above on network density, and connectivity. Are your findings in line with what you expected? Why?

The findings make sense as we did not expect a very dense graph since it might not be possible for every author to work with every other author in the real world. The number of isolated nodes also makes sense since it is definitely possible for authors to write their papers and research alone. Although it is interesting to see that there were only 226 lone authors out of 28735 (about 0.8%).

##### Degree Analysis

In [74]:
# Compute the average, median, mode, minimum, and maximum degree of the nodes

# Get the degrees of all the nodes
degrees = [degree for node, degree in Graph.degree()]

avg_deg = np.mean(degrees)
med_deg = np.median(degrees)
mode_deg = stats.mode(degrees, keepdims=True)[0][0] # keepdims=True to get the mode as an array
min_deg = min(degrees)
max_deg = max(degrees)

print(f"Average degree of the nodes is {round(avg_deg, 1)}.")
print(f"Median degree of the nodes is {med_deg}.")
print(f"Mode degree of the nodes is {mode_deg}.")
print(f"Minimum degree of the nodes is {min_deg}.")
print(f"Maximum degree of the nodes is {max_deg}.")


Average degree of the nodes is 28.9.
Median degree of the nodes is 11.0.
Mode degree of the nodes is 4.
Minimum degree of the nodes is 0.
Maximum degree of the nodes is 605.


This shows, for example, on average, each author in the network has collaborated with 29 other authors. A median of 11.0 indicates that atleast half of the authors have 11 or fewer collaborations.

In [75]:
# Compute the average, median, mode, minimum, and maximum of node strength

# Get the strength of all the nodes (for each node, strength is the sum of the weights of the edges incident to that node)
strengths = [strength for node, strength in Graph.degree(weight='weight')]
avg_str = np.mean(strengths)
med_str = np.median(strengths)
mode_str = stats.mode(strengths, keepdims=True)[0][0] # keepdims=True to get the mode as an array
min_str = min(strengths)
max_str = max(strengths)

print(f"Average strength of the nodes is {round(avg_str, 1)}.")
print(f"Median strength of the nodes is {med_str}.")
print(f"Mode strength of the nodes is {mode_str}.")
print(f"Minimum strength of the nodes is {min_str}.")
print(f"Maximum strength of the nodes is {max_str}.")

Average strength of the nodes is 46.3.
Median strength of the nodes is 15.0.
Mode strength of the nodes is 4.
Minimum strength of the nodes is 0.
Maximum strength of the nodes is 2627.


This tells us that on average for example, an author has a total collaboration weight of 46.3 but this does not mean they have co-authored 46.3 papers across all their collaborations since some of the collaborations with other individual authors might be overlapping.

Degree and strength are also related.

High degree, high strength indicates highly collaborative researchers with many co-authors and frequent collaborations.
Low degree, high strength indicates authors who work repeatedly with a small set of collaborators
High degree, low strength indicates authors who have many co-authors but only a few papers per collaboration.
Low degree, low strength indiates authors with few collaborations and few papers.

##### Top Authors

In [76]:
sorted_by_deg = sorted(Graph.degree(), key=lambda x: x[1], reverse=True)
top_5_deg = sorted_by_deg[:5]

for node, degree in top_5_deg:
    print(Graph.nodes[node]['display_name'])

Anna Dreber
Magnus Johannesson
Simon A. Levin
Yan Wang
Lyle Ungar


What role do these nodes play in the network?

These authors have collaborated the most with other authors in the network since they have the highest degree, indicating that they are possibly senior researchers or mentors to other researchers (since they have many co-authors).

Research these authors online. What areas do they specialize in? Do you think that their work aligns with the themes of Computational Social Science? If not, what could be possible reasons?

Anna Dreber is an economist, hence her field of study aligns with the themes of Computational Social Science. 

Magnus Johannesson is known for his research in the field of experimental economics, which aligns with the themes of Computational Social Science.

Simon A. Levin is an ecologist but has worked on a lot of economic and psychology papers, which is a possible reason that he is a top collaborator in this network.

Yan Wang is a computer scientist, which aligns with the themes of Computational Social Science.

Lyle Ungar is also a computer scientist, which again aligns with the themes of Computational Social Science.
