## Research Areas Analysis Using Graph Data and Large Language Models

### Donato Riccio

![](image-1.png)

# Graph construction

The following code loads the dataset from csv and builds a graph using networkx. 

In [1]:
import pandas as pd
import networkx as nx
from tqdm import tqdm

# Initialize a directed graph
G = nx.DiGraph()

# Define the chunk size
chunk_size = 10000  

# This is done in a streaming way to avoid loading the entire dataset into memory along with the graph

for chunk in tqdm(pd.read_csv('data/dblp_ml.csv', chunksize=chunk_size), total=2941588/chunk_size):
    
    # Process each row in the chunk
    for idx, row in chunk.iterrows():
        paper_id = row['id']
        references = str(row['references']).split(';') if pd.notna(row['references']) else []

        # Add paper node
        G.add_node(paper_id, type='paper')

        # Add reference edges (directed)
        for ref in references:
            if ref and ref.lower() != 'nan':
                ref = int(ref)  # Ensure the reference ID is an integer
                G.add_edge(paper_id, ref, relationship='cites')  # Directed edge
    

303it [03:52,  1.30it/s]                              


# Running InfoMap to partition the graph in communities

The algorithm was chosen for two reasons:
- It's suited for directed graphs
- It's very time efficient (O(nlogn)

In [2]:
import pickle
from infomap import Infomap
from tqdm import tqdm

# Initialize Infomap
im = Infomap()

# Add edges to the Infomap object
for u, v, data in tqdm(G.edges(data=True)):
    im.addLink(u, v)

# Run Infomap
im.run()

# Retrieve the communities
communities = im.getModules()
import pandas as pd
# Prepare a DataFrame to store the results
processed_papers_df = pd.DataFrame(
    [(node, communities[node]) for node in communities],
    columns=['node_id', 'community']
)

# add community to the graph
for row in tqdm(processed_papers_df.itertuples(), total=len(processed_papers_df)):
    G.nodes[row.node_id]['community'] = row.community

print(f"Number of communities detected: {im.numTopModules()}")

100%|██████████| 34170947/34170947 [02:57<00:00, 192204.36it/s]


  Infomap v2.7.1 starts at 2024-06-12 18:44:03
  -> Input network: 
  -> No file output!
  -> Ordinary network input, using the Map Equation for first order network flows
Calculating global network flow using flow model 'undirected'... 
  -> Using undirected links.
  => Sum node flow: 1, sum link flow: 1
Build internal network with 2880605 nodes and 34170947 links...
  -> One-level codelength: 20.4467932

Trial 1/1 starting at 2024-06-12 18:46:36
Two-level compression: 20% 3.7% 0.0486511818% 0.0481414097% 0.00442233328% 0.00835782759% 
Partitioned to codelength 4.68737479 + 11.0540549 = 15.74142964 in 24029 (24028 non-trivial) modules.
Super-level compression: 2.48449717% to codelength 15.35033427 in 1808 top modules.

Recursive sub-structure compression: 14.6931561% 0.0781793798% 5.21006099e-07% 0% . Found 5 levels with codelength 15.21972421

=> Trial 1/1 finished in 461.651345s with codelength 15.2197242


Summary after 1 trial
Best end modular solution in 5 levels:
Per level number

100%|██████████| 2880605/2880605 [00:28<00:00, 101496.14it/s]

Number of communities detected: 1808





# Calculating HITS scores	
    
HITS provides two scores for each node: a hub score and an authority score. In the context of a citation network, a high authority score indicates a highly cited paper (important and influential in its field), whereas a high hub score indicates a paper that cites many important papers. This dual scoring system allows for a nuanced understanding of a paper’s role in the network, distinguishing between sources of information and distributors of information.

In [None]:
import networkx as nx
import pickle
from tqdm import tqdm

# Calculate HITS scores
hits = nx.hits(G)

#hits df with node_id, hub_score, authority_score
hits_data = []

# Update the node attributes with the HITS scores
for node in tqdm(G.nodes):
    hits_data.append((node, hits[0][node], hits[1][node]))
    G.nodes[node]['hub_score'] = hits[0][node]
    G.nodes[node]['authority_score'] = hits[1][node]

hits_df = pd.DataFrame(hits_data, columns=['node_id', 'hub_score', 'authority_score'])

#join with the processed_papers_df
processed_papers_df = pd.merge(hits_df, processed_papers_df, on='node_id')



processed_papers_df.to_csv('data/processed_papers_df.csv', index=False)

# Save the updated graph to a pickle file for faster loading
with open('data/graphs/dblp_ml_graph.pkl', 'wb') as f:
    pickle.dump(G, f)

processed_papers_df.to_csv('data/processed_papers_df.csv', index=False)

# Save the updated graph to a pickle file for faster loading
with open('data/graphs/dblp_ml_graph.pkl', 'wb') as f:
    pickle.dump(G, f)