## Research Areas Analysis Using Graph Data and Large Language Models

### Donato Riccio

![](image-1.png)

## Converting JSON to CSV
The first step for building the dataset is converting JSON to CSV. We'll use the CSV for the graph as it's easier to work with. We only need id, references and keyword for this step.


In [1]:
# Code from: https://www.kaggle.com/code/shreyasbhatk/csv-conversion-all-fields

import ijson
import time
import csv
import numpy as np
from tqdm import tqdm
from decimal import Decimal

start = time.process_time()

PAPER = []
Author = []
count = 0

with open('data/dblp.v12.json', "rb") as f, open("data/dblp.csv", "w", newline="") as csvfile:
    fieldnames = ['id', 'references', 'keyword']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()

    for i, element in enumerate(ijson.items(f, "item")):
        paper = {}
        paper['id'] = element['id']

        references = element.get('references')
        if references:
            paper['references'] = ';'.join(str(int(r)) for r in references)
        else:
            paper['references'] = np.nan

        fos = element.get('fos')
        if fos:
            fosunparsed = element['fos']
            keyword = []
           # weight = []

            for i in fosunparsed:
                if isinstance(i['w'], (int, float, Decimal)):
                    keyword.append(str(i['name']))  # Convert to string
                else:
                    keyword.append(str(np.nan))  # Convert to string

        else:
            keyword = []
            weight = []

        paper['keyword'] = ';'.join(keyword)

  
        count += 1
        writer.writerow(paper)

        if count % 100000 == 0:
            print(f"{count}:{round((time.process_time() - start), 2)}s ", end="")


100000:4.12s 200000:8.16s 300000:12.95s 400000:18.38s 500000:23.82s 600000:29.24s 700000:35.02s 800000:41.09s 900000:47.07s 1000000:53.0s 1100000:59.03s 1200000:64.96s 1300000:71.05s 1400000:77.05s 1500000:83.06s 1600000:88.89s 1700000:94.84s 1800000:100.7s 1900000:106.7s 2000000:112.67s 2100000:118.89s 2200000:125.19s 2300000:131.28s 2400000:137.45s 2500000:143.49s 2600000:149.73s 2700000:155.91s 2800000:161.98s 2900000:168.06s 3000000:173.87s 3100000:179.52s 3200000:184.07s 3300000:189.62s 3400000:195.65s 3500000:201.0s 3600000:206.73s 3700000:212.46s 3800000:218.28s 3900000:224.29s 4000000:230.19s 4100000:236.13s 4200000:241.75s 4300000:247.18s 4400000:252.33s 4500000:256.31s 4600000:259.92s 4700000:265.31s 4800000:270.49s 

## Extracting Machine Learning Papers and Their References from DBLP Dataset

We aim to extract papers related to machine learning (ML) and similar research areads from the DBLP dataset, along with their references. The process involves several steps:

1. **Filter ML-related Papers**: We begin by filtering the papers in the DBLP dataset that are related to machine learning, deep learning, artificial intelligence, or neural networks based on their keywords.

2. **Identify Missing References**: For each selected ML paper, we extract the references. Since ML papers can cite papers outside the filtered subset, we need to add the missing references to avoid having missing papers in the graph.

3. **Iteratively Add Missing Papers**: We iteratively add these missing referenced papers to our dataset. This process ensures that all references are included, even if they were not initially classified under ML-related keywords. This process is iterated until no more missing papers are found.

4. **Save the Final Dataset**: The final dataset, which now includes all ML-related papers and their references, is saved as a CSV file for further analysis.


In [2]:
import pandas as pd
df_all = pd.read_csv('data/dblp.csv')

df_ml = df_all[df_all['keyword'].str.contains('statistics|machine learning|deep learning|artificial intelligence|neural networks', case=False, na=False)]

import pandas as pd
from tqdm import tqdm
missing_paper_n = 99999999
id_all = df_all['id'].values
selected_paper_ids = df_ml['id'].values

references = []
df_added = df_ml

while missing_paper_n > 0:
    references = []
    missing_paper_ids = []

    for i in tqdm(df_added.iterrows()):
        row = i[1]
        for ref in str(row['references']).split(';'):
            if ref == 'nan' or not ref or ref == 'None':
                continue
            ref = int(ref)
            references.append(ref)
 
    references = list(set(references))

    len(references)

    missing_paper_ids = list(set(references) - set(selected_paper_ids))
    selected_paper_ids = list(set(selected_paper_ids).union(set(references)))
    missing_paper_n = len(missing_paper_ids)
    # add the missing papers to the dataframe
    df_ml = pd.concat([df_ml, df_all[df_all['id'].isin(missing_paper_ids)]])
    df_added = df_all[df_all['id'].isin(missing_paper_ids)]

    print('Missing paper ids:', len(missing_paper_ids))
    print('Added paper ids:', len(df_added))


len(df_ml)

df_ml.to_csv('data/dblp_ml.csv')

1283069it [00:16, 76414.51it/s]


Missing paper ids: 829088
Added paper ids: 829088


829088it [00:11, 72839.25it/s]


Missing paper ids: 594257
Added paper ids: 594257


594257it [00:07, 75440.91it/s]


Missing paper ids: 227768
Added paper ids: 227768


227768it [00:03, 74131.26it/s]


Missing paper ids: 67088
Added paper ids: 67088


67088it [00:00, 73957.94it/s]


Missing paper ids: 18977
Added paper ids: 18977


18977it [00:00, 75035.83it/s]


Missing paper ids: 5139
Added paper ids: 5139


5139it [00:00, 72674.00it/s]


Missing paper ids: 1563
Added paper ids: 1563


1563it [00:00, 71090.67it/s]


Missing paper ids: 527
Added paper ids: 527


527it [00:00, 7690.29it/s]


Missing paper ids: 169
Added paper ids: 169


169it [00:00, 63629.93it/s]


Missing paper ids: 71
Added paper ids: 71


71it [00:00, 55808.77it/s]


Missing paper ids: 40
Added paper ids: 40


40it [00:00, 61840.09it/s]


Missing paper ids: 20
Added paper ids: 20


20it [00:00, 49490.31it/s]


Missing paper ids: 2
Added paper ids: 2


2it [00:00, 12905.55it/s]


Missing paper ids: 2
Added paper ids: 2


2it [00:00, 13336.42it/s]


Missing paper ids: 0
Added paper ids: 0


# Extracting additional information
For each selected paper, we extract specific details such as the paper’s ID, title, year, authors, and venue name.
Then, we save this information to a new csv file that will be useful for analysis.
Since the graph itself is very heavy in memory, it's better to keep this information separated to avoid adding more weight to the Graph object. With a join using the node ID we'll be able to retrieve additional information.




In [3]:
import ijson
import time
import csv
import numpy as np
from tqdm import tqdm
import pandas as pd

id_ml_papers = df_ml['id'].values


start = time.process_time()

PAPER = []
Author = []
count = 0

with open('data/dblp.v12.json', "rb") as f, open("data/papers.csv", "w", newline="") as csvfile:
    fieldnames = ['id', 'title', 'year', 'authors', 'abstract', 'venue_name']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()

    for i, element in enumerate(ijson.items(f, "item")):
        paper = {}
        count += 1
        if element['id'] not in id_ml_papers:
            continue
        paper['id'] = element['id']
        paper['title'] = element['title']

        year = element.get('year')
        paper['year'] = year if year else np.nan

        author = element.get('authors')
        if author:
            author_names = [str(a.get('name', np.nan)) for a in author]
            paper['authors'] = ';'.join(author_names)
        else:
            paper['authors'] = np.nan

        venue = element.get('venue', {}).get('raw')
        paper['venue_name'] = venue if venue else np.nan

        writer.writerow(paper)

        if count % 1000000 == 0:
            print(f"{count}:{round((time.process_time() - start), 2)}s ", end="")

print(f"Finished processing {count} papers.")

2000000:1335.73s Finished processing 4894081 papers.
