# Prepare the CORA Dataset

The is the full CORA dataset version 1.0.

A link to the "baby" CORA dataset. Not used and should be the same as the included `.tsv` files from https://relational.fit.cvut.cz/dataset/CORA.

In [1]:
#!wget http://www.cs.umd.edu/~sen/lbc-proj/data/cora.tgz

## Download Raw Data

Download the full CORA dataset version 1.0 from the original source.

In [2]:
from pathlib import Path

!wget -N http://people.cs.umass.edu/~mccallum/data/cora-classify.tar.gz
!tar --skip-old-files -zxf cora-classify.tar.gz
CORA_PATH = Path('cora')

--2023-02-05 15:35:57--  http://people.cs.umass.edu/~mccallum/data/cora-classify.tar.gz
Resolving people.cs.umass.edu (people.cs.umass.edu)... 128.119.240.99
Connecting to people.cs.umass.edu (people.cs.umass.edu)|128.119.240.99|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://people.cs.umass.edu/~mccallum/data/cora-classify.tar.gz [following]
--2023-02-05 15:35:57--  https://people.cs.umass.edu/~mccallum/data/cora-classify.tar.gz
Connecting to people.cs.umass.edu (people.cs.umass.edu)|128.119.240.99|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 264768650 (253M) [application/x-gzip]
Saving to: ‘cora-classify.tar.gz’


2023-02-05 15:36:04 (33.9 MB/s) - ‘cora-classify.tar.gz’ saved [264768650/264768650]



## Process `papers`

Find all of the reference fields.

In [3]:
import re

# Open the file
tag_set = set()

with open(CORA_PATH / "papers", 'r') as file:
    for i, line in enumerate(file):
        # Use a regular expression to find all tags in the line
        tags = re.findall(r'</(.*?)>', line)
        tag_set |= set(tags)

print("Description fields in at least one paper's descriptor:")
print(tag_set)

# Make a reasonable ordering
all_tags = ['author', 'title', 'type', 'institution', 'booktitle', 'publisher', 'editor', 'address', 'journal', 'volume', 'pages', 'month', 'year', 'note']

assert set(all_tags) == set(tag_set)

Description fields in at least one paper's descriptor:
{'type', 'editor', 'author', 'booktitle', 'address', 'institution', 'journal', 'pages', 'year', 'volume', 'note', 'month', 'title', 'publisher'}


Process the entries of `papers` one line at a time, ignoring duplicate entries which have the same id.

In [4]:
%%time
import pandas as pd
import string

# Initialize an empty dataframe
df = pd.DataFrame(columns=['id', 'filename', 'reference'] + all_tags) 

missing_val = ""

last_id = None
with open(CORA_PATH / "papers", 'r') as file:
    # Loop through each line in the file
    for i, line in enumerate(file):
        parts = line.strip().split("\t")
        
        # Many of the entries only have 2 parts. Ignore these
        if len(parts) != 3:
            continue
        
        id = parts[0]
        # Skip repeated ids
        # if id == last_id:
        #     continue
        last_id = id
        
        filename = parts[1]
        
        # the first group matches a citation, e.g [B & G] if present
        try:
            m = re.match(r'\[(.+)\] (.+)', parts[2])
        except IndexError as e:
            print(f"Error {e} on entry {i}. Parts {parts}")
            raise e

        if m is None:
            reference = missing_val
            tagged_list = parts[2]
        else:
            reference = m.group(1)
            tagged_list = m.group(2)
        
        try:
            #details = parts[2].split("]")[1].strip().split("<")[1:]
            details = tagged_list.split("<")[1:]
            details = [x.split(">") for x in details]
            details = {x[0]: x[1].strip() for x in details}
        except IndexError as e:
            print(f"Error {e} on entry {i}. Parts {parts}")
            raise e
        
        row = {'id': id, 'filename': filename, 'reference': reference}
        
        for i, tag in enumerate(all_tags):
            row[tag] = details.get(tag, missing_val).rstrip(".,")
            
        # additional clean up for year field
        row['year'] = row['year'].strip("[()];:").rstrip(string.ascii_letters).rstrip("(),.")
        
        # Add the values to the dataframe
        df = df.append(row, ignore_index=True)

#df.set_index('id', inplace=True)
# Print the dataframe
display(df)

Unnamed: 0,id,filename,reference,author,title,type,institution,booktitle,publisher,editor,address,journal,volume,pages,month,year,note
0,2,http:##dimacs.rutgers.edu#techps#1994#94-07.ps,Gar,M.R. Garey & D.S. Johnson,Computers and Intractibility: A Guide to the T...,,,,Freeman,,New York,,,,,1979,
1,16,http:##www.cs.wisc.edu#~fischer#ftp#pub#tech-r...,DeWitt90,"D. DeWitt, P. Futtersack, D. Maier, F. Velez","""A Study of Three Alternative Workstation-Serv...",,,Proceedings of the 16th International Conferec...,,,"Brisbane, Australia",,,,August,1990,
2,18,ftp:##ftp.cs.purdue.edu#pub#hosking#papers#oop...,Hoski93a,"A. Hosking, J. E. B. Moss","""Object Fault Handling for Persistent Programm...",,,Proceedings of the 16th International Conferec...,,,,,,pp. 288-303,,1993,
3,18,ftp:##ftp.cs.umass.edu#pub#osl#papers#oopsla93...,Hoski93a,"A. Hosking, J. E. B. Moss","""Object Fault Handling for Persistent Programm...",,,Proceedings of the 16th International Conferec...,,,,,,pp. 288-303,,1993,
4,18,http:##cobar.cs.umass.edu#pubfiles#ds7.ps.gz,Hoski93a,"A. Hosking, J. E. B. Moss","""Object Fault Handling for Persistent Programm...",,,Proceedings of the 16th International Conferec...,,,,,,pp. 288-303,,1993,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31359,1102114,http:##wwwpub.utdallas.edu#~herve#abdi.josa.ps,29,"Valentin, D. and Abdi, H","""Can a linear autoassociator recognize faces f...",,,,,,,Journal of the Optical Society of America A 13,,717-724,,1996,
31360,1102216,www.cs.bilkent.edu.tr#~oulusoy#jss2.ps.Z,,"Cetintemel U., Zimmermann J., Ulusoy O., and B...",OBJECTIVE: A Benchmark for Object-Oriented Act...,Technical Report BU-CEIS-9610,"Bilkent University, Ankara, Turkey",,,,,,,,,1996,
31361,1102254,http:##www.ri.cmu.edu#afs#cs#user#kseymore#htm...,3,S. F. Chen et al,Topic Adaptation for Language Modeling Using U...,,,in Proc. ICASSP\'98,,,,,Vol. 2,pp. 681-684,May 12-15,1998,
31362,1102262,http:##www.cs.jhu.edu#~junwu#topic-lm.ps,11,S. Khudanpur and J. Wu,"""A Maximum Entropy Language Model to Integrate...",,,Proceedings of ICASSP\'99,,,,,,pp. 553-556,,,


CPU times: user 4min 5s, sys: 538 ms, total: 4min 6s
Wall time: 4min 6s


In [5]:
years = df['year']
print(f'There are {len(years)} entries with unique paper ids.')
W = [y for y in years if y == ""]
print(f'There are {len(W)} entries missing the publication year.')

def is_valid(year):
    return year.isdigit() and int(year) <= 2023

print(f'There are {sum(years.apply(is_valid))} valid entries.')

bad = [y for y in years if y != "" and not is_valid(y)]
assert len(years) == len(W) + len(bad) + sum(years.apply(is_valid))
print(f'There are {len(bad)} entries with badly formed/invalid year fields:')
print(bad)

# Replace bad years with None, and convert id to numeric
papers_df = (
    df
        .assign(year = lambda df: df.year.apply(lambda x: x if is_valid(x) else None))
        .assign(id = lambda df: df.id.apply(pd.to_numeric))
)
#papers_df = df.loc[df.year.apply(is_valid)].assign(year = lambda df: df.year.apply(pd.to_numeric))

There are 31364 entries with unique paper ids.
There are 4595 entries missing the publication year.
There are 26719 valid entries.
There are 50 entries with badly formed/invalid year fields:
['207216', '207216', '207216', '207216', '207216', '207216', '207216', '207216', '1988/89', '1996, 1997', '1996, 1997', '1996, 1997', '1996, 1997', '1996, 1997', '1996, 1997', '1996, 1997', '1996, 1997', '1996, 1997', '1996, 1997', '1996, 1997', '1996, 1997', '1996, 1997', '1996, 1997', '1996, 1997', '1996, 1997', '1996, 1997', '1991, 1991', '1996. 1996', '19(1996', '19(1996', '19(1996', '191203', '203212', '1994, 1994', 'Oct.1994', '1994), 1994', '2034', '1997, 1997', '1997, 1997', '1987. ftp://ftp.cs.ruu.nl/pub/RUU/CS/techreps/CS-1986/1986-16.ps', '1994, pp.1901-1905', '1994, pp.1901-1905', '1995. ftp://cse.ogi.edu/pub/tech-reports/1995/95-010.ps', '1995, 1995', '1995, 1995', '1995, 1995', '1998?', '1998?', '807-815,1998', '807-815,1998']


In [6]:
papers_df

Unnamed: 0,id,filename,reference,author,title,type,institution,booktitle,publisher,editor,address,journal,volume,pages,month,year,note
0,2,http:##dimacs.rutgers.edu#techps#1994#94-07.ps,Gar,M.R. Garey & D.S. Johnson,Computers and Intractibility: A Guide to the T...,,,,Freeman,,New York,,,,,1979,
1,16,http:##www.cs.wisc.edu#~fischer#ftp#pub#tech-r...,DeWitt90,"D. DeWitt, P. Futtersack, D. Maier, F. Velez","""A Study of Three Alternative Workstation-Serv...",,,Proceedings of the 16th International Conferec...,,,"Brisbane, Australia",,,,August,1990,
2,18,ftp:##ftp.cs.purdue.edu#pub#hosking#papers#oop...,Hoski93a,"A. Hosking, J. E. B. Moss","""Object Fault Handling for Persistent Programm...",,,Proceedings of the 16th International Conferec...,,,,,,pp. 288-303,,1993,
3,18,ftp:##ftp.cs.umass.edu#pub#osl#papers#oopsla93...,Hoski93a,"A. Hosking, J. E. B. Moss","""Object Fault Handling for Persistent Programm...",,,Proceedings of the 16th International Conferec...,,,,,,pp. 288-303,,1993,
4,18,http:##cobar.cs.umass.edu#pubfiles#ds7.ps.gz,Hoski93a,"A. Hosking, J. E. B. Moss","""Object Fault Handling for Persistent Programm...",,,Proceedings of the 16th International Conferec...,,,,,,pp. 288-303,,1993,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31359,1102114,http:##wwwpub.utdallas.edu#~herve#abdi.josa.ps,29,"Valentin, D. and Abdi, H","""Can a linear autoassociator recognize faces f...",,,,,,,Journal of the Optical Society of America A 13,,717-724,,1996,
31360,1102216,www.cs.bilkent.edu.tr#~oulusoy#jss2.ps.Z,,"Cetintemel U., Zimmermann J., Ulusoy O., and B...",OBJECTIVE: A Benchmark for Object-Oriented Act...,Technical Report BU-CEIS-9610,"Bilkent University, Ankara, Turkey",,,,,,,,,1996,
31361,1102254,http:##www.ri.cmu.edu#afs#cs#user#kseymore#htm...,3,S. F. Chen et al,Topic Adaptation for Language Modeling Using U...,,,in Proc. ICASSP\'98,,,,,Vol. 2,pp. 681-684,May 12-15,1998,
31362,1102262,http:##www.cs.jhu.edu#~junwu#topic-lm.ps,11,S. Khudanpur and J. Wu,"""A Maximum Entropy Language Model to Integrate...",,,Proceedings of ICASSP\'99,,,,,,pp. 553-556,,,


In [7]:
import plotly.express as px
import plotly.io as pio
# fixes blank plots in jupyter lab
pio.renderers.default = "iframe"

fig = px.histogram(papers_df.assign(year = lambda df: df.year.apply(pd.to_numeric)), x="year", title="Publication Year")
fig.show()

## Process `citations`

In [8]:
citations_df = pd.read_csv(CORA_PATH / 'citations', header=None, names=['referring_id', 'cited_id'], sep='\t')
citations_df

Unnamed: 0,referring_id,cited_id
0,172005,0
1,172005,1
2,172005,2
3,172005,3
4,172005,4
...,...,...
714261,1102288,1102284
714262,1102288,37258
714263,1102288,66922
714264,1102288,1102301


## Process `classifications`

In [9]:
classifications_df = pd.read_csv(CORA_PATH / 'classifications', sep='\t', header=None, names=['filename', 'full_classification'], skipfooter=1)
classifications_df





Unnamed: 0,filename,full_classification
0,http:##www.isi.edu#sims#papers#94-sims-agents.ps,/Information_Retrieval/Retrieval/
1,http:##www.cis.ohio-state.edu#~ren#tois.ps,/Information_Retrieval/Retrieval/
2,ftp:##ftp.cs.umass.edu#pub#techrept#techreport...,/Information_Retrieval/Retrieval/
3,http:##www.cs.cmu.edu#afs#cs#user#alex#docs#id...,/Information_Retrieval/Retrieval/
4,http:##www.ri.cmu.edu#afs#cs#user#alex#docs#id...,/Information_Retrieval/Retrieval/
...,...,...
30782,http:##zen.efs.mq.edu.au:80#~akozek#GAMBL.ps,/Artificial_Intelligence/Machine_Learning/Theory/
30783,http:##zen.efs.mq.edu.au:80#~akozek#NoLoEss.ps,/Artificial_Intelligence/Machine_Learning/Prob...
30784,http:##zen.efs.mq.edu.au:80#~akozek#mdkl.ps,/Artificial_Intelligence/Machine_Learning/Prob...
30785,http:##zen.efs.mq.edu.au:80#~akozek#nwsl.ps,/Artificial_Intelligence/Machine_Learning/Prob...


In [10]:
classes = classifications_df.loc[classifications_df.filename == 'keywords'].full_classification
print(f"There are {len(set(classes))} hierarchical classes:\n")
set(classes)

There are 64 hierarchical classes:



{'/Artificial_Intelligence/Agents/',
 '/Artificial_Intelligence/Data_Mining/',
 '/Artificial_Intelligence/Expert_Systems/',
 '/Artificial_Intelligence/Games_and_Search/',
 '/Artificial_Intelligence/Knowledge_Representation/',
 '/Artificial_Intelligence/Machine_Learning/Case-Based/',
 '/Artificial_Intelligence/Machine_Learning/Genetic_Algorithms/',
 '/Artificial_Intelligence/Machine_Learning/Neural_Networks/',
 '/Artificial_Intelligence/Machine_Learning/Probabilistic_Methods/',
 '/Artificial_Intelligence/Machine_Learning/Reinforcement_Learning/',
 '/Artificial_Intelligence/Machine_Learning/Rule_Learning/',
 '/Artificial_Intelligence/Machine_Learning/Theory/',
 '/Artificial_Intelligence/NLP/',
 '/Artificial_Intelligence/Planning/',
 '/Artificial_Intelligence/Robotics/',
 '/Artificial_Intelligence/Speech/',
 '/Artificial_Intelligence/Theorem_Proving/',
 '/Artificial_Intelligence/Vision_and_Pattern_Recognition/',
 '/Data_Structures__Algorithms_and_Theory/Computational_Complexity/',
 '/Data

In [11]:
top_level_classes = {re.match(r'^\/([^\/]*)', full).group(1) for full in set(classes)}
top_level_classes

{'Artificial_Intelligence',
 'Data_Structures__Algorithms_and_Theory',
 'Databases',
 'Encryption_and_Compression',
 'Hardware_and_Architecture',
 'Human_Computer_Interaction',
 'Information_Retrieval',
 'Networking',
 'Operating_Systems',
 'Programming'}

In [12]:
classifications_df = classifications_df.assign(top_level_class = lambda df: df.full_classification.apply(lambda full: re.match(r'^\/([^\/]*)', full).group(1)))
classifications_df

Unnamed: 0,filename,full_classification,top_level_class
0,http:##www.isi.edu#sims#papers#94-sims-agents.ps,/Information_Retrieval/Retrieval/,Information_Retrieval
1,http:##www.cis.ohio-state.edu#~ren#tois.ps,/Information_Retrieval/Retrieval/,Information_Retrieval
2,ftp:##ftp.cs.umass.edu#pub#techrept#techreport...,/Information_Retrieval/Retrieval/,Information_Retrieval
3,http:##www.cs.cmu.edu#afs#cs#user#alex#docs#id...,/Information_Retrieval/Retrieval/,Information_Retrieval
4,http:##www.ri.cmu.edu#afs#cs#user#alex#docs#id...,/Information_Retrieval/Retrieval/,Information_Retrieval
...,...,...,...
30782,http:##zen.efs.mq.edu.au:80#~akozek#GAMBL.ps,/Artificial_Intelligence/Machine_Learning/Theory/,Artificial_Intelligence
30783,http:##zen.efs.mq.edu.au:80#~akozek#NoLoEss.ps,/Artificial_Intelligence/Machine_Learning/Prob...,Artificial_Intelligence
30784,http:##zen.efs.mq.edu.au:80#~akozek#mdkl.ps,/Artificial_Intelligence/Machine_Learning/Prob...,Artificial_Intelligence
30785,http:##zen.efs.mq.edu.au:80#~akozek#nwsl.ps,/Artificial_Intelligence/Machine_Learning/Prob...,Artificial_Intelligence


## Create Temporal Graph

### Add publication years to citation table

In [13]:
citations_dates = (
    citations_df
        .merge(papers_df, left_on='referring_id', right_index=True)[['referring_id', 'cited_id', 'year']]
        .rename(columns={'year': 'referring_year'})
        .merge(papers_df, left_on='cited_id', right_index=True)[['referring_id', 'cited_id', 'referring_year', 'year']]
        .rename(columns={'year': 'cited_year'})
        .reset_index(drop=True)
)
display(citations_dates)
citations_dates.to_csv("citations_with_dates.csv", index=False)

Unnamed: 0,referring_id,cited_id,referring_year,cited_year
0,9351,12,1995,1990
1,18212,12,1996,1990
2,741,12,1993,1990
3,22,12,1991,1990
4,19535,12,1996,1990
...,...,...,...,...
46436,9003,2830,1994,1994
46437,9003,9009,1994,1996
46438,1925,26272,1993,
46439,21468,12712,1994,


### Classify by paper id

Asssociate classifications with paper ids. Not all filenames have been classified, and ids with no classification are dropped.

In [20]:
m = papers_df.merge(classifications_df, how='inner', on='filename')[['id', 'filename', 'full_classification', 'top_level_class']]
m

Unnamed: 0,id,filename,full_classification,top_level_class
0,2,http:##dimacs.rutgers.edu#techps#1994#94-07.ps,/Artificial_Intelligence/Knowledge_Representat...,Artificial_Intelligence
1,16,http:##www.cs.wisc.edu#~fischer#ftp#pub#tech-r...,/Databases/Object_Oriented/,Databases
2,18,http:##cobar.cs.umass.edu#pubfiles#ds7.ps.gz,/Databases/Object_Oriented/,Databases
3,20,http:##www.pmg.lcs.mit.edu#papers#dist-mgmt.ps.gz,/Databases/Object_Oriented/,Databases
4,20,http:##www.pmg.lcs.mit.edu#papers#thor.ps.gz,/Databases/Object_Oriented/,Databases
...,...,...,...,...
17374,1100161,http:##dimacs.rutgers.edu#techps#1993#93-48.ps,/Data_Structures__Algorithms_and_Theory/Comput...,Data_Structures__Algorithms_and_Theory
17375,1100792,ftp:##ftp.cs.umass.edu#pub#ccs#spring#robot_rt...,/Operating_Systems/Realtime/,Operating_Systems
17376,1100866,http:##www.csl.sri.com#~bruno#publis#safefm_ts...,/Artificial_Intelligence/Theorem_Proving/,Artificial_Intelligence
17377,1101196,http:##www.cs.umd.edu#users#traum#Papers#agenc...,/Artificial_Intelligence/Planning/,Artificial_Intelligence


There are repeated ids since the same paper may have multiple filenames. Pick one for each id.

In [21]:
classifications = m.groupby('id').nth(0)
display(classifications)
classifications.to_csv('classifications.csv')

Unnamed: 0_level_0,filename,full_classification,top_level_class
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,http:##dimacs.rutgers.edu#techps#1994#94-07.ps,/Artificial_Intelligence/Knowledge_Representat...,Artificial_Intelligence
16,http:##www.cs.wisc.edu#~fischer#ftp#pub#tech-r...,/Databases/Object_Oriented/,Databases
18,http:##cobar.cs.umass.edu#pubfiles#ds7.ps.gz,/Databases/Object_Oriented/,Databases
20,http:##www.pmg.lcs.mit.edu#papers#dist-mgmt.ps.gz,/Databases/Object_Oriented/,Databases
22,http:##www.pmg.lcs.mit.edu#papers#osdi94-opplo...,/Databases/Object_Oriented/,Databases
...,...,...,...
1100161,http:##dimacs.rutgers.edu#techps#1993#93-48.ps,/Data_Structures__Algorithms_and_Theory/Comput...,Data_Structures__Algorithms_and_Theory
1100792,ftp:##ftp.cs.umass.edu#pub#ccs#spring#robot_rt...,/Operating_Systems/Realtime/,Operating_Systems
1100866,http:##www.csl.sri.com#~bruno#publis#safefm_ts...,/Artificial_Intelligence/Theorem_Proving/,Artificial_Intelligence
1101196,http:##www.cs.umd.edu#users#traum#Papers#agenc...,/Artificial_Intelligence/Planning/,Artificial_Intelligence
