# Prepare the CORA Dataset

The is the full CORA dataset version 1.0.

A link to the "baby" CORA dataset. Not used and should be the same as the included `.tsv` files from https://relational.fit.cvut.cz/dataset/CORA.

In [1]:
#!wget http://www.cs.umd.edu/~sen/lbc-proj/data/cora.tgz

## Download Raw Data

Download the full CORA dataset version 1.0 from the original source.

In [2]:
from pathlib import Path

!wget -N http://people.cs.umass.edu/~mccallum/data/cora-classify.tar.gz
!tar --skip-old-files -zxf cora-classify.tar.gz
CORA_PATH = Path('cora')

--2023-02-04 14:41:04--  http://people.cs.umass.edu/~mccallum/data/cora-classify.tar.gz
Resolving people.cs.umass.edu (people.cs.umass.edu)... 128.119.240.99
Connecting to people.cs.umass.edu (people.cs.umass.edu)|128.119.240.99|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://people.cs.umass.edu/~mccallum/data/cora-classify.tar.gz [following]
--2023-02-04 14:41:05--  https://people.cs.umass.edu/~mccallum/data/cora-classify.tar.gz
Connecting to people.cs.umass.edu (people.cs.umass.edu)|128.119.240.99|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 264768650 (253M) [application/x-gzip]
Saving to: ‘cora-classify.tar.gz’


2023-02-04 14:41:23 (14.2 MB/s) - ‘cora-classify.tar.gz’ saved [264768650/264768650]



## Process `papers`

Find all of the reference fields.

In [3]:
import re

# Open the file
tag_set = set()

with open(CORA_PATH / "papers", 'r') as file:
    for i, line in enumerate(file):
        # Use a regular expression to find all tags in the line
        tags = re.findall(r'</(.*?)>', line)
        tag_set |= set(tags)

print("Description fields in at least one paper's descriptor:")
print(tag_set)

# Make a reasonable ordering
all_tags = ['author', 'title', 'type', 'institution', 'booktitle', 'publisher', 'editor', 'address', 'journal', 'volume', 'pages', 'month', 'year', 'note']

assert set(all_tags) == set(tag_set)

Description fields in at least one paper's descriptor:
{'address', 'year', 'publisher', 'institution', 'editor', 'journal', 'note', 'volume', 'type', 'booktitle', 'month', 'pages', 'title', 'author'}


Process the entries of `papers` one line at a time, ignoring duplicate entries which have the same id.

In [33]:
%%time
import pandas as pd
import string

# Initialize an empty dataframe
df = pd.DataFrame(columns=['id', 'filename', 'reference'] + all_tags) 

missing_val = ""

last_id = None
with open(CORA_PATH / "papers", 'r') as file:
    # Loop through each line in the file
    for i, line in enumerate(file):
        parts = line.strip().split("\t")
        
        # Many of the entries only have 2 parts. Ignore these
        if len(parts) != 3:
            continue
        
        id = parts[0]
        # Skip repeated ids
        if id == last_id:
            continue
        last_id = id
        
        filename = parts[1]
        
        # the first group matches a citation, e.g [B & G] if present
        try:
            m = re.match(r'\[(.+)\] (.+)', parts[2])
        except IndexError as e:
            print(f"Error {e} on entry {i}. Parts {parts}")
            raise e

        if m is None:
            reference = missing_val
            tagged_list = parts[2]
        else:
            reference = m.group(1)
            tagged_list = m.group(2)
        
        try:
            #details = parts[2].split("]")[1].strip().split("<")[1:]
            details = tagged_list.split("<")[1:]
            details = [x.split(">") for x in details]
            details = {x[0]: x[1].strip() for x in details}
        except IndexError as e:
            print(f"Error {e} on entry {i}. Parts {parts}")
            raise e
        
        row = {'id': id, 'filename': filename, 'reference': reference}
        
        for i, tag in enumerate(all_tags):
            row[tag] = details.get(tag, missing_val).rstrip(".,")
            
        # additional clean up for year field
        row['year'] = row['year'].strip("[()];:").rstrip(string.ascii_letters).rstrip("(),.")
        
        # Add the values to the dataframe
        df = df.append(row, ignore_index=True)

df.set_index('id', inplace=True)
# Print the dataframe
display(df)

Unnamed: 0_level_0,filename,reference,author,title,type,institution,booktitle,publisher,editor,address,journal,volume,pages,month,year,note
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2,http:##dimacs.rutgers.edu#techps#1994#94-07.ps,Gar,M.R. Garey & D.S. Johnson,Computers and Intractibility: A Guide to the T...,,,,Freeman,,New York,,,,,1979,
16,http:##www.cs.wisc.edu#~fischer#ftp#pub#tech-r...,DeWitt90,"D. DeWitt, P. Futtersack, D. Maier, F. Velez","""A Study of Three Alternative Workstation-Serv...",,,Proceedings of the 16th International Conferec...,,,"Brisbane, Australia",,,,August,1990,
18,ftp:##ftp.cs.purdue.edu#pub#hosking#papers#oop...,Hoski93a,"A. Hosking, J. E. B. Moss","""Object Fault Handling for Persistent Programm...",,,Proceedings of the 16th International Conferec...,,,,,,pp. 288-303,,1993,
20,http:##www.pmg.lcs.mit.edu#papers#dist-mgmt.ps.gz,Liskov93,"Liskov B., Day M., Shrira L",Distributed Object Management in Thor,,,Distributed Object Management,,In M. Tamer Ozsu and Umesh Dayal and Patrick V...,"San Mateo, California",,,,,1993,
22,http:##www.pmg.lcs.mit.edu#papers#osdi94-opplo...,Otoole94,"J. O\'Toole, L. Shrira","""Opportunistic Log: Efficient Installation Rea...",,,USENIX Symposium on Operating Systems Design a...,,,,,,pp. 39-48,November,1994,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1102114,http:##wwwpub.utdallas.edu#~herve#abdi.josa.ps,29,"Valentin, D. and Abdi, H","""Can a linear autoassociator recognize faces f...",,,,,,,Journal of the Optical Society of America A 13,,717-724,,1996,
1102216,www.cs.bilkent.edu.tr#~oulusoy#jss2.ps.Z,,"Cetintemel U., Zimmermann J., Ulusoy O., and B...",OBJECTIVE: A Benchmark for Object-Oriented Act...,Technical Report BU-CEIS-9610,"Bilkent University, Ankara, Turkey",,,,,,,,,1996,
1102254,http:##www.ri.cmu.edu#afs#cs#user#kseymore#htm...,3,S. F. Chen et al,Topic Adaptation for Language Modeling Using U...,,,in Proc. ICASSP\'98,,,,,Vol. 2,pp. 681-684,May 12-15,1998,
1102262,http:##www.cs.jhu.edu#~junwu#topic-lm.ps,11,S. Khudanpur and J. Wu,"""A Maximum Entropy Language Model to Integrate...",,,Proceedings of ICASSP\'99,,,,,,pp. 553-556,,,


CPU times: user 2min 57s, sys: 247 ms, total: 2min 57s
Wall time: 3min 48s


In [44]:
years = df['year']
print(f'There are {len(years)} entries with unique paper ids.')
W = [y for y in years if y == ""]
print(f'There are {len(W)} entries missing the publication year.')

def is_valid(year):
    return year.isdigit() and int(year) <= 2023

print(f'There are {sum(years.apply(is_valid))} valid entries.')

bad = [y for y in years if y != "" and not is_valid(y)]
assert len(years) == len(W) + len(bad) + sum(years.apply(is_valid))
print(f'There are {len(bad)} entries with badly formed/invalid year fields:')
print(bad)

# Make the index (id) and year entries numeric
df.index = pd.to_numeric(df.index)
#df.index = pd.to_numeric(df.index) #, errors='coerce')
papers_df = df.loc[df.year.apply(is_valid)].assign(year = lambda df: df.year.apply(pd.to_numeric))

There are 19396 entries with unique paper ids.
There are 2957 entries missing the publication year.
There are 16420 valid entries.
There are 19 entries with badly formed/invalid year fields:
['207216', '1988/89', '1996, 1997', '1991, 1991', '1996. 1996', '19(1996', '191203', '203212', '1994, 1994', 'Oct.1994', '1994), 1994', '2034', '1997, 1997', '1987. ftp://ftp.cs.ruu.nl/pub/RUU/CS/techreps/CS-1986/1986-16.ps', '1994, pp.1901-1905', '1995. ftp://cse.ogi.edu/pub/tech-reports/1995/95-010.ps', '1995, 1995', '1998?', '807-815,1998']


In [6]:
import plotly.express as px
import plotly.io as pio
# fixes blank plots in jupyter lab
pio.renderers.default = "iframe"

fig = px.histogram(papers_df, x="year", title="Publication Year")
fig.show()

## Process `citations`

In [7]:
citations_df = pd.read_csv(CORA_PATH / 'citations', header=None, names=['referring_id', 'cited_id'], sep='\t')
citations_df

Unnamed: 0,referring_id,cited_id
0,172005,0
1,172005,1
2,172005,2
3,172005,3
4,172005,4
...,...,...
714261,1102288,1102284
714262,1102288,37258
714263,1102288,66922
714264,1102288,1102301


## Process `classifications`

In [26]:
classifications_df = pd.read_csv(CORA_PATH / 'classifications', sep='\t', header=None, names=['filename', 'full_classification'], skipfooter=1)
classifications_df





Unnamed: 0,filename,full_classification
0,http:##www.isi.edu#sims#papers#94-sims-agents.ps,/Information_Retrieval/Retrieval/
1,http:##www.cis.ohio-state.edu#~ren#tois.ps,/Information_Retrieval/Retrieval/
2,ftp:##ftp.cs.umass.edu#pub#techrept#techreport...,/Information_Retrieval/Retrieval/
3,http:##www.cs.cmu.edu#afs#cs#user#alex#docs#id...,/Information_Retrieval/Retrieval/
4,http:##www.ri.cmu.edu#afs#cs#user#alex#docs#id...,/Information_Retrieval/Retrieval/
...,...,...
30782,http:##zen.efs.mq.edu.au:80#~akozek#GAMBL.ps,/Artificial_Intelligence/Machine_Learning/Theory/
30783,http:##zen.efs.mq.edu.au:80#~akozek#NoLoEss.ps,/Artificial_Intelligence/Machine_Learning/Prob...
30784,http:##zen.efs.mq.edu.au:80#~akozek#mdkl.ps,/Artificial_Intelligence/Machine_Learning/Prob...
30785,http:##zen.efs.mq.edu.au:80#~akozek#nwsl.ps,/Artificial_Intelligence/Machine_Learning/Prob...


In [27]:
classes = classifications_df.loc[classifications_df.filename == 'keywords'].full_classification
print(f"There are {len(set(classes))} hierarchical classes:\n")
set(classes)

There are 64 hierarchical classes:



{'/Artificial_Intelligence/Agents/',
 '/Artificial_Intelligence/Data_Mining/',
 '/Artificial_Intelligence/Expert_Systems/',
 '/Artificial_Intelligence/Games_and_Search/',
 '/Artificial_Intelligence/Knowledge_Representation/',
 '/Artificial_Intelligence/Machine_Learning/Case-Based/',
 '/Artificial_Intelligence/Machine_Learning/Genetic_Algorithms/',
 '/Artificial_Intelligence/Machine_Learning/Neural_Networks/',
 '/Artificial_Intelligence/Machine_Learning/Probabilistic_Methods/',
 '/Artificial_Intelligence/Machine_Learning/Reinforcement_Learning/',
 '/Artificial_Intelligence/Machine_Learning/Rule_Learning/',
 '/Artificial_Intelligence/Machine_Learning/Theory/',
 '/Artificial_Intelligence/NLP/',
 '/Artificial_Intelligence/Planning/',
 '/Artificial_Intelligence/Robotics/',
 '/Artificial_Intelligence/Speech/',
 '/Artificial_Intelligence/Theorem_Proving/',
 '/Artificial_Intelligence/Vision_and_Pattern_Recognition/',
 '/Data_Structures__Algorithms_and_Theory/Computational_Complexity/',
 '/Data

In [28]:
top_level_classes = {re.match(r'^\/([^\/]*)', full).group(1) for full in set(classes)}
top_level_classes

{'Artificial_Intelligence',
 'Data_Structures__Algorithms_and_Theory',
 'Databases',
 'Encryption_and_Compression',
 'Hardware_and_Architecture',
 'Human_Computer_Interaction',
 'Information_Retrieval',
 'Networking',
 'Operating_Systems',
 'Programming'}

In [32]:
classifications_df = classifications_df.assign(top_level_class = lambda df: df.full_classification.apply(lambda full: re.match(r'^\/([^\/]*)', full).group(1)))
classifications_df

Unnamed: 0,filename,full_classification,top_level_class
0,http:##www.isi.edu#sims#papers#94-sims-agents.ps,/Information_Retrieval/Retrieval/,Information_Retrieval
1,http:##www.cis.ohio-state.edu#~ren#tois.ps,/Information_Retrieval/Retrieval/,Information_Retrieval
2,ftp:##ftp.cs.umass.edu#pub#techrept#techreport...,/Information_Retrieval/Retrieval/,Information_Retrieval
3,http:##www.cs.cmu.edu#afs#cs#user#alex#docs#id...,/Information_Retrieval/Retrieval/,Information_Retrieval
4,http:##www.ri.cmu.edu#afs#cs#user#alex#docs#id...,/Information_Retrieval/Retrieval/,Information_Retrieval
...,...,...,...
30782,http:##zen.efs.mq.edu.au:80#~akozek#GAMBL.ps,/Artificial_Intelligence/Machine_Learning/Theory/,Artificial_Intelligence
30783,http:##zen.efs.mq.edu.au:80#~akozek#NoLoEss.ps,/Artificial_Intelligence/Machine_Learning/Prob...,Artificial_Intelligence
30784,http:##zen.efs.mq.edu.au:80#~akozek#mdkl.ps,/Artificial_Intelligence/Machine_Learning/Prob...,Artificial_Intelligence
30785,http:##zen.efs.mq.edu.au:80#~akozek#nwsl.ps,/Artificial_Intelligence/Machine_Learning/Prob...,Artificial_Intelligence


## Create Temporal Graph

### Add publication years to citation table

In [56]:
m = (
    citations_df
        .merge(papers_df, left_on='referring_id', right_index=True)[['referring_id', 'cited_id', 'year']]
        .rename(columns={'year': 'referring_year'})
        .merge(papers_df, left_on='cited_id', right_index=True)[['referring_id', 'cited_id', 'referring_year', 'year']]
        .rename(columns={'year': 'cited_id'})
)
m

Unnamed: 0,referring_id,cited_id,referring_year,cited_id.1
2,172005,2,1993,1979
168,201,2,1995,1979
275871,31083,2,1996,1979
4284,102884,2,1992,1979
4620,213,2,1992,1979
...,...,...,...,...
713941,1099528,1101626,1995,1993
713742,1101624,1101657,1991,1992
714069,1102114,1101996,1996,1995
714201,1102216,1102216,1996,1996


In [50]:
m.columns.pop(

TypeError: Cannot broadcast np.ndarray with operand of type <class 'list'>

In [39]:
papers_df.dtypes

filename       object
reference      object
author         object
title          object
type           object
institution    object
booktitle      object
publisher      object
editor         object
address        object
journal        object
volume         object
pages          object
month          object
year            int64
note           object
dtype: object

In [45]:
papers_df.index

Int64Index([      2,      16,      18,      20,      22,      25,      26,
                 35,      40,      51,
            ...
            1101293, 1101600, 1101624, 1101626, 1101657, 1101996, 1102114,
            1102216, 1102254, 1102288],
           dtype='int64', name='id', length=16420)