# Part 3: Formatting Papers for Neo4j Admin Import

This notebook formats the papers nodes and cites relationships into csvs for admin import.  

__Note: The runtime for this notebook depends greatly on the environment within which it is run.  It takes a few hours for me to complete on a 64-core 976 GB memory instance.__

### Chunking Methodology
This notebook splits the papers into chunks to avoid out of memory errors when formatting data.  The `chunk_size` variable determines the number of papers brought into memory at once. `chunk_size` can be adjusted as needed, specifically, it can be turned down if encountering kernel shutdowns due to running out of memory. 

### Reducing Dimensionality with PCA
the PCA (128 component) object saved from Part 2 is used here to reduce the encoding vector from 768 to 128 dimensions.  This is a trade-off that gives up a small amount of variance in the original encodings for shorter vectors that will require less resource to work with in latter steps. It may be worth exploring higher dimensionality in the future.

In [1]:
from ogb.lsc import MAG240MDataset
import numpy as np
import os.path as osp
import pandas as pd
import dask.dataframe as dd
from joblib import load

ROOT_DATA_DIR = '/data'
pca_model_file = f'{ROOT_DATA_DIR}/paper-feat-pca128.joblib'
chunk_size = 20_000_000

## Prepare Paper Nodes

In [2]:
def load_and_pre_format_paper_data(dataset, from_ind, to_inx):
    feat_in_memory = feat_in_memory = dataset.paper_feat[from_ind:to_inx]
    feat_cols = [f'paper_encoding_{i}' for i in range(768)]
    paper_df = pd.DataFrame(feat_in_memory, columns = feat_cols)
    
    paper_df['ogb_index'] = paper_df.index + from_ind
    
    paper_df['paper_subject'] = dataset.all_paper_label[from_ind:to_inx] 
    paper_df['paper_subject'] = paper_df['paper_subject'].fillna(-2)
    
    paper_df['paper_year'] = dataset.all_paper_year[from_ind:to_inx] 
    
    split_dict = dataset.get_idx_split()
    paper_df["split_segment"] = 'REMAINDER'
    paper_df.loc[paper_df.ogb_index.isin(split_dict['train']), 'split_segment'] = 'TRAIN'
    paper_df.loc[paper_df.ogb_index.isin(split_dict['valid']), 'split_segment'] = 'VALIDATE'
    paper_df.loc[paper_df.ogb_index.isin(split_dict['test-dev']), 'split_segment'] = 'TEST_DEV'
    paper_df.loc[paper_df.ogb_index.isin(split_dict['test-challenge']), 'split_segment'] = 'TEST_CHALLENGE'
    
    paper_df['subject_status'] = "ERROR"
    paper_df.loc[paper_df.paper_subject > -1,'subject_status'] = "KNOWN"
    paper_df.loc[paper_df.paper_subject == -1,'subject_status'] = "HIDDEN"
    paper_df.loc[paper_df.paper_subject == -2,'subject_status'] = "UNKNOWN"
    
    return paper_df

In [3]:
def reduce_paper_data(paper_df, pca_model_object = pca_model_file):
    feat_cols = [f'paper_encoding_{i}' for i in range(768)]
    feat_128_cols = ['paper_128_encoding_' + str(x) for x in range(128)]
    pca128 = load(pca_model_object)
    res_df = pd.DataFrame(pca128.transform(paper_df[feat_cols]), columns = feat_128_cols)
    res_df = pd.concat([paper_df[["ogb_index", "split_segment", "subject_status", "paper_year", 
                                           "paper_subject"]], res_df], axis=1)
    return res_df

In [4]:
#Future Note: Change split_segment -> splitSegment and subject_status -> subjectStatus 
## for consitent naming and to work with next parts, namely export. 
def post_format_paper_data(reduced_paper_df, npartitions=2000):
    feat_128_cols = ['paper_128_encoding_' + str(x) for x in range(128)]
    reduced_paper_df.rename(
        columns={"ogb_index":"ogbIndex:ID", "split_segment":"split_segment:string", 
                 "subject_status":"subject_status:string", "paper_year":"year:int", 
                 "paper_subject":"subject:int"}, inplace=True)
    paper_ddf = dd.from_pandas(reduced_paper_df, npartitions=npartitions)
    paper_ddf = paper_ddf.astype({'subject:int':'int32'})
    paper_ddf["encoding:float[]"] = \
    paper_ddf.apply(lambda x:";".join(['%0.5f' % i for i in x[feat_128_cols]]), axis=1,meta=("str"))
    paper_ddf = paper_ddf.drop(columns=feat_128_cols)
    paper_ddf.compute(scheduler="processes")
    return paper_ddf

In [5]:
dataset = MAG240MDataset(root = ROOT_DATA_DIR)
total_n = dataset.num_papers
done_n = 0
count = 0

while done_n < total_n:
    to_n = done_n + chunk_size
    print("pre-formatting for chunk " + str(done_n) + " to " + str(to_n) + "...")
    paper_df = load_and_pre_format_paper_data(dataset, done_n, to_n)
    print("pca reduction...")
    reduced_paper_df = reduce_paper_data(paper_df)
    print("post-formatting...")
    paper_ddf = post_format_paper_data(reduced_paper_df)
    print("writing chunk to files " + f'{ROOT_DATA_DIR}/demo-load/paper-c{count}-*.csv')
    paper_ddf = paper_ddf.repartition(npartitions=100)
    paper_ddf.to_csv(f'{ROOT_DATA_DIR}/demo-load/paper-c{count}-*.csv', 
                          header_first_partition_only=True, index=False, compute_kwargs={'scheduler': 'processes'})
    count += 1
    done_n = to_n
    print(f"finished {round(100*done_n/total_n,2)} of the data, iterated count to {count}")
    print("========================================")
    print("========================================")

pre-formatting for chunk 0 to 20000000...
pca reduction...
post-formatting...
writing chunk to files /data/demo-load/paper-c0-*.csv
finished 16.43 of the data, iterated count to 1
pre-formatting for chunk 20000000 to 40000000...
pca reduction...
post-formatting...
writing chunk to files /data/demo-load/paper-c1-*.csv
finished 32.85 of the data, iterated count to 2
pre-formatting for chunk 40000000 to 60000000...
pca reduction...
post-formatting...
writing chunk to files /data/demo-load/paper-c2-*.csv
finished 49.28 of the data, iterated count to 3
pre-formatting for chunk 60000000 to 80000000...
pca reduction...
post-formatting...
writing chunk to files /data/demo-load/paper-c3-*.csv
finished 65.71 of the data, iterated count to 4
pre-formatting for chunk 80000000 to 100000000...
pca reduction...
post-formatting...
writing chunk to files /data/demo-load/paper-c4-*.csv
finished 82.13 of the data, iterated count to 5
pre-formatting for chunk 100000000 to 120000000...
pca reduction...
pos

## Prepare Cite Relationships

In [6]:
cites_edge_ddf = dd.from_pandas(pd.DataFrame(dataset.edge_index('paper', 'paper').T, 
                                columns = [":START_ID",":END_ID"]), npartitions=100)


In [7]:
cites_edge_ddf.to_csv(f'{ROOT_DATA_DIR}/demo-load/cited-*.csv', 
                      header_first_partition_only=True, index=False, scheduler="processes")

  warn(


['/data/demo-load/cited-00.csv',
 '/data/demo-load/cited-01.csv',
 '/data/demo-load/cited-02.csv',
 '/data/demo-load/cited-03.csv',
 '/data/demo-load/cited-04.csv',
 '/data/demo-load/cited-05.csv',
 '/data/demo-load/cited-06.csv',
 '/data/demo-load/cited-07.csv',
 '/data/demo-load/cited-08.csv',
 '/data/demo-load/cited-09.csv',
 '/data/demo-load/cited-10.csv',
 '/data/demo-load/cited-11.csv',
 '/data/demo-load/cited-12.csv',
 '/data/demo-load/cited-13.csv',
 '/data/demo-load/cited-14.csv',
 '/data/demo-load/cited-15.csv',
 '/data/demo-load/cited-16.csv',
 '/data/demo-load/cited-17.csv',
 '/data/demo-load/cited-18.csv',
 '/data/demo-load/cited-19.csv',
 '/data/demo-load/cited-20.csv',
 '/data/demo-load/cited-21.csv',
 '/data/demo-load/cited-22.csv',
 '/data/demo-load/cited-23.csv',
 '/data/demo-load/cited-24.csv',
 '/data/demo-load/cited-25.csv',
 '/data/demo-load/cited-26.csv',
 '/data/demo-load/cited-27.csv',
 '/data/demo-load/cited-28.csv',
 '/data/demo-load/cited-29.csv',
 '/data/de

### Removing Extra Headers for Papers

Unfortunately a small manual step here since I didn't get the headings quite right when writing the paper csvs. There should only be one paper csv with a header (the first one from the first chunk) for admin import. However, In the chunking logic above I write a header for the first file of each chunk. As a result, we must remove the headers from the first file of each chunk with exception to the initial chunk.  

It is easy to accomplish this in a terminal.  Simply go to the directory with the csvs and execute the `sed` command like below for each first file with exception to `paper-c0-00.csv`.  For the chunk size of 20 Million, it would look like the below. 

```bash
  sed -i '1d' paper-c1-00.csv 
  sed -i '1d' paper-c2-00.csv 
  sed -i '1d' paper-c3-00.csv 
  sed -i '1d' paper-c4-00.csv 
  sed -i '1d' paper-c5-00.csv 
  sed -i '1d' paper-c6-00.csv 
```