Script for preprocessing contribution and research method data for graph database. 
rdf files come from 2 folders: education and psychology. files consist of doi header and xml of appended research method entities 

contribution data for prototype is from powerBI datatbase made available by shared access. 

final result is csv file where each row contains a research method and unique set of contribution data (author, article, journal, institute, ringgold id, etc.)

In [1]:
import xml.etree.ElementTree as ET
import simplejson as json
import glob
import itertools 
import csv
import re
import math
import pandas as pd
from tqdm import tqdm

In [3]:
rdf_files = glob.glob("{}/*.rdf".format("/to/path/Psychology"))
master_contribution = pd.read_csv('/to/path/master_contribution_data.csv')

**GENERATE {SRM:[DOIs]}**

In [4]:
srm_dict = {}
for rdf_filename in tqdm(rdf_files):
    with open(rdf_filename, "r") as rdf_file:
        rdf_str = rdf_file.read()
        xml_idx = rdf_str.rfind("<?xml version=")
        
        if (xml_idx > 0):
            rdf_str = rdf_str[:xml_idx]

        root = ET.fromstring(rdf_str)
        doi = root.attrib.get("id")

        entity_root = root.find('entities')
        for entity_element in entity_root.findall('entity'):
            #entity_id = entity_element.get("id")
            entity_val = entity_element.get("value")
            srm_dict.setdefault(entity_val,[]).append(doi)

100%|██████████| 123913/123913 [06:40<00:00, 309.04it/s]


**FILTER CONTRIBUTION DATA**

In [8]:
srm_doi_list = []
for key,val in srm_dict.items():
    for doi in val:
        srm_doi_list.append(doi)

contrib_doi_list = master_contribution['DOI'].tolist()
filter_doi_list = list(set(contrib_doi_list).intersection(srm_doi_list))

filter_contribution = pd.DataFrame(columns = master_contribution.columns.values.tolist())

for i in tqdm(range (0, master_contribution.shape[0])):
    if master_contribution['DOI'].iloc[i] in filter_doi_list:
        filter_contribution = filter_contribution.append(master_contribution.iloc[i])
filter_contribution = filter_contribution.reset_index().drop(columns='index')

100%|██████████| 30000/30000 [00:11<00:00, 2500.42it/s]


**FILTER {SRM:[DOIs]} BY FILTERED CONTRIBUTION DATA**

In [9]:
filter_dict = {}
for key,val in tqdm(srm_dict.items()):
    for doi in val:
        if doi in filter_doi_list:
            filter_dict.setdefault(key,[]).append(doi) 

100%|██████████| 648/648 [00:27<00:00, 23.39it/s] 


**GENERATE {SRM:[CONTRIBUTION DATA]}**
such that each row corresponds to srm key and an element in value array

In [11]:
final_dict = {}
for key,val in tqdm(filter_dict.items()):
    for doi in val:
        for i in range(0, filter_contribution.shape[0]):
            if filter_contribution['DOI'].loc[i] == doi:
                data= filter_contribution.loc[i].values.tolist()
                final_dict.setdefault(key,[]).append(data) 

100%|██████████| 518/518 [15:56<00:00, 12.57it/s]


**GENERATE COMPLETE SRM/CONTRIBUTION DATAFRAME**

In [17]:
srm_list = []
data_list = []
for key,val in final_dict.items():
    key_list = [key]*len(val)
    srm_list.extend(key_list)
    for data in val:
        data_list.append(data)
        
srm_df = pd.DataFrame(srm_list,columns=['research method'])
df_columns = master_contribution.columns.values
data_df = pd.DataFrame(data_list,columns=df_columns)
master_df = srm_df.join(data_df)

In [25]:
master_df.to_csv('/to/path/SRM.csv')