# Web of Science Import to Neo4j Notes

Web of Science / Clarivate (https://www.webofscience.com/) requires a manual download of search results.
A Web of Science account is required to search for literature metadata and download these search results.

Web of Science provides a number of options to search for articles of interest
'Topic' search is recommended to provide comprehensive results.

Once the search results are returned, they will need to be exported by clicking on the "Export" tab
then selecting 'Tab delimited file'.

At a minimum select 'Author', 'Title', 'Source' and 'Abstract' attributes. 
This Jupyter Notebook will import all attributes if they are selected. 

Only 1000 results can be downloaded at a time. If more than 1000 results are returned, multiple downloads
will be required.

Once the spreadsheet files are downloaded, they will need to be placed into your Neo4j Import directory
The scripting will automatically import these spreadsheets.



In [None]:
# Download and install Python packages needed for this Jupyter Notebook

!pip install neo4j

In [None]:
# This imports the Python packages needed for this Jupyter Notebook 

# Note: 'ast' 'json' and 'os' are part of the Python Standard Library
# If not already included in your Python installer, 
# they will need to be installed manually 

import json 
from neo4j import GraphDatabase
import os
import ast

In [None]:
# The block connect the Jupyter Notebook to your Neo4j Database
# Note: Your Neo4j Database must be running and accepting connections
# Note: This example is for connecting to a local instance of Neo4j
# More information on interfacing with can be found at
# https://neo4j.com/docs/python-manual/current/connect/

uri = 'bolt://localhost:7687'
username = 'neo4j'
password = 'password'
driver = GraphDatabase.driver(uri, auth=(username, password))

In [None]:
# This block creates indexes on the following properties to greatly speed data import and data queries
# Scripting that is commented out indicate an Node type not present in the data set imported  
    
#driver.execute_query('CREATE INDEX Institutions IF NOT EXISTS FOR \
#    (i:Institutions) ON (i.id)')
#driver.execute_query('CREATE INDEX Concept IF NOT EXISTS FOR \
#    (i:Concept) ON (i.id)')
record, summary, keys =  driver.execute_query('CREATE INDEX Work_ID IF NOT EXISTS FOR \
        (i:Work) ON (i.id)')
record, summary, keys =  driver.execute_query('CREATE INDEX Author IF NOT EXISTS FOR \
        (i:Author) ON (i.id)')

In [None]:
# This block imports each spreadsheet downloaded from Web of Science and creates
# Work nodes with the Web of Science data fields added as properties  

# Returns path for neo4j Import directory 
with driver.session() as session:
    
    # Returns path for neo4j instance running
    neodir = session.run('CALL dbms.listConfig() YIELD name, value \
        WHERE name = \'server.directories.import\' RETURN value').values()
    path = neodir[0][0] 
    #print(path)
    
    os.chdir(path)
directory_list = sorted(os.listdir(path))

# Imports downloaded spreadsheet and creates Work nodes
for file in directory_list:
    if not file.startswith('.'): 
       work_node_creation  = "LOAD CSV WITH HEADERS FROM 'file:///" + file + \
            "' AS line WITH line WHERE line.`Authors` IS NOT NULL \
            MERGE (w:Work {id: line.`UT (Unique ID)`}) \
            SET w.source = \'WebOfScience\', \
            w.display_name = coalesce(line.`Article Title`, \'\'), \
            w.cited_by_count = \
            coalesce(line.`Times Cited, All Databases`, \'\'), \
            w.doi = coalesce(line.`DOI`, \'\'), \
            w.publication_date = coalesce(line.`Publication Date`, \'\'), \
            w.publication_year = coalesce(line.`Publication Year`, \'\'), \
            w.type = coalesce(line.`Document Type`, \'\'), \
            w.abstract = coalesce(line.`Abstract`, \'\'), \
            w.authorships = coalesce(split(line.`Authors`, \';\'), \'\'), \
            w.ISBN = coalesce(line.`ISBN`, \'\'), \
            w.ISSN = coalesce(line.`ISSN`, \'\'), \
            w.SourceTitle = coalesce(line.`Source Title`, \'\'), \
            w.ConferenceTitle = coalesce(line.`Conference Title`, \'\')" 
    # Uncomment statement below to print cypher syntax for command being executed
    #print(work_node_creation)
    
    print("Processing file: " + str(file))
    record, summary, keys = driver.execute_query(work_node_creation)
    # This print statement provides information on 
    # labels_added, relationships_created, nodes_created and properties_set
    print(summary.counters)


In [None]:
# This block retrieves the authorships property imported earlier and 
# creates an Author Node for each unique author. 
# Author names are assigned to display_name and a unique identifier
# (UUID) is created

author_node_creation = "CALL apoc.periodic.iterate(\"MATCH (w:Work) RETURN w\",\"WITH \
    w.authorships AS authors,w \
    UNWIND authors AS author \
    MERGE (a:Author {display_name: trim(author)}) \
    SET a.source = 'Web of Science', \
    a.id = randomUUID() \
    MERGE (a)-[:WROTE]->(w)\",\
    {batchSize:200, parallel:false})"


# Print line provides the cypher syntax executed within neo4j
#print(author_node_creation)

record, summary, keys = driver.execute_query(author_node_creation)

# This print statement provides information on the transaction batches
print(record[0][8])

print("Author import complete")