## Neo4j Code to load data, nodes, properties and relationships

This file is used to import data into neo4j graph data base and execute queries

## 1- Install neo4j Desktop

Here are the steps to install and setup neo4j in an ubuntu laptop.
Similar instructions are available for macOS, windows, and other Linux OS

### --------------neo4j instructions------------------
***Neo4j desktop is most likely all that is needed for the system to operate, but these are the steps used in the test system.***

### [ ] how to install neo4j desktop on ubuntu:

First install fuse (pick the steps that work for your version of ubuntu)

https://github.com/AppImage/AppImageKit/wiki/FUSE


for ubuntu 22.04 or greater type in terminal:

sudo add-apt-repository universe

sudo apt install libfuse2


Download appimage from https://neo4j.com/download/


in terminal type in the folder where the file was downloaded (note might need to change the file version number):

chmod a+x neo4j-desktop-1.0.3-x86_64.AppImage; 

./neo4j-desktop-1.0.3-x86_64.AppImage



### -------------------------neo4j configuration/file paths------------------

once neo4j is installed

run neo4j desktop by typing the following in a terminal:


XXXXXXX


neo4j desktop may open as part of the installation. 

Keep track of the following information.

The following settings will need to be entered in the neo4j_graph_etl notebook


bolt: 7687 → 7689

http: 7474 → 7475


also keep track of


username: neo4j

password: XXXXX


They will also need to be entered in the notebook


In the neo4j desktop app from the "home" or "startup" screen under example projects, click add then click local dbms.

Enter the name of the server (ex: SBIR)


A new server with that name will show up under example project. click on the three dots to the right of the server name.

Click open folder then click import. a new window will open up showing the contents of the import folder.

The path to that folder should be listed at the top of the screen. Copy or write down that path and save it. 

This path is needed to import csv files into neo4j, the neo4j notebook will need to know the path to the neo4j import folder


ex: /home/laben/.config/Neo4j Desktop/Application/relate-data/dbmss/dbms-6959f6d4-5ba3-45e0-a8db-7a248c595605/import


## 2 - Set Variables

After completing the neo4j installation, update the variables in the notebook to reference the specific values for the system the notebook is running on.

## 3 - Load in base level csv files and flatten them for import into neo4j

Create additional csv files after flattening some of the cells in the original csv files.

## 4 - Copy Data files to neo4j data directory

Import the csv files from the input file directory into the neo4j import directory.

## 5 - Create nodes and edges in neo4j

Run Cypher queries to create nodes and edges

## 6 - Test Queries (Optional)

Run some test queries to verify data is correctly setup in neo4j

## 7 - Conduct Queries to answer questions posed for Final Project

These are the queries proposed the team would answer for the DSE 203 Final Project

In [1]:
## install 
#pip install py2neo

In [2]:
import pandas as pd
from pandas import json_normalize
import ast
from ast import literal_eval
from py2neo import Graph
import spacy
import subprocess

2023-12-09 07:37:22.370998: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-09 07:37:22.372544: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-09 07:37:22.393781: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-09 07:37:22.393798: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-09 07:37:22.394381: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to

## 2 - Set Variables

In [3]:
# Set variables to each specific user/machine that is running the notebook

#Username will be used to switch to different users settings as configured in this file
#neo4j_import_dir = "/Users/prakhar/Library/Application Support/Neo4j Desktop/Application/relate-data/dbmss/dbms-c1b4f9c3-baa6-45d6-a13c-31ec4ba4c393/import" #SF
neo4j_import_dir = "/home/laben/.config/Neo4j Desktop/Application/relate-data/dbmss/dbms-6959f6d4-5ba3-45e0-a8db-7a248c595605/import" #LF
#neo4j_import_dir = "/Users/sagarjogadhenu/Library/Application Support/Neo4j Desktop/Application/relate-data/dbmss/dbms-f6ef41f5-ff23-4598-9ef0-df4c332f4744/import" #SJ

#neo4j connection settings
neo4j_url = "bolt://localhost:7689" #LF
#neo4j_url = "bolt://localhost:7690" #PS
#neo4j_url = "bolt://localhost:7687" #SJ
neo4j_username = "neo4j"
neo4j_password = "Welcome19" #LF
#neo4j_password = "Welcome19#" #PS
#neo4j_password = "dse203sbir" #SJ

# Filenames of the data sets needed to be loaded into neo4j and modified
# These files will create additional csv files.
# The acutal csv files loaded into neo4j will be defined in step 3
filename_sbir = './preprocessed_files/sbir_1k_sample.csv'
filename_patents = './preprocessed_files/patents.json'


In [4]:
# Connect to your Neo4j database
graph = Graph(neo4j_url, auth=(neo4j_username, neo4j_password))

## 3 - load in base level csv files and flatten them for import into neo4j

### Load the SBIR sample data

In [5]:
# Read the Excel file into a pandas DataFrame
df_sbir = pd.read_csv(filename_sbir)
#df_sbir.rename(columns={'Agency Tracking Number': 'id_Agency Tracking Number'}, inplace=True)

df_sbir['Award Amount'] = pd.to_numeric(df_sbir['Award Amount'].str.replace(',', ''), errors='coerce')


df_sbir['key'] = df_sbir.id # PS Change 12/08 12 PM
df_sbir.head(5)

Unnamed: 0,id,Company,Award Title,Agency,Branch,Phase,Program,Agency Tracking Number,Contract,Proposal Award Date,...,Contact Email,PI Name,PI Title,PI Phone,PI Email,RI Name,RI POC Name,RI POC Phone,abstract_entities,key
0,16068,"EXPEDITION TECHNOLOGY, INC.",Fast Recovery Of Signal Estimates using Neural...,Department of Defense,Navy,Phase I,SBIR,N192-128-0403,N68335-19-C-0742,10/21/2019,...,marc.harlacher@exptechinc.com,Mike Tinston,Chief Scientist,(571) 246-8479,mike.tinston@exptechinc.com,,,,"['condition', 'signal density', 'signal mingle...",16068
1,187914,"Apa Optics, Inc.",EXTRAVEHICULAR MOBILITY UNIT HELMET MOUNTED DI...,National Aeronautics and Space Administration,,Phase II,SBIR,6818,,,...,,David E Stoltzmann,,() -,,,,,"['OPTICAL MECHANICAL DESIGNS', 'opto-mechanica...",187914
2,97687,"RADIATION MONITORING DEVICES, INC.","An Efficient, Solid State Detector for Nuclear...",Department of Energy,,Phase II,SBIR,80498S06-I,DE-FG02-06ER84430,,...,GEntine@rmdinc.com,Michael Squillante,Dr,(617) 668-6808,MSquillante@rmdinc.com,,,,"['signal-to-noise ratio', 'Computed tomography...",97687
3,38328,"QUASAR FEDERAL SYSTEMS, INC.",Miniature Oriented Tri-Axial Fluxgate Magnetom...,Department of Defense,Navy,Phase I,SBIR,N172-116-0564,N68335-17-C-0621,08/30/2017,...,twrightson@quasarfs.com,Yongming Zhang,CEO,(858) 412-1737,yongming@quasarfs.com,,,,"['long range sigint', 'magnetic signature', 'e...",38328
4,82571,"INVOTEK, INC.",Prosody enhanced TTS for Dysarthric Speakers,Department of Health and Human Services,National Institutes of Health,Phase I,SBIR,DC010085,1R43DC010085-01,,...,tjakobs@invotek.org,THOMAS JAKOBS,,(479) 632-4166,TJAKOBS@INVOTEK.ORG,,,,"['feasibility project', 'control prosody compu...",82571


In [6]:
selected_columns1 = ['key',
                     'Agency Tracking Number', 'Company', 'Branch', 'Award Title', 'Agency', 'Company Website', 'Contract', 'Proposal Award Date', 'Contract End Date', 'Solicitation Number', 'Solicitation Year',
                     'Solicitation Close Date', 'Proposal Receipt Date', 'Date of Notification', 'Topic Code', 'Award Year', 'Award Amount',
                     'PI Name', 'PI Title', 'PI Phone', 'PI Email', 'RI Name', 'RI POC Name', 'RI POC Phone','Address1','Address2', 'City', 'State', 'Zip',
                     'HUBZone Owned', 'Socially and Economically Disadvantaged','Women Owned'
                    ]

selected_columns2 = ['key','Abstract','abstract_entities']   ##need to remove/update  #PS update 12/08/1:45 PM
# Remove leading/trailing whitespaces
selected_columns1 = [col.strip() for col in selected_columns1]
selected_columns2 = [col.strip() for col in selected_columns2]

df_sbirs = df_sbir[selected_columns1]

df_sbirent = df_sbir[selected_columns2]


df_sbirs.to_csv('./preprocessed_files/neo4j_sbir_main.csv', index=False)

In [7]:
#Explode the 'abstract_entities' column  (need to update)
#This is done so that the list of entities are converted to individual entites
#that can be placed into seperate rows in a csv file which will be importated to neo4j
#df_source_sbir = df_sbirent

df_entities_sbir = pd.DataFrame(columns=['key','entities'])
#df_sbirent['abstract_entities']
#df_exploded_sbir = df_sbirent.explode('abstract_entities')
#df_exploded_sbir = df_source_sbir
#Display the resulting DataFrame
#display(df_sbirent)
#display(df_sbirent['abstract_entities'].iloc[0])
#test = df_sbirent['abstract_entities'].iloc[0].replace(''','').replace('[','').replace(']','').split(',')
#print(test)
#print(len(test))

for i in range(0, len(df_sbirent)):
    temp_list = df_sbirent['abstract_entities'].iloc[i].replace('\'','').replace('[','').replace(']','').split(',')

    for x in range(0, len(temp_list)):
        #print(temp_list[x])
        df_entities_sbir.loc[len(df_entities_sbir.index)] = [df_sbirent['key'].iloc[i], temp_list[x].strip()]
        #df_sbirent.loc[len(df_entities_sbir.index)] = [df_sbirent['key'].iloc[i], temp_list[x]]

df_sbir_abstract_entities = pd.merge(df_sbirent, df_entities_sbir, on='key', how='inner') 
display(df_sbir_abstract_entities)
df_sbir_abstract_entities.to_csv('./preprocessed_files/neo4j_sbir_abstract_entities.csv', index=False)

Unnamed: 0,key,Abstract,abstract_entities,entities
0,16068,Current automated RF signal acquisition and an...,"['condition', 'signal density', 'signal mingle...",condition
1,16068,Current automated RF signal acquisition and an...,"['condition', 'signal density', 'signal mingle...",signal density
2,16068,Current automated RF signal acquisition and an...,"['condition', 'signal density', 'signal mingle...",signal mingle
3,16068,Current automated RF signal acquisition and an...,"['condition', 'signal density', 'signal mingle...",signal intermodulation product desire
4,16068,Current automated RF signal acquisition and an...,"['condition', 'signal density', 'signal mingle...",Bandwidth ( ibw ) receiver wide bit-depth reduce
...,...,...,...,...
20899,171725,We propose to provide a set of general design ...,"['system configuration', 'central distribution...",expert design engineering
20900,171725,We propose to provide a set of general design ...,"['system configuration', 'central distribution...",sample transfer
20901,171725,We propose to provide a set of general design ...,"['system configuration', 'central distribution...",efficiency
20902,171725,We propose to provide a set of general design ...,"['system configuration', 'central distribution...",integration


### Load the Patent sample data

In [8]:
# Read in the patent data
df_patent = pd.read_json(filename_patents)
df_patent['key'] = df_patent['doc-number']  # PS Change 12/08 12 PM
#df_patent.head(5)

# Display the DataFrame
display(df_patent)

Unnamed: 0,country,doc-number,date,application-reference,title,assignee,inventors,abstract,claims,abstract_entities,claim_entities,key
0,US,20230227158,20230720,"{'country': 'US', 'doc-number': 18123405, 'dat...",UNMANNED AERIAL SYSTEM AND METHOD FOR CONTACT ...,"{'orgname': 'Beirobotics LLC', 'city': 'Richmo...","[{'last-name': 'Beiro', 'first-name': 'Michael...",\nA system for performing work on electrical p...,a power line tool adapted to perch on an energ...,"[splice electrical power line, power line tool...","[power line tool, method, couple unmanned, sla...",20230227158
1,US,20230229051,20230720,"{'country': 'US', 'doc-number': 17577538, 'dat...",METHOD AND DEVICE FOR CONTROLLING STATES OF DY...,"{'orgname': 'FURCIFER INC.', 'city': 'FREMONT'...","[{'last-name': 'WANG', 'first-name': 'JIAN', '...",\nThe disclosure relates generally to a method...,selecting a desired optical state of the elect...,"[method, optical state base, switch, power, di...","[controller, optical state base, switch, signa...",20230229051
2,US,20230229087,20230720,"{'country': 'US', 'doc-number': 18098167, 'dat...",UV-CURABLE QUANTUM DOT FORMULATIONS,"{'orgname': 'Nanosys, Inc.', 'city': 'Milpitas...","[{'last-name': 'IPPEN', 'first-name': 'Christi...",\nProvided are patterned films comprising nano...,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n,"[electroluminescent device, method pattern, uv...",[],20230229087
3,US,20230230714,20230720,"{'country': 'US', 'doc-number': 18010358, 'dat...",CONTROL DRUM CONTROLLER FOR NUCLEAR REACTOR SY...,"{'orgname': 'Ultra Safe Nuclear Corporation', ...","[{'last-name': 'Chaleff', 'first-name': 'Ethan...",\nA nuclear reactor system includes a nuclear ...,a pressure vessel;\na nuclear reactor core dis...,"[moderator element, nuclear reactor core dispo...","[internal control drum, longitudinally pressur...",20230230714
4,US,20230230802,20230720,"{'country': 'US', 'doc-number': 18178664, 'dat...",Ultra High Purity Conditions for Atomic Scale ...,"{'orgname': 'Kurt J. Lesker Company', 'city': ...","[{'last-name': 'Rayner, JR.', 'first-name': 'G...",\nAn apparatus for atomic scale processing is ...,a reactor having inner and outer surfaces;\nwh...,"[apparatus, internal volume reactor, substrate...","[base pressure reactor, backfille process, pur...",20230230802
...,...,...,...,...,...,...,...,...,...,...,...,...
880,US,20230226141,20230720,"{'country': 'US', 'doc-number': 18157933, 'dat...",ANGIOTENSIN II ALONE OR IN COMBINATION FOR THE...,{'orgname': 'The George Washington University ...,"[{'last-name': 'CHAWLA', 'first-name': 'Lakhmi...","\nThe present invention relates, inter alia, t...",\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,"[method, invention, effective, treatment, outp...",[],20230226141
881,US,20230226142,20230720,"{'country': 'US', 'doc-number': 18166956, 'dat...",IONIC SELF-ASSEMBLING PEPTIDES,"{'orgname': '3-D MATRIX, LTD.', 'city': 'Tokyo...","[{'last-name': 'Gil', 'first-name': 'Eun Seok'...",\nProvided herein are ionic self-assembling pe...,\n\na) an N-terminal functional group selected...,"[ionic self-assembling peptide, method, pharma...","[administer solution, pharmaceutical compositi...",20230226142
882,US,20230226143,20230720,"{'country': 'US', 'doc-number': 18011718, 'dat...",PHARMACEUTICAL COMPOSITION COMPRISING A COMBIN...,"{'orgname': 'OCVIRK, Soren', 'city': 'Kranzber...","[{'last-name': 'Ocvirk', 'first-name': 'Sören'...",\nA pharmaceutical composition includes a comb...,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n,"[pharmaceutical composition, gastrointestinal ...",[],20230226143
883,US,20230226144,20230720,"{'country': 'US', 'doc-number': 18186639, 'dat...",TOPICAL ANTIBIOTIC,"{'orgname': 'Concept Matrix Solutions', 'city'...","[{'last-name': 'LaRosa', 'first-name': 'Tony',...",\nProvided herein is a topical antibiotic comp...,"(a) at least one of bacitracin, bacitracin zin...","[skin surface, antibiotic agent, method, compo...","[sodium hydroxide, cetearyl alcohol, water, et...",20230226144


In [9]:
selected_columns3 = ['key','doc-number', 'country',  'application-reference', 'title', 'assignee', 'inventors']
selected_columns4 = ['key', 'abstract','abstract_entities']  ##need to remove/update  #PS update 12/08/1:45 PM

#selected_columns4 = ['key', 'abstract_entities']  ## need to add

selected_columns3 = [col.strip() for col in selected_columns3]
selected_columns4 = [col.strip() for col in selected_columns4]

df_patents = df_patent[selected_columns3]
df_patents_main = df_patents[['key','doc-number', 'country', 'title']]
df_patentsent= df_patent[selected_columns4]

#df_patentsent.rename(columns={'abstract': 'abstract_entities'}, inplace=True)
display(df_patents_main)
df_patents_main.to_csv('./preprocessed_files/neo4j_patents_main.csv', index=False)



Unnamed: 0,key,doc-number,country,title
0,20230227158,20230227158,US,UNMANNED AERIAL SYSTEM AND METHOD FOR CONTACT ...
1,20230229051,20230229051,US,METHOD AND DEVICE FOR CONTROLLING STATES OF DY...
2,20230229087,20230229087,US,UV-CURABLE QUANTUM DOT FORMULATIONS
3,20230230714,20230230714,US,CONTROL DRUM CONTROLLER FOR NUCLEAR REACTOR SY...
4,20230230802,20230230802,US,Ultra High Purity Conditions for Atomic Scale ...
...,...,...,...,...
880,20230226141,20230226141,US,ANGIOTENSIN II ALONE OR IN COMBINATION FOR THE...
881,20230226142,20230226142,US,IONIC SELF-ASSEMBLING PEPTIDES
882,20230226143,20230226143,US,PHARMACEUTICAL COMPOSITION COMPRISING A COMBIN...
883,20230226144,20230226144,US,TOPICAL ANTIBIOTIC


In [10]:

df_source_patent = df_patentsent

# Convert the string representation to a list
#(need to update)
#df_source_patent['abstract_entities'] = df_source_patent['abstract_entities'].apply(ast.literal_eval)

# Explode the 'abstract_entities' column
#(need to update)
df_exploded_patents = df_source_patent.explode('abstract_entities') #PS update 12/08/1:45 PM


# Display the resulting DataFrame
display(df_exploded_patents)

df_exploded_patents.to_csv('./preprocessed_files/neo4j_patents_abstract.csv', index=False)

Unnamed: 0,key,abstract,abstract_entities
0,20230227158,\nA system for performing work on electrical p...,splice electrical power line
0,20230227158,\nA system for performing work on electrical p...,power line tool
0,20230227158,\nA system for performing work on electrical p...,electrical power line
0,20230227158,\nA system for performing work on electrical p...,attachment point support
0,20230227158,\nA system for performing work on electrical p...,system
...,...,...,...
883,20230226144,\nProvided herein is a topical antibiotic comp...,composition
884,20230226145,\nThe disclosure relates to a pharmaceutical c...,method
884,20230226145,\nThe disclosure relates to a pharmaceutical c...,autoimmune disease
884,20230226145,\nThe disclosure relates to a pharmaceutical c...,disclosure


In [11]:
# Functions used to navigate json file and extract information

# This function helps deal with errors when looking for values in the json file
def safe_eval(value):
    try:
        return literal_eval(value)
    except (ValueError, SyntaxError):
        return value

# If there is only 1 level of nesting, this function will extract the data
def flatten_nested_columns(df, column_name):
    # Use safe_eval to handle malformed data
    df[column_name] = df[column_name].apply(safe_eval)
    
    # Flatten the nested dictionary
    df_flat = pd.concat([df.drop(column_name, axis=1),
                        json_normalize(df[column_name])], axis=1)
    
    return df_flat

# If there are 2 levels of nesting, this function will extract the data
def flatten_nested_nested_columns(df, column_name):
    df[column_name] = df[column_name].apply(safe_eval)
    
    # Create an empty DataFrame to store the flattened results
    df_flat = pd.DataFrame()
    
    # Iterate over each row
    for _, row in df.iterrows():
        # Flatten the nested column using json_normalize for each row
        df_row_flat = pd.json_normalize(row[column_name]).reset_index(drop=True)
        
        # Add the id_doc-number to the flattened DataFrame
        df_row_flat['key'] = row['key']
        
        # Concatenate the results for each row
        df_flat = pd.concat([df_flat, df_row_flat], ignore_index=True)
    
    return df_flat

In [12]:
# 1st Flattening, which is used to generate the patent application reference node and associated properties
# This is the root node for patents

# Create a DataFrame
df = df_patents[['key','application-reference']]

# Flatten the specified nested column
df_flat = flatten_nested_columns(df, 'application-reference')

# Display the flattened DataFrame
print(df_flat)

df_flat.to_csv('./preprocessed_files/neo4j_patents_application_reference.csv', index=False)

             key country  doc-number      date
0    20230227158      US    18123405  20230320
1    20230229051      US    17577538  20220118
2    20230229087      US    18098167  20230118
3    20230230714      US    18010358  20220817
4    20230230802      US    18178664  20230306
..           ...     ...         ...       ...
880  20230226141      US    18157933  20230123
881  20230226142      US    18166956  20230209
882  20230226143      US    18011718  20200630
883  20230226144      US    18186639  20230320
884  20230226145      US    17379987  20210719

[885 rows x 4 columns]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_name] = df[column_name].apply(safe_eval)


In [13]:
#2nd Flattening, which is used to extract the assignee node and its associated properties
# data
df = df_patents[['key','assignee']]

# Flatten the specified nested column
df_flat = flatten_nested_columns(df, 'assignee')

# Display the flattened DataFrame
print(df_flat)


df_flat.to_csv('./preprocessed_files/neo4j_patents_assignee.csv', index=False)

             key                                            orgname  \
0    20230227158                                    Beirobotics LLC   
1    20230229051                                      FURCIFER INC.   
2    20230229087                                      Nanosys, Inc.   
3    20230230714                     Ultra Safe Nuclear Corporation   
4    20230230802                             Kurt J. Lesker Company   
..           ...                                                ...   
880  20230226141  The George Washington University a Congression...   
881  20230226142                                   3-D MATRIX, LTD.   
882  20230226143                                      OCVIRK, Soren   
883  20230226144                           Concept Matrix Solutions   
884  20230226145                                     BIOINCEPT, LLC   

                city state country  
0           Richmond    VA      US  
1            FREMONT    CA      US  
2           Milpitas    CA      US  

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_name] = df[column_name].apply(safe_eval)


In [14]:
#3rd Flattening

# Extract specific columns
df = df_patents[['key', 'inventors']].dropna()

# Flatten the specified nested column
df_flat = flatten_nested_nested_columns(df, 'inventors')

# Display the flattened DataFrame
print(df_flat)


df_flat.to_csv('./preprocessed_files/neo4j_patents_inventors.csv', index=False)


        last-name       first-name                city state country  \
0           Beiro  Michael Kenneth            Richmond    VA      US   
1     Corbin, III      Alvin Leroy             Dillwyn    VA      US   
2           Coble   Chase Hamilton  North Chesterfield    VA      US   
3           Schul     David Carson            Richmond    VA      US   
4            WANG             JIAN             FREMONT    CA      US   
...           ...              ...                 ...   ...     ...   
2750       Ocvirk            Sören           Kranzberg   NaN      DE   
2751       LaRosa             Tony      Woodland Hills    CA      US   
2752     Davidson           Robert      Woodland Hills    CA      US   
2753         Reid            David     Woodland Hillls    CA      US   
2754       Barnea         Eytan R.            New York    NY      US   

              key  
0     20230227158  
1     20230227158  
2     20230227158  
3     20230227158  
4     20230229051  
...           .

## 4 - Copy Data files to neo4j data directory

In [15]:
# delete any csv from neo4j director if required
#!rm '/Users/prakhar/Library/Application Support/Neo4j Desktop/Application/relate-data/dbmss/dbms-c1b4f9c3-baa6-45d6-a13c-31ec4ba4c393/import'/*.csv
#!ls '/Users/prakhar/Library/Application Support/Neo4j Desktop/Application/relate-data/dbmss/dbms-c1b4f9c3-baa6-45d6-a13c-31ec4ba4c393/import'

In [16]:
#subprocess.run(["ls", "-l"])
output = subprocess.run(["cp" ,"./preprocessed_files/neo4j_sbir_main.csv", neo4j_import_dir], shell=False, capture_output=True) 
print(output.stderr)
output = subprocess.run(["cp" ,"./preprocessed_files/neo4j_patents_main.csv", neo4j_import_dir], shell=False, capture_output=True) 
print(output.stderr)
output = subprocess.run(["cp" ,"./preprocessed_files/neo4j_patents_abstract.csv", neo4j_import_dir], shell=False, capture_output=True) 
print(output.stderr)
output = subprocess.run(["cp" ,"./preprocessed_files/neo4j_patents_application_reference.csv", neo4j_import_dir], shell=False, capture_output=True) 
print(output.stderr)
output = subprocess.run(["cp" ,"./preprocessed_files/neo4j_patents_assignee.csv", neo4j_import_dir], shell=False, capture_output=True) 
print(output.stderr)
output = subprocess.run(["cp" ,"./preprocessed_files/neo4j_patents_inventors.csv", neo4j_import_dir], shell=False, capture_output=True) 
print(output.stderr)
output = subprocess.run(["cp" ,"./preprocessed_files/llama_similarity.csv", neo4j_import_dir], shell=False, capture_output=True) 
print(output.stderr)
output = subprocess.run(["cp" ,"./preprocessed_files/neo4j_sbir_abstract_entities.csv", neo4j_import_dir], shell=False, capture_output=True) 
print(output.stderr)
output = subprocess.run(["ls", neo4j_import_dir], shell=False, capture_output=True) 
print(output.stderr)
print(output.stdout)


b''
b''
b''
b''
b''
b''
b''
b''
b''
b'llama_similarity.csv\nneo4j_patents_abstract.csv\nneo4j_patents_application_reference.csv\nneo4j_patents_assignee.csv\nneo4j_patents_inventors.csv\nneo4j_patents_main.csv\nneo4j_sbir_abstract.csv\nneo4j_sbir_abstract_entities.csv\nneo4j_sbir_main.csv\n'


## 5 - Create nodes and edges in neo4j

In [17]:
# Define the Cypher query to load data and define nodes and edges from SBIR
cypher_query_sbir1 = """
LOAD CSV WITH HEADERS FROM 'file:///neo4j_sbir_main.csv' AS row

// Create Location nodes
MERGE (location:Location {
  key: COALESCE(toInteger(row.key), 0),
  city: COALESCE(row.City, 'Unknown'),
  state: COALESCE(row.State, 'Unknown'),
  zip: COALESCE(row.Zip, 'Unknown'),
  address1: COALESCE(row.Address1, 'Unknown'),
  address2: COALESCE(row.Address2, 'Unknown')
})

// Create Agency nodes
MERGE (agency:Agency {
  key: COALESCE(toInteger(row.key), 0),
  name: COALESCE(row.Agency, 'Unknown'),
  branch: COALESCE(row.Branch, 'Unknown'),
  website: COALESCE(row.`Company Website`, 'Unknown')
})

// Create Abstract Topic nodes
MERGE (sbir_award:Sbir_Award {
  key: COALESCE(toInteger(row.key), 0),
  topic_code: COALESCE(row.`Topic Code`, 'Unknown'),
  award_title: COALESCE(row.`Award Title`, 'Unknown'),
  award_year: COALESCE(row.`Award Year`, 'Unknown'),
  award_amount: TOFLOAT(COALESCE(row.`Award Amount`, '0.0'))

})

// Create Principal nodes
MERGE (principal:Principal {
  key: COALESCE(toInteger(row.key), 0),
  company: COALESCE(row.Company, 'Unknown'),
  pi_name: COALESCE(row.`PI Name`, 'Unknown'),
  pi_title: COALESCE(row.`PI Title`, 'Unknown'),
  pi_phone: COALESCE(row.`PI Phone`, 'Unknown'),
  pi_email: COALESCE(row.`PI Email`, 'Unknown'),
  hubzone_owned: COALESCE(row.`HUBZone Owned`, 'Unknown'),
  socially_economically_disadvantaged: COALESCE(row.`Socially and Economically Disadvantaged`, 'Unknown'),
  women_owned: COALESCE(row.`Women Owned`, 'Unknown')
})

// Create relationships
CREATE (agency)-[:OVERSEES]->(sbir_award)
CREATE (sbir_award)-[:AWARDED_TO]->(principal)
CREATE (principal)-[:WORKSIN]->(location)
"""
#granted to relationship rather belongs to

# Execute the Cypher query
graph.run(cypher_query_sbir1)

# Define the Cypher query to load data from the second CSV file and create nodes
cypher_query_sbir2 = """
LOAD CSV WITH HEADERS FROM 'file:///neo4j_sbir_abstract_entities.csv' AS row

// Create Abstract nodes
MERGE (sbir_abstract:Sbir_Abstract {
  key: COALESCE(toInteger(row.key), 0),
  Abstract: COALESCE(row.Abstract, 'Unknown')
})


// Create Entities nodes
MERGE (entities:Entities {
  entities: COALESCE(row.entities, 'Unknown')
})

// Create relationships
CREATE (sbir_abstract)-[:HOLDS]->(entities)
"""

# Execute the Cypher query
graph.run(cypher_query_sbir2)

# Create relationships between nodes from the second file and existing nodes from the first file
cypher_query_sbir3 = """
MATCH (sbir_award:Sbir_Award), (principal:Principal), (sbir_abstract:Sbir_Abstract)
WHERE sbir_award.key = principal.key AND principal.key = sbir_abstract.key
CREATE (sbir_award)-[:CONTAINS]->(sbir_abstract)
CREATE (principal)-[:RESEARCHED]->(sbir_abstract)
"""

# Execute the Cypher query
graph.run(cypher_query_sbir3)

In [18]:
# Define the Cypher query to load data and define nodes and edges from Patents
cypher_query_patents1 = """
LOAD CSV WITH HEADERS FROM 'file:///neo4j_patents_main.csv' AS row

// Create Title nodes
MERGE (title_pt:Title_PT {
  key: COALESCE(toInteger(row.key), 0),
  doc_number: COALESCE(row.`doc-number`, 'Unknown'),
  title: COALESCE(row.title, 'Unknown'),
  country: COALESCE(row.country, 'Unknown')
})

"""
# Execute the Cypher query
graph.run(cypher_query_patents1)


# Define the Cypher query to load data from the second CSV file and create nodes
cypher_query_patents2 = """
LOAD CSV WITH HEADERS FROM 'file:///neo4j_patents_application_reference.csv' AS row

// Create appplication ref nodes
MERGE (application_ref_pt:Application_Ref_PT {
  key: COALESCE(toInteger(row.key), 0),
  doc_number: COALESCE(row.`doc-number`, 'Unknown'),
  country: COALESCE(row.country, 'Unknown'),
  date: COALESCE(row.date, 'Unknown')
})
"""

# Execute the Cypher query
graph.run(cypher_query_patents2)


# Define the Cypher query to load data from the second CSV file and create nodes
cypher_query_patents3 = """
LOAD CSV WITH HEADERS FROM 'file:///neo4j_patents_assignee.csv' AS row

// Create assignee nodes
MERGE (assignee_pt:Assignee_PT {
  key: COALESCE(toInteger(row.key), 0),
  orgname: COALESCE(row.orgname, 'Unknown'),
  city: COALESCE(row.city, 'Unknown'),
  state: COALESCE(row.state, 'Unknown'),
  country: COALESCE(row.country, 'Unknown')
});
"""

# Execute the Cypher query
graph.run(cypher_query_patents3)

# Define the Cypher query to load data from the second CSV file and create nodes
cypher_query_patents4 = """
LOAD CSV WITH HEADERS FROM 'file:///neo4j_patents_inventors.csv' AS row

// Create assignee nodes
MERGE (inventors_pt:Inventors_PT {
  key: COALESCE(toInteger(row.key), 0),
  first_name: COALESCE(row.`first-name`, 'Unknown'),
  last_name: COALESCE(row.`last-name`, 'Unknown'),
  city: COALESCE(row.city, 'Unknown'),
  state: COALESCE(row.state, 'Unknown'),
  country: COALESCE(row.country, 'Unknown') 
})

"""

# Execute the Cypher query
graph.run(cypher_query_patents4)


# Define the Cypher query to load data from the second CSV file and create nodes
cypher_query_patents5 = """
LOAD CSV WITH HEADERS FROM 'file:///neo4j_patents_abstract.csv' AS row

// Create Abstract nodes
MERGE (patent_abstract:Patent_Abstract {
  key: COALESCE(toInteger(row.key), 0),
  Abstract: COALESCE(row.abstract, 'Unknown')
})


// Create Entities nodes
MERGE (entities:Entities {
  entities: COALESCE(row.abstract_entities, 'Unknown')
})

// Create relationships
CREATE (patent_abstract)-[:HOLDS_PT]->(entities)
"""

# Execute the Cypher query
graph.run(cypher_query_patents5)

# Create relationships between nodes from the second file and existing nodes from the first file
cypher_query_patents6 = """
MATCH (title_pt:Title_PT) , (assignee_pt:Assignee_PT), (inventors_pt:Inventors_PT), (applications_ref_pt:Application_Ref_PT), (patent_abstract:Patent_Abstract)
WHERE title_pt.key = assignee_pt.key AND  assignee_pt.key = inventors_pt.key AND inventors_pt.key = applications_ref_pt.key AND applications_ref_pt.key = patent_abstract.key
CREATE (title_pt)-[:ASSIGN_PT]->(assignee_pt)
CREATE (title_pt)-[:CREATED_BY_PT]->(inventors_pt)
CREATE (inventors_pt)-[:RESEARCHED_PT]->(patent_abstract)
CREATE (applications_ref_pt)-[:CONTAINS_PT]->(patent_abstract)
CREATE (patent_abstract)-[:ASSIGNED_PT]->(assignee_pt)
CREATE (title_pt)-[:HAS_PT]->(patent_abstract)
"""

# Execute the Cypher query
graph.run(cypher_query_patents6)

In [19]:
# Create relationships based on similarity
cypher_query7 = """
LOAD CSV WITH HEADERS FROM 'file:///llama_similarity.csv' AS row
MATCH (sbir_abstract:Sbir_Abstract {key: toInteger(row.sbir_id)})
MATCH (patent_abstract:Patent_Abstract {key: toInteger(row.patent_id)})
MERGE (sbir_abstract)-[:SIMILARITY_SCORE {score: COALESCE(toFloat(row.score), 0.0)}]->(patent_abstract);
"""

# Execute the Cypher query
graph.run(cypher_query7)

## 6 - Test Queries (Optional)

In [20]:
# // Count all nodes
cypher_query_check = """
MATCH (n)
RETURN count(n)
"""
graph.run(cypher_query_check)

count(n)
26947


In [21]:
#//  condition for Location
cypher_query_check = """
MATCH (principal:Principal)-[:WORKSIN]->(location:Location)
WHERE location.city = 'New York'
RETURN principal.pi_name;
"""
graph.run(cypher_query_check)

principal.pi_name
Andrew Chepaitis
Yan Ivnitskiy
Harper Langston


## 7 - Conduct Queries to answer questions posed for Final Project

In [22]:
#1. What states receive the most SBIR awards in a variety of technology areas?
cypher_FP_query1 = """
MATCH (sbir_award:Sbir_Award)-[:AWARDED_TO]->(principal:Principal)-[:WORKSIN]->(location:Location)
MATCH (sbir_award:Sbir_Award)-[:CONTAINS]->(sbir_abstract)-[:HOLDS]->(e:Entities)
WHERE toLower(e.entities) = "communication"
RETURN location.state, COUNT(DISTINCT sbir_award.key) AS sbir_awards
ORDER BY sbir_awards DESC LIMIT 10;
"""
graph.run(cypher_FP_query1)

location.state,sbir_awards
CA,6
VA,3
MD,2


In [23]:
#2. What states are leading research in the area of technology x (ex: Photonics)?
cypher_FP_query2 = """
CALL {
    MATCH (title_pt)-[:ASSIGN_PT]->(assignee_pt)
    MATCH (title_pt)-[:CREATED_BY_PT]->(inventors_pt)-[:RESEARCHED_PT]->(Patent_Abstract)-[:HOLDS_PT]->(e:Entities)
    WHERE toLower(e.entities) = 'communication'
    RETURN assignee_pt.state as states, COUNT(DISTINCT title_pt.key) AS SBIR_PATENT_COUNT
    UNION
    MATCH (title_pt)-[:ASSIGN_PT]->(assignee_pt)
    MATCH (title_pt)-[:CREATED_BY_PT]->(inventors_pt)-[:RESEARCHED_PT]->(Patent_Abstract)-[:HOLDS_PT]->(e:Entities)
    WHERE toLower(e.entities) = 'communication'
    RETURN inventors_pt.state as states,COUNT(DISTINCT title_pt.key) AS SBIR_PATENT_COUNT
    UNION
    MATCH (sbir_award:Sbir_Award)-[:AWARDED_TO]->(principal:Principal)-[:WORKSIN]->(location:Location)
    MATCH (sbir_award:Sbir_Award)-[:CONTAINS]->(sbir_abstract)-[:HOLDS]->(e:Entities)
    WHERE toLower(e.entities) = "communication"
    RETURN location.state as states, COUNT(DISTINCT sbir_award.key) AS SBIR_PATENT_COUNT
}
RETURN states, SBIR_PATENT_COUNT
ORDER BY SBIR_PATENT_COUNT DESC
"""
graph.run(cypher_FP_query2)

states,SBIR_PATENT_COUNT
CA,6
VA,3
PA,2


In [24]:
#3. What states are benefiting the most from SBIR?
cypher_FP_query3 = """
MATCH (sbir:Sbir_Award)-[:AWARDED_TO]->(principal:Principal)-[:WORKSIN]->(location:Location)
RETURN location.state, SUM(sbir.award_amount) AS Total_Award_Amount
ORDER BY Total_Award_Amount DESC
"""
graph.run(cypher_FP_query3)

location.state,Total_Award_Amount
CA,80252207.0
MA,33537820.0
VA,24627717.0


In [25]:
#4.What businesses receive the most SBIR awards
cypher_FP_query4 = """
MATCH (Sbir_Award)-[:AWARDED_TO]->(Principal)
RETURN Principal.company, sum(Sbir_Award.award_amount) AS total_award
ORDER BY total_award desc;
"""
graph.run(cypher_FP_query4)

Principal.company,total_award
Nanosys,5563666.0
CREARE LLC,4033594.0
Thermedical Inc.,3678980.0


In [26]:
#5. Did they receive any patents on the same topic?
cypher_FP_query5 = """
MATCH (assignee_pt:Assignee_PT)<-[]-(Patent_Abstract)-[]->(Entities)<-[]-(Sbir_Abstract)<-[]-(p:Principal)
WHERE toLower(assignee_pt.orgname) contains toLower(p.company)
RETURN distinct assignee_pt.orgname as patent_assignee, p.company as SBIR_awardee
order by patent_assignee;
"""
graph.run(cypher_FP_query5)

patent_assignee,SBIR_awardee
Andluca Technologies Inc.,Andluca Technologies Inc.
"Intuitive Surgical Operations, Inc.",INTUITIVE SURGICAL
Ultra Safe Nuclear Corporation,Ultra Safe Nuclear Corporation


In [27]:
#6. How many additional patents the businesses generated 1 year after the SBIR award completed?


In [28]:
#7. What technology areas are receiving the most interest from Government agencies and have patent filings?
cypher_FP_query7 = """
MATCH (agency:Agency)-[:OVERSEES]->(sbir_award:Sbir_Award)-[:CONTAINS]->(sbir_abstract:Sbir_Abstract)-[:HOLDS]->(entities:Entities)<-[:HOLDS_PT]-(patent_abstract:Patent_Abstract)
RETURN entities.entities AS INTRESTING_TECHNOLOGY, count(entities.entities) AS ENTITY_COUNT ORDER BY ENTITY_COUNT DESC;
"""
graph.run(cypher_FP_query7)

INTRESTING_TECHNOLOGY,ENTITY_COUNT
method,7200
system,7139
patient,4032


In [33]:
#8. What companies are working in similar research areas in x (ex: AI)?
#As GRAPH:
cypher_FP_query8a = """
CALL {
    MATCH (p:Principal)-[:RESEARCHED]->(sbir_abstract:Sbir_Abstract)-[:HOLDS]->(e:Entities)
    WHERE EXISTS {
        MATCH ()-[r:HOLDS|HOLDS_PT]->(e)
        WITH e, count(r) as rel_cnt
        WHERE rel_cnt > 1
    }
    RETURN e as entity, p as company
    ORDER BY entity 
    UNION
    MATCH (ap:Assignee_PT)<-[:ASSIGN_PT]-(title_pt:Title_PT)-[:HAS_PT]->(patent_abstract:Patent_Abstract)-[:HOLDS_PT]->(e:Entities)
    WHERE EXISTS {
        MATCH ()-[r:HOLDS|HOLDS_PT]->(e)
        WITH e, count(r) as rel_cnt
        WHERE rel_cnt > 1
    }
    RETURN e as entity, ap as company
    ORDER BY entity 
}
RETURN entity, company
ORDER BY entity 
"""
#graph.run(cypher_FP_query8a)


#Just text:
cypher_FP_query8b = """
CALL {
    MATCH (p:Principal)-[:RESEARCHED]->(sbir_abstract:Sbir_Abstract)-[:HOLDS]->(e:Entities)
    WHERE EXISTS {
        MATCH ()-[r:HOLDS|HOLDS_PT]->(e)
        WITH e, count(r) as rel_cnt
        WHERE rel_cnt > 1
    }
    RETURN toLower(e.entities) as entity, toLower(p.company) as company
    ORDER BY entity 
    UNION
    MATCH (ap:Assignee_PT)<-[:ASSIGN_PT]-(title_pt:Title_PT)-[:HAS_PT]->(patent_abstract:Patent_Abstract)-[:HOLDS_PT]->(e:Entities)
    WHERE EXISTS {
        MATCH ()-[r:HOLDS|HOLDS_PT]->(e)
        WITH e, count(r) as rel_cnt
        WHERE rel_cnt > 1
    }
    RETURN toLower(e.entities) as entity, toLower(ap.orgname) as company
    ORDER BY entity 
}
RETURN entity, company
ORDER BY entity
"""
graph.run(cypher_FP_query8b)


entity,company
3-dimensional,"reactive innovations, llc"
3-dimensional,aerodyne research inc
3-dimensional,"synthecon, inc."
