## Neo4j Code to load data, nodes, properties and relationships

This file is used to import data into neo4j graph data base and execute queries

## 1- Install neo4j Desktop

Here are the steps to install and setup neo4j in an ubuntu laptop.
Similar instructions are available for macOS, windows, and other Linux OS

### --------------neo4j instructions------------------
***Neo4j desktop is most likely all that is needed for the system to operate, but these are the steps used in the test system.***

### [ ] how to install neo4j desktop on ubuntu:

First install fuse (pick the steps that work for your version of ubuntu)

https://github.com/AppImage/AppImageKit/wiki/FUSE


for ubuntu 22.04 or greater type in terminal:

sudo add-apt-repository universe

sudo apt install libfuse2


Download appimage from https://neo4j.com/download/


in terminal type in the folder where the file was downloaded (note might need to change the file version number):

chmod a+x neo4j-desktop-1.0.3-x86_64.AppImage; 

./neo4j-desktop-1.0.3-x86_64.AppImage



### -------------------------neo4j configuration/file paths------------------

once neo4j is installed

run neo4j desktop by typing the following in a terminal:


XXXXXXX


neo4j desktop may open as part of the installation. 

Keep track of the following information.

The following settings will need to be entered in the neo4j_graph_etl notebook


bolt: 7687 → 7689

http: 7474 → 7475


also keep track of


username: neo4j

password: XXXXX


They will also need to be entered in the notebook


In the neo4j desktop app from the "home" or "startup" screen under example projects, click add then click local dbms.

Enter the name of the server (ex: SBIR)


A new server with that name will show up under example project. click on the three dots to the right of the server name.

Click open folder then click import. a new window will open up showing the contents of the import folder.

The path to that folder should be listed at the top of the screen. Copy or write down that path and save it. 

This path is needed to import csv files into neo4j, the neo4j notebook will need to know the path to the neo4j import folder


ex: /home/laben/.config/Neo4j Desktop/Application/relate-data/dbmss/dbms-6959f6d4-5ba3-45e0-a8db-7a248c595605/import


## 2 - Set Variables

After completing the neo4j installation, update the variables in the notebook to reference the specific values for the system the notebook is running on.

## 3 - Load in base level csv files and flatten them for import into neo4j

Create additional csv files after flattening some of the cells in the original csv files.

## 4 - Copy Data files to neo4j data directory

Import the csv files from the input file directory into the neo4j import directory.

## 5 - Create nodes and edges in neo4j

Run Cypher queries to create nodes and edges

## 6 - Test Queries (Optional)

Run some test queries to verify data is correctly setup in neo4j

## 7 - Conduct Queries to answer questions posed for Final Project

These are the queries proposed the team would answer for the DSE 203 Final Project

In [1]:
## install 
#pip install py2neo

In [2]:
import pandas as pd
from pandas import json_normalize
import ast
from ast import literal_eval
from py2neo import Graph
import spacy
import subprocess

## 2 - Set Variables

In [3]:
# Set variables to each specific user/machine that is running the notebook

#Username will be used to switch to different users settings as configured in this file
#neo4j_import_dir =  "/Users/prakhar/Library/Application Support/Neo4j Desktop/Application/relate-data/dbmss/dbms-c1b4f9c3-baa6-45d6-a13c-31ec4ba4c393/import" #PS
neo4j_import_dir = "/Users/prakhar/Library/Application Support/Neo4j Desktop/Application/relate-data/dbmss/dbms-c1b4f9c3-baa6-45d6-a13c-31ec4ba4c393/import"
#"/home/laben/.config/Neo4j Desktop/Application/relate-data/dbmss/dbms-6959f6d4-5ba3-45e0-a8db-7a248c595605/import" #LF
#neo4j_import_dir = "sagar"

#neo4j connection settings
#neo4j_url = "bolt://localhost:7689" #LF
neo4j_url = "bolt://localhost:7690" #PS
neo4j_username = "neo4j"
#neo4j_password = "Welcome19" LF
neo4j_password = "Welcome19#"
#neo4j_password = "Welcome19#" #PS
# path to your SBIR Excel file
#graph = Graph("bolt://localhost:7690", auth=("neo4j", "Welcome19#"))
filename_sbir = './preprocessed_files/sbir_1k_sample.csv'
filename_patents = './preprocessed_files/patents.json'


In [4]:
# Connect to your Neo4j database
graph = Graph(neo4j_url, auth=(neo4j_username, neo4j_password))

## 3 - load in base level csv files and flatten them for import into neo4j

### Load the SBIR sample data

In [5]:
# Read the Excel file into a pandas DataFrame
df_sbir = pd.read_csv(filename_sbir)
#df_sbir.rename(columns={'Agency Tracking Number': 'id_Agency Tracking Number'}, inplace=True)

df_sbir['Award Amount'] = pd.to_numeric(df_sbir['Award Amount'].str.replace(',', ''), errors='coerce')


df_sbir['key'] = df_sbir.id # PS Change 12/08 12 PM
df_sbir.head(5)

Unnamed: 0,id,Company,Award Title,Agency,Branch,Phase,Program,Agency Tracking Number,Contract,Proposal Award Date,...,Contact Email,PI Name,PI Title,PI Phone,PI Email,RI Name,RI POC Name,RI POC Phone,abstract_entities,key
0,16068,"EXPEDITION TECHNOLOGY, INC.",Fast Recovery Of Signal Estimates using Neural...,Department of Defense,Navy,Phase I,SBIR,N192-128-0403,N68335-19-C-0742,10/21/2019,...,marc.harlacher@exptechinc.com,Mike Tinston,Chief Scientist,(571) 246-8479,mike.tinston@exptechinc.com,,,,"['condition', 'signal density', 'signal mingle...",16068
1,187914,"Apa Optics, Inc.",EXTRAVEHICULAR MOBILITY UNIT HELMET MOUNTED DI...,National Aeronautics and Space Administration,,Phase II,SBIR,6818,,,...,,David E Stoltzmann,,() -,,,,,"['OPTICAL MECHANICAL DESIGNS', 'opto-mechanica...",187914
2,97687,"RADIATION MONITORING DEVICES, INC.","An Efficient, Solid State Detector for Nuclear...",Department of Energy,,Phase II,SBIR,80498S06-I,DE-FG02-06ER84430,,...,GEntine@rmdinc.com,Michael Squillante,Dr,(617) 668-6808,MSquillante@rmdinc.com,,,,"['signal-to-noise ratio', 'Computed tomography...",97687
3,38328,"QUASAR FEDERAL SYSTEMS, INC.",Miniature Oriented Tri-Axial Fluxgate Magnetom...,Department of Defense,Navy,Phase I,SBIR,N172-116-0564,N68335-17-C-0621,08/30/2017,...,twrightson@quasarfs.com,Yongming Zhang,CEO,(858) 412-1737,yongming@quasarfs.com,,,,"['long range sigint', 'magnetic signature', 'e...",38328
4,82571,"INVOTEK, INC.",Prosody enhanced TTS for Dysarthric Speakers,Department of Health and Human Services,National Institutes of Health,Phase I,SBIR,DC010085,1R43DC010085-01,,...,tjakobs@invotek.org,THOMAS JAKOBS,,(479) 632-4166,TJAKOBS@INVOTEK.ORG,,,,"['feasibility project', 'control prosody compu...",82571


In [6]:
df_sbir.columns

Index(['id', 'Company', 'Award Title', 'Agency', 'Branch', 'Phase', 'Program',
       'Agency Tracking Number', 'Contract', 'Proposal Award Date',
       'Contract End Date', 'Solicitation Number', 'Solicitation Year',
       'Solicitation Close Date', 'Proposal Receipt Date',
       'Date of Notification', 'Topic Code', 'Award Year', 'Award Amount',
       'Duns', 'HUBZone Owned', 'Socially and Economically Disadvantaged',
       'Women Owned', 'Number Employees', 'Company Website', 'Address1',
       'Address2', 'City', 'State', 'Zip', 'Abstract', 'Contact Name',
       'Contact Title', 'Contact Phone', 'Contact Email', 'PI Name',
       'PI Title', 'PI Phone', 'PI Email', 'RI Name', 'RI POC Name',
       'RI POC Phone', 'abstract_entities', 'key'],
      dtype='object')

In [7]:
selected_columns1 = ['key',
                     'Agency Tracking Number', 'Company', 'Branch', 'Award Title', 'Agency', 'Company Website', 'Contract', 'Proposal Award Date', 'Contract End Date', 'Solicitation Number', 'Solicitation Year',
                     'Solicitation Close Date', 'Proposal Receipt Date', 'Date of Notification', 'Topic Code', 'Award Year', 'Award Amount',
                     'PI Name', 'PI Title', 'PI Phone', 'PI Email', 'RI Name', 'RI POC Name', 'RI POC Phone','Address1','Address2', 'City', 'State', 'Zip',
                     'HUBZone Owned', 'Socially and Economically Disadvantaged','Women Owned'
                    ]

selected_columns2 = ['key','Abstract','abstract_entities']   ##need to remove/update  #PS update 12/08/1:45 PM
# Remove leading/trailing whitespaces
selected_columns1 = [col.strip() for col in selected_columns1]
selected_columns2 = [col.strip() for col in selected_columns2]

df_sbirs = df_sbir[selected_columns1]

df_sbirent = df_sbir[selected_columns2]


df_sbirs.to_csv('./preprocessed_files/neo4j_sbir_main.csv', index=False)

In [8]:
#Explode the 'abstract_entities' column  (need to update)
#This is done so that the list of entities are converted to individual entites
#that can be placed into seperate rows in a csv file which will be importated to neo4j
#df_source_sbir = df_sbirent

df_entities_sbir = pd.DataFrame(columns=['key','entities'])
#df_sbirent['abstract_entities']
#df_exploded_sbir = df_sbirent.explode('abstract_entities')
#df_exploded_sbir = df_source_sbir
#Display the resulting DataFrame
#display(df_sbirent)
#display(df_sbirent['abstract_entities'].iloc[0])
#test = df_sbirent['abstract_entities'].iloc[0].replace(''','').replace('[','').replace(']','').split(',')
#print(test)
#print(len(test))

for i in range(0, len(df_sbirent)):
    temp_list = df_sbirent['abstract_entities'].iloc[i].replace('\'','').replace('[','').replace(']','').split(',')

    for x in range(0, len(temp_list)):
        #print(temp_list[x])
        df_entities_sbir.loc[len(df_entities_sbir.index)] = [df_sbirent['key'].iloc[i], temp_list[x].strip()]
        #df_sbirent.loc[len(df_entities_sbir.index)] = [df_sbirent['key'].iloc[i], temp_list[x]]

df_sbir_abstract_entities = pd.merge(df_sbirent, df_entities_sbir, on='key', how='inner') 
display(df_sbir_abstract_entities)
df_sbir_abstract_entities.to_csv('./preprocessed_files/neo4j_sbir_abstract_entities.csv', index=False)

Unnamed: 0,key,Abstract,abstract_entities,entities
0,16068,Current automated RF signal acquisition and an...,"['condition', 'signal density', 'signal mingle...",condition
1,16068,Current automated RF signal acquisition and an...,"['condition', 'signal density', 'signal mingle...",signal density
2,16068,Current automated RF signal acquisition and an...,"['condition', 'signal density', 'signal mingle...",signal mingle
3,16068,Current automated RF signal acquisition and an...,"['condition', 'signal density', 'signal mingle...",signal intermodulation product desire
4,16068,Current automated RF signal acquisition and an...,"['condition', 'signal density', 'signal mingle...",Bandwidth ( ibw ) receiver wide bit-depth reduce
...,...,...,...,...
20899,171725,We propose to provide a set of general design ...,"['system configuration', 'central distribution...",expert design engineering
20900,171725,We propose to provide a set of general design ...,"['system configuration', 'central distribution...",sample transfer
20901,171725,We propose to provide a set of general design ...,"['system configuration', 'central distribution...",efficiency
20902,171725,We propose to provide a set of general design ...,"['system configuration', 'central distribution...",integration


### Load the Patent sample data

In [9]:
# Read in the patent data
df_patent = pd.read_json(filename_patents)
df_patent['key'] = df_patent['doc-number']  # PS Change 12/08 12 PM
#df_patent.head(5)

# Display the DataFrame
display(df_patent)

Unnamed: 0,country,doc-number,date,application-reference,title,assignee,inventors,abstract,claims,abstract_entities,claim_entities,key
0,US,20230225235,20230720,"{'country': 'US', 'doc-number': 17754513, 'dat...","AGRICULTURAL TRENCH DEPTH SYSTEMS, METHODS, AN...","{'orgname': 'Precision Planting LLC', 'city': ...","[{'last-name': 'Sloneker', 'first-name': 'Dill...",\nA row unit (10) of an agricultural planter w...,a row unit frame;\na furrow opening disc rotat...,"[depth, depth adjustment body, electric motor,...","[depth, detect, depth adjustment body, adjustm...",20230225235
1,US,20230225236,20230720,"{'country': 'US', 'doc-number': 18007883, 'dat...",Agricultural Attachment for Cultivating Row Crops,{'orgname': 'Amazonen-Werke H. Dreyer SE & Co....,"[{'last-name': 'RESCH', 'first-name': 'Rainer'...",\nThe invention relates to an agricultural att...,"a row-detection device adapted to detect, duri...","[row-detection device design, cultivation proc...","[automatic, cultivation process, plurality soi...",20230225236
2,US,20230225237,20230720,"{'country': 'US', 'doc-number': 18121636, 'dat...",TRAVEL LINE CREATION SYSTEM FOR AGRICULTURAL M...,"{'orgname': 'KUBOTA CORPORATION', 'city': 'Osa...","[{'last-name': 'MORIMOTO', 'first-name': 'Taka...",\nA travel line creation system for an agricul...,a position acquirer to acquire position measur...,"[point agricultural machine, point field, meas...","[plurality work, working device, angle criteri...",20230225237
3,US,20230225238,20230720,"{'country': 'US', 'doc-number': 18187398, 'dat...",AGRICULTURAL HARVESTING MACHINE WITH PRE-EMERG...,"{'orgname': 'Deere & Company', 'city': 'Moline...","[{'last-name': 'BLANK', 'first-name': 'Sebasti...",\nAn agricultural harvesting machine includes ...,crop processing functionality configured to en...,"[seed treatment, seed, control signal, crop pr...","[imaging sensor, imaging, harvesting operation...",20230225238
4,US,20230225239,20230720,"{'country': 'US', 'doc-number': 18190358, 'dat...","DETECTION OF PLANT DISEASES WITH MULTI-STAGE, ...","{'orgname': 'CLIMATE LLC', 'city': 'Saint Loui...","[{'last-name': 'Guan', 'first-name': 'Wei', 'c...",\nA computer system is provided comprising a c...,a classification model management server compu...,"[digital model region, image user device, comp...","[model construction instruction, resize photo ...",20230225239
...,...,...,...,...,...,...,...,...,...,...,...,...
904,US,20230226141,20230720,"{'country': 'US', 'doc-number': 18157933, 'dat...",ANGIOTENSIN II ALONE OR IN COMBINATION FOR THE...,{'orgname': 'The George Washington University ...,"[{'last-name': 'CHAWLA', 'first-name': 'Lakhmi...","\nThe present invention relates, inter alia, t...",\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,"[treatment, output shock, invention, effective...",[],20230226141
906,US,20230226142,20230720,"{'country': 'US', 'doc-number': 18166956, 'dat...",IONIC SELF-ASSEMBLING PEPTIDES,"{'orgname': '3-D MATRIX, LTD.', 'city': 'Tokyo...","[{'last-name': 'Gil', 'first-name': 'Eun Seok'...",\nProvided herein are ionic self-assembling pe...,\n\na) an N-terminal functional group selected...,"[method, pharmaceutical composition, ionic sel...","[lesion site, area treatment, n-terminal funct...",20230226142
908,US,20230226143,20230720,"{'country': 'US', 'doc-number': 18011718, 'dat...",PHARMACEUTICAL COMPOSITION COMPRISING A COMBIN...,"{'orgname': 'OCVIRK, Soren', 'city': 'Kranzber...","[{'last-name': 'Ocvirk', 'first-name': 'Sören'...",\nA pharmaceutical composition includes a comb...,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n,"[short-chain c2 c5 fatty acid, pharmaceutical ...",[],20230226143
909,US,20230226144,20230720,"{'country': 'US', 'doc-number': 18186639, 'dat...",TOPICAL ANTIBIOTIC,"{'orgname': 'Concept Matrix Solutions', 'city'...","[{'last-name': 'LaRosa', 'first-name': 'Tony',...",\nProvided herein is a topical antibiotic comp...,"(a) at least one of bacitracin, bacitracin zin...","[skin surface, method, antibiotic agent, compo...","[sodium hydroxide, magnesium stearate, ceteary...",20230226144


In [10]:
selected_columns3 = ['key','doc-number', 'country',  'application-reference', 'title', 'assignee', 'inventors']
selected_columns4 = ['key', 'abstract','abstract_entities']  ##need to remove/update  #PS update 12/08/1:45 PM

#selected_columns4 = ['key', 'abstract_entities']  ## need to add

selected_columns3 = [col.strip() for col in selected_columns3]
selected_columns4 = [col.strip() for col in selected_columns4]

df_patents = df_patent[selected_columns3]
df_patents_main = df_patents[['key','doc-number', 'country', 'title']]
df_patentsent= df_patent[selected_columns4]

#df_patentsent.rename(columns={'abstract': 'abstract_entities'}, inplace=True)
display(df_patents_main)
df_patents_main.to_csv('./preprocessed_files/neo4j_patents_main.csv', index=False)



Unnamed: 0,key,doc-number,country,title
0,20230225235,20230225235,US,"AGRICULTURAL TRENCH DEPTH SYSTEMS, METHODS, AN..."
1,20230225236,20230225236,US,Agricultural Attachment for Cultivating Row Crops
2,20230225237,20230225237,US,TRAVEL LINE CREATION SYSTEM FOR AGRICULTURAL M...
3,20230225238,20230225238,US,AGRICULTURAL HARVESTING MACHINE WITH PRE-EMERG...
4,20230225239,20230225239,US,"DETECTION OF PLANT DISEASES WITH MULTI-STAGE, ..."
...,...,...,...,...
904,20230226141,20230226141,US,ANGIOTENSIN II ALONE OR IN COMBINATION FOR THE...
906,20230226142,20230226142,US,IONIC SELF-ASSEMBLING PEPTIDES
908,20230226143,20230226143,US,PHARMACEUTICAL COMPOSITION COMPRISING A COMBIN...
909,20230226144,20230226144,US,TOPICAL ANTIBIOTIC


In [11]:

df_source_patent = df_patentsent

# Convert the string representation to a list
#(need to update)
#df_source_patent['abstract_entities'] = df_source_patent['abstract_entities'].apply(ast.literal_eval)

# Explode the 'abstract_entities' column
#(need to update)
df_exploded_patents = df_source_patent.explode('abstract_entities') #PS update 12/08/1:45 PM


# Display the resulting DataFrame
display(df_exploded_patents)

df_exploded_patents.to_csv('./preprocessed_files/neo4j_patents_abstract.csv', index=False)

Unnamed: 0,key,abstract,abstract_entities
0,20230225235,\nA row unit (10) of an agricultural planter w...,depth
0,20230225235,\nA row unit (10) of an agricultural planter w...,depth adjustment body
0,20230225235,\nA row unit (10) of an agricultural planter w...,electric motor
0,20230225235,\nA row unit (10) of an agricultural planter w...,adjustment body
0,20230225235,\nA row unit (10) of an agricultural planter w...,rotation shaft
...,...,...,...
909,20230226144,\nProvided herein is a topical antibiotic comp...,composition
910,20230226145,\nThe disclosure relates to a pharmaceutical c...,method
910,20230226145,\nThe disclosure relates to a pharmaceutical c...,pharmaceutical composition
910,20230226145,\nThe disclosure relates to a pharmaceutical c...,autoimmune disease


In [12]:
# Functions used to navigate json file and extract information

# This function helps deal with errors when looking for values in the json file
def safe_eval(value):
    try:
        return literal_eval(value)
    except (ValueError, SyntaxError):
        return value

# If there is only 1 level of nesting, this function will extract the data
def flatten_nested_columns(df, column_name):
    # Use safe_eval to handle malformed data
    df[column_name] = df[column_name].apply(safe_eval)
    
    # Flatten the nested dictionary
    df_flat = pd.concat([df.drop(column_name, axis=1),
                        json_normalize(df[column_name])], axis=1)
    
    return df_flat

# If there are 2 levels of nesting, this function will extract the data
def flatten_nested_nested_columns(df, column_name):
    df[column_name] = df[column_name].apply(safe_eval)
    
    # Create an empty DataFrame to store the flattened results
    df_flat = pd.DataFrame()
    
    # Iterate over each row
    for _, row in df.iterrows():
        # Flatten the nested column using json_normalize for each row
        df_row_flat = pd.json_normalize(row[column_name]).reset_index(drop=True)
        
        # Add the id_doc-number to the flattened DataFrame
        df_row_flat['key'] = row['key']
        
        # Concatenate the results for each row
        df_flat = pd.concat([df_flat, df_row_flat], ignore_index=True)
    
    return df_flat

In [13]:
# 1st Flattening, which is used to generate the patent application reference node and associated properties
# This is the root node for patents

# Create a DataFrame
df = df_patents[['key','application-reference']]

# Flatten the specified nested column
df_flat = flatten_nested_columns(df, 'application-reference')

# Display the flattened DataFrame
print(df_flat)

df_flat.to_csv('./preprocessed_files/neo4j_patents_application_reference.csv', index=False)

              key country  doc-number      date
0    2.023023e+10      US  17754513.0  20200922
1    2.023023e+10      US  18007883.0  20210525
2    2.023023e+10      US  18121636.0  20230315
3    2.023023e+10      US  18187398.0  20230321
4    2.023023e+10      US  18190358.0  20230327
..            ...     ...         ...       ...
852           NaN      US  18046930.0  20221016
862           NaN      US  18115752.0  20230228
871           NaN      US  18007713.0  20210528
875           NaN      US  18166956.0  20230209
877           NaN      US  18186639.0  20230320

[903 rows x 4 columns]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_name] = df[column_name].apply(safe_eval)


In [14]:
#2nd Flattening, which is used to extract the assignee node and its associated properties
# data
df = df_patents[['key','assignee']]

# Flatten the specified nested column
df_flat = flatten_nested_columns(df, 'assignee')

# Display the flattened DataFrame
print(df_flat)


df_flat.to_csv('./preprocessed_files/neo4j_patents_assignee.csv', index=False)

              key                               orgname  \
0    2.023023e+10                Precision Planting LLC   
1    2.023023e+10  Amazonen-Werke H. Dreyer SE & Co. KG   
2    2.023023e+10                    KUBOTA CORPORATION   
3    2.023023e+10                       Deere & Company   
4    2.023023e+10                           CLIMATE LLC   
..            ...                                   ...   
852           NaN               UNIVERSITY OF ROCHESTER   
862           NaN                    Probiotical S.p.A.   
871           NaN                 Black Cat Bio Limited   
875           NaN                      3-D MATRIX, LTD.   
877           NaN              Concept Matrix Solutions   

                         city state country  
0                     Tremont    IL      US  
1                   Hasbergen   NaN      DE  
2                   Osaka-shi   NaN      JP  
3                      Moline    IL      US  
4                 Saint Louis    MO      US  
..             

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_name] = df[column_name].apply(safe_eval)


In [15]:
#3rd Flattening

# Extract specific columns
df = df_patents[['key', 'inventors']].dropna()

# Flatten the specified nested column
df_flat = flatten_nested_nested_columns(df, 'inventors')

# Display the flattened DataFrame
print(df_flat)


df_flat.to_csv('./preprocessed_files/neo4j_patents_inventors.csv', index=False)


     last-name first-name             city state country          key
0     Sloneker     Dillon          Danvers    IL      US  20230225235
1        Hodel  Jeremy J.           Morton    IL      US  20230225235
2      Schlipf     Ben L.              NaN   NaN      US  20230225235
3        RESCH     Rainer       Hagen a TW   NaN      DE  20230225236
4       MAHLER        Tom              NaN   NaN      US  20230225236
...        ...        ...              ...   ...     ...          ...
2733    Ocvirk      Sören        Kranzberg   NaN      DE  20230226143
2734    LaRosa       Tony   Woodland Hills    CA      US  20230226144
2735  Davidson     Robert   Woodland Hills    CA      US  20230226144
2736      Reid      David  Woodland Hillls    CA      US  20230226144
2737    Barnea   Eytan R.         New York    NY      US  20230226145

[2738 rows x 6 columns]


## 4 - Copy Data files to neo4j data directory

In [16]:
# delete any csv from neo4j director if required
#!rm '/Users/prakhar/Library/Application Support/Neo4j Desktop/Application/relate-data/dbmss/dbms-c1b4f9c3-baa6-45d6-a13c-31ec4ba4c393/import'/*.csv
#!ls '/Users/prakhar/Library/Application Support/Neo4j Desktop/Application/relate-data/dbmss/dbms-c1b4f9c3-baa6-45d6-a13c-31ec4ba4c393/import'

In [17]:
#subprocess.run(["ls", "-l"])
output = subprocess.run(["cp" ,"./preprocessed_files/neo4j_sbir_main.csv", neo4j_import_dir], shell=False, capture_output=True) 
print(output.stderr)
output = subprocess.run(["cp" ,"./preprocessed_files/neo4j_patents_main.csv", neo4j_import_dir], shell=False, capture_output=True) 
print(output.stderr)
output = subprocess.run(["cp" ,"./preprocessed_files/neo4j_patents_abstract.csv", neo4j_import_dir], shell=False, capture_output=True) 
print(output.stderr)
output = subprocess.run(["cp" ,"./preprocessed_files/neo4j_patents_application_reference.csv", neo4j_import_dir], shell=False, capture_output=True) 
print(output.stderr)
output = subprocess.run(["cp" ,"./preprocessed_files/neo4j_patents_assignee.csv", neo4j_import_dir], shell=False, capture_output=True) 
print(output.stderr)
output = subprocess.run(["cp" ,"./preprocessed_files/neo4j_patents_inventors.csv", neo4j_import_dir], shell=False, capture_output=True) 
print(output.stderr)
output = subprocess.run(["cp" ,"./preprocessed_files/similarity.csv", neo4j_import_dir], shell=False, capture_output=True) 
print(output.stderr)
output = subprocess.run(["cp" ,"./preprocessed_files/neo4j_sbir_abstract_entities.csv", neo4j_import_dir], shell=False, capture_output=True) 
print(output.stderr)
output = subprocess.run(["ls", neo4j_import_dir], shell=False, capture_output=True) 
print(output.stderr)
print(output.stdout)


b''
b''
b''
b''
b''
b''
b''
b''
b''
b'2\nneo4j_patents_abstract.csv\nneo4j_patents_application_reference.csv\nneo4j_patents_assignee.csv\nneo4j_patents_inventors.csv\nneo4j_patents_main.csv\nneo4j_sbir_abstract_entities.csv\nneo4j_sbir_main.csv\nsimilarity.csv\n'


## 5 - Create nodes and edges in neo4j

In [18]:
# Define the Cypher query to load data and define nodes and edges from SBIR
cypher_query_sbir1 = """
LOAD CSV WITH HEADERS FROM 'file:///neo4j_sbir_main.csv' AS row

// Create Location nodes
MERGE (location:Location {
  key: COALESCE(toInteger(row.key), 0),
  city: COALESCE(row.City, 'Unknown'),
  state: COALESCE(row.State, 'Unknown'),
  zip: COALESCE(row.Zip, 'Unknown'),
  address1: COALESCE(row.Address1, 'Unknown'),
  address2: COALESCE(row.Address2, 'Unknown')
})

// Create Agency nodes
MERGE (agency:Agency {
  key: COALESCE(toInteger(row.key), 0),
  name: COALESCE(row.Agency, 'Unknown'),
  branch: COALESCE(row.Branch, 'Unknown'),
  website: COALESCE(row.`Company Website`, 'Unknown')
})

// Create Abstract Topic nodes
MERGE (award:AWARD {
  key: COALESCE(toInteger(row.key), 0),
  topic_code: COALESCE(row.`Topic Code`, 'Unknown'),
  award_title: COALESCE(row.`Award Title`, 'Unknown'),
  award_year: COALESCE(row.`Award Year`, 'Unknown'),
  award_amount: TOFLOAT(COALESCE(row.`Award Amount`, '0.0'))

})

// Create Principal nodes
MERGE (principal:Principal {
  key: COALESCE(toInteger(row.key), 0),
  company: COALESCE(row.Company, 'Unknown'),
  pi_name: COALESCE(row.`PI Name`, 'Unknown'),
  pi_title: COALESCE(row.`PI Title`, 'Unknown'),
  pi_phone: COALESCE(row.`PI Phone`, 'Unknown'),
  pi_email: COALESCE(row.`PI Email`, 'Unknown'),
  hubzone_owned: COALESCE(row.`HUBZone Owned`, 'Unknown'),
  socially_economically_disadvantaged: COALESCE(row.`Socially and Economically Disadvantaged`, 'Unknown'),
  women_owned: COALESCE(row.`Women Owned`, 'Unknown')
})

// Create relationships
CREATE (agency)-[:OVERSEES]->(award)
CREATE (award)-[:BELONGSTO]->(principal)
CREATE (principal)-[:WORKSIN]->(location)
"""
#granted to relationship rather belongs to

# Execute the Cypher query
graph.run(cypher_query_sbir1)

# Define the Cypher query to load data from the second CSV file and create nodes
cypher_query_sbir2 = """
LOAD CSV WITH HEADERS FROM 'file:///neo4j_sbir_abstract_entities.csv' AS row

// Create Abstract nodes
MERGE (sbir_abstract:Sbir_Abstract {
  key: COALESCE(toInteger(row.key), 0),
  Abstract: COALESCE(row.Abstract, 'Unknown')
})


// Create Entities nodes
MERGE (entities:Entities {
  entities: COALESCE(row.entities, 'Unknown')
})

// Create relationships
CREATE (sbir_abstract)-[:HOLDS]->(entities)
"""

# Execute the Cypher query
graph.run(cypher_query_sbir2)

# Create relationships between nodes from the second file and existing nodes from the first file
cypher_query_sbir3 = """
MATCH (award:AWARD), (principal:Principal), (sbir_abstract:Sbir_Abstract)
WHERE award.key = principal.key AND principal.key = sbir_abstract.key
CREATE (award)-[:CONTAINS]->(sbir_abstract)
CREATE (principal)-[:RESEARCHED]->(sbir_abstract)
"""

# Execute the Cypher query
graph.run(cypher_query_sbir3)

In [19]:
# Define the Cypher query to load data and define nodes and edges from Patents
cypher_query_patents1 = """
LOAD CSV WITH HEADERS FROM 'file:///neo4j_patents_main.csv' AS row

// Create Title nodes
MERGE (title_pt:Title_PT {
  key: COALESCE(toInteger(row.key), 0),
  doc_number: COALESCE(row.`doc-number`, 'Unknown'),
  title: COALESCE(row.title, 'Unknown'),
  country: COALESCE(row.country, 'Unknown')
})

"""
# Execute the Cypher query
graph.run(cypher_query_patents1)


# Define the Cypher query to load data from the second CSV file and create nodes
cypher_query_patents2 = """
LOAD CSV WITH HEADERS FROM 'file:///neo4j_patents_application_reference.csv' AS row

// Create appplication ref nodes
MERGE (application_ref_pt:Application_Ref_PT {
  key: COALESCE(toInteger(row.key), 0),
  doc_number: COALESCE(row.`doc-number`, 'Unknown'),
  country: COALESCE(row.country, 'Unknown'),
  date: COALESCE(row.date, 'Unknown')
})
"""

# Execute the Cypher query
graph.run(cypher_query_patents2)


# Define the Cypher query to load data from the second CSV file and create nodes
cypher_query_patents3 = """
LOAD CSV WITH HEADERS FROM 'file:///neo4j_patents_assignee.csv' AS row

// Create assignee nodes
MERGE (assignee_pt:Assignee_PT {
  key: COALESCE(toInteger(row.key), 0),
  orgname: COALESCE(row.orgname, 'Unknown'),
  city: COALESCE(row.city, 'Unknown'),
  state: COALESCE(row.state, 'Unknown'),
  country: COALESCE(row.country, 'Unknown')
});
"""

# Execute the Cypher query
graph.run(cypher_query_patents3)

# Define the Cypher query to load data from the second CSV file and create nodes
cypher_query_patents4 = """
LOAD CSV WITH HEADERS FROM 'file:///neo4j_patents_inventors.csv' AS row

// Create assignee nodes
MERGE (inventors_pt:Inventors_PT {
  key: COALESCE(toInteger(row.key), 0),
  first_name: COALESCE(row.`first-name`, 'Unknown'),
  last_name: COALESCE(row.`last-name`, 'Unknown'),
  city: COALESCE(row.city, 'Unknown'),
  state: COALESCE(row.state, 'Unknown'),
  country: COALESCE(row.country, 'Unknown') 
})

"""

# Execute the Cypher query
graph.run(cypher_query_patents4)


# Define the Cypher query to load data from the second CSV file and create nodes
cypher_query_patents5 = """
LOAD CSV WITH HEADERS FROM 'file:///neo4j_patents_abstract.csv' AS row

// Create Abstract nodes
MERGE (patent_abstract:Patent_Abstract {
  key: COALESCE(toInteger(row.key), 0),
  Abstract: COALESCE(row.abstract, 'Unknown')
})


// Create Entities nodes
MERGE (entities:Entities {
  entities: COALESCE(row.abstract_entities, 'Unknown')
})

// Create relationships
CREATE (patent_abstract)-[:HOLDS_PT]->(entities)
"""

# Execute the Cypher query
graph.run(cypher_query_patents5)

# Create relationships between nodes from the second file and existing nodes from the first file
cypher_query_patents6 = """
MATCH (title_pt:Title_PT) , (assignee_pt:Assignee_PT), (inventors_pt:Inventors_PT), (applications_ref_pt:Application_Ref_PT), (patent_abstract:Patent_Abstract)
WHERE title_pt.key = assignee_pt.key AND  assignee_pt.key = inventors_pt.key AND inventors_pt.key = applications_ref_pt.key AND applications_ref_pt.key = patent_abstract.key
CREATE (title_pt)-[:ASSIGN]->(assignee_pt)
CREATE (title_pt)-[:INVENTOR]->(inventors_pt)
CREATE (inventors_pt)-[:RESEARCHED]->(patent_abstract)
CREATE (applications_ref_pt)-[:CONTAINS]->(patent_abstract)
"""

# Execute the Cypher query
graph.run(cypher_query_patents6)

In [20]:
# Create relationships based on similarity
cypher_query7 = """
LOAD CSV WITH HEADERS FROM 'file:///similarity.csv' AS row
MATCH (sbir_abstract:Sbir_Abstract {key: toInteger(row.sbir_id)})
MATCH (patent_abstract:Patent_Abstract {key: toInteger(row.patent_id)})
MERGE (sbir_abstract)-[:SIMILARITY_SCORE {score: COALESCE(toFloat(row.sim_score), 0.0)}]->(patent_abstract);
"""

# Execute the Cypher query
graph.run(cypher_query7)

## 6 - Test Queries (Optional)

In [21]:
# // Count all nodes
cypher_query_check = """
MATCH (n)
RETURN count(n)
"""
graph.run(cypher_query_check)

count(n)
26920


In [22]:
#//  condition for Location
cypher_query_check = """
MATCH (principal:Principal)-[:WORKSIN]->(location:Location)
WHERE location.city = 'New York'
RETURN principal.pi_name;
"""
graph.run(cypher_query_check)

principal.pi_name
Andrew Chepaitis
Yan Ivnitskiy
Harper Langston


In [23]:
#//  condition for Similarity
cypher_query_check = """
MATCH (a)-[:SIMILARITY_SCORE]->(b)
RETURN labels(a)[0] AS labelName;
"""

graph.run(cypher_query_check)

labelName
Sbir_Abstract
Sbir_Abstract
Sbir_Abstract


## 7 - Conduct Queries to answer questions posed for Final Project

In [24]:
#1. What states receive the most SBIR awards in a variety of technology areas?
cypher_FP_query1 = """
MATCH (award:AWARD)-[:BELONGSTO]->(principal:Principal)-[:WORKSIN]->(location:Location)
MATCH (award:AWARD)-[:CONTAINS]->(abstract_entities)
WHERE abstract_entities.abstract_entities CONTAINS 'communication'
RETURN location.state, COUNT(DISTINCT award.key) AS sbir_awards
ORDER BY sbir_awards DESC LIMIT 10;
"""
graph.run(cypher_FP_query1)

In [25]:
##test
cypher_FP_query1 = """CALL db.schema.visualization()
"""
graph.run(cypher_FP_query1)

nodes,relationships
"[(_-20:AWARD {constraints: [], indexes: [], name: 'AWARD'}), (_-17:Agency {constraints: [], indexes: [], name: 'Agency'}), (_-23:Application_Ref_PT {constraints: [], indexes: [], name: 'Application_Ref_PT'}), (_-19:Assignee_PT {constraints: [], indexes: [], name: 'Assignee_PT'}), (_-21:Inventors_PT {constraints: [], indexes: [], name: 'Inventors_PT'}), (_-25:Patent_Abstract {constraints: [], indexes: [], name: 'Patent_Abstract'}), (_-18:Principal {constraints: [], indexes: [], name: 'Principal'}), (_-24:Sbir_Abstract {constraints: [], indexes: [], name: 'Sbir_Abstract'}), (_-16:Location {constraints: [], indexes: [], name: 'Location'}), (_-22:Title_PT {constraints: [], indexes: [], name: 'Title_PT'}), (_-26:Entities {constraints: [], indexes: [], name: 'Entities'})]","[(_-22)-[:INVENTOR {name: 'INVENTOR'}]->(_-21), (_-18)-[:WORKSIN {name: 'WORKSIN'}]->(_-16), (_-24)-[:SIMILARITY_SCORE {name: 'SIMILARITY_SCORE'}]->(_-25), (_-20)-[:BELONGSTO {name: 'BELONGSTO'}]->(_-18), (_-17)-[:OVERSEES {name: 'OVERSEES'}]->(_-20), (_-25)-[:HOLDS_PT {name: 'HOLDS_PT'}]->(_-26), (_-22)-[:ASSIGN {name: 'ASSIGN'}]->(_-19), (_-20)-[:CONTAINS {name: 'CONTAINS'}]->(_-24), (_-23)-[:CONTAINS {name: 'CONTAINS'}]->(_-25), (_-23)-[:CONTAINS {name: 'CONTAINS'}]->(_-24), (_-20)-[:CONTAINS {name: 'CONTAINS'}]->(_-25), (_-21)-[:RESEARCHED {name: 'RESEARCHED'}]->(_-25), (_-18)-[:RESEARCHED {name: 'RESEARCHED'}]->(_-25), (_-18)-[:RESEARCHED {name: 'RESEARCHED'}]->(_-24), (_-21)-[:RESEARCHED {name: 'RESEARCHED'}]->(_-24), (_-24)-[:HOLDS {name: 'HOLDS'}]->(_-26)]"


In [26]:
MATCH (ae1:Abstract_Entities)-[r:SIMILARITY_SCORE]->(ae2:Abstract_Entities_PT)
WHERE r.score > 0.8
RETURN ae1.abstract_entities, ae2.abstract_entities, r.score;

SyntaxError: invalid syntax (729712564.py, line 1)

In [None]:
MATCH (ae1:Patent_Abstract)-[r]->(ae2:Entities) <-[r1]-(ae3:Sbir_Abstract)
WHERE ae2.entities= "design"
RETURN ae1, ae2,ae3, r,r1 limit 10;