#**Converting Structured Document of NVIDIA Form 10K FY 2024 into Knowledge Graphs**



*What is 10K form data?*

A 10K form is a comprehensive report filed annually by public companies, which provides  detailed information about their financial performance, business operations, and strategic direction. This document is structured into various sections

In this Colab file, we analyze **NVIDIA’s 10K form data** and transform it into a **knowledge graph** using the **Neo4j database**.  

### **Overview of NVIDIA’s 10K Form**  
NVIDIA’s **10K form** follows a structured format, comprising several key sections:  

- **Business Overview** – Outlines the company’s core operations, market segments, and strategic initiatives.  
- **Financial Data** – Contains detailed financial statements, including the balance sheet, income statement, and cash flow statement.  
- **Management Discussion and Analysis (MD&A)** – Provides insights into the company’s financial condition, operational performance, and future outlook.  
- **Risk Factors** – Identifies potential challenges and risks that may affect the company’s business and financial stability.  
- **Notes to Financial Statements** – Offers additional details and context related to the financial data presented.  

Given its well-structured nature, the **10K form** is highly suitable for conversion into a **knowledge graph**.

###**Vector Search V/S Knowledge Grphs**
**Limitations of Vector Search**

1. Context Limitation: Vector search focuses on similarity but often misses the nuanced context and relationships between data points.

  • Example: A query about Nvidia’s financial health might retrieve related financial figures but miss contextual links like strategic initiatives influencing those figures.

2. Semantic Gaps: Embeddings capture semantic meaning to an extent but can fail to distinguish between subtly different entities or concepts.

  • Example: Differentiating between Nvidia’s “GeForce GPUs” for gaming and “A100 GPUs” for AI workloads might be challenging.

3. Static Representations: Vectors are static and don’t dynamically adapt to new relationships unless the entire embedding space is re-trained.

  • Example: Changes in Nvidia’s business strategy reflected in a new 10K form require reprocessing to update vector representations.

**Advantages of knowledge Graphs**
1. Enhanced Context: Knowledge graphs inherently capture relationships and context between entities, providing richer and more accurate retrieval results.

 • Example: A query about financial health not only retrieves financial figures but also links to strategic initiatives, market conditions, and risk factors influencing those figures.

2. Disambiguation: By explicitly defining entities and their relationships, knowledge graphs reduce ambiguity and improve the precision of retrieved information.

  • Example: Clearly distinguishing between different GPU products and their respective market segments or use cases.

3. Dynamic and Scalable: Knowledge graphs can dynamically incorporate new data and relationships without requiring a complete overhaul of the existing structure.

  • Example: Integrating updates from the latest 10K form seamlessly into the existing graph structure.





We retrieved the data through SEC API calls and stored it in JSON format.









In [1]:
!pip install python-dotenv
import json

Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1


In [2]:
with open('nvda_10k_2023 (1).json', 'r') as f:
    filing_10k = json.load(f)

In [3]:
with open('nvda_10k_2023 (1).json', 'r') as f:
    content = f.read()
    print(content)  # Print the file content to check if it looks like valid JSON


{
    "name": "NVIDIA CORP",
    "ticker": "NVDA",
    "cik": "1045810",
    "cusip": "67066G104",
    "exchange": "NASDAQ",
    "isDelisted": false,
    "category": "Domestic Common Stock",
    "sector": "Technology",
    "industry": "Semiconductors",
    "sic": "3674",
    "sicSector": "Manufacturing",
    "sicIndustry": "Semiconductors & Related Devices",
    "famaSector": "",
    "famaIndustry": "Electronic Equipment",
    "currency": "USD",
    "location": "California; U.S.A",
    "id": "4a73b69083f93d38e05e0b76219875c9",
    "url": "https://www.sec.gov/Archives/edgar/data/1045810/000104581024000029/nvda-20240128.htm",
    "item1": " Item 1. Business \n\nOur Company \n\nNVIDIA pioneered accelerated computing to help solve the most challenging computational problems. NVIDIA is now a full-stack computing infrastructure company with data-center-scale offerings that are reshaping industry. \n\nOur full-stack includes the foundational CUDA programming model that runs on all NVIDIA GPUs

In [4]:
import os

file_path = "nvda_10k_2023 (1).json"
file_size = os.path.getsize('/content/nvda_10k_2023 (1).json')

print(f"File size: {file_size} bytes")


In [5]:
print(f"File size: {file_size / 1024:.2f} KB")  # Convert to KB
print(f"File size: {file_size / (1024 * 1024):.2f} MB")  # Convert to MB


File size: 344760 bytes
File size: 336.68 KB
File size: 0.33 MB


**Name**: The company's full legal name, NVIDIA Corporation.
ticker: The stock symbol used to trade on an exchange (NVDA for NVIDIA on NASDAQ).

**cik** (Central Index Key): A unique identifier assigned by the SEC (Securities and Exchange Commission) to track company filings (1045810 for NVIDIA).

**cusip** (Committee on Uniform Securities Identification Procedures): A 9-character code that uniquely identifies a security (67066G104 for NVIDIA stock).

**Exchange**: The stock exchange where the company’s shares are listed (NASDAQ for NVIDIA).
isDelisted: Indicates whether the company's stock is delisted (removed from trading). False means NVIDIA is still actively traded.

**Category**: Describes the type of stock; Domestic Common Stock means it’s a regularly traded U.S. company share.

**Sector**: The broad industry category the company belongs to (Technology for NVIDIA).

**Industry**: A more specific classification within the sector (Semiconductors, meaning NVIDIA produces chips and processors).

**Sic** (Standard Industrial Classification): A 4-digit code categorizing businesses for U.S. government reporting (3674 for Semiconductors & Related Devices).

sicSector: The general industry sector based on SIC classification (Manufacturing for NVIDIA).

sicIndustry: The detailed industry category under SIC (Semiconductors & Related Devices).

famaSector: A sector classification based on the Fama-French industry taxonomy (empty in this case).

famaIndustry: A more granular industry classification under Fama-French, here marked as Electronic Equipment.

currency: The reporting currency for the company's financials (USD – U.S. Dollar).

location: The headquarters or primary location of the company (California, U.S.A).


In [6]:
extract_section_list = ['item1', 'item1A','item7','item7A','item15',]


In [7]:
!pip install langchain
!pip install langchain_community
!pip install transformers
!pip install langchain_huggingface

Collecting langchain_community
  Downloading langchain_community-0.3.17-py3-none-any.whl.metadata (2.4 kB)
Collecting langchain-core<1.0.0,>=0.3.34 (from langchain_community)
  Downloading langchain_core-0.3.34-py3-none-any.whl.metadata (5.9 kB)
Collecting langchain<1.0.0,>=0.3.18 (from langchain_community)
  Downloading langchain-0.3.18-py3-none-any.whl.metadata (7.8 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.7.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain_community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.llms import HuggingFaceHub
from dotenv import load_dotenv



In [9]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 2000,
    chunk_overlap = 200,
    length_function = len,
    is_separator_regex = False
)

In [10]:
embeddings = HuggingFaceEmbeddings(
    model_name="mixedbread-ai/mxbai-embed-large-v1"
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/266 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/114k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/677 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

In [11]:
def split_form10k_data_from_file(file_as_object):
    chunks_with_metadata = []  # use this to accumulate chunk records
    for item in extract_section_list:  # pull these keys from the json
        print(f'Processing {item}')
        item_text = file_as_object[item]  # grab the text of the item
        item_text_chunks = text_splitter.split_text(item_text)  # split the text into chunks
        chunk_seq_id = 0
        for chunk in item_text_chunks:
            form_id = file_as_object['id']
            # finally, construct a record with metadata and the chunk text
            chunks_with_metadata.append({
                'text': chunk,
                # metadata from looping...
                'f10kItem': item,
                'chunkSeqId': chunk_seq_id,
                # constructed metadata...
                'formId': f'{form_id}',  # pulled from the filename
                'chunkId': f'{form_id}-{item}-chunk{chunk_seq_id:04d}',
                # metadata from file...
                'name': file_as_object['name'],
                'cik': file_as_object['cik'],
                'cusip': file_as_object['cusip'],
                'source': file_as_object['url'],
                'textEmbedding': embeddings.embed_query(chunk)
            })
            chunk_seq_id += 1
        print(f'\tSplit into {chunk_seq_id} chunks')
    return chunks_with_metadata


In [12]:
filing_10k_chunks = split_form10k_data_from_file(filing_10k)

# Save the embeddings to a JSON file
#with open('filing_10k_embeddings.json', 'w') as f:
    #json.dump(filing_10k_chunks, f)

print(f'Total {len(filing_10k_chunks)} chunks')




Processing item1
	Split into 32 chunks
Processing item1A
	Split into 75 chunks
Processing item7
	Split into 25 chunks
Processing item7A
	Split into 2 chunks
Processing item15
	Split into 72 chunks
Total 206 chunks


In [13]:
!pip install neo4j

Collecting neo4j
  Downloading neo4j-5.28.0-py3-none-any.whl.metadata (5.9 kB)
Downloading neo4j-5.28.0-py3-none-any.whl (311 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/311.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m307.2/311.9 kB[0m [31m11.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.9/311.9 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: neo4j
Successfully installed neo4j-5.28.0


In [14]:
NEO4J_URI="neo4j+s://55206846.databases.neo4j.io"
NEO4J_USERNAME="neo4j"
NEO4J_PASSWORD="5mNuHAkmV_8ddr7tGVQwu378cW3NEeaUSPmJX8ZBQwA"

In [15]:
import os
os.environ["NEO4J_URI"] = NEO4J_URI
os.environ["NEO4J_USERNAME"] = NEO4J_USERNAME
os.environ["NEO4J_PASSWORD"] = NEO4J_PASSWORD

In [16]:
from langchain_community.graphs import Neo4jGraph
kg = Neo4jGraph(
    url= NEO4J_URI,
    username= NEO4J_USERNAME,
    password= NEO4J_PASSWORD,
)

  kg = Neo4jGraph(


In [17]:
index_name = "unique_chunk"

kg.query(f"""
CREATE CONSTRAINT {index_name} IF NOT EXISTS
    FOR (c:Chunk) REQUIRE c.chunkId IS UNIQUE
""")


[]

In [18]:
# ✅ Cypher Query to Merge Chunks
merge_chunk_node_query = """
MERGE (mergedChunk:Chunk {chunkId: COALESCE($chunkParam.chunkId, "UNKNOWN")})
ON CREATE SET
    mergedChunk.name = COALESCE($chunkParam.name),
    mergedChunk.formId  = COALESCE($chunkParam.formId),
    mergedChunk.cik = COALESCE($chunkParam.cik),
    mergedChunk.cusip = COALESCE($chunkParam.cusip),
    mergedChunk.source = COALESCE($chunkParam.source),
    mergedChunk.f10kItem = COALESCE($chunkParam.f10kItem),
    mergedChunk.chunkSeqId = COALESCE($chunkParam.chunkSeqId),
    mergedChunk.text = COALESCE($chunkParam.text)

WITH mergedChunk, COALESCE($chunkParam.textEmbedding, [0.0]) AS vector
CALL db.create.setNodeVectorProperty(mergedChunk, "textEmbedding", vector)
RETURN mergedChunk
"""

In [19]:
# ✅ Upload Chunks to Neo4j
node_count = 0
for chunk in filing_10k_chunks:
    chunk_id = chunk.get("chunkId")
    text_embedding = chunk.get("textEmbedding")

    print(f"Uploading Chunk: {chunk_id}, Embedding Type: {type(text_embedding)}")

    if not chunk_id or chunk_id in [None, "null", "NaN"]:
        print(f"⚠️ Skipping invalid chunk: {chunk}")
        continue  # Skip invalid chunks

    if not isinstance(text_embedding, list):
        print(f"⚠️ Warning: textEmbedding is not a list for chunk {chunk_id}")
        continue  # Skip if embedding is not a list

    kg.query(merge_chunk_node_query, params={"chunkParam": chunk})
    node_count += 1

print(f"✅ Successfully created {node_count} nodes in Neo4j!")


Uploading Chunk: 4a73b69083f93d38e05e0b76219875c9-item1-chunk0000, Embedding Type: <class 'list'>
Uploading Chunk: 4a73b69083f93d38e05e0b76219875c9-item1-chunk0001, Embedding Type: <class 'list'>
Uploading Chunk: 4a73b69083f93d38e05e0b76219875c9-item1-chunk0002, Embedding Type: <class 'list'>
Uploading Chunk: 4a73b69083f93d38e05e0b76219875c9-item1-chunk0003, Embedding Type: <class 'list'>
Uploading Chunk: 4a73b69083f93d38e05e0b76219875c9-item1-chunk0004, Embedding Type: <class 'list'>
Uploading Chunk: 4a73b69083f93d38e05e0b76219875c9-item1-chunk0005, Embedding Type: <class 'list'>
Uploading Chunk: 4a73b69083f93d38e05e0b76219875c9-item1-chunk0006, Embedding Type: <class 'list'>
Uploading Chunk: 4a73b69083f93d38e05e0b76219875c9-item1-chunk0007, Embedding Type: <class 'list'>
Uploading Chunk: 4a73b69083f93d38e05e0b76219875c9-item1-chunk0008, Embedding Type: <class 'list'>
Uploading Chunk: 4a73b69083f93d38e05e0b76219875c9-item1-chunk0009, Embedding Type: <class 'list'>
Uploading Chunk: 4a7

In [20]:
kg.query(f"""
    CREATE VECTOR INDEX chunk_text_embedding_index IF NOT EXISTS
    FOR (c:chunkId) ON (c.textEmbedding)
    OPTIONS {{
        indexConfig: {{
            `vector.dimensions`: 1024,
            `vector.similarity_function`: 'cosine'
        }}
    }}
""")


[]

In [21]:
kg.query("""
SHOW INDEXES yield *
where name CONTAINS 'chunk_text_embedding_index'
return name, createStatement
""" )

[{'name': 'chunk_text_embedding_index',
  'createStatement': "CREATE VECTOR INDEX `chunk_text_embedding_index` FOR (n:`chunkId`) ON (n.`textEmbedding`) OPTIONS {indexConfig: {`vector.dimensions`: 1024,`vector.hnsw.ef_construction`: 100,`vector.hnsw.m`: 16,`vector.quantization.enabled`: true,`vector.similarity_function`: 'COSINE'}}"}]

In [22]:
cypher = """
 MATCH (anyChunk:Chunk)
 WITH anyChunk LIMIT 1
 RETURN anyChunk {.name,.source, .formId, .cik,.cusip} AS formInfo
"""
form_info = kg.query(cypher)

form_info

[{'formInfo': {'cik': '1045810',
   'source': 'https://www.sec.gov/Archives/edgar/data/1045810/000104581024000029/nvda-20240128.htm',
   'formId': '4a73b69083f93d38e05e0b76219875c9',
   'name': 'NVIDIA CORP',
   'cusip': '67066G104'}}]

In [23]:
# Extract the `formId` from the `form_info` result
form_id_param = form_info[0]['formInfo']['formId']

# Execute the query with the corrected parameters
cypher = """
MATCH (from_same_form:Chunk)
  WHERE from_same_form.formId = $formIdParam
RETURN from_same_form {.formId, .f10kItem, .chunkId, .chunkSeqId} AS chunkInfo
"""

kg.query(cypher, params={'formIdParam': form_id_param})


[{'chunkInfo': {'formId': '4a73b69083f93d38e05e0b76219875c9',
   'f10kItem': 'item1',
   'chunkId': '4a73b69083f93d38e05e0b76219875c9-item1-chunk0000',
   'chunkSeqId': 0}},
 {'chunkInfo': {'formId': '4a73b69083f93d38e05e0b76219875c9',
   'f10kItem': 'item1',
   'chunkId': '4a73b69083f93d38e05e0b76219875c9-item1-chunk0001',
   'chunkSeqId': 1}},
 {'chunkInfo': {'formId': '4a73b69083f93d38e05e0b76219875c9',
   'f10kItem': 'item1',
   'chunkId': '4a73b69083f93d38e05e0b76219875c9-item1-chunk0002',
   'chunkSeqId': 2}},
 {'chunkInfo': {'formId': '4a73b69083f93d38e05e0b76219875c9',
   'f10kItem': 'item1',
   'chunkId': '4a73b69083f93d38e05e0b76219875c9-item1-chunk0003',
   'chunkSeqId': 3}},
 {'chunkInfo': {'formId': '4a73b69083f93d38e05e0b76219875c9',
   'f10kItem': 'item1',
   'chunkId': '4a73b69083f93d38e05e0b76219875c9-item1-chunk0004',
   'chunkSeqId': 4}},
 {'chunkInfo': {'formId': '4a73b69083f93d38e05e0b76219875c9',
   'f10kItem': 'item1',
   'chunkId': '4a73b69083f93d38e05e0b7621987

In [24]:
cypher = """
MATCH (from_same_section:Chunk)
WHERE from_same_section.formId = $formIdParam
  AND from_same_section.f10kItem = $f10kItemParam
WITH from_same_section
  ORDER BY from_same_section.chunkSeqId ASC
WITH collect(from_same_section) as section_chunk_list
CALL apoc.nodes.link(
  section_chunk_list,
  "NEXT",
  {avoidDuplicates: true}
)
RETURN size(section_chunk_list)
"""

for section in extract_section_list:
    kg.query(cypher, params={
        'formIdParam': form_info[0]['formInfo']['formId'],  # Access the first element and then the formId
        'f10kItemParam': section
    })



In [25]:
cypher = """
  MATCH (first:Chunk), (f:Form)
  WHERE first.formId = f.formId
    AND first.chunkSeqId = 0
  WITH first, f
    MERGE (f)-[r:SECTION {f10kItem: first.f10kItem}]->(first)
  RETURN count(r)
"""
kg.query(cypher)

[{'count(r)': 0}]

In [26]:
kg.refresh_schema()
print(kg.schema)

Node properties:
Chunk {chunkId: STRING, name: STRING, formId: STRING, cik: STRING, cusip: STRING, source: STRING, f10kItem: STRING, chunkSeqId: INTEGER, text: STRING, textEmbedding: LIST}
Relationship properties:

The relationships:
(:Chunk)-[:NEXT]->(:Chunk)
