In [2]:
!pip install wikipedia

Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py): started
  Building wheel for wikipedia (setup.py): finished with status 'done'
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11689 sha256=ec8268442d6e95f4eef55715f9dd7f6b4bbd86d1823c3933ebe8a89a88cf0ae7
  Stored in directory: c:\users\samra\appdata\local\pip\cache\wheels\07\93\05\72c05349177dca2e0ba31a33ba4f7907606f7ddef303517c6a
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


#### Limitations of vector RAG 
1. **Themes and relationships** - Document embedding captures semantic meaning but struggles to capture themes and relationships between entities in the document corpus.
2. **Scalability** - as the volume of the database grows, the retrieval process can become less efficient, as the computational load increases with the search space.
3. **Diverse Data** - the structured and diverse data are harder to embed. 

In [4]:
import time
from langchain_community.document_loaders import WikipediaLoader
from langchain_text_splitters import TokenTextSplitter

# Function to load Wikipedia data with retry mechanism
def load_wikipedia_data(query, retries=3, delay=5):
    for attempt in range(retries):
        try:
            loader = WikipediaLoader(query=query)
            raw_documents = loader.load()
            return raw_documents
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt < retries - 1:
                time.sleep(delay)
            else:
                raise

# Load Wikipedia data with retry mechanism
query = "Large language model"
raw_documents = load_wikipedia_data(query)

# Split the documents
text_splitter = TokenTextSplitter(chunk_size=100, chunk_overlap=20)
documents = text_splitter.split_documents(raw_documents[:3])

# Print the first document
print(documents[0])

page_content='A large language model (LLM) is a type of computational model designed for natural language processing tasks such as language generation. As language models, LLMs acquire these abilities by learning statistical relationships from vast amounts of text during a self-supervised and semi-supervised training process.
The largest and most capable LLMs are artificial neural networks built with a decoder-only transformer-based architecture, enabling efficient processing and generation of large-scale text data. Modern models can be fine-tun' metadata={'title': 'Large language model', 'summary': 'A large language model (LLM) is a type of computational model designed for natural language processing tasks such as language generation. As language models, LLMs acquire these abilities by learning statistical relationships from vast amounts of text during a self-supervised and semi-supervised training process.\nThe largest and most capable LLMs are artificial neural networks built with a

In [2]:
import time
from langchain_community.document_loaders import WikipediaLoader
from langchain_text_splitters import TokenTextSplitter

# Load Wikipedia data with retry mechanism
query = "Large language model"
raw_documents = WikipediaLoader(query=query).load()

# Split the documents
text_splitter = TokenTextSplitter(chunk_size=100, chunk_overlap=20)
documents = text_splitter.split_documents(raw_documents[:3])

# Print the first document
print(documents[0])

page_content='A large language model (LLM) is a type of computational model designed for natural language processing tasks such as language generation. As language models, LLMs acquire these abilities by learning statistical relationships from vast amounts of text during a self-supervised and semi-supervised training process.
The largest and most capable LLMs are artificial neural networks built with a decoder-only transformer-based architecture, enabling efficient processing and generation of large-scale text data. Modern models can be fine-tun' metadata={'title': 'Large language model', 'summary': 'A large language model (LLM) is a type of computational model designed for natural language processing tasks such as language generation. As language models, LLMs acquire these abilities by learning statistical relationships from vast amounts of text during a self-supervised and semi-supervised training process.\nThe largest and most capable LLMs are artificial neural networks built with a

#### Document to graph

In [3]:
from langchain_openai import ChatOpenAI
from langchain_experimental.graph_transformers import LLMGraphTransformer
import os 

openai_api_key = os.getenv("OPENAI_API_KEY")

llm = ChatOpenAI (api_key = openai_api_key, temperature=0, model_name="gpt-4o-mini")
llm_transformer = LLMGraphTransformer(llm=llm)

graph_documents = llm_transformer.convert_to_graph_documents(documents)
print(graph_documents)

[GraphDocument(nodes=[Node(id='Large Language Model', type='Concept'), Node(id='Natural Language Processing', type='Concept'), Node(id='Language Generation', type='Concept'), Node(id='Artificial Neural Networks', type='Concept'), Node(id='Decoder-Only Transformer-Based Architecture', type='Concept'), Node(id='Text Data', type='Concept')], relationships=[Relationship(source=Node(id='Large Language Model', type='Concept'), target=Node(id='Natural Language Processing', type='Concept'), type='DESIGNED_FOR'), Relationship(source=Node(id='Large Language Model', type='Concept'), target=Node(id='Language Generation', type='Concept'), type='DESIGNED_FOR'), Relationship(source=Node(id='Large Language Model', type='Concept'), target=Node(id='Artificial Neural Networks', type='Concept'), type='IS_A'), Relationship(source=Node(id='Artificial Neural Networks', type='Concept'), target=Node(id='Decoder-Only Transformer-Based Architecture', type='Concept'), type='BUILT_WITH'), Relationship(source=Node(