<a href="https://colab.research.google.com/github/rjhalliday/python-llm/blob/main/langchain_wikipedia_llm_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Simple RAG using wikipedia data retrieval

Retrieval Augmented Generation, or RAG, is an architectural approach that can improve the efficacy of large language model (LLM) applications by leveraging custom data. LLM's are trained on publically available data, however there is much data that is not publically available on the internet, such as commercial data.

A common usecase of an RAG would be where an organisation wants to leverage their internal proprietary documentation, databases and other information repositories to build an LLM which can answer questions specitic to their organisation. An example might be an aircraft manufacturer wanting to have an internal chatbot that employees or perhaps customers could query for information relating to their aircraft.

In this example I retrieve data from wikipedia to build a RAG. This is a more advanced version than my original [single document RAG you can find here](https://github.com/rjhalliday/python-llm/blob/main/langchain_simple_rag_with_gemini.ipynb).
The steps are
1.  Search wikipedia using the query "Artificial Intelligence". You can see a simple example of [Wikipedia retrieval in Python here](https://github.com/rjhalliday/python-examples/blob/main/python_wikipedia_api_data_retrieval.ipynb)
2. The results of this search is used to create a vector store.
3. The RAG is then queries for the question "What is Artificial Intelligence?"



In [1]:
# wikipedia has it's own unique markdown which we need to remove
!pip -q install mwparserfromhell

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/191.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.0/191.0 kB[0m [31m1.5 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/191.0 kB[0m [31m1.1 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m184.3/191.0 kB[0m [31m1.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m191.0/191.0 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
!pip install -qU langchain-google-genai mwparserfromhell

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m384.0/384.0 kB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m140.2/140.2 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m141.9/141.9 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
!pip install -qU langchain-community

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m997.8/997.8 kB[0m [31m39.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m393.9/393.9 kB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m149.1/149.1 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [5]:
!pip install -qU chromadb


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━[0m [32m41.0/67.3 kB[0m [31m2.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m584.3/584.3 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m21.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m273.8/273.8 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.2/93.2 kB[0m [31m6.0 MB/s

In [None]:
from langchain.chains import RetrievalQA
from langchain.document_loaders import WebBaseLoader
from langchain.embeddings import GooglePalmEmbeddings
from langchain.llms import GooglePalm
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from google.colab import userdata
from langchain_google_genai import ChatGoogleGenerativeAI


import os
GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY





In [None]:
import requests
import mwparserfromhell
from langchain.chains import RetrievalQA
#from langchain.document_loaders import Document
from langchain.schema import Document  # Import Document from langchain.schema

from langchain.embeddings import GooglePalmEmbeddings
from langchain.llms import GooglePalm
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain_google_genai import ChatGoogleGenerativeAI

# Initialize the language model using GoogleGenerativeAI
llm = ChatGoogleGenerativeAI(
    model="gemini-1.5-pro",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)

# Initialize embeddings
embeddings = GooglePalmEmbeddings()

def search_wikipedia(query):
    search_url = f"https://en.wikipedia.org/w/api.php?action=query&list=search&srsearch={query}&format=json"
    try:
        response = requests.get(search_url)
        response.raise_for_status()  # Raise an error for HTTP issues
        data = response.json()
        page_titles = [result['title'] for result in data['query']['search']]
        return page_titles
    except requests.RequestException as e:
        print(f"Error during Wikipedia search: {e}")
        return []

def fetch_wikipedia_content(title):
    content_url = f"https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles={title}&rvprop=content&format=json"
    try:
        response = requests.get(content_url)
        response.raise_for_status()  # Raise an error for HTTP issues
        data = response.json()
        pages = data['query']['pages']
        page = next(iter(pages.values()))
        if 'revisions' in page:
            content = page['revisions'][0]['*']
            return content
        else:
            print(f"No content found for title: {title}")
            return ""
    except requests.RequestException as e:
        print(f"Error fetching content for {title}: {e}")
        return ""

def clean_wikipedia_content(content):
    wikicode = mwparserfromhell.parse(content)
    text = wikicode.strip_code()
    return text

def create_documents(query):
    titles = search_wikipedia(query)
    documents = []
    for title in titles:
        print(f"Fetching content for title: {title}")
        content = fetch_wikipedia_content(title)
        cleaned_content = clean_wikipedia_content(content)
        # Create Document objects from the cleaned content
        documents.append(Document(page_content=cleaned_content, metadata={"title": title}))
    return documents

# Create documents based on a query
query = "Artificial Intelligence"
documents = create_documents(query)

# Text splitting
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

# Creating a Vector Store
db = Chroma.from_documents(texts, embeddings)

# Retriever setup
retriever = db.as_retriever()

# QA Chain Setup
qa_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)

# Execute a Query
query = "What is Artificial Intelligence?"
result = qa_chain({"query": query})
print(result["result"])


Fetching content for title: Artificial intelligence
Fetching content for title: Generative artificial intelligence
Fetching content for title: A.I. Artificial Intelligence
Fetching content for title: Artificial general intelligence
Fetching content for title: Applications of artificial intelligence
Fetching content for title: Ethics of artificial intelligence
Fetching content for title: History of artificial intelligence
Fetching content for title: Artificial intelligence in healthcare
Fetching content for title: Hallucination (artificial intelligence)
Fetching content for title: Timeline of artificial intelligence


  warn_deprecated(


Artificial intelligence (AI), in its broadest sense, is intelligence exhibited by machines, particularly computer systems. It is a field of research in computer science that develops and studies methods and software that enable machines to perceive their environment and use learning and intelligence to take actions that maximize their chances of achieving defined goals. Such machines may be called AIs. 

