# Project Goal
#### To implement a Retrieval-Augmented Generation (RAG) system using FAISS for document retrieval and OpenAI's GPT models for generating responses, focusing on AlUla's cultural and architectural features based on provided text data.



## Project Workflow


### 1. Importing Libraries
Import the necessary libraries for text processing, embeddings, FAISS, and OpenAI.

In [1]:
import ipywidgets as widgets
from IPython.display import display, Markdown
from dotenv import load_dotenv
from openai import OpenAI
import os
import PyPDF2
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from collections import Counter
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import re
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

load_dotenv()

True

### 2. OpenAI Initialization
 Initialize the OpenAI client and define the model you’re going to use.

In [2]:
model_gpt = "gpt-4o-mini"

client = OpenAI(
    # This is the default and can be omitted
    api_key=os.getenv("OPENAI_API_KEY"),
)


In [6]:
def zero_shot_prompt(question):
    response = client.chat.completions.create(
        model= model_gpt,
        messages=[
            {'role': 'user', 'content': question}
        ]
    )
    return response.choices[0].message.content
question = 'What are the architectural features of the Maraya building?'

In [7]:
zero_shot_prompt(question)

"Maraya is a notable architectural landmark located in AlUla, Saudi Arabia. Here are some of its distinguishing architectural features:\n\n1. **Reflective Facade**: Maraya is renowned for its highly reflective glass facade that mirrors the surrounding desert landscape, creating an intriguing visual effect and blending the structure with its environment.\n\n2. **Sculptural Design**: The building is designed in a way that appears to emerge from the ground, with a sculptural form that captures the essence of the natural landscape around it.\n\n3. **Natural Light Integration**: The architecture emphasizes natural light, utilizing large windows and openings that allow for ample daylight to penetrate the interior spaces.\n\n4. **Sustainability**: Incorporating sustainable design principles, Maraya aims to minimize its environmental impact through energy-efficient systems.\n\n5. **Cultural Influence**: The design reflects a blend of modern aesthetics with local cultural elements, resonating w

### 3. Preprocess Text:

Load, Clean and split text into meaningful chunks.

In [2]:
import os
from PyPDF2 import PdfReader

def extract_text_from_pdfs(pdf_paths):
   
    text_chunks = []
    for pdf_path in pdf_paths:
        with open(pdf_path, 'rb') as file:
            reader = PdfReader(file)
            for page in reader.pages:
                text_chunks.append(page.extract_text())
    return text_chunks

# Example usage
pdf_files = [
    "/Users/fatimaessa/Downloads/RAG Project/Saudi Travel Guides/Abha-Explore_Aseer_s_flavours_and_history_in_Abha_07f5641003.pdf",
    "/Users/fatimaessa/Downloads/RAG Project/Saudi Travel Guides/al-ahsa-guidebook.pdf",
    "/Users/fatimaessa/Downloads/RAG Project/Saudi Travel Guides/ALULA_OVG_EN_260122.pdf",
    "/Users/fatimaessa/Downloads/RAG Project/Saudi Travel Guides/aseer-city-guide-en.pdf",
    "/Users/fatimaessa/Downloads/RAG Project/Saudi Travel Guides/REDSEA_Factsheet_2024.pdf",
    "/Users/fatimaessa/Downloads/RAG Project/Saudi Travel Guides/jeddah-guidebook.pdf",
    "/Users/fatimaessa/Downloads/RAG Project/Saudi Travel Guides/riyadh-guidebook (2).pdf",
    "/Users/fatimaessa/Downloads/RAG Project/Saudi Travel Guides/saudi-series-travel-english.pdf",
    "/Users/fatimaessa/Downloads/RAG Project/Saudi Travel Guides/TCWKeepDreaming_047.pdf",
    "/Users/fatimaessa/Downloads/RAG Project/Saudi Travel Guides/Tourist-guide-For-Al-Bahah-region-V00.pdf"

]

# Extract text from all PDFs
chunks = extract_text_from_pdfs(pdf_files)

# Print the first chunk and the total number of chunks
print(chunks[:10], len(chunks))


["Explore Aseer's flavours & history in Abha\ndiscoversaudi.saAbha | Group 9 hours\n", "Embark on an exciting adventure and explore unique flavours in Abha. Start your tour on a sweet note by gaining some insight into the production process and different types of honey, which is an integral part of Aseer's culture. The honey farm features a small museum that allows guests to take a closer look at the tools, equipment and beekeeper's outfit that are used for honey extraction from the hives. Gain more insight into the life cycle of honey bees and check out the apiary, also called a 'Bee yard', where the insects are kept. At this farm, guests can also sample over 15 different types of honey and learn about their distinguishing characteristics, along with Saudi coffee and treats. Learn about Rijal Almaa's history, marvel at the stunning panoramic views and the traditional decor of the farm and take lots of photos to help you remember this fantastic place forever. Enjoy a refreshing cup of 

In [None]:
# import PyPDF2
# def extract_text_from_pdf(pdf_path) -> list:
#     with open(pdf_path, 'rb') as file:
#         reader = PyPDF2.PdfReader(file)
#         text = []
#         for page in reader.pages:
#             text.append(page.extract_text())
#     return text


In [20]:
# # Example usage
# chunks = extract_text_from_pdf("/Users/taifabdullah/Desktop/Alula/TCWKeepDreaming_047.pdf")
# print(chunks[:1], len(chunks))  # Print the first 500 characters for review

### 4. RAG (Embeddings, Retieving and Generation)
Embed the preprocessed text chunks and store the embeddings in FAISS.

In [3]:
import faiss
import numpy as np
from openai import OpenAI
import pickle


class SimpleRAG:
    def __init__(self, max_tokens=1000):
        self.client = OpenAI()
        self.index = faiss.IndexFlatL2(1536)  # OpenAI embedding dimension
        self.texts = []
        self.max_tokens = max_tokens

    def add_documents(self, documents):
        """Add documents to the vector store"""
        for doc in documents:
            embedding = self.client.embeddings.create(
                model="text-embedding-3-small",
                input=doc
            ).data[0].embedding
            self.index.add(np.array([embedding]))
            self.texts.append(doc)

        pickle.dump(self.index, open('vectors.pkl', 'wb'))
        pickle.dump(self.texts, open("texts.pkl", "wb"))

    def load_documents(self):
        self.index = pickle.load(open('vectors.pkl', 'rb'))
        self.texts = pickle.load(open('texts.pkl', 'rb'))

    def retrieve(self, query, k=3):
        """Retrieve k most relevant documents"""
        query_embedding = self.client.embeddings.create(
            model="text-embedding-3-small",
            input=query
        ).data[0].embedding
        
        D, I = self.index.search(np.array([query_embedding]), k=k)
        return [self.texts[i] for i in I[0]]

    def generate_prompt(self, query, relevant_docs):
        """Create prompt for the LLM"""
        context = "\n".join(relevant_docs)
        prompt = f"""Use the following pieces of context to answer the question. 
        If you cannot find the answer in the context, say "I don't have enough information to answer this question."

        Context:
        {context}

        Question: {query}
        
        Answer:"""
        return prompt

    def query(self, question):
        """Full RAG pipeline"""
        # 1. Retrieve relevant documents
        relevant_docs = self.retrieve(question)
        
        # 2. Generate prompt with context
        prompt = self.generate_prompt(question, relevant_docs)
        
        # 3. Get answer from LLM
        response = self.client.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "You are a helpful assistant that answers questions based on the provided context."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=self.max_tokens
        )
        
        return response.choices[0].message.content



In [4]:
simple_rag = SimpleRAG()
simple_rag.add_documents(chunks)

In [8]:
# simple_rag = SimpleRAG()
# simple_rag.load_documents()

In [5]:
simple_rag.retrieve("best food in albahah")

[' Al-Bustan\nRestaurant\nAl-Bustan Restaurant is one of the best restaurants offering \nLebanese cuisine in Al-Baha. The restaurant is known for its wide \nand diverse selection of delicious Lebanese dishes.\nClick here\n Al Baha CityTourist guide For Al Bahah region\n 47',
 ' 8OZ Café\nIt has modern decoration and offers a variety of cold and hot \ndrinks in addition to western desserts.\nClick here\n Al Baha CityTourist guide For Al Bahah region\n 45',
 'Visit Saudi Road trips 2022  275 UNMISSABLE \nTHINGS TO DO\nCamp in a forest\nSprawling forests are the norm in Al Baha making this city a favourite among nature lovers or those looking to catch a break from urban life. \nThe Khairah Forest \nPark has some of the best night views under the starry Arabian sky. Visit the Raghadan Forest next, a family-friendly camping spot that is closely located to Al Baha’s city centre. The winding road near the forest will lead you to a spot loved by locals for its beautiful vistas that offer a bre

### 5. Querying

In [6]:
simple_rag.query('what is the best time for star gazing in alula?')


'The best time for star-gazing in AlUla is around each new moon.'

In [7]:
simple_rag.query('what are the facilites in habitas alula?')

'The facilities in Habitas AlUla include a spa with in-and-outdoor treatments, a dramatic infinity pool, the Desert X Pavilion showcasing local and international artists, an onsite Middle-Eastern restaurant, and a facility for cooking classes led by a local chef. It also includes a yoga deck, wellness and fitness centres, and a swimming pool.'

In [8]:
# List of questions to build a travel guide
questions = [
    # Jeddah Guidebook
    "What are the must-visit historical sites in Jeddah?",
    "What are the best beaches in Jeddah for water sports and relaxation?",
    "Can you recommend unique accommodations in Jeddah, including chalets and luxury hotels?",
    "What are the top dining spots in Jeddah to try traditional Hijazi cuisine?",
    "What are the best scuba diving and snorkeling spots near Jeddah?",

    # Red Sea Factsheet
    "What luxury resorts are currently open at the Red Sea destination?",
    "What sustainable initiatives are implemented at the Red Sea International Airport?",
    "What are the main features of the Ummahat Islands resorts?",
    "What activities are available for adventure seekers at the Red Sea resorts?",
    "How does the Red Sea project contribute to regenerative tourism?"
]

# Function to query the RAG system
def query_rag_system(questions):
    for question in questions:
        print(f"Question: {question}")
        try:
            # Assuming `simple_rag.query` is already defined
            response = simple_rag.query(question)
            print(f"Answer: {response}\n")
        except Exception as e:
            print(f"Error processing question '{question}': {e}\n")

# Execute the query for all questions
query_rag_system(questions)


Question: What are the must-visit historical sites in Jeddah?
Answer: The must-visit historical sites in Jeddah include the Nasseef House, the Matbouli House Museum, and the Tayebat Museum. Other notable locations to visit include Our Days of Bliss Magad and the Al Balad district, a UNESCO World Heritage Site filled with unique architecture dating from the 16th to the early 20th centuries.

Question: What are the best beaches in Jeddah for water sports and relaxation?
Answer: The context mentions that Jeddah has many prime beach locations on the Red Sea that are suitable for water sports and relaxation. While it doesn't mention specific beaches, it does recommend considering Jeddah's private beach clubs such as Oia or Indigo for activities like jet skiing and paddleboarding. It also mentions the Jeddah Waterfront, also known as the Jeddah Corniche, which features beaches, parks, play areas for kids, and dedicated sports and fishing areas.

Question: Can you recommend unique accommodati

### 6. Querying with UI

In [9]:
import ipywidgets as widgets
from IPython.display import display, Markdown

# Just a visualization function to make it easier to interact with the API
def create_ui(process_function):
    """Create a simple UI that works with any processing function."""
    # Create widgets
    input_box = widgets.Textarea(
        description='Input:',
        layout=widgets.Layout(width='600px', height='100px')
    )
    submit_button = widgets.Button(description='Submit')
    output_area = widgets.Output()
    response_area = widgets.HTML(value='<h3>Chat</h3>')

    def on_submit(b):
        with output_area:
            output_area.clear_output()
            response = process_function(input_box.value)
            response = (
                "**Question:** " + input_box.value + "\n\n**Answer:**\n" + response
            )
            input_box.value = ""  # Clear input box after submission
            display(Markdown(response))  # Display formatted response

    submit_button.on_click(on_submit)
    
    # Layout
    ui = widgets.VBox([
        input_box,
        submit_button,
        response_area,
        output_area
    ])
    display(ui)

In [10]:
create_ui(simple_rag.query)

VBox(children=(Textarea(value='', description='Input:', layout=Layout(height='100px', width='600px')), Button(…

### 6. Evalution 
Evaluate how well the retrieved documents match the query using cosine similarity.

In [13]:
import numpy as np
import pandas as pd
def get_top_similar_documents(vectors, query_vector, top_k=4):
    def get_cosine_similarity(a, b):
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
    # Compute similarities
    similarities = [get_cosine_similarity(doc, query_vector) for doc in vectors]
    similarities_series = pd.Series(similarities, name="sims")
    # Get indices of the top-k documents
    top_indices = similarities_series.sort_values(ascending=False).iloc[:top_k].index.tolist()
    return top_indices