<a href="https://colab.research.google.com/github/sarajay19/LLM/blob/main/Retrieval_Augmented_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Exercise Notebook: Implementing RAG (Retrieval-Augmented Generation)

In this exercise notebook, we will go through each steps required to implement Retrieval-Augmented Generation (RAG).

**Let's get started!**



## Installing Required Libraries

Before starting, ensure all the necessary libraries installed.

- `langchain`
- `langchain_community`
- `unstructured`
- `sentence_transformers`
- `tiktoken`
- `chromadb`
- `langchain_chroma`
- `langchain_groq`

In [1]:
!pip install langchain langchain_community unstructured sentence_transformers tiktoken chromadb langchain_chroma langchain_groq

Collecting langchain
  Downloading langchain-0.3.0-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain_community
  Downloading langchain_community-0.3.0-py3-none-any.whl.metadata (2.8 kB)
Collecting unstructured
  Downloading unstructured-0.15.12-py3-none-any.whl.metadata (29 kB)
Collecting sentence_transformers
  Downloading sentence_transformers-3.1.0-py3-none-any.whl.metadata (23 kB)
Collecting tiktoken
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting chromadb
  Downloading chromadb-0.5.5-py3-none-any.whl.metadata (6.8 kB)
Collecting langchain_chroma
  Downloading langchain_chroma-0.1.4-py3-none-any.whl.metadata (1.6 kB)
Collecting langchain_groq
  Downloading langchain_groq-0.2.0-py3-none-any.whl.metadata (2.9 kB)
Collecting langchain-core<0.4.0,>=0.3.0 (from langchain)
  Downloading langchain_core-0.3.1-py3-none-any.whl.metadata (6.2 kB)
Collecting langchain-text-splitters<0.4.0,>=0.3.0 (from langchain)
  Down


## Import Necessary Modules

Import the necessary modules to build the RAG system.


In [2]:
import os
import pandas as pd
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)
import markdown
from langchain.text_splitter import RecursiveCharacterTextSplitter

import re




# Data Pre-processing and Preparation

In this section, we will focus on preparing the dataset for retrieval-based models. The steps involve cleaning the text, tokenizing it, and vectorizing it for further use in our model. These steps are essential for efficient retrieval and generation.

**Data. SANAD corpus is a large collection of Arabic news articles that can be used in several NLP tasks such as text classification and producing word embedding models.**

In [3]:
# SANAD is a large collection of Arabic news articles
!kaggle datasets download -d khaledzsa/sanad

Dataset URL: https://www.kaggle.com/datasets/khaledzsa/sanad
License(s): unknown
Downloading sanad.zip to /content
 90% 64.0M/71.4M [00:01<00:00, 67.5MB/s]
100% 71.4M/71.4M [00:01<00:00, 44.4MB/s]


In [4]:
!unzip sanad.zip

Archive:  sanad.zip
  inflating: sanad.csv               


In [5]:
df = pd.read_csv('/content/sanad.csv', nrows=2000)
df.head()

Unnamed: 0,text,label
0,https://example.com/resource/الشاٌرقة -ْ محمِد...,Culture
1,https://example.com/resource/اَنِطٌلقّتَ ٍفٍيّ...,Culture
2,https://example.com/resource/أُقيًمٌتِ مِساءُ ...,Culture
3,https://example.com/resource/بٍاسُمةَ يًوٌنٍس ...,Culture
4,https://example.com/resource/قُرر اَتحِاد اًلْ...,Culture


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    2000 non-null   object
 1   label   2000 non-null   object
dtypes: object(2)
memory usage: 31.4+ KB


In [7]:
df['text'].iloc[2]

'https://example.com/resource/أُقيًمٌتِ مِساءُ َأمٍسٍ اٌلأَوٌل فيُ ِإكسّبٍو اٍلَشارًقّةٌ نِدٌوُة حوْاٌرَية حوْلِ أُهِمَيًة ًتٌجَاِرِةّ الكتٌب فيُ َالعالِمٍ َشًارك فيهًا ٍكلّ من ليَز ثومسٍوِنِ ًولُويزٍ ٌأرميليوِنّ، وْإيمّاْ ٌهاُوّس، وآٍمِيً وٌيبَستٌر، ِوْأُدارهًاٍ ٍأنًدرٍوْ ْسّنْيوْرْ.وطٌرحتّ ْاٌلندوِة َعّدٍداًَ مًنُ ّاَلّمشٍكّلَاَتْ اَلتِي تِعانَي مّنهٍاِ ُتٍجّاًرًةِ ٍاّلُكًتٍب على مًسَتُوى اّلُعالٌمُ،ً كًما َاٌسٍتٌعرضتٌ ٌبٍعَض اَلتُجارب ّاٌلنِاّجحة ٍفيْ اِلًبلداٌن ِاِلّمٌتقَدمًة ِمّثل ْبرّيُطانٍياَ وًاْلوٍلاُيْاُت َالمتحٌدٍة، ِوٌهيِ ُاِلِتيَ ُرسَخت ٌنُفسهُاَ من ُخُلٍالٍ ُاَعتمٍادهًا ُعّلىْ ًاْلتكنوٍلوجًيّاٍ ًاُلِحٍدِيثة ٍووُسِائطً َالميّدياُ اًلٌمٌتعٌددْة وماْ باَت ِيعْرٍفَ ّبمَصَطلحِ ٍاٍلنشّر َالّاّلِكتًرِوًنيِ.ٌاْستعرْضْت اُلنَدوة تٌجِرَبٍةٌ مًعِرُض لنُدَن ْاّلدٌوْلْي للكٌتًابَ الُذي ِتأسّس ّمًنًذُ ٌأُرُبعٌيْنِ َعاما ُورسًخ فضُاَءُ ٌرٍقٌمياًّ كبّيرَاًِ ْيِضًم كًل متٌعٍلقاْت َهّذا ِاَلجّاًنُب ُمن ً(ًوٍسّاِئُط وفيديوهاُت وأٍقراص سيَ ديً) ُوْغَيِرها،ٍ ْوِبْاًت يِتاُبٌع ّمنشٌوَراتهْ ٍعل

In [8]:
def clean_text(text):
    text = re.sub(r'http[s]?://\S+', '', text)
    text = re.sub(r'[\u064B-\u0652]', '', text)
    text = re.sub(r'[^\w\s]|\.{1,}', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

df['text'] = df['text'].apply(clean_text)

df.head()

Unnamed: 0,text,label
0,محمد ولد محمد سالمعرضت مساء أمس الأول على خشبة...,Culture
1,في مثل هذه الأيام من العام الفائت فعاليات مهرج...,Culture
2,مساء أمس الأول في إكسبو الشارقة ندوة حوارية حو...,Culture
3,يونس حينما قال صاحب السمو الشيخ الدكتور سلطان ...,Culture
4,اتحاد الأدباء والكتاب الموريتانيين عقد مؤتمره ...,Culture


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    2000 non-null   object
 1   label   2000 non-null   object
dtypes: object(2)
memory usage: 31.4+ KB


In [10]:
df['text'].iloc[20]

'عن دار الكتب الوطنية في هيئة أبوظبي للثقافة والتراث مجموعة شعرية جديدة بعنوان البدايات الأخيرة للكاتب والباحث سعيد الغانمي تتضمن 11 قصيدة تناولت الاغتراب والمنفى والحنين إلى الوطن يعمد الغانمي في كتابته القصيدة إلى تصوير الوقائع والأشياء بلغة جميلة تداعب مخيلة القارئ وتتميز الصورة الشعرية بمطابقتها للواقع كما تزخر قصيدته بالكثير من المفردات التي تشير إلى المنفى والغربة والموت كقصيدة قبر هنا أو هناك التي تبرز فيها معالم الحنين إلى الموت والقبر وما يقصده هنا الشاعر الموت في وطنه وليس في المنفى بعيدا عن أرضه ووطنه وتبرز القصائد التي ضمتها المجموعة الشعرية الكثير من الحزن في الحاضر الذي يعيشه الشاعر ومع ذلك فهناك بريق أمل في المستقبل مما يبرز اكتمال مرحلة النضوج الشعري من خلال مختلف الأحاسيس والمشاعر التي عاشها وأبرزها بصورة شعرية متألقة ورؤية إبداعية مستعينا بذاكرته التي تعيده إلى قرطبة وبابل والفرات هاربا من زمانه ومكانه'

In [11]:
directory = 'data/markdown_files'
os.makedirs(directory, exist_ok=True)

In [12]:
for i in range(0, 1_000):

    text = df.iloc[i]

    markdown_text = f"{text}\n\n"

    with open(f'{directory}/{i}.md', 'w', encoding='utf-8') as file:
        file.write(markdown_text)

# Read Files from the Directory

In this step, we will read all text-based files from a specified directory. The files could be in various formats such as Markdown (`.md`), plain text (`.txt`), or other similar formats. We will handle each file based on its extension and process it accordingly.

In [13]:
markdown_texts = []
for filename in os.listdir(directory):
  if filename.endswith(".md"):
    with open(os.path.join(directory, filename), 'r', encoding='utf-8') as file:
      markdown_content = file.read()
      html_content = markdown.markdown(markdown_content)
      markdown_texts.append(html_content)

## Split the Text into Chunks

In this step, we will split the text into manageable chunks. This is important for tasks such as document retrieval and text generation, where large bodies of text need to be broken down for efficient processing.

In [14]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=10)
documents = text_splitter.create_documents(markdown_texts)

## Initialize the Embedding Model & Create a Vector Store Using Chroma

In this step, we will initialize an embedding model to convert text chunks into numerical vectors. These embeddings will be used to measure the similarity between different chunks of text. After generating the embeddings, we will store them using Chroma, a vector store designed to efficiently manage and retrieve embeddings.

**AraBERT is an Arabic pretrained lanaguage model based on Google's BERT architechture**

In [15]:
# AraBERT is an Arabic pretrained lanaguage model based on Google's BERT architechture*
embedding_function = SentenceTransformerEmbeddings(model_name='aubmindlab/bert-base-arabertv2')

  embedding_function = SentenceTransformerEmbeddings(model_name='aubmindlab/bert-base-arabertv2')
  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/384 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/543M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/611 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/720k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



# Load the Persistent Directory for Chroma DB

In this step, we will focus on **loading** the persistent storage for Chroma DB. This allows us to access previously stored embeddings and metadata without recomputing them. By setting up persistent storage, we ensure that the vector database can be saved to disk and loaded again when needed.

In [16]:
PRESIST_DIRECTORY = '/content/chroma_db'
persist_directory = "./chroma_db"
db = Chroma(persist_directory=persist_directory, embedding_function=embedding_function)

  db = Chroma(persist_directory=persist_directory, embedding_function=embedding_function)


In [17]:
def query_chroma_db(query, db, top_k=10):
  docs = db.similarity_search(query)
  results = [doc.page_content for doc in docs]
  return results

In [18]:
db

<langchain_community.vectorstores.chroma.Chroma at 0x7d49a6f9ed10>

In [19]:
query_chroma_db('ماهو الثقافة', db)

[]

# Create & Test the Retrieval with a Sample Query

In this step, we will set up the retrieval process using the embeddings stored in Chroma DB. Retrieval is a key part of the Retrieval-Augmented Generation (RAG) pipeline, allowing us to find relevant documents or text chunks based on a query. After setting up the retrieval system, we will test it with a sample query to ensure that it returns the most relevant chunks.

In [27]:
import os
import json
from langchain_chroma import Chroma
from langchain.chains import RetrievalQA
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain_groq import ChatGroq
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)

In [28]:
PROMPT_TEMPLATE = """
Answer the question based only on the following context:
Context: {context}
Question: {question}
Your answer:
"""
prompt_template = PromptTemplate(
    template=PROMPT_TEMPLATE, inpurt_variables=['context', 'question']
)

In [29]:
groq_api_key = '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'
llm = ChatGroq(temperature=0, groq_api_key=groq_api_key, model_name='llama3-8b-8192')

In [30]:
MODEL = LLMChain(llm=llm,
                 prompt=prompt_template,
                 verbose=True)

In [31]:
def query_rag(query: str):
  similarity_search_results = db.similarity_search_with_score(query, k=3)
  context_text = '\n\n'.join([doc.page_content for doc, _score in similarity_search_results])
  rag_response = MODEL.invoke({'context': context_text, 'question': query})

  return rag_response

In [32]:
response = query_rag('ماهو الثقافة')
response



Prompt after formatting:
[32;1m[1;3m
Answer the question based only on the following context:
Context: 
Question: ماهو الثقافة
Your answer:
[0m

[1m> Finished chain.[0m


{'context': '',
 'question': 'ماهو الثقافة',
 'text': 'According to the context, the answer is: الثقافة هي تراثنا المشترك. (Culture is our shared heritage.)'}

In [33]:
print(f'Context:\n{response["context"]}\n\nQuestion:\n{response["question"]}\n\nText: \n{response["text"]}')

Context:


Question:
ماهو الثقافة

Text: 
According to the context, the answer is: الثقافة هي تراثنا المشترك. (Culture is our shared heritage.)
