<center><h2>Gen AI: Business Project - Real Estate Assistant Using RAG</h2></center

#### 6.2: RAG Based Technical Architecture
- Technical Architecture
  - Documents -> Splits(Chunk1, Chunk2 etc.) -> Vector DB (Vector Database) -> Retrieval (Question based on Chunk2 and Chunk4) -> Prompt (Question: Ans based on the below text Chunk2 and Chunk4 ) -> LLM -> Answer
- Tool used
  - Documents (JSONLoader, UnstructuredURLLoader) -> Splits(CharacterTextSplitter, RecursiveTextSplitter) -> Vector DB (FAISS, Chroma) -> Retrieval (RetrievalQAWithSourceChain) -> Prompt -> LLM -> Answer
- Generally in Phase 2 company build data ingestion pipeline
  - Multiple Source system -> Web Scrapper (Cron Based) -> Embedding (Many ways to generate depends on user choice) -> Chroma DB
- Final Architecture
  - ChatGPT like react based UI (Where people as question) -> Embedding -> VectorDB -> Prompt (Question along with data chunks) -> LLM -> ChatGPT like react based UI (Where people as question)

#### 6.3: Document Loaders

In [8]:
# UnstructuredURLLoader
from langchain_community.document_loaders import UnstructuredURLLoader

loader = UnstructuredURLLoader(
    urls=["https://www.cnbc.com/2024/12/21/how-the-federal-reserves-rate-policy-affects-mortgages.html"]
)
docs = loader.load()
len(docs)

1

In [9]:
docs[0].metadata

{'source': 'https://www.cnbc.com/2024/12/21/how-the-federal-reserves-rate-policy-affects-mortgages.html'}

In [10]:
docs[0].page_content

'Access Denied\n\nYou don\'t have permission to access "http://www.cnbc.com/2024/12/21/how-the-federal-reserves-rate-policy-affects-mortgages.html" on this server.\n\nReference #18.65ad4d68.1754539958.447612c\n\nhttps://errors.edgesuite.net/18.65ad4d68.1754539958.447612c'

# CSVLoader
from langchain_community.document_loaders import CSVLoader

loader = CSVLoader(
    file_path="../data/patient_records.csv"
)
docs = loader.load()
len(docs)

In [12]:
docs[0]

Document(metadata={'source': '../data/patient_records.csv', 'row': 0}, page_content='patient_id: PT-1000\nsymptoms: Dizziness, irregular heartbeat, fatigue\ndiagnosis: Atrial Fibrillation\ntreatment: Blood thinners, beta-blockers\ndoctor_notes: Irregular heartbeat detected; cardiology referral made.')

In [13]:
docs[0].metadata

{'source': '../data/patient_records.csv', 'row': 0}

In [14]:
print(docs[0].page_content)

patient_id: PT-1000
symptoms: Dizziness, irregular heartbeat, fatigue
diagnosis: Atrial Fibrillation
treatment: Blood thinners, beta-blockers
doctor_notes: Irregular heartbeat detected; cardiology referral made.


#### 6.4: Text Splitter in Langchain

In [16]:
text = '''Hometown Cha-Cha-Cha is a 2021 South Korean romantic comedy drama television series starring Shin Min-a, Kim Seon-ho and Lee Sang-yi. It is a remake of 2004 South Korean film Mr. Handy, Mr. Hong.[6] It aired on tvN from August 28 to October 17, 2021, every Saturday and Sunday at 21:00 (KST).[7][8] It is also available for streaming on Netflix.[9]

The series was a commercial hit and became one of the highest-rated dramas in Korean cable television history.[10][11] It ranked first place during its entire run for eight weeks, and the last episode achieved 12.665% nationwide rating, with over 3.2 million views.[12] It also became one of Netflix's most-watched non-English television shows, and one of its longest-running hits as it spent 16 weeks in global top ten rankings.
'''

##### Manual approach of splitting the text into chunks

In [17]:
text[0:100]

'Hometown Cha-Cha-Cha is a 2021 South Korean romantic comedy drama television series starring Shin Mi'

In [18]:
# Well but we want complete words and want to do this for entire text, may be we can use Python's split funciton
words = text.split(" ")
len(words)

126

In [19]:
chunks = []

s = ""
for word in words:
    s += word + " "
    if len(s)>200:
        chunks.append(s)
        s = ""
        
chunks.append(s)

In [20]:
chunks[:2]

['Hometown Cha-Cha-Cha is a 2021 South Korean romantic comedy drama television series starring Shin Min-a, Kim Seon-ho and Lee Sang-yi. It is a remake of 2004 South Korean film Mr. Handy, Mr. Hong.[6] It ',
 'aired on tvN from August 28 to October 17, 2021, every Saturday and Sunday at 21:00 (KST).[7][8] It is also available for streaming on Netflix.[9]\n\nThe series was a commercial hit and became one of the ']

Splitting data into chunks can be done in native python but it is a tidious process. Also if necessary, you may need to experiment with various delimiters in an iterative manner to ensure that each chunk does not exceed the token length limit of the respective LLM.

Langchain provides a better way through text splitter classes.

##### CharacterTextSplitter

In [21]:
from langchain_text_splitters import CharacterTextSplitter

splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size=200,
    chunk_overlap=0
)

In [22]:
chunks = splitter.split_text(text)
len(chunks)

Created a chunk of size 348, which is longer than the specified 200


2

In [23]:
len(chunks[0]), len(chunks[1])

(348, 429)

As you can see, all though we gave 200 as a chunk size since the split was based on \n, it ended up creating chunks that are bigger than size 200.

Another class from Langchain can be used to recursively split the text based on a list of separators. This class is RecursiveTextSplitter. Let's see how it works

In [24]:
print(text)

Hometown Cha-Cha-Cha is a 2021 South Korean romantic comedy drama television series starring Shin Min-a, Kim Seon-ho and Lee Sang-yi. It is a remake of 2004 South Korean film Mr. Handy, Mr. Hong.[6] It aired on tvN from August 28 to October 17, 2021, every Saturday and Sunday at 21:00 (KST).[7][8] It is also available for streaming on Netflix.[9]

The series was a commercial hit and became one of the highest-rated dramas in Korean cable television history.[10][11] It ranked first place during its entire run for eight weeks, and the last episode achieved 12.665% nationwide rating, with over 3.2 million views.[12] It also became one of Netflix's most-watched non-English television shows, and one of its longest-running hits as it spent 16 weeks in global top ten rankings.



##### RecursiveCharacterTextSplitter

In [25]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

r_splitter = RecursiveCharacterTextSplitter(
    separators = ["\n\n", "\n", " "],  # List of separators based on requirement (defaults to ["\n\n", "\n", " "])
    chunk_size = 200,  # size of each chunk created
    chunk_overlap  = 30,  # size of  overlap between chunks in order to maintain the context
    length_function = len  # Function to calculate size, currently we are using "len" which denotes length of string however you can pass any token counter)
)

In [26]:
chunks = r_splitter.split_text(text)

for chunk in chunks:
    print(len(chunk))

198
178
192
196
91


In [27]:
print(chunks[0])

Hometown Cha-Cha-Cha is a 2021 South Korean romantic comedy drama television series starring Shin Min-a, Kim Seon-ho and Lee Sang-yi. It is a remake of 2004 South Korean film Mr. Handy, Mr. Hong.[6]


In [28]:
print(chunks[1])

film Mr. Handy, Mr. Hong.[6] It aired on tvN from August 28 to October 17, 2021, every Saturday and Sunday at 21:00 (KST).[7][8] It is also available for streaming on Netflix.[9]


As you see above, the second chunk has an overlap of few end characters from the first chunk due to chunk_overlap

##### Let's understand how exactly it formed these chunks

In [29]:
first_split = text.split("\n\n")[0]
first_split

'Hometown Cha-Cha-Cha is a 2021 South Korean romantic comedy drama television series starring Shin Min-a, Kim Seon-ho and Lee Sang-yi. It is a remake of 2004 South Korean film Mr. Handy, Mr. Hong.[6] It aired on tvN from August 28 to October 17, 2021, every Saturday and Sunday at 21:00 (KST).[7][8] It is also available for streaming on Netflix.[9]'

In [30]:
len(first_split)

348

Recursive text splitter uses a list of separators, i.e. separators = ["\n\n", "\n", "."]

So now it will first split using \n\n and then if the resulting chunk size is greater than the chunk_size parameter which is 200 in our case, then it will use the next separator which is \n

In [31]:
second_split = first_split.split("\n")
second_split

['Hometown Cha-Cha-Cha is a 2021 South Korean romantic comedy drama television series starring Shin Min-a, Kim Seon-ho and Lee Sang-yi. It is a remake of 2004 South Korean film Mr. Handy, Mr. Hong.[6] It aired on tvN from August 28 to October 17, 2021, every Saturday and Sunday at 21:00 (KST).[7][8] It is also available for streaming on Netflix.[9]']

In [32]:
len(second_split[0])

348

In [33]:
third_split = second_split[0].split(" ")
len(third_split)

59

In [34]:
third_split[0]

'Hometown'

#### Learning lessons
- RAG is a popular approach that people use to build Gen AI Applications.
- Langchain abstracts low level processing tasks making it fast to build Gen AI Apps.
- Langchain goes through rapid changes (sometimes backward incompatible) creating frustration among the community.
- A smart AI engineer, data scientist will have a strong skill of figuring out a correct syntax and figure out solutions of errors they face.