# Chroma + RAG
- Load PDF, Split into chuncks, Embed, then store in a retrieval system (Chroma)
- Embed Query
- Retrieval system finds most relevant documents (Nearest Neighbour of Query and pdf txt embeddings)
- Return Query and relevant results to LLM
- LLM synthesizes that data

***
## Install packages

In [2]:
%pip install -qqq langchain==0.1.0
%pip install -qqq sentence-transformers==2.2.2
%pip install -qqq openai==1.7.1 umap==0.1.1

# to avoid RuntimeError: Your system has an unsupported version of sqlite3. Chroma requires sqlite3 >= 3.35.0., install version 0.3.29
%pip install -qqq chromadb==0.3.29 #0.4.22

%pip install -qqq pypdf==3.17.4
%pip install -qqq git+https://github.com/tkra90/magic.git

Note: you may need to restart the kernel to use updated packages.


In [1]:
from tinymagic.nlp import helpers as nlp_helper

In [4]:
from pypdf import PdfReader
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter, 
    SentenceTransformersTokenTextSplitter
)

***
## Read pdf document

In [5]:
pdf_file = './LLMs/SI-HoMan-PL-en-53.pdf'

# extract text and strip whitespace, remove ..s
pdf_readr = PdfReader(pdf_file)
pdfs = [page.extract_text().strip().replace('..','') for page in pdf_readr.pages]

# filter empty pages
filtered_pdf = [pg for pg in pdfs if pg]

len(pdfs), len(filtered_pdf)

(64, 63)

In [6]:
print(filtered_pdf[0])

Planning Guidelines
SMA SMART HOME
The System Solution for Greater Independence
SI-HoMan-PL-en-53 | Version 5.3 ENGLISH


In [8]:
print(nlp_helper.wrap_text(filtered_pdf[0], 50, '\n\n'))

Planning Guidelines
SMA SMART HOME
The System

Solution for Greater
Independence
SI-HoMan-PL-en-53 | Version 5.3
ENGLISH


***
## Split text by Characters 

In [9]:
# split by \\n, if still >1000, split by \n etc
char_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ",  " ", ""],
    chunk_size=1000,
    chunk_overlap=0
)

merged_pages = '\n\n'.join(filtered_pdf)
split_text = char_splitter.split_text(merged_pages)

print(f'Nr. of chunks: {len(split_text)}')

Nr. of chunks: 149


In [10]:
print(nlp_helper.wrap_text(split_text[0])[:250])

Planning Guidelines
SMA SMART HOME
The System
Solution for Greater
Independence
SI-HoMan-PL-en-53 | Version 5.3
ENGLISH


***
## Split Tokens
- prepare text for the embedding model

In [11]:
token_splitter = SentenceTransformersTokenTextSplitter(
    chunk_overlap=0, 
    tokens_per_chunk=256
)
# bc embedding model context window  is 256

In [12]:
split_txt_by_token = []
for text in split_text:
    split_txt_by_token += token_splitter.split_text(text)
    
len(split_txt_by_token)

153

In [13]:
print(nlp_helper.wrap_text(split_txt_by_token[0]))

planning guidelines sma smart home the system
solution for greater independence si - homan - pl
- en - 53 | version 5. 3 english


In [14]:
split_txt_by_token[0]

'planning guidelines sma smart home the system solution for greater independence si - homan - pl - en - 53 | version 5. 3 english'

***
## Sentence Transformer 
- <a href="https://arxiv.org/pdf/1908.10084.pdf">paper</a>
- state-of-the-art sentence, text and image embeddings, can compute sentence / text embeddings for more than 100 languages
- is based on BERT transformer architecture
- embeds each token individually, pools output of each token embedding to produce a single dense vector per sentence\chunk
- embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning
- useful for semantic textual similar, semantic search, or paraphrase mining

In [16]:
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

In [17]:
embedding_fun = SentenceTransformerEmbeddingFunction(model_name = 'all-MiniLM-L6-v2')

embedded_chunk = embedding_fun([split_txt_by_token[0]])
len(embedded_chunk[0])

384

In [18]:
embedded_chunk[0][:15]

[-0.008290236815810204,
 0.04230229929089546,
 0.03675457835197449,
 -0.04941593483090401,
 0.06758478283882141,
 0.03065255470573902,
 -0.013549256138503551,
 0.009989805519580841,
 -0.12992864847183228,
 0.027583245187997818,
 0.02715235762298107,
 0.009702799841761589,
 0.0440477691590786,
 -0.05999058485031128,
 0.09309713542461395]

In [19]:
# add docs to Chroma
chroma_client = chromadb.Client()
chroma_collection = chroma_client.create_collection(
    "sma_smart_home", 
    embedding_function=embedding_fun
)

indices = [str(i) for i, _ in enumerate(split_txt_by_token)]

chroma_collection.add(ids=indices, documents=split_txt_by_token)
chroma_collection.count()

153

In [31]:
query = "What kind of EV chargers does the system support from SMA?" # evc7, evc22
# query = "Is it possible to connect a hybrid inverter and an EV charger to the Home Manager?"
# query = "What are the supported batteries by the SMA HOME?"
# query = "Does the system support connecting heat pumps?"

results = chroma_collection.query(query_texts=[query], n_results=5)
resulting_docs = results['documents'][0]

for doc in resulting_docs:
    print(nlp_helper.wrap_text(doc))
    print('\n')

• a desk lamp with an energy requirement of e. g.
20 wh can only consume a very small portion of
the pv energy. • toasters and kettles are only
switched on when they are required. toast and hot
water are required promptly. • an electric cooker
is switched on when the user wishes to cook. the
food is to be prepared promptly and not simply
whenever sufficient pv energy is available for
operation of the electric cooker. 5. 2 sma ev
charger in the energy management system the ev
charger is an ac charging station that is
designed for unidirectional charging of a
vehicle. the sma ev charger along with the sunny
home manager 2. 0 makes an intelligent charging
station for the sma energy system home. if the ev
charger is operated without the sunny home
manager 2. 0, the modes for intelligent charging
are not available.


6 kwp = total po wer −1 kwline conductor neutr al
conductor figure 19 : the sma ev charger uses the
single - phase pv generation for faster charging
of the electric vehicle ( b

In [37]:
import openai
from openai import OpenAI

openai_client = OpenAI(api_key=key1)

In [47]:
def run_rag(query, resulting_documents, model = "gpt-3.5-turbo"):
    info = "\n".join(resulting_documents)
    msg = [
        {"role": "system",
         "content":"You are a helpful solar systems engineer. Your client asks you questions about a certain product from a catalogue. "
         "You will be shown client's question, and the relevant info. Answer using only the relevant info that is given to you."
        },
        {"role": "user", 
         "content": f"Question: {query}, \n  Relevant info: {info}"
        }
    ]
    
    response = openai_client.chat.completions.create(messages=msg, model=model)
    return response.choices[0].message.content

res = run_rag(query, resulting_docs)

In [49]:
print(nlp_helper.wrap_text(res))

The system supports EV chargers from SMA.
Specifically, it mentions the SMA EV charger,
which is an AC charging station designed for
unidirectional charging of a vehicle. It can be
used with the Sunny Home Manager 2.0 to make an
intelligent charging station for the SMA energy
system home. The modes for intelligent charging
are only available when the EV charger is
operated with the Sunny Home Manager 2.0. In
multi-EVC operation mode, the system supports the
connection of a maximum of 3 SMA EV chargers.
