# GenAI API 

## 1. langchain

### Data Ingestion 

In [1]:
# import
## Data Ingestging
import bs4
from langchain_community.document_loaders import TextLoader, PyPDFLoader, WebBaseLoader, ArxivLoader, WikipediaLoader

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [2]:
text_documents = TextLoader('speech.txt').load()
docs_pdf = PyPDFLoader('attention.pdf').load()
loader = WebBaseLoader(web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
                     bs_kwargs=dict(parse_only=bs4.SoupStrainer(
                         class_=("post-title","post-content","post-header")
                     )))
docs_ark = ArxivLoader(query="1706.03762", load_max_docs=2).load()
docs_wk = WikipediaLoader(query="Generative AI", load_max_docs=2).load()

In [1]:
# text_documents
# docs_pdf
# len(docs_ark)
# print(docs_wk)

### 2. Data Transformation

**Text Splitting from Documents- RecursiveCharacter Text Splitters**

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

- **How the text is split:** by list of characters.
- **How the chunk size is measured:** by number of characters.

In [12]:
# Import 
## Data transformation/Chunk
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_text_splitters import CharacterTextSplitter
from langchain_text_splitters import HTMLHeaderTextSplitter

In [13]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=50)
final_documents = text_splitter.split_documents(docs)


speech=""
with open("speech.txt") as f:
    speech=f.read()
text_splitter=RecursiveCharacterTextSplitter(chunk_size=100,chunk_overlap=20)
text=text_splitter.create_documents([speech])

In [15]:
# print(final_documents[0])
# print(text[0])

In [4]:
# CharacterTextSplitter
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter(separator="\n\n",  chunk_size=100, chunk_overlap=20)
final_documents = text_splitter.split_documents(text_documents)

**How to split by HTML header**

**How to split JSON data**

### 3. Embeddings

In [None]:
# Import
import os
from dotenv import load_dotenv

# OpenaAi embedding 
from langchain_openai import OpenAIEmbeddings
from langchain.embeddings import OpenAIEmbeddings

# Ollama embedding
from langchain_community.embeddings import OllamaEmbeddings

# Huggingface embedding


load_dotenv() 
os.environ["OPENAI_API_KEY"]=os.getenv("OPENAI_API_KEY")
os.environ['HF_TOKEN']=os.getenv("HF_TOKEN")

**1. OpenAI Embedding**

In [None]:
embeddings_1024=OpenAIEmbeddings(model="text-embedding-3-large", dimensions=1024)

# Example
text="This is a tutorial on OPENAI embedding"
query_result=embeddings.embed_query(text)

**2. Ollama Embedding**

In [None]:
embeddings_ollama=(OllamaEmbeddings(model="gemma:2b"))  ##by default it ues llama2


r1=embeddings_ollama.embed_documents(["Alpha is the first letter of Greek alphabet", 
                               "Beta is the second letter of Greek alphabet", ])
r1[1]

In [None]:
embeddings_ollama.embed_query("What is the second letter of Greek alphabet ")

**3. Huggingface Embedding**

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings
embeddings=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
text="this is atest documents"
query_result=embeddings.embed_query(text)
doc_result = embeddings.embed_documents([text, "This is not a test document."])
doc_result[0]

### 4. VectorStore

In [6]:
# Import
from langchain_chroma import Chroma
from langchain_community.vectorstores import FAISS

**1. FAISS**

In [None]:
# Vectorstore
db = FAISS.from_documents(docs, embeddings)

### querying 
query = "How does the speaker describe the desired outcome of the war?"
docs = db.similarity_search(query)
docs[0].page_content

As a Retriever

We can also convert the vectorstore into a Retriever class. This allows us to easily use it in other LangChain methods, which largely work with retrievers

In [None]:
retriever = db.as_retriever()
docs = retriever.invoke(query)
docs[0].page_content

Similarity Search with score

There are some FAISS specific methods. One of them is similarity_search_with_score, which allows you to return not only the documents but also the distance score of the query to them. The returned distance score is L2 distance. Therefore, a lower score is better.

In [None]:
docs_and_score = db.similarity_search_with_score(query)

### Saving And Loading
db.save_local("faiss_index")

new_db = FAISS.load_local("faiss_index", embeddings,allow_dangerous_deserialization = True)
docs = new_db.similarity_search(query)

**2. Chroma**

In [None]:
vectordb=Chroma.from_documents(documents=splits,embedding=embedding,persist_directory="./chroma_db")
# load from disk
db2 = Chroma(persist_directory="./chroma_db", embedding_function=embedding)
docs=db2.similarity_search(query)
print(docs[0].page_content)

## similarity Search With Score
docs = vectordb.similarity_search_with_score(query)

### Retriever option
retriever=vectordb.as_retriever()
retriever.invoke(query)[0].page_content