# Automate Tech Blog

In this notebook, we'll examine a brief introduction to Retrieval Augmented Generation using Langchain.

We'll Also Be Using the Langchain Expression Language to build our solutions. LCEL is a production ready style of building and prototyping chains. with automatic async and built-in parallelization, LCEL ensures you're ready for production with very little developer-side lift!

### Cycle of LLM Chat Model

- Understanding the business Statement
- creating data collection
- creating vectorestore
- add collection to vectorstore
- Using RAG to retrieve data from Database

# Business statement
- when developing a technical product, writing techinical blog post can help users to get started by following post and seeing how to set. now the problem is how can we ensure the post reach to the community?
- In open source, a common pratice is to retrieve data from webpage in combination of code and convert to text.
- How can we leverage open source tools to do outreach with the community?
- One way is to spread the word and connect with community is through social media, like twitter, Linkedin.
- With the rise of Artificial Intellengence, we can automate the process of generating blogs that are enticing and will encourage users to try a tech product.


# Our Domain

 - The domain i've selected today is about Numpy - it's fairly niche topic.
 - Source - https://numpy.org/doc/

# Model

 Model - this allow us to specify our model

In [2]:
from langchain.chat_models import ChatOpenAI

# Prompt Templete

Since we need to pass in user-defined questions to our RAG chain, wwe want to set up a simple templete.

In [5]:
from langchain.prompts import ChatPromptTemplate

# Output Parser

If we look at our LLM- we'll notice that it's outputs are Message objects-we can convert the response into a str by chaining a StrOutputParser at the end.

In [6]:
from langchain.schema.output_parser import StrOutputParser

# Create Embeddings using Qdrant Vectorstore
Qdrant is a vector database & vector similarity search engine

 - First we want to create a Qdrant vector store and seed it with some data.

In [9]:
from langchain.vectorstores import Qdrant
from langchain_community.vectorstores import Qdrant
from qdrant_client import models, QdrantClient
import qdrant_client
import os

In [10]:
# create qdrant client

os.environ['QDRANT_HOST']
os.environ["QDRANT_API_KEY"]

client = qdrant_client.QdrantClient(
    os.getenv("QDRANT_HOST"),
    api_key=os.getenv("QDRANT_API_KEY")
)

In [11]:
# read the file
file="numpy.txt"
data=""

with open(file,'r') as f:
    data = f.read()

### We'll use the naive solution of the CharacterTextSplitter first, which will simply split our text  and measure chunk length by number of characters.

In [12]:
from langchain.text_splitter import CharacterTextSplitter

In [13]:
# split the text into chunks
#create a function to return chunks
def get_chunks(text):
    text_splitter=CharacterTextSplitter(
        separator= "\n",
        chunk_size=700,
        chunk_overlap=100,
         # second chunk start  character from 800, overlap is used to stop loosing chunk 
        length_function=len
    )

    chunks=text_splitter.split_text(text)
    return chunks

In [14]:
# get the chunks for the data
texts=get_chunks(data) 

In [15]:
len(texts)

7

In [13]:
len(data)

3606

In [None]:
# creating a new  collection and naming it.

vectors_config=models.VectorParams(
    # depends on model, we can google dimension. 1536 for openai
    # we are using openai embedding, for that size is 1536
    size=1536,
    distance=models.Distance.COSINE)

client.create_collection(
    collection_name="numpy",
    vectors_config=vectors_config,
)

# Embedding Model & API Keys

Now that we've chunked our documents, we'll need to vectorize them and move them into a Vectorstore - a place that will associate Vectors with Text Chunks.

We'll be using OpenAI Embeddings and API Key

In [17]:
from langchain.embeddings.openai import OpenAIEmbeddings

In [18]:
# if we want to use any other embedding, we need to change size

os.environ["OPENAI_API_KEY"]

embeddings = OpenAIEmbeddings()

vector_store = Qdrant(
    client=client,
    collection_name="numpy",
    embeddings=embeddings,
)

  warn_deprecated(


In [19]:
# add chunks to vector store
vector_store.add_texts(texts)

['9784ad6af04d4298a5cd1f0ec0d0ba26',
 'b4e1afd4bca44f47961c46b9f88a48ef',
 'ba781bdaf9a64445ba554c529be18eb0',
 'f7291fa80d034ed29a0802f7bb9e71b3',
 '8a93bb6e9fc64103911eb0655274593b',
 '55c7ac3ec4ed49e6a88412fa5fe17f59',
 '9466a568b2274e8ab93c6dae8a9b0213']

# Retrieval Augmented Genaration with LangChain - Simple Implementation
We've built a fully-fledged knowledge base, We'll now implement a simple RAG chain to boost the performance 

# Retriever

Now that we have a VectorStore - We'll need to connect it to a retriever. Luckily, this is a straight forward process with LangChain!

In [20]:
retriever = vector_store.as_retriever()

In [21]:
from langchain.schema.runnable import RunnablePassthrough

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

model = ChatOpenAI()
     

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)
   

  warn_deprecated(


# With LCEL - Building a chain has never been easier!

In [22]:
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

### Summary to generate a blog about

This blog  is a tutorial about using NumPy to solve static equilibrium problems in three-dimensional space. Readers will learn how to represent points, vectors, and moments with NumPy, find the normal of vectors, and use NumPy for matrix calculations. The tutorial covers the application of Newton's second law to simple examples of force
vectors and introduces more complex cases involving reaction forces and moments. The post also discusses the use of NumPy functions in more varied problems, including kinetic problems and different dimensions, here are some few questions we can ask our model to generate the response;

 - how to train a simple feed-forward neural network from scratch using NumPy to classify handwritten MNIST digits?
 - what is concept of masked arrays in NumPy and their usefulness in handling missing or invalid data?
 - Linear algebra on n-dimensional arrays
 - NumPy to solve static equilibrium problems in three-dimensional space


In [23]:
chain.invoke("how to train a simple feed-forward neural network from scratch using NumPy to classify handwritten MNIST digits?")

'To train a simple feed-forward neural network from scratch using NumPy to classify handwritten MNIST digits, you need to follow the steps mentioned in the context. \n\n1. Initialize the weights.\n2. Define activation functions and their derivatives.\n3. Implement functions for forward pass and backward pass.\n4. Train the model in batches using Stochastic Gradient Descent (SGD) and update the weights.\n5. Test the model on the validation set.\n6. Predict on the test data and print the accuracy.\n\nBy following these steps, you can build and train a simple feed-forward neural network using NumPy to classify handwritten MNIST digits.'

In [24]:
for chunk in chain.stream("what is concept of masked arrays in NumPy and their usefulness in handling missing or invalid data?"):
    print(chunk, end="", flush=True) 

The concept of masked arrays in NumPy is the combination of a standard ndarray and a mask. A mask is either nomask, indicating that no value of the associated array is invalid, or an array of booleans that determines for each element of the associated array whether the value is valid or not. Masked arrays are useful in handling missing or invalid data by allowing the user to easily identify and manipulate these values.

In [25]:
for chunk in chain.stream("Linear algebra on n-dimensional arrays"):
    print(chunk, end="", flush=True) 

Linear algebra on n-dimensional arrays refers to the use of linear algebra operations, such as matrix multiplication and vector addition, on arrays with multiple dimensions. This is particularly relevant in the context of machine learning and deep learning models, which often require large amounts of data for optimal performance. As the amount of data increases, performing operations on individual scalars becomes inefficient, and vectorized or matrix operations are needed to compute efficiently. Linear algebra provides the necessary tools and techniques for manipulating and analyzing n-dimensional arrays.

Conclusion