# Summary Step

1. Load the environment variable
2. Load the pdf documents
3. Split it into chunks
4. Convert it to embedding and store it into a vector database
5. Explore Retriever
6. Chain with LCEL, Input & Output formatter


## Step 1 - Load the env

In [2]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai_api_key = os.environ["OPENAI_API_KEY"]


## Step 2 - Load the pdf document

In [3]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "./data/A_Convolution_Neural_Network_for_Plant_Species_Identification_report.pdf"

pdf_loader = PyPDFLoader(file_path)

docs = pdf_loader.load()

# Test print to check the content
print(len(docs))
print(docs[0].page_content[0:100])
print(docs[0].metadata)

5
 A Convolution Neural Network for Plant Species 
Identification  
Shiang Jin, Chin   
Khoury College
{'source': './data/A_Convolution_Neural_Network_for_Plant_Species_Identification_report.pdf', 'page': 0}


## Step 3 - Split it into the chunk

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

splits = text_splitter.split_documents(docs)

In [6]:
# Test print to understand property of splits
print(type(splits))
print(len(splits))
print(splits[0])

<class 'list'>
28
page_content='A Convolution Neural Network for Plant Species 
Identification  
Shiang Jin, Chin   
Khoury College of Computer Science  
Northeastern University  
Seattle, WA  
chin.shi@northeastern.edu     
Abstract . In this paper , I describe the training of machine 
learning models based on convolution neural network 
architecture for the image classification task of identifying plant 
species based on the leaves.  Throughout the paper, I document 
the various experiments with different model architectures and 
their corresponding results.  The best -performing model uses a 
pre-trained ResNet -50 model fine-tuned to classify plant species 
based on leaves. The model performs better and computes faster 
on the Leafsnap dataset compared to the recognition system 
developed before.  
Keywords —convolution neural network , ResNet -50, Leafsnap 
database   
I. INTRODUCTION  
In this work, I describe the experiments carried out to train 
a machine learning model based o

## Step 4 - Convert the chunks into Embedding and Store into a Vector DB

In [9]:
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
embeddings_model = OpenAIEmbeddings()

# Directly create a vector store
vector_db = Chroma.from_documents(documents=splits, embedding=embeddings_model, persist_directory="./data/test_wiki_db")

## Step 5-1 - Explore similarity search of Vector_db

Similarity_search is useful to quickly search and retrieve stored information most relevant to the question

In [11]:
question = "How is convolution neural network used by the author?"
response = vector_db.similarity_search(question)
print(response[0].page_content)

A Convolution Neural Network for Plant Species 
Identification  
Shiang Jin, Chin   
Khoury College of Computer Science  
Northeastern University  
Seattle, WA  
chin.shi@northeastern.edu     
Abstract . In this paper , I describe the training of machine 
learning models based on convolution neural network 
architecture for the image classification task of identifying plant 
species based on the leaves.  Throughout the paper, I document 
the various experiments with different model architectures and 
their corresponding results.  The best -performing model uses a 
pre-trained ResNet -50 model fine-tuned to classify plant species 
based on leaves. The model performs better and computes faster 
on the Leafsnap dataset compared to the recognition system 
developed before.  
Keywords —convolution neural network , ResNet -50, Leafsnap 
database   
I. INTRODUCTION  
In this work, I describe the experiments carried out to train 
a machine learning model based on a Convolution Neural


## Step 5-2 Explore Vector_db as retriever
retriever can be chained to make more complex application

In [12]:
retriever = vector_db.as_retriever()
response = retriever.invoke(question)
print(response)

[Document(metadata={'page': 0, 'source': './data/A_Convolution_Neural_Network_for_Plant_Species_Identification_report.pdf'}, page_content='A Convolution Neural Network for Plant Species \nIdentification  \nShiang Jin, Chin   \nKhoury College of Computer Science  \nNortheastern University  \nSeattle, WA  \nchin.shi@northeastern.edu     \nAbstract . In this paper , I describe the training of machine \nlearning models based on convolution neural network \narchitecture for the image classification task of identifying plant \nspecies based on the leaves.  Throughout the paper, I document \nthe various experiments with different model architectures and \ntheir corresponding results.  The best -performing model uses a \npre-trained ResNet -50 model fine-tuned to classify plant species \nbased on leaves. The model performs better and computes faster \non the Leafsnap dataset compared to the recognition system \ndeveloped before.  \nKeywords —convolution neural network , ResNet -50, Leafsnap \n

## Step 6 - Chain with LCEL, input and output formatters

In [14]:
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

# Helper method to join the documents
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)
    
# This prompt should provide answer based on the context
prompt = ChatPromptTemplate.from_template(
    """
    Answer the question based only on the context provided.
    Context: {context}
    Question: {question}
    """
)
model = ChatOpenAI(model="gpt-4o-mini")
# Chaining retriever with format_docs to context should 
# retrieve the documents related to the document
chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

In [15]:
# Test the response
response = chain.invoke(question)

In [16]:
print(response)

The author uses a Convolution Neural Network (CNN) to classify plant species based on images of their leaves. The CNN automatically identifies relevant features and filters needed for feature extraction without human supervision. The author describes various experiments with different model architectures, ultimately finding that a fine-tuned pre-trained ResNet-50 model performs best for this task, achieving higher accuracy and faster computation on the Leafsnap dataset compared to a previous four-step process used by the Leafsnap system. The CNN model optimizes the classification process and offers fixed computation time once trained.
