## Generate code from the methods section from ML Research Papers using generative AI and Langchain

Data Scientists often need to confer with research papers as they may contain solutions to their problems. But sometimes, especially for students, these papers may require a lot of prior knowledge to implement the steps being highlighted. Generative AI can be utilized to get a head start on how to implement the methods being outlined in their programming language of choice. This can often be a good stepping stone for understanding the methodology being used to solve the problem. 

In [156]:
# import Lanchain modules
from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings 

# support for dataset retrieval with Hugging Face
from datasets import load_dataset

# CassIO, engine powering Astra DB integration in LangChain
import cassio

from PyPDF2 import PdfReader

from typing_extensions import Concatenate

from langchain.text_splitter import CharacterTextSplitter

In [157]:
# using open AI's model
import os
import openai


# create a .env file which contains personal OpenAI api key, Astra DB token, Astra DB ID (obtained from DATASTAX) 

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

In [158]:
import os
import openai

from dotenv import load_dotenv, find_dotenv


_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

# initialize connection to db
cassio.init(token=os.environ['ASTRA_DB_APPLICATION_TOKEN'], database_id=os.environ['ASTRA_DB_ID'])


In [159]:
# extract raw text from each page when in the methods section - will limit the embeddings stored in vector db

from typing_extensions import Concatenate

def extract_method_from_research(pdf_path):
    """Extarcts the method section from a pdf"""

    pdfreader = PdfReader(pdf_path)

    raw_txt = ''    
    
    for i, page in enumerate(pdfreader.pages):
        content = page.extract_text()
        
        # check if the page contains the start of the methods section - not fool-proof
        if "methods" in content.lower():
            raw_txt += content
        elif "results" in content.lower():
            break
            
    return raw_txt

# not completely fool-proof method
raw_txt = extract_method_from_research('cv_paper.pdf')

In [160]:
# Create the LangChain embedding and LLM objects

llm = OpenAI()
embedding = OpenAIEmbeddings()

In [161]:
# create langchain vector store backed by Astra DB

astra_vector_store = Cassandra(
    embedding=embedding,
    table_name="pdf_code",
    session=None,
    keyspace=None,
)

In [162]:
# chunking data and converting chunks to vectors
text_splitter = CharacterTextSplitter(
    separator='\n',
    chunk_size=800,
    chunk_overlap = 200,
    length_function = len,
)

texts = text_splitter.split_text(raw_txt)

In [163]:
texts

['Yuet al. EURASIP Journal on Image and Video Processing 2013, 2013 :52 Page 2 of 10\nhttp://jivp.eurasipjournals.com/content/2013/1/52\ncollection, we selected sequences and species to keep the\ndata balanced. Then, we manually cropped animals from\nall the frames to generate a dataset with 7, 196 images over\n18 different vertebrate species.\n2 Related work\nMost related works are camera-based studies of wildlife\nthat use image analysis to identify individual animals of\nselect species with unique coat patterns (e.g., spots or\nstripes). Bolger et al. [10] applied software to help identify\nindividual animals based on coat patterns for subsequent\nphotographic mark-recapture analysis. The data they used\nwas image based, which is a cost-effective, non-invasive',
 'individual animals based on coat patterns for subsequent\nphotographic mark-recapture analysis. The data they used\nwas image based, which is a cost-effective, non-invasive\nway to study population. The method they used wa

In [164]:
# load top 50 text into database for vector store  - it converts to embeddings

astra_vector_store.add_texts(texts)

print(f"Inserted {len(texts)} headlines." )

astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)

Inserted 10 headlines.


### Run Query to obtain detail summaryof methods

In [250]:
llm = OpenAI(temperature=0.9, max_tokens=-1) # initialize open AI chat with 0.9 temp to give LLM freedom for creativity, 
                                            # Setting max_tokens to -1 to prevent cut responses
                                            # Please remove the max_tokens to not incur heavy costs from your API

query_text = """This text corresponds to the methods section of a research paper. Please provide a detailed summary of all the methods being utilized \
chronologically without missing any steps. It is important that the summary is detailed and summarizes each of the methods \
being outlined independently. It is also important the the summary outline methods in such a way that they are reproducible."""

summary = astra_vector_index.query(query_text, llm=llm).strip()
print(f"Summary: {summary}")

Summary: 1. The researchers are using a cost-effective and non-invasive method to study population, which involves identifying individual animals based on their coat patterns for subsequent photographic mark-recapture analysis.
2. The data used for the study is image-based.
3. The method used for identifying individual animals is the SIFT key points extraction and matching.
4. This method is specifically focused on identifying strongly marked texture species.
5. The researchers acknowledge that identifying species from remote camera images is a major challenge that has not been addressed.
6. In the community of computer vision, there are various methods for recognizing general objects.
7. One of the most successful methods is Yang's work, which uses ScSPM (Spatial pyramid matching).
8. The ScSPM is applied in the researchers' pattern extraction and classification program.
9. The algorithm first extracts local feature descriptors densely.
10. Two kinds of local descriptors, SIFT and cLB

In [251]:
from langchain.prompts import ChatPromptTemplate # the prompt template
from langchain.chains import LLMChain # LLMChain

In [252]:
template_string = """The text \
that is delimited by triple backticks \
is a summary of the methods utilized in a Machine Learning Paper. \
Provide a step-by-step guide in English on how to code these methods \
in the programming language Python, but limit your responses to the feature extraction, \
do not provide any detail related to training and model building. Each step must be detailed enough \
that anyone should be able to reproduce the methods and must not omit any details or steps required. If for whatever reason, the methods being outlined \
do not offer any codable steps, then respond with 'No codable steps found'. \
summary: ```{summary}```
"""

# propmpt that inputs a product

In [253]:
guide_template = ChatPromptTemplate.from_template(template_string)

guide_template.messages[0].prompt

PromptTemplate(input_variables=['summary'], template="The text that is delimited by triple backticks is a summary of the methods utilized in a Machine Learning Paper. Provide a step-by-step guide in English on how to code these methods in the programming language Python, but limit your responses to the feature extraction, do not provide any detail related to training and model building. Each step must be detailed enough that anyone should be able to reproduce the methods and must not omit any details or steps required. If for whatever reason, the methods being outlined do not offer any codable steps, then respond with 'No codable steps found'. summary: ```{summary}```\n")

In [254]:
# confirm the input variable of the template
guide_template.messages[0].prompt.input_variables

['summary']

In [255]:
# will generate prompt from template string above
chain = LLMChain(llm=llm, prompt=guide_template)
print(chain.run(summary))


Step 1: Prepare the images
- Obtain a dataset of images of the population being studied
- Resize the images to a standard size
- Pre-process the images for feature extraction (e.g. convert to grayscale)

Step 2: Extract SIFT key points
- Use OpenCV library to extract SIFT (Scale-Invariant Feature Transform) key points from the images
- These key points represent distinctive areas of an image that can be used for identification
- Save the extracted key points for each image

Step 3: Match key points
- Use a matching algorithm (e.g. FLANN) to match the key points between different images
- This allows for identification of individual animals based on their unique coat patterns

Step 4: Identify strongly marked texture species
- Use a filtering method to select only images with strongly marked textures
- This will help improve the accuracy of identification

Step 5: Address the challenge of identifying species from remote camera images
- Research and understand various methods for recogn

In [256]:
from langchain.chains import SimpleSequentialChain

# prompt template 2
code_prompt = ChatPromptTemplate.from_template(
    """The text \
       that is delimited by triple backticks \
       outlines a step-by-step guide to implement methods outlined in a Machine Learning research paper in Python. \
       Please follow each step and write runnable code in python. The code must work!
        ```{guide}```"""
)

# Chain 2
chain_two = LLMChain(llm=llm, prompt=code_prompt)

In [257]:
simple_code_chain = SimpleSequentialChain(chains=[chain, chain_two],
                                             verbose=False)

In [258]:
print(simple_code_chain.run(summary))



#Import necessary libraries
import cv2
import numpy as np

#Load the image data
img = cv2.imread('coat_pattern.jpg')

#Convert image to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

#Use the SIFT algorithm to extract and match key points
sift = cv2.SIFT.create()
keypoints = sift.detect(gray, None)

#Apply a filter to SIFT results
strong_keypoints = [keypoint for keypoint in keypoints if keypoint.response > 100]

#Implement the ScSPM algorithm
yang = cv2.ScSPM.create()
keypoints = yang.detect(gray, None)

#Extract local feature descriptors densely
sift_desc = sift.compute(gray, strong_keypoints)
clbp_desc = clbp.compute(gray, strong_keypoints)

#Combine SIFT and cLBP descriptors
descriptors = np.hstack((sift_desc, clbp_desc))

#Learn a dictionary using weighted sparse coding
dictionary = cv2.ml.ml.SparseCoder_create()
dictionary.setCodeBook(descriptors)

#Use max pooling with SPM to construct a global image feature
global_feature = dictionary.project(descriptors)

#Classify 