# Simple RAG Pipeline for STEM OPT Document
This notebook demonstrates a simple Retrieval-Augmented Generation (RAG) workflow using LangChain, FAISS, and LLMs to answer questions about the STEM OPT extension process. Each section is annotated for clarity.

In [35]:
from bs4 import BeautifulSoup

from langchain.docstore.document import Document
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain_ollama import OllamaLLM
from langchain_openai import ChatOpenAI

from dotenv import load_dotenv
import os


## Load and Parse the STEM OPT HTML Document
Read the USCIS STEM OPT extension HTML file and extract its content using BeautifulSoup for further processing.

In [36]:

with open("../data/documents/Optional Practical Training Extension for STEM Students (STEM OPT) _ USCIS.html", "r", encoding="utf-8") as f:
    html = f.read()

# Parse HTML and extract text
soup = BeautifulSoup(html, "html.parser")

### Extract Relevant Content
Identify and extract the main content panels from the HTML using their CSS class.

In [37]:
panels = soup.find_all(class_="accordion__panel")

In [38]:
texts = [panel.get_text(separator="\n", strip=True) for panel in panels]
combined_text = "\n\n".join(texts)

### Chunk the Extracted Text
Split the combined text into smaller chunks to fit within the token limits of embedding models and LLMs.

In [39]:
def chunk_text(text, max_tokens=500, separator="\n\n"):
    
    # Split by paragraphs (double newlines)
    paragraphs = text.split(separator)
    print(paragraphs)
    
    chunks = []
    current_chunk = ""

    for para in paragraphs:
        if len(current_chunk) + len(para) < max_tokens:
            current_chunk += para + separator
        else:
            chunks.append(current_chunk.strip())
            current_chunk = para + separator

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

In [40]:
chunks = chunk_text(combined_text, max_tokens=200)  # adjust size as needed

['To qualify for the 24-month extension, you must:\nHave been granted OPT and currently be in a valid period\xa0of post-completion OPT;\nHave earned a bachelor’s, master’s, or doctoral degree from a school that is accredited by a U.S. Department of Education-recognized accrediting agency and is certified by\nthe Student and Exchange Visitor Program (SEVP)\nat the time you submit your STEM OPT extension application.\nNOTE: Previously obtained STEM degrees\n: If you are an F-1 student participating in a 12-month period of post-completion OPT based on a non-STEM degree, you may be eligible to use a previous STEM degree from a U.S. institution of higher education to apply for a STEM OPT extension. You must have received both degrees from currently accredited and SEVP-certified institutions, and cannot have already received a STEM OPT extension based on this previous degree. The practical training opportunity also must be directly related to the previously obtained STEM degree.\nFor example

In [41]:
print(f"Created {len(chunks)} chunks.")

Created 5 chunks.


### Create Document Objects
Convert each text chunk into a LangChain `Document` object for downstream processing.

In [42]:
documents = [Document(page_content=chunk) for chunk in chunks]

### Embed Documents and Build Vector Store
Generate embeddings for each document chunk and store them in a FAISS vector database for efficient retrieval.

In [43]:
embedding_model = HuggingFaceEmbeddings(model_name="multi-qa-MiniLM-L6-cos-v1")

In [44]:
vectorstore = FAISS.from_documents(documents, embedding_model)

### Retrieve Relevant Chunks for a Query
Set up a retriever from the vector store and use it to fetch the most relevant document chunks for a sample user query.

In [45]:
retriever = vectorstore.as_retriever()
query = "what's the process for applying OPT?"


In [46]:
retrieved_docs = retriever.invoke(query)

In [47]:
print(f"{len(retrieved_docs)} docs retrieved")

4 docs retrieved


### Prepare Retrieved Content for LLM
Combine the retrieved document chunks into a single context string to be used as input for the language model.

In [48]:
combined_docs_text = "\n\n".join([doc.page_content for doc in retrieved_docs])

### Construct the Prompt for the LLM
Create a prompt template that provides the retrieved context and user query to the language model for answer generation.

In [49]:
prompt_template = """
You are an expert assistant. Use the context below to answer the user's question.
Do NOT include any internal thoughts or explanations.

Context:
{documents}

User Question:
{query}

Answer:
"""

In [50]:
prompt = PromptTemplate(
    input_variables=["documents", "query"],
    template=prompt_template
)

### Set Up LLMs and Chains
Instantiate both a local (Ollama) and an online (Together API) language model, and set up LLM chains for answer generation.

In [51]:
llama3 = OllamaLLM(model="llama3")

In [52]:
load_dotenv()

api_key = os.getenv("TOGETHER_API_KEY")

online_llm = ChatOpenAI(
    model="meta-llama/Llama-Vision-Free",
    api_key=api_key,
    base_url="https://api.together.xyz/v1",
    temperature=0
)

### Generate Answers Using LLMs
Run both the online and local LLM chains to generate answers to the user query based on the retrieved context.

In [53]:
online_llm_chain = prompt | online_llm


# This took 2 s to generate summary
online_llm_summary = online_llm_chain.invoke(
    {
        "documents": combined_docs_text,
        "query": query
    }
).content

In [54]:
# llama3_chain = prompt | llama3

# #local model-  this took 10m to generate summary
# llama3_summary = llama3_chain.invoke(
#     {
#         "documents": combined_docs_text,
#         "query": query
#     }
# )

In [55]:
# no_rag_summary = online_llm.invoke(query).content

### Output and Compare Results
Display the answers generated by both the online and local LLMs for comparison.

In [56]:
# print(no_rag_summary)

In [57]:
print(online_llm_summary)

To apply for an OPT (Optional Practical Training) extension, you must properly file:

1. Form I-765 with:
	* The correct application fee
	* Your employer's name as listed in E-Verify
	* Your employer's E-Verify Company Identification Number or valid E-Verify Client Company Identification Number
2. Form I-20, Certificate of Eligibility for Nonimmigrant Student Status, endorsed by your DSO within the last 60 days
3. A copy of your STEM degree

Additionally, if you file your STEM OPT extension application on time and your OPT period expires while your extension application is pending, you will automatically receive an 180-day extension of your employment authorization.


In [58]:
# print(llama3_summary)