# Simple RAG Pipeline for STEM OPT Document
This notebook demonstrates a simple Retrieval-Augmented Generation (RAG) workflow using LangChain, FAISS, and LLMs to answer questions about the STEM OPT extension process. Each section is annotated for clarity.

In [None]:
from bs4 import BeautifulSoup
from langchain.docstore.document import Document
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.llms.ollama import Ollama
from langchain.chat_models import ChatOpenAI


## Load and Parse the STEM OPT HTML Document
Read the USCIS STEM OPT extension HTML file and extract its content using BeautifulSoup for further processing.

In [2]:

with open("../data/documents/Optional Practical Training Extension for STEM Students (STEM OPT) _ USCIS.html", "r", encoding="utf-8") as f:
    html = f.read()

# Parse HTML and extract text
soup = BeautifulSoup(html, "html.parser")

### Extract Relevant Content
Identify and extract the main content panels from the HTML using their CSS class.

In [3]:
panels = soup.find_all(class_="accordion__panel")

In [5]:
texts = [panel.get_text(separator="\n", strip=True) for panel in panels]
combined_text = "\n\n".join(texts)

### Chunk the Extracted Text
Split the combined text into smaller chunks to fit within the token limits of embedding models and LLMs.

In [None]:
def chunk_text(text, max_tokens=500, separator="\n\n"):
    
    # Split by paragraphs (double newlines)
    paragraphs = text.split(separator)
    print(paragraphs)
    
    chunks = []
    current_chunk = ""

    for para in paragraphs:
        if len(current_chunk) + len(para) < max_tokens:
            current_chunk += para + separator
        else:
            chunks.append(current_chunk.strip())
            current_chunk = para + separator

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

In [7]:
chunks = chunk_text(combined_text, max_tokens=200)  # adjust size as needed
print(f"Created {len(chunks)} chunks.")

['To qualify for the 24-month extension, you must:\nHave been granted OPT and currently be in a valid period\xa0of post-completion OPT;\nHave earned a bachelor’s, master’s, or doctoral degree from a school that is accredited by a U.S. Department of Education-recognized accrediting agency and is certified by\nthe Student and Exchange Visitor Program (SEVP)\nat the time you submit your STEM OPT extension application.\nNOTE: Previously obtained STEM degrees\n: If you are an F-1 student participating in a 12-month period of post-completion OPT based on a non-STEM degree, you may be eligible to use a previous STEM degree from a U.S. institution of higher education to apply for a STEM OPT extension. You must have received both degrees from currently accredited and SEVP-certified institutions, and cannot have already received a STEM OPT extension based on this previous degree. The practical training opportunity also must be directly related to the previously obtained STEM degree.\nFor example

### Create Document Objects
Convert each text chunk into a LangChain `Document` object for downstream processing.

In [8]:
documents = [Document(page_content=chunk) for chunk in chunks]

### Embed Documents and Build Vector Store
Generate embeddings for each document chunk and store them in a FAISS vector database for efficient retrieval.

In [None]:
embedding_model = HuggingFaceEmbeddings(model_name="multi-qa-MiniLM-L6-cos-v1")

  embedding_model = HuggingFaceEmbeddings(model_name="multi-qa-MiniLM-L6-cos-v1")


In [9]:
vectorstore = FAISS.from_documents(documents, embedding_model)

### Retrieve Relevant Chunks for a Query
Set up a retriever from the vector store and use it to fetch the most relevant document chunks for a sample user query.

In [10]:
retriever = vectorstore.as_retriever()
query = "what's the process for applying OPT?"


In [11]:
retrieved_docs = retriever.get_relevant_documents(query)

  retrieved_docs = retriever.get_relevant_documents(query)


In [24]:
retrieved_docs

[Document(id='00e8f54f-ded3-407f-96bf-51c63b19c042', metadata={}, page_content='Student Reporting Responsibilities\nIf you receive a STEM OPT extension, you must:\nReport changes to the following information to your DSO within 10 days of the change, specifically:\nYour legal name;\nYour residential or mailing address;\nYour email address;\nYour employer’s name; and\nYour employer’s address.\nReport to your DSO every 6 months to confirm the information listed above, even if none of your information has changed.\nFor more information, please refer to the\nUSCIS Policy Manual\nand the\nDHS STEM OPT Hub\n.\nUnemployment during the OPT Period\nYou may be unemployed during your OPT period for a limited number of days.\nIf you received…\nYou may be unemployed for…\nFor a total of…(during the OPT period)\nInitial post-completion OPT only\nUp to 90 days\n90 days\n24-month extension\nAn additional 60 days\n150 days'),
 Document(id='fd28ee4c-39ec-4fb4-a3e1-422137bd0530', metadata={}, page_content

### Prepare Retrieved Content for LLM
Combine the retrieved document chunks into a single context string to be used as input for the language model.

In [12]:
combined_docs_text = "\n\n".join([doc.page_content for doc in retrieved_docs])

In [13]:
combined_docs_text

'Student Reporting Responsibilities\nIf you receive a STEM OPT extension, you must:\nReport changes to the following information to your DSO within 10 days of the change, specifically:\nYour legal name;\nYour residential or mailing address;\nYour email address;\nYour employer’s name; and\nYour employer’s address.\nReport to your DSO every 6 months to confirm the information listed above, even if none of your information has changed.\nFor more information, please refer to the\nUSCIS Policy Manual\nand the\nDHS STEM OPT Hub\n.\nUnemployment during the OPT Period\nYou may be unemployed during your OPT period for a limited number of days.\nIf you received…\nYou may be unemployed for…\nFor a total of…(during the OPT period)\nInitial post-completion OPT only\nUp to 90 days\n90 days\n24-month extension\nAn additional 60 days\n150 days\n\nTo qualify for the 24-month extension, you must:\nHave been granted OPT and currently be in a valid period\xa0of post-completion OPT;\nHave earned a bachelor

### Construct the Prompt for the LLM
Create a prompt template that provides the retrieved context and user query to the language model for answer generation.

In [26]:
prompt_template = """
You are an expert assistant. Use the context below to answer the user's question.
Do NOT include any internal thoughts or explanations.

Context:
{documents}

User Question:
{query}

Answer:
"""

In [27]:
prompt = PromptTemplate(
    input_variables=["documents", "query"],
    template=prompt_template
)

### Set Up LLMs and Chains
Instantiate both a local (Ollama) and an online (Together API) language model, and set up LLM chains for answer generation.

In [31]:
llama3 = Ollama(model="llama3")

  llama3 = Ollama(model="llama3")


In [32]:
api_key = '34f5f526391626c1e46bb060671b85eaf8ec355a22fdb8292dc147fe6d4b3df7'

online_llm = ChatOpenAI(
    model="meta-llama/Llama-Vision-Free",
    openai_api_key=api_key,
    openai_api_base="https://api.together.xyz/v1",
    temperature=0
)

### Generate Answers Using LLMs
Run both the online and local LLM chains to generate answers to the user query based on the retrieved context.

In [None]:
online_llm_chain = LLMChain(llm=online_llm, prompt=prompt)

# this took 2 s to generate summary
online_llm_summary = online_llm_chain.run(documents=combined_docs_text, query=query) 

In [21]:
llama3_chain = LLMChain(llm=llama3, prompt=prompt)

#local model-  this took 7m to generate summary
llama3_summary = llama3_chain.run(documents=combined_docs_text, query=query) 

### Output and Compare Results
Display the answers generated by both the online and local LLMs for comparison.

In [33]:
print(online_llm_summary)

To apply for an OPT (Optional Practical Training) extension, you must properly file:

1. Form I-765 with:
	* The correct application fee
	* Your employer's name as listed in E-Verify
	* Your employer's E-Verify Company Identification Number or valid E-Verify Client Company Identification Number
2. Form I-20, Certificate of Eligibility for Nonimmigrant Student Status, endorsed by your DSO within the last 60 days
3. A copy of your STEM degree

Additionally, if you file your STEM OPT extension application on time and your OPT period expires while your extension application is pending, you will automatically receive an 180-day extension of your employment authorization.


In [23]:
print(llama3_summary)

To apply for an OPT (Optional Practical Training), you must properly file:

1. Form I-765 with:
	* The correct application fee;
	* Your employer's name as listed in E-Verify, and
	* Your employer's E-Verification Company Identification Number or valid E-Verification Client Company Identification Number
2. Form I-20, Certificate of Eligibility for Nonimmigrant Student Status, endorsed by your DSO (Designated School Official) within the last 60 days; and
3. A copy of your STEM degree.

Note that if you file your OPT extension application on time and your OPT period expires while your application is pending, USCIS will automatically extend your employment authorization for 180 days. This automatic 180-day extension ceases once USCIS adjudicates your OPT extension application.
