# Simple RAG Pipeline for STEM OPT Document
This notebook demonstrates a simple Retrieval-Augmented Generation (RAG) workflow using LangChain, FAISS, and LLMs to answer questions about the STEM OPT extension process. Each section is annotated for clarity.

In [2]:
from bs4 import BeautifulSoup

from langchain.docstore.document import Document
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain_ollama import OllamaLLM
from langchain_openai import ChatOpenAI

from dotenv import load_dotenv
import os


## Load and Parse the STEM OPT HTML Document
Read the USCIS STEM OPT extension HTML file and extract its content using BeautifulSoup for further processing.

In [3]:

with open("../data/documents/Optional Practical Training Extension for STEM Students (STEM OPT) _ USCIS.html", "r", encoding="utf-8") as f:
    html = f.read()

# Parse HTML and extract text
soup = BeautifulSoup(html, "html.parser")

### Extract Relevant Content
Identify and extract the main content panels from the HTML using their CSS class.

In [4]:
panels = soup.find_all(class_="accordion__panel")

In [5]:
texts = [panel.get_text(separator="\n", strip=True) for panel in panels]
combined_text = "\n\n".join(texts)

### Chunk the Extracted Text
Split the combined text into smaller chunks to fit within the token limits of embedding models and LLMs.

In [6]:
def chunk_text(text, max_tokens=500, separator="\n\n"):
    
    # Split by paragraphs (double newlines)
    paragraphs = text.split(separator)
    print(paragraphs)
    
    chunks = []
    current_chunk = ""

    for para in paragraphs:
        if len(current_chunk) + len(para) < max_tokens:
            current_chunk += para + separator
        else:
            chunks.append(current_chunk.strip())
            current_chunk = para + separator

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

In [None]:
chunks = chunk_text(combined_text, max_tokens=200)  # adjust size as needed

In [8]:
print(f"Created {len(chunks)} chunks.")

Created 5 chunks.


### Create Document Objects
Convert each text chunk into a LangChain `Document` object for downstream processing.

In [9]:
documents = [Document(page_content=chunk) for chunk in chunks]

### Embed Documents and Build Vector Store
Generate embeddings for each document chunk and store them in a FAISS vector database for efficient retrieval.

In [10]:
embedding_model = HuggingFaceEmbeddings(model_name="multi-qa-MiniLM-L6-cos-v1")

In [11]:
vectorstore = FAISS.from_documents(documents, embedding_model)

### Retrieve Relevant Chunks for a Query
Set up a retriever from the vector store and use it to fetch the most relevant document chunks for a sample user query.

In [12]:
retriever = vectorstore.as_retriever()
query = "what's the process for applying OPT?"


In [13]:
retrieved_docs = retriever.invoke(query)

In [14]:
print(f"{len(retrieved_docs)} docs retrieved")

4 docs retrieved


### Prepare Retrieved Content for LLM
Combine the retrieved document chunks into a single context string to be used as input for the language model.

In [15]:
combined_docs_text = "\n\n".join([doc.page_content for doc in retrieved_docs])

### Construct the Prompt for the LLM
Create a prompt template that provides the retrieved context and user query to the language model for answer generation.

In [16]:
prompt_template = """
You are an expert assistant. Use the context below to answer the user's question.
Do NOT include any internal thoughts or explanations.

Context:
{documents}

User Question:
{query}

Answer:
"""

In [17]:
prompt = PromptTemplate(
    input_variables=["documents", "query"],
    template=prompt_template
)

### Set Up LLMs and Chains
Instantiate both a local (Ollama) and an online (Together API) language model, and set up LLM chains for answer generation.

In [18]:
llama3 = OllamaLLM(model="llama3")

In [19]:
load_dotenv()

api_key = os.getenv("TOGETHER_API_KEY")

online_llm = ChatOpenAI(
    model="meta-llama/Llama-Vision-Free",
    api_key=api_key,
    base_url="https://api.together.xyz/v1",
    temperature=0
)

### Generate Answers Using LLMs
Run both the online and local LLM chains to generate answers to the user query based on the retrieved context.

In [25]:
online_llm_chain = prompt | online_llm


# This took 2 s to generate summary
online_llm_summary = online_llm_chain.invoke(
    {
        "documents": combined_docs_text,
        "query": query
    }
).content

In [28]:
llama3_chain = prompt | llama3

#local model-  this took 10m to generate summary
llama3_summary = llama3_chain.invoke(
    {
        "documents": combined_docs_text,
        "query": query
    }
)

In [33]:
no_rag_summary = online_llm.invoke(query).content

### Output and Compare Results
Display the answers generated by both the online and local LLMs for comparison.

In [34]:
print(no_rag_summary)

OPT (Optional Practical Training) is a type of work authorization for international students in the United States. Here's a step-by-step guide to the OPT application process:

**Eligibility:**

Before applying for OPT, make sure you meet the eligibility criteria:

1. You must be a F-1 student in good standing.
2. You must have completed at least one academic year of study.
3. You must have a valid I-20 form from your school.

**Preparation:**

1. **Check your eligibility**: Verify that you meet the eligibility criteria.
2. **Gather required documents**: You'll need:
	* A valid passport
	* A copy of your I-20 form
	* A copy of your F-1 visa (if applicable)
	* Proof of completion of at least one academic year of study (e.g., transcript, diploma)
	* Proof of employment (if you have a job offer)
3. **Choose your OPT start date**: You can choose one of two start dates:
	* Post-completion OPT: Start date is 90 days after your program completion date.
	* Pre-completion OPT: Start date is the 

In [30]:
print(online_llm_summary)

To apply for an OPT (Optional Practical Training) extension, you must properly file:

1. Form I-765 with:
	* The correct application fee
	* Your employer's name as listed in E-Verify
	* Your employer's E-Verify Company Identification Number or valid E-Verify Client Company Identification Number
2. Form I-20, Certificate of Eligibility for Nonimmigrant Student Status, endorsed by your DSO within the last 60 days
3. A copy of your STEM degree

Additionally, if you file your STEM OPT extension application on time and your OPT period expires while your extension application is pending, you will automatically receive an 180-day extension of your employment authorization.


In [31]:
print(llama3_summary)

To apply for an OPT (Optional Practical Training), you must properly file:

1. Form I-765 with:
	* The correct application fee;
	* Your employer’s name as listed in E-Verify, and
	* Your employer’s E-Verify Company Identification Number or valid E-Verify Client Company Identification Number
2. Form I-20, Certificate of Eligibility for Nonimmigrant Student Status, endorsed by your DSO (Designated School Official) within the last 60 days;
3. A copy of your STEM degree.

Additionally, if you file your OPT extension application on time and your OPT period expires while your extension application is pending, USCIS will automatically extend your employment authorization for 180 days. This automatic 180-day extension ceases once USCIS adjudicates your OPT extension application.
