# üéì SMU International Student Assistant 

**Project:** RAG Assistant for F1 International Students at Southern Methodist University (SMU)

**Author:** Vee Huynh

**Date:** February 2026

**GitHub:** https://github.com/veehuynh311/SMU-International-Student-Assistant

---

## üìã Problem Statement

**Domain:** F1 Visa Navigation for International Students at Southern Methodist University (SMU)

**Target User:** F1 international students at SMU, especially new arrivals and those seeking employment authorization (internships, post-graduation jobs).

**Problem:** International students face a maze of complex regulations across multiple life areas: maintaining F1 status, work authorization (CPT/OPT/STEM OPT), filing taxes, obtaining an SSN, and getting a Texas driver's license. Information is scattered across 6+ government agencies (USCIS, IRS, DHS, SSA, Texas DPS) and university resources. One mistake can jeopardize visa status, leading to serious consequences including deportation. This RAG assistant consolidates 12 official documents into a single conversational interface, providing accurate, source-cited answers to questions like "How do I apply for CPT?", "Will full-time CPT affect my OPT eligibility?", "How do I get an SSN?", or "What documents do I need for a Texas driver's license?"

---

## üì¶ Step 1: Install Required Packages

In [None]:
# Install packages for document loading and text processing
!pip install pypdf beautifulsoup4 lxml langchain -q
print("‚úÖ Packages installed!")

## üìÅ Step 2: Upload My Documents

Upload documents that i collected (PDF, HTML files).

**Documents collected for this project (12 total):**

| # | Source | Topic | Type | Filename |
|---|--------|-------|------|----------|
| 1 | SMU ISSS | New Student Information | HTML | `smu_new_student_info.html` |
| 2 | SMU ISSS | Current Student Information | HTML | `smu_current_student_info.html` |
| 3 | SMU ISSS | US Living (Tax, SSN, DL) | HTML | `smu_us_living.html` |
| 4 | USCIS | OPT for F-1 Students | HTML | `uscis_opt.html` |
| 5 | USCIS | STEM OPT Extension | HTML | `uscis_stem_opt.html` |
| 6 | DHS | CPT Guide | HTML | `dhs_cpt_guide.html` |
| 7 | ICE | Practical Training | HTML | `ice_practical_training.html` |
| 8 | IRS | Form 8843 Instructions | HTML | `irs_form_8843.html` |
| 9 | Sprintax | F1 Tax Guide | HTML | `sprintax_tax_guide.html` |
| 10 | DHS | Obtaining SSN | HTML | `dhs_ssn_guide.html` |
| 11 | SSA | SSN for International Students | PDF | `ssa_international_students_ssn.pdf` |
| 12 | Texas DPS | Driver License Checklist | PDF | `texas_dps_dl_checklist.pdf` |

In [None]:
from google.colab import files
import os

# Create documents folder
os.makedirs("documents", exist_ok=True)

print("üìÅ Upload documents (PDF, HTML files):")
print("   Target: 12 documents\n")

# Upload files
uploaded = files.upload()

# Move to documents folder
for filename in uploaded.keys():
    os.rename(filename, f"documents/{filename}")
    print(f"   ‚úÖ Saved: documents/{filename}")

print(f"\nüìä Total documents uploaded: {len(uploaded)}")

## üìÇ Step 3: List All Documents

In [None]:
import os

doc_files = os.listdir("documents")
print(f"üìÇ Documents in folder ({len(doc_files)} files):\n")

for i, filename in enumerate(doc_files, 1):
    filepath = f"documents/{filename}"
    size_kb = os.path.getsize(filepath) / 1024
    extension = filename.split('.')[-1].upper()
    print(f"   {i}. [{extension}] {filename} ({size_kb:.1f} KB)")

---

# üìñ PART 1: Data Ingestion

---

## üîß Step 4: Define Document Loaders

Different functions to load different file types:
- **PDF**: Use `pypdf` to extract text with page tracking
- **HTML**: Use `BeautifulSoup` to parse and extract text
- **TXT**: Simple file read

In [None]:
from pypdf import PdfReader
from bs4 import BeautifulSoup
import re

def load_pdf(filepath):
    """
    Load a PDF file and extract text from all pages.
    Returns tuple: (full_text, pages_list)
    where pages_list contains (page_num, start_char, end_char) for metadata.
    """
    reader = PdfReader(filepath)
    full_text = ""
    pages_list = []  # List of(page_num, start_char, end_char)

    for page_num, page in enumerate(reader.pages, start=1):
        page_text = page.extract_text()
        if page_text:
            start_char = len(full_text)
            full_text += page_text + "\n"
            end_char = len(full_text)
            pages_list.append((page_num, start_char, end_char))

    return full_text, pages_list


def load_html(filepath):
    """
    Load an HTML file and extract text content.
    Returns tuple: (text, None) - no page tracking for HTML.
    """
    with open(filepath, 'r', encoding='utf-8', errors='ignore') as f:
        html_content = f.read()

    soup = BeautifulSoup(html_content, 'lxml')

    # Remove non-content elements
    for element in soup(['script', 'style', 'nav', 'footer', 'header', 'aside']):
        element.decompose()

    text = soup.get_text(separator='\n')
    return text, None


def load_txt(filepath):
    """
    Load a plain text file.
    Returns tuple: (text, None)
    """
    with open(filepath, 'r', encoding='utf-8', errors='ignore') as f:
        return f.read(), None


def load_document(filepath):
    """
    Load a document based on its file extension.
    Returns tuple: (text, page_info)
    """
    extension = filepath.lower().split('.')[-1]

    if extension == 'pdf':
        return load_pdf(filepath)
    elif extension in ['html', 'htm']:
        return load_html(filepath)
    elif extension == 'txt':
        return load_txt(filepath)
    else:
        print(f"‚ö†Ô∏è Unknown file type: {extension}")
        return "", None


def get_doc_type(filename):
    """
    Determine document type/category based on filename.
    """
    filename_lower = filename.lower()
    if 'smu' in filename_lower:
        return 'university'
    elif 'uscis' in filename_lower or 'dhs' in filename_lower or 'ice' in filename_lower:
        return 'immigration'
    elif 'irs' in filename_lower or 'tax' in filename_lower or 'sprintax' in filename_lower:
        return 'tax'
    elif 'ssa' in filename_lower or 'ssn' in filename_lower:
        return 'ssn'
    elif 'dps' in filename_lower or 'driver' in filename_lower:
        return 'driver_license'
    else:
        return 'general'

print("‚úÖ Document loader functions defined!")

## üßπ Step 5: Define Text Cleaning Function

Clean up common issues:
- Extra whitespace and newlines
- Headers/footers
- Special characters

In [None]:
import re

def clean_text(text):
    """
    Clean extracted text by removing noise.
    """
    # Replace multiple newlines with double newline
    text = re.sub(r'\n\s*\n', '\n\n', text)

    # Replace multiple spaces with single space
    text = re.sub(r' +', ' ', text)

    # Remove leading/trailing whitespace from each line
    lines = [line.strip() for line in text.split('\n')]

    # Remove empty lines
    lines = [line for line in lines if line]

    text = '\n'.join(lines)

    # Remove common noise patterns
    text = re.sub(r'Page \d+ of \d+', '', text)
    text = re.sub(r'\s+', ' ', text)  # Final cleanup of whitespace

    return text.strip()

print("‚úÖ Text cleaning function defined!")

## üìñ Step 6: Load and Clean All Documents

In [None]:
import os

# Store documents with metadata
documents = []

print("üìñ Loading and cleaning documents...\n")
print("-" * 60)

for filename in sorted(os.listdir("documents")):
    filepath = f"documents/{filename}"

    try:
        # Load the document (returns text and page info for PDFs)
        raw_text, page_info = load_document(filepath)

        # Clean the text
        cleaned_text = clean_text(raw_text)

        # Determine document type
        doc_type = get_doc_type(filename)
        file_type = filename.split('.')[-1].upper()

        # Store document info with enhanced metadata
        doc = {
            "filename": filename,
            "filepath": filepath,
            "source": filename.replace("_", " ").replace(".html", "").replace(".pdf", "").replace(".txt", ""),
            "doc_type": doc_type,        # Category: immigration, tax, ssn, etc.
            "file_type": file_type,      # PDF, HTML, TXT
            "page_info": page_info,      # For PDFs: list of(page_num, start, end)
            "raw_length": len(raw_text),
            "cleaned_length": len(cleaned_text),
            "text": cleaned_text
        }
        documents.append(doc)

        reduction = (1 - len(cleaned_text)/len(raw_text)) * 100 if len(raw_text) > 0 else 0
        page_str = f", {len(page_info)} pages" if page_info else ""
        print(f"‚úÖ {filename} [{doc_type}]")
        print(f"   Raw: {len(raw_text):,} chars ‚Üí Cleaned: {len(cleaned_text):,} chars ({reduction:.0f}% reduction){page_str}")

    except Exception as e:
        print(f"‚ùå Error loading {filename}: {e}")

print("-" * 60)
print(f"\nüìä Successfully loaded {len(documents)} documents!")

## üëÄ Step 7: Display Sample Cleaned Text

Preview each document to verify extraction worked correctly.

In [None]:
print("=" * 70)
print("üìÑ SAMPLE CLEANED TEXT FROM EACH DOCUMENT")
print("=" * 70)

for i, doc in enumerate(documents[:5], 1):  # Show first 5 docs
    print(f"\n{'‚îÄ' * 70}")
    print(f"üìÑ Document {i}: {doc['filename']}")
    print(f"   Type: {doc['doc_type']} | Format: {doc['file_type']}")
    print(f"   Total length: {doc['cleaned_length']:,} characters")
    print(f"{'‚îÄ' * 70}")

    # Show first 800 characters as preview
    preview = doc['text'][:800]
    if len(doc['text']) > 800:
        preview += "\n\n[... truncated ...]"

    print(preview)

print("\n" + "=" * 70)
print("‚úÖ Session 2 Complete: Text extraction working!")
print("=" * 70)

---

# ‚úÇÔ∏è PART 2: Chunking

---

## Why Chunking?

Documents are too long to fit in an LLM's context window and too broad for accurate retrieval. We split them into smaller, focused chunks.

**Settings (500-800 chars):**
- `chunk_size`: 600 characters
- `chunk_overlap`: 100 characters (to preserve context at boundaries)

## ‚úÇÔ∏è Step 8: Define Chunking Function

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize the text splitter with specified settings
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=600,           # Target chunk size in characters (500-800 range)
    chunk_overlap=100,        # Overlap between chunks
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]  # Priority order for splitting
)

print("‚úÖ Text splitter configured!")
print(f"   ‚Ä¢ chunk_size: 600 characters")
print(f"   ‚Ä¢ chunk_overlap: 100 characters")

## ‚úÇÔ∏è Step 9: Chunk All Documents with Enhanced Metadata

**Metadata includes:**
- Document name (source)
- Page/section (for PDFs)
- Document type category

In [None]:
def get_page_for_position(position, page_info):
    """
    Given a character position in text, return the page number.
    For PDFs only - returns None for HTML/TXT.
    """
    if not page_info:
        return None
    for page_num, start, end in page_info:
        if start <= position < end:
            return page_num
    return page_info[-1][0] if page_info else None  # Default to last page


# Store all chunks with metadata
all_chunks = []

print("‚úÇÔ∏è Chunking documents with enhanced metadata...\n")
print("-" * 60)

for doc in documents:
    # Split the document text into chunks
    chunks = text_splitter.split_text(doc['text'])

    # Track position in original text for page mapping
    current_pos = 0

    # Add metadata to each chunk
    for i, chunk_text in enumerate(chunks):
        # Find chunk position in cleaned text (approximate)
        chunk_start = doc['text'].find(chunk_text[:50], current_pos)
        if chunk_start == -1:
            chunk_start = current_pos

        # Get page number for this chunk (PDFs only)
        page_num = get_page_for_position(chunk_start, doc['page_info'])

        chunk = {
            "text": chunk_text,
            "metadata": {
                "source": doc['source'],           # Document name
                "filename": doc['filename'],
                "doc_type": doc['doc_type'],       # Category (immigration, tax, etc.)
                "file_type": doc['file_type'],     # PDF, HTML
                "page": page_num,                  # Page number (PDFs only)
                "chunk_id": i,
                "total_chunks": len(chunks)
            }
        }
        all_chunks.append(chunk)
        current_pos = chunk_start + len(chunk_text) - 100  # Account for overlap

    page_str = f" (pages tracked)" if doc['page_info'] else ""
    print(f"‚úÖ {doc['filename']}")
    print(f"   {doc['cleaned_length']:,} chars ‚Üí {len(chunks)} chunks{page_str}")

print("-" * 60)
print(f"\nüìä Total chunks created: {len(all_chunks)}")

## üìä Step 10: Compute and Log Statistics

In [None]:
import statistics

# Calculate statistics
chunk_lengths = [len(chunk['text']) for chunk in all_chunks]

total_chunks = len(all_chunks)
avg_length = statistics.mean(chunk_lengths)
min_length = min(chunk_lengths)
max_length = max(chunk_lengths)
std_length = statistics.stdev(chunk_lengths) if len(chunk_lengths) > 1 else 0

print("=" * 60)
print("üìä CHUNKING STATISTICS")
print("=" * 60)
print(f"\nüìà Overall Statistics:")
print(f"   ‚Ä¢ Total documents: {len(documents)}")
print(f"   ‚Ä¢ Total chunks: {total_chunks}")
print(f"   ‚Ä¢ Average chunk length: {avg_length:.0f} characters")
print(f"   ‚Ä¢ Min chunk length: {min_length} characters")
print(f"   ‚Ä¢ Max chunk length: {max_length} characters")
print(f"   ‚Ä¢ Std deviation: {std_length:.0f} characters")

print(f"\nüìÑ Chunks per Document:")
print("-" * 60)

# Count chunks per document
chunks_per_doc = {}
for chunk in all_chunks:
    filename = chunk['metadata']['filename']
    chunks_per_doc[filename] = chunks_per_doc.get(filename, 0) + 1

for filename, count in chunks_per_doc.items():
    print(f"   ‚Ä¢ {filename}: {count} chunks")

# Count by document type
print(f"\nüìÇ Chunks by Document Type:")
print("-" * 60)
chunks_per_type = {}
for chunk in all_chunks:
    doc_type = chunk['metadata']['doc_type']
    chunks_per_type[doc_type] = chunks_per_type.get(doc_type, 0) + 1

for doc_type, count in sorted(chunks_per_type.items()):
    print(f"   ‚Ä¢ {doc_type}: {count} chunks")

print("\n" + "=" * 60)

## üëÄ Step 11: Print 3-5 Sample Chunks with Metadata

In [None]:
print("=" * 70)
print("üìù SAMPLE CHUNKS WITH METADATA (5 examples)")
print("=" * 70)

# Select diverse sample chunks (from different documents)
sample_indices = [0, len(all_chunks)//4, len(all_chunks)//2, 3*len(all_chunks)//4, len(all_chunks)-1]
sample_indices = sample_indices[:5]  # Ensure max 5

for idx in sample_indices:
    chunk = all_chunks[idx]
    meta = chunk['metadata']
    print(f"\n{'‚îÄ' * 70}")
    print(f"üìå Chunk #{idx}")
    print(f"   Source: {meta['source']}")
    print(f"   Type: {meta['doc_type']} | Format: {meta['file_type']}")
    page_str = f" | Page: {meta['page']}" if meta['page'] else ""
    print(f"   Chunk: {meta['chunk_id'] + 1} of {meta['total_chunks']}{page_str}")
    print(f"   Length: {len(chunk['text'])} characters")
    print(f"{'‚îÄ' * 70}")
    print(chunk['text'][:500])
    if len(chunk['text']) > 500:
        print("\n[... truncated ...]")

print("\n" + "=" * 70)
print("‚úÖ Session 3 Complete: Chunking with enhanced metadata working!")
print("=" * 70)

## üíæ Step 12: Save Processed Data

In [None]:
import json

# Save chunks to JSON for next session
with open('chunks.json', 'w') as f:
    json.dump(all_chunks, f, indent=2)

print("üíæ Chunks saved to 'chunks.json'")

# Save statistics
stats = {
    "total_documents": len(documents),
    "total_chunks": total_chunks,
    "avg_chunk_length": round(avg_length, 2),
    "min_chunk_length": min_length,
    "max_chunk_length": max_length,
    "chunk_size_setting": 600,
    "chunk_overlap_setting": 100,
    "chunks_per_document": chunks_per_doc,
    "chunks_per_type": chunks_per_type
}

with open('chunking_stats.json', 'w') as f:
    json.dump(stats, f, indent=2)

print("üíæ Statistics saved to 'chunking_stats.json'")

# Download files
files.download('chunks.json')
files.download('chunking_stats.json')

---

## ‚úÖ Summary: Sessions 2 & 3 Complete!

### Session 2 (Data Ingestion):
- ‚úÖ Loaded 12 documents (PDF, HTML)
- ‚úÖ Extracted and cleaned text
- ‚úÖ Printed sample cleaned text

### Session 3 (Chunking):
- ‚úÖ Implemented chunking (chunk_size=600, overlap=100)
- ‚úÖ **Enhanced metadata:** doc name, page (PDFs), doc_type category
- ‚úÖ Computed statistics: total chunks, avg length, chunks per doc/type
- ‚úÖ Printed 3-5 sample chunks with metadata
- ‚úÖ Saved data for next session

### Topics Covered by Documents:
| Topic | Sources | Doc Type |
|-------|--------|----------|
| F1 Status | SMU ISSS (3 docs) | university |
| CPT | DHS, ICE | immigration |
| OPT | USCIS, ICE | immigration |
| STEM OPT | USCIS | immigration |
| Taxes | IRS, Sprintax | tax |
| SSN | DHS, SSA, SMU | ssn |
| Texas Driver's License | Texas DPS, SMU | driver_license |

### Next Steps (Session 4):
- Generate embeddings using sentence-transformers (all-MiniLM-L6-v2)
- Build vector store with FAISS
- Implement `retrieve_top_k(query)` function
- Write 10-15 test questions and inspect retrieval results

---