# Travel Visa Assistant #

Welcome! This notebook builds a simple and practical English-only travel visa assistant that answers visa-related questions about countries like Canada, the US, or the UK using real data scraped from government websites.

For this project, we'll set up a Retrieval-Augmented Generation (RAG) system using webpages from official government sites that provide visitor visa information. The scraped visa content from these trusted sources will form the knowledge base. This content will be indexed using vector embeddings and retrieved dynamically to give relevant context to an LLM (GPT-4), which  will then answer user questions accurately and conversationally.

## Prerequisites ## 

You need an OpenAI account to create the OpenAI API Key

## Setup Instructions ##

### Step 1. Set Up Environment (Mac/Linux) ###

```
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
```

### Step 2. Get OpenAI API Key ###

1. Go to: <https://platform.openai.com/account/api-keys>

2. Log in with your OpenAI account

3. Click “Create new secret key”

4. Copy and store the key securely

Note: GPT-4 is usage-billed. You can set hard limits on usage in your [billing dashboard](https://platform.openai.com/settings/organization/billing/overview).

### Step 3. Create .env File with API Keys ###

Create a .env file in your root project directory and add your API key like this:

```
OPENAI_API_KEY=your-openai-key
```

### Step 4. Launch the notebook
```
jupyter notebook
```
Open and run travel_assistant.ipynb

## Import Required Libraries ##

In [1]:
import os
from dotenv import load_dotenv
from huggingface_hub import login

load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")

HF_TOKEN = os.getenv("HF_TOKEN")

We’ll be using LangChain for our RAG pipeline, FAISS for vector search, BeautifulSoup for scraping, and Gradio for a quick user interface.

In [2]:
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from deep_translator import GoogleTranslator
from bs4 import BeautifulSoup
import requests
import gradio as gr
from langchain.docstore.document import Document
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings


USER_AGENT environment variable not set, consider setting it to identify your requests.


## Define Visa Websites ## 

We’ll index a small set of visa-related government URLs to create our knowledge base. These official pages will serve as the data sources from which we extract content. The cleaned and relevant text from these pages will be turned into vector embeddings and stored in a FAISS index. This will allow our assistant to dynamically retrieve and feed context to an LLM (GPT-4), enabling it to answer user questions with real and current information.

In [3]:
TRAVEL_URLS = {
    "canada": "https://www.canada.ca/en/immigration-refugees-citizenship/services/visit-canada.html",
    "usa": "https://travel.state.gov/content/travel/en/us-visas/tourism-visit.html",
    "uk": "https://www.gov.uk/standard-visitor",
    # "australia": "https://immi.homeaffairs.gov.au/visas/getting-a-visa/visa-finder/visit",
    # "japan": "https://www.mofa.go.jp/j_info/visit/visa/index.html",
    # "germany": "https://www.germany.info/us-en/service/visa-entry",
    # "south_korea": "https://www.visa.go.kr/openPage.do?MENU_ID=10101",
    # "new_zealand": "https://www.immigration.govt.nz/new-zealand-visas",
    # "india": "https://indianvisaonline.gov.in/",
    # "singapore": "https://www.ica.gov.sg/enter-transit-depart/entering-singapore"
}


## Load Data ##

For this project, instead of using a built-in LangChain loader like ```WebBaseLoader```, we are going to build a custom scraper using BeautifulSoup. This gives us full control to follow internal links and limit how deep we go (to avoid crawling unrelated content). The scraper collects visible paragraph text from a set of visa-related government URLs and their subpages. Each page will be converted into a LangChain Document object, which stores both the text and some useful metadata like the URL and depth level.

At the end of the scraping process, we'll print a sample document to inspect its structure and content.

In [4]:
from bs4 import BeautifulSoup
from urllib.parse import urlparse, urljoin
from collections import deque

def scrape_data(start_url, country, max_depth=1, max_pages=10):
    visited = set()
    to_visit = deque([(start_url, 0)])
    documents = []

    while to_visit and len(visited) < max_pages:
        current_url, level = to_visit.popleft()
        if current_url in visited or level > max_depth:
            continue

        try:
            response = requests.get(current_url, timeout=10)
            soup = BeautifulSoup(response.text, "html.parser")
            paragraphs = soup.find_all("p")
            text = "\n".join(p.get_text(strip=True) for p in paragraphs)

            if text.strip():
                documents.append(
                    Document(
                        page_content=text,
                        metadata={"source": country, "url": current_url, "level": level}
                    )
                )
                print(f"Visiting: {current_url} (Level {level}) — {len(visited)+1}")


            visited.add(current_url)

            base_netloc = urlparse(start_url).netloc
            for a_tag in soup.find_all("a", href=True):
                href = a_tag["href"]
                full_url = urljoin(current_url, href)
                if urlparse(full_url).netloc == base_netloc and full_url not in visited:
                    to_visit.append((full_url, level + 1))

        except Exception as e:
            print(f"Failed to visit {current_url}: {e}")

    return documents


## Scrape & Load Documents ##

We'll be loading a maximum of 50 pages, 2 levels deep from each starting URL. This keeps data current but manageable.

In [5]:
MAX_DEPTH = 2
MAX_PAGES=50

all_documents = []

for country, url in TRAVEL_URLS.items():
    print(f"Getting Data from {country.upper()} (up to level {MAX_DEPTH})...")
    docs = scrape_data(url, country, max_depth=MAX_DEPTH, max_pages=MAX_PAGES)
    print(f"{len(docs)} documents scraped for {country}")
    all_documents.extend(docs)

print(f"\nTotal documents: {len(all_documents)}")


Getting Data from CANADA (up to level 2)...
Visiting: https://www.canada.ca/en/immigration-refugees-citizenship/services/visit-canada.html (Level 0) — 1
Visiting: https://www.canada.ca/en/immigration-refugees-citizenship/services/visit-canada.html#wb-cont (Level 1) — 2
Visiting: https://www.canada.ca/en/immigration-refugees-citizenship/services/visit-canada.html#wb-info (Level 1) — 3
Visiting: https://www.canada.ca/fr/immigration-refugies-citoyennete/services/visiter-canada.html (Level 1) — 4
Visiting: https://www.canada.ca/en.html (Level 1) — 5
Visiting: https://www.canada.ca/en/services/jobs.html (Level 1) — 6
Visiting: https://www.canada.ca/en/services/immigration-citizenship.html (Level 1) — 7
Visiting: https://www.canada.ca/en/services/business.html (Level 1) — 8
Visiting: https://www.canada.ca/en/services/benefits.html (Level 1) — 9
Visiting: https://www.canada.ca/en/services/health.html (Level 1) — 10
Visiting: https://www.canada.ca/en/services/taxes.html (Level 1) — 11
Visiting

In [6]:
## Show one sample
print("\nSample Document:")
print("Source:", all_documents[0].metadata['source'])
print("URL:", all_documents[0].metadata['url'])
print("Level:", all_documents[0].metadata['level'])
print("\nContent preview:\n", all_documents[0].page_content[:1000])  


Sample Document:
Source: canada
URL: https://www.canada.ca/en/immigration-refugees-citizenship/services/visit-canada.html
Level: 0

Content preview:
 Find out what document you need to travel, visit family and friends, do business, or transit through Canada, and how to extend your stay.
What a visitor visa is, who is eligible and how to apply
What an eTA is, eligibility and how to apply online
How to extend your stay in Canada
How to get a new visa from inside Canada
Who is eligible and how to apply for a super visa that lets you stay with family in Canada for 5
                    years at a time
Find out about the travel documents you need and what to bring to Canada as a business visitor
What it means to transit through Canada and which documents you need
Travel tips, what happens at the border, and prohibited or restricted goods
Answer a few questions to see different ways you might be able to come to Canada


In [7]:
## Show first 3 documents
for i, doc in enumerate(all_documents[:3]):
    print("Country:", doc.metadata['source'])
    print("URL:", doc.metadata['url'])
    print("Level:", doc.metadata['level'])
    print("Data Content:\n", doc.page_content[:500], "\n...")


Country: canada
URL: https://www.canada.ca/en/immigration-refugees-citizenship/services/visit-canada.html
Level: 0
Data Content:
 Find out what document you need to travel, visit family and friends, do business, or transit through Canada, and how to extend your stay.
What a visitor visa is, who is eligible and how to apply
What an eTA is, eligibility and how to apply online
How to extend your stay in Canada
How to get a new visa from inside Canada
Who is eligible and how to apply for a super visa that lets you stay with family in Canada for 5
                    years at a time
Find out about the travel documents you need 
...
Country: canada
URL: https://www.canada.ca/en/immigration-refugees-citizenship/services/visit-canada.html#wb-cont
Level: 1
Data Content:
 Find out what document you need to travel, visit family and friends, do business, or transit through Canada, and how to extend your stay.
What a visitor visa is, who is eligible and how to apply
What an eTA is, eligibility and 

## Chunk Text ##

We need to split the scraped content into smaller, more manageable pieces called "chunks." LangChain’s ```RecursiveCharacterTextSplitter``` does this by attempting to break large blocks of text at logical boundaries using a prioritized set of characters:

- "\n\n" - two new line characters
- "\n" - one new line character
- " " - a space
- "" - an empty character

It starts with the first option, "\n\n" and moves through the list if the resulting pieces are still too long. This method ensures that splits happen in the most natural way possible.

For this project, we'll use a chunk size of 500 characters with a 50-character overlap. This provides the model with enough context to form meaningful answers without overwhelming it. Larger chunks (like 1000 characters) led to noisy and confused results in earlier testing, so I refined it to this size.

In [8]:
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunked_docs = splitter.split_documents(all_documents)

## Embeddings and Vectorstore ##

Choosing the right embedding model is critical to how well the assistant can understand and retrieve relevant information. Since this project uses text scraped from visa-related government websites, we'll use OpenAI’s ```text-embedding-ada-002``` model through the ```OpenAIEmbeddings``` interface from the ```langchain_openai``` package.

This model is optimized for semantic similarity and works well in question-answering tasks like this one. Although there are open-source alternatives like GloVe, Word2Vec, FastText, or transformer-based models like BERT and RoBERTa, OpenAI’s model was chosen for its ease of integration and high-quality embeddings.

That said, using OpenAI’s embedding API incurs usage costs. If you're looking for a free, local alternative, you can explore ```HuggingFaceEmbeddings``` with sentence-transformers, though the quality may vary slightly depending on your hardware setup.

Here’s how the embeddings and FAISS index were set up:

In [9]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(openai_api_key=api_key)
vectorstore = FAISS.from_documents(chunked_docs, embeddings)


## Setup Model and Retriever

The next step is to turn our FAISS vectorstore into a retriever. This retriever will fetch relevant chunks of visa content to provide context for our generative model. The context helps the LLM generate more accurate and grounded responses.

In this project, we'll use OpenAI’s gpt-4 model through LangChain. It was selected for its strong performance in question-answering tasks, especially when combined with relevant context. Alternatives like local models were not chosen due to the memory and compute requirements.

Here’s how the retrieval and model setup looks:

In [10]:
from langchain_community.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

llm = ChatOpenAI(model_name="gpt-4", openai_api_key=api_key)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever()
)

  llm = ChatOpenAI(model_name="gpt-4", openai_api_key=api_key)


## Ask Question ##

In [11]:
test_question = "Who can apply for visitor visa in canada?"
answer = qa_chain.invoke(test_question)
print("Answer:", answer['result'])


Answer: Anyone who meets the requirements needed to travel to Canada can apply for a visitor visa. This includes individuals who want to visit Canada for business, tourism, or to visit family and friends. However, permanent residents of Canada must have a nonimmigrant visa. You may also need to give biometrics with your application. Additionally, up to 2 adults can be identified in your visa application.


In [12]:
test_question = "What is the visa requiremnts for the Citizens of Canada traveling to the United States?"
answer = qa_chain.invoke(test_question)
print("Answer:", answer['result'])


Answer: Citizens of Canada traveling to the United States generally do not require a nonimmigrant visa, except for certain travel purposes. However, Canadian citizens who are inadmissible to the United States under United States immigration law or have previously violated the terms of their immigration status in the United States may need a visa. Permanent residents (landed immigrants) of Canada must have a nonimmigrant visa.


In [13]:
test_question = "How to visit as an Academic in UK?"
answer = qa_chain.invoke(test_question)
print("Answer:", answer['result'])


Answer: As an academic, you can visit the UK for up to 6 months to do research. If you're doing a distance learning course, your course can last longer than 6 months as most of your study will happen outside of the UK. You can also visit to do certain activities as part of your course. Before visiting, you must prove certain conditions, although these conditions are not specified in the provided context. If you are an academic, but not a senior doctor or dentist, you must also prove you're visiting to do research or a formal exchange. If you're a senior doctor or dentist, you must also prove you're visiting to do research, clinical practice, a formal exchange or to teach. In some cases, you may need to get an Academic Technology Approval Scheme (ATAS) certificate before you start your course or research, especially if you're researching certain subjects at postgraduate level or above.


## Add a Gradio Interface ##

Once everything is set up and tested, it's time to make the assistant interactive. Here, we'll use Gradio to quickly create a user-friendly web interface that lets anyone ask visa-related questions in a textbox and receive an answer from the model in real time. This is the final step in turning your RAG pipeline into a usable application.

In [14]:
import gradio as gr

def travel_assistant(user_input):
    try:
        response = qa_chain.invoke(user_input)
        return response['result']
    except Exception as e:
        return f"Error: {e}"


gr.Interface(
    fn=travel_assistant,
    inputs=gr.Textbox(lines=2, placeholder="Ask your visa related question"),
    outputs=gr.Textbox(label="Answer"),
    title="Visa Assistant",
    description="Ask travel-related visa questions"
).launch(inline=True)


Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.




✅ That’s it! You now have a working English-only visa assistant that scrapes real data, finds relevant context, and answers your questions with GPT-4.