# Multilingual Travel Visa Assistant 

In this project, we’re building a Multilingual Travel Visa Assistant that can answer visa-related questions for countries like Canada, the US, or the UK using real-time data scraped from official government websites. The assistant supports multiple languages — users can ask their questions in their native language, and receive responses in the same language. We’ll combine a custom web scraper, a vector search engine, a translation layer, and a language model to build an interactive Retrieval-Augmented Generation (RAG) system.

For this project, we'll set up a Retrieval-Augmented Generation (RAG) system using webpages from official government sites that provide visitor visa information. The scraped visa content from these trusted sources will form the knowledge base. This content will be indexed using vector embeddings and retrieved dynamically to give relevant context to an LLM (GPT-4), which will then answer user questions accurately and conversationally.

## Prerequisites 

You need an OpenAI account to create the OpenAI API Key

## Setup Instructions ##

### Step 1. Set Up Environment (Mac/Linux) ###

```
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
```

### Step 2. Get OpenAI API Key ###

1. Go to: <https://platform.openai.com/account/api-keys>

2. Log in with your OpenAI account

3. Click “Create new secret key”

4. Copy and store the key securely

Note: GPT-4 is usage-billed. You can set hard limits on usage in your [billing dashboard](https://platform.openai.com/settings/organization/billing/overview).

### Step 3. Create .env File with API Keys ###

Create a .env file in your root project directory and add your API key like this:

```
OPENAI_API_KEY=your-openai-key
```

### Step 4. Launch the notebook
```
jupyter notebook
```
Open and run multilingual_visa_assistant.ipynb

## Import Required Libraries ##

We’ll be using LangChain for our RAG pipeline, FAISS for vector search, BeautifulSoup for scraping, deep-translator for translating languages, and Gradio for a quick user interface.

In [1]:
import os
from dotenv import load_dotenv
import requests
import gradio as gr
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
from collections import deque
from langdetect import detect
from deep_translator import GoogleTranslator
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA

In [2]:
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

## Define Visa Websites ## 

We’ll index a small set of visa-related government URLs to create our knowledge base. These official pages will serve as the data sources from which we extract content. The cleaned and relevant text from these pages will be turned into vector embeddings and stored in a FAISS index. This will allow our assistant to dynamically retrieve and feed context to an LLM (GPT-4), enabling it to answer user questions with real and current information.

In [3]:
TRAVEL_URLS = {
    "canada": "https://www.canada.ca/en/immigration-refugees-citizenship/services/visit-canada.html",
    "usa": "https://travel.state.gov/content/travel/en/us-visas/tourism-visit.html",
    "uk": "https://www.gov.uk/standard-visitor",
    # "australia": "https://immi.homeaffairs.gov.au/visas/getting-a-visa/visa-finder/visit",
    # "japan": "https://www.mofa.go.jp/j_info/visit/visa/index.html",
    # "germany": "https://www.germany.info/us-en/service/visa-entry",
    # "south_korea": "https://www.visa.go.kr/openPage.do?MENU_ID=10101",
    # "new_zealand": "https://www.immigration.govt.nz/new-zealand-visas",
    # "india": "https://indianvisaonline.gov.in/",
    # "singapore": "https://www.ica.gov.sg/enter-transit-depart/entering-singapore"
}

## Load Data ##

For this project, instead of using a built-in LangChain loader like ```WebBaseLoader```, we are going to build a custom scraper using BeautifulSoup. This gives us full control to follow internal links and limit how deep we go (to avoid crawling unrelated content). The scraper collects visible paragraph text from a set of visa-related government URLs and their subpages. Each page will be converted into a LangChain Document object, which stores both the text and some useful metadata like the URL and depth level.

At the end of the scraping process, we'll print a sample document to inspect its structure and content.

In [4]:
from bs4 import BeautifulSoup
from urllib.parse import urlparse, urljoin
from collections import deque

def scrape_data(start_url, country, max_depth=1, max_pages=10):
    visited = set()
    to_visit = deque([(start_url, 0)])
    documents = []

    while to_visit and len(visited) < max_pages:
        current_url, level = to_visit.popleft()
        if current_url in visited or level > max_depth:
            continue

        try:
            response = requests.get(current_url, timeout=10)
            soup = BeautifulSoup(response.text, "html.parser")
            paragraphs = soup.find_all("p")
            text = "\n".join(p.get_text(strip=True) for p in paragraphs)

            if text.strip():
                documents.append(
                    Document(
                        page_content=text,
                        metadata={"source": country, "url": current_url, "level": level}
                    )
                )
                print(f"Visiting: {current_url} (Level {level}) — {len(visited)+1}")


            visited.add(current_url)

            base_netloc = urlparse(start_url).netloc
            for a_tag in soup.find_all("a", href=True):
                href = a_tag["href"]
                full_url = urljoin(current_url, href)
                if urlparse(full_url).netloc == base_netloc and full_url not in visited:
                    to_visit.append((full_url, level + 1))

        except Exception as e:
            print(f"Failed to visit {current_url}: {e}")

    return documents


## Scrape & Load Documents ##

We'll be loading a maximum of 50 pages, 2 levels deep from each starting URL. This keeps data current but manageable.

In [5]:
MAX_DEPTH = 2
MAX_PAGES=50

all_documents = []

for country, url in TRAVEL_URLS.items():
    print(f"Getting Data from {country.upper()}")
    docs = scrape_data(url, country, max_depth=MAX_DEPTH, max_pages=MAX_PAGES)
    print(f"{len(docs)} documents scraped for {country}")
    all_documents.extend(docs)

print(f"\nTotal documents: {len(all_documents)}")


Getting Data from CANADA
Visiting: https://www.canada.ca/en/immigration-refugees-citizenship/services/visit-canada.html (Level 0) — 1
Visiting: https://www.canada.ca/en/immigration-refugees-citizenship/services/visit-canada.html#wb-cont (Level 1) — 2
Visiting: https://www.canada.ca/en/immigration-refugees-citizenship/services/visit-canada.html#wb-info (Level 1) — 3
Visiting: https://www.canada.ca/fr/immigration-refugies-citoyennete/services/visiter-canada.html (Level 1) — 4
Visiting: https://www.canada.ca/en.html (Level 1) — 5
Visiting: https://www.canada.ca/en/services/jobs.html (Level 1) — 6
Visiting: https://www.canada.ca/en/services/immigration-citizenship.html (Level 1) — 7
Visiting: https://www.canada.ca/en/services/business.html (Level 1) — 8
Visiting: https://www.canada.ca/en/services/benefits.html (Level 1) — 9
Visiting: https://www.canada.ca/en/services/health.html (Level 1) — 10
Visiting: https://www.canada.ca/en/services/taxes.html (Level 1) — 11
Visiting: https://www.canad

## Chunk Text ##

We need to split the scraped content into smaller, more manageable pieces called "chunks." LangChain’s ```RecursiveCharacterTextSplitter``` does this by attempting to break large blocks of text at logical boundaries using a prioritized set of characters:

- "\n\n" - two new line characters
- "\n" - one new line character
- " " - a space
- "" - an empty character

It starts with the first option, "\n\n" and moves through the list if the resulting pieces are still too long. This method ensures that splits happen in the most natural way possible.

For this project, we'll use a chunk size of 500 characters with a 50-character overlap. This provides the model with enough context to form meaningful answers without overwhelming it. Larger chunks (like 1000 characters) led to noisy and confused results in earlier testing, so I refined it to this size.

In [6]:
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(all_documents)

## Embeddings and Vectorstore ##

Choosing the right embedding model is critical to how well the assistant can understand and retrieve relevant information. Since this project uses text scraped from visa-related government websites, we opted for Hugging Face’s ```sentence-transformers/all-MiniLM-L6-v2``` model via ```HuggingFaceEmbeddings``` to create vector representations of the content.

This model is fast, lightweight, and freely available — which makes it ideal for projects that need to run locally without incurring usage costs. Although we could have used OpenAI’s ```text-embedding-ada-002``` via OpenAIEmbeddings, it would require API access and comes with a price. Open-source alternatives like GloVe, Word2Vec, or BERT could also be considered, but MiniLM offers a strong tradeoff between speed and quality for retrieval tasks.

In contrast, the English-only Travel Visa Assistant project uses OpenAIEmbeddings (```text-embedding-ada-002```) for vectorization. That choice was made to prioritize semantic quality and simplicity of integration in a mono-language setting where fewer translation dependencies exist. However, for the multilingual assistant, we prioritized cost efficiency and local execution due to additional resource usage introduced by translation and detection layers.

Here’s how the embeddings and FAISS index were set up:

In [7]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(chunks, embeddings)

## Setup Model and Retriever

The next step is to turn our FAISS vectorstore into a retriever. This retriever will fetch relevant chunks of visa content to provide context for our generative model. The context helps the LLM generate more accurate and grounded responses.

In this project, we'll use OpenAI’s gpt-4 model through LangChain. It was selected for its strong performance in question-answering tasks, especially when combined with relevant context. Alternatives like local models were not chosen due to the memory and compute requirements.

Here’s how the retrieval and model setup looks:

In [8]:
llm = ChatOpenAI(model_name="gpt-4", openai_api_key=OPENAI_API_KEY)
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore.as_retriever())

## Ask Question in English

In [9]:
test_question = "Who can apply for a visitor visa in Canada?"

try:
    result = qa_chain.invoke(test_question)
    print("Answer:", result['result'])
except Exception as e:
    print("Error:", e)


Answer: The context does not provide specific details on who can apply for a visitor visa in Canada. However, it does mention that you can only apply for a visitor visa from inside Canada if you meet all of the given criteria, which are not specified in the provided context. If you're outside Canada, you must follow the process to apply for a visa from outside Canada and there may be different requirements you need to meet.


## Translation Function ##

We use ```langdetect``` to automatically detect the language of the user’s question, and ```deep-translator``` (GoogleTranslator) to handle language translation. This allows users to interact with the assistant in a wide range of languages, while the system processes everything internally in English.

The ```translate()``` function handles this conversion. It detects the language of the incoming query, translates it into English before passing it to the model, and then translates the model's response back into the original language.

This is what the function looks like:

In [10]:
def translate(text, source_lang="auto", target_lang="en"):
    try:
        return GoogleTranslator(source=source_lang, target=target_lang).translate(text)
    except Exception as e:
        return f"[Translation error] {text}"

## Multilingual Chatbot Logic ##

This setup provides a clean abstraction for multilingual support while minimizing dependencies and keeping the translation logic modular.

Here’s how we integrate it into our assistant: to automatically detect the language of the user’s question, and ```deep-translator``` (GoogleTranslate) to translate the question to English before sending it to GPT-4. The answer is then translated back into the user’s original language.

In [11]:
def detect_language(text):
    try:
        return detect(text)
    except:
        return "en"

def multilingual_travel_bot(user_input):
    try:
        lang = detect_language(user_input)
        query_en = translate(user_input, source_lang=lang, target_lang="en") if lang != "en" else user_input
        answer_en = qa_chain.invoke(query_en)['result']
        final_answer = translate(answer_en, source_lang="en", target_lang=lang) if lang != "en" else answer_en
        return final_answer
    except Exception as e:
        return f"Error: {e}"

## Test Translation ##

In [12]:
def test_multilingual(user_input):
    try:
        detected_lang = detect_language(user_input)
        print(f"\nOriginal Question in {detected_lang}: {user_input}")

        if detected_lang != "en":
            translated_en = translate(user_input, source_lang=detected_lang, target_lang="en")
            print(f"\nTranslated to English: {translated_en}")
        else:
            translated_en = user_input
            print("Already in English")

       
        answer_en = qa_chain.invoke(translated_en)['result']
        print(f"\nAnswer in English: {answer_en}")

       
        if detected_lang != "en":
            translated_back = translate(answer_en, source_lang="en", target_lang=detected_lang)
            print(f"\nAnswer translated back to {detected_lang}: {translated_back}")
        else:
            print("\n User asked in English)")

    except Exception as e:
        print(f"Error: {e}")


In [13]:
test_multilingual("¿Quién puede solicitar una visa de visitante para Canadá?")


Original Question in es: ¿Quién puede solicitar una visa de visitante para Canadá?

Translated to English: Who can request a visitor visa for Canada?

Answer in English: Anyone who meets the requirements for travel to Canada can request a visitor visa. This can include individuals who wish to visit, work, or study in the country. However, the specific requirements may vary based on the individual's home country and other factors. To determine eligibility, one would need to answer a few questions on the official Government of Canada website.

Answer translated back to es: Cualquier persona que cumpla con los requisitos para viajar a Canadá puede solicitar una visa de visitante. Esto puede incluir personas que deseen visitar, trabajar o estudiar en el país. Sin embargo, los requisitos específicos pueden variar según el país de origen del individuo y otros factores. Para determinar la elegibilidad, uno necesitaría responder algunas preguntas en el sitio web oficial del Gobierno de Canadá

In [14]:
test_multilingual("Quel sont les exigences de visa pour les citoyens canadiens voyageant aux États-Unis ?")


Original Question in fr: Quel sont les exigences de visa pour les citoyens canadiens voyageant aux États-Unis ?

Translated to English: What are the visa requirements for Canadian citizens traveling to the United States?

Answer in English: Canadian citizens generally do not require visas to enter the United States for visit, tourism, and temporary business travel purposes. However, they do require nonimmigrant visas for specific purposes such as acting as foreign government officials, officials and employees of international organizations, NATO officials, representatives, and employees, treaty traders, treaty investors, or if they are the spouse or child of an Australian Treaty Alien coming to the United States solely to perform. Additionally, Canadian citizens who are inadmissible to the United States under United States immigration law or have previously violated the terms of their immigration status in the United States, have to apply for a visa.

Answer translated back to fr: Les

In [15]:
test_multilingual("यूके में एक शैक्षणिक के रूप में कैसे यात्रा करें?")


Original Question in hi: यूके में एक शैक्षणिक के रूप में कैसे यात्रा करें?

Translated to English: How to travel as an educational in the UK?

Answer in English: If you're planning to travel to the UK for educational purposes, you can do so as a Standard Visitor. You are allowed to study courses up to 6 months. If you're planning to study or research certain subjects at postgraduate level or above, you may need to get an Academic Technology Approval Scheme (ATAS) certificate before you start your course or research. Please note that the accredited UK institution cannot be an academy or state-funded school.

Answer translated back to hi: यदि आप शैक्षिक उद्देश्यों के लिए यूके की यात्रा करने की योजना बना रहे हैं, तो आप एक मानक आगंतुक के रूप में ऐसा कर सकते हैं। आपको 6 महीने तक के पाठ्यक्रमों का अध्ययन करने की अनुमति है। यदि आप स्नातकोत्तर स्तर या उससे अधिक पर कुछ विषयों का अध्ययन या शोध करने की योजना बना रहे हैं, तो आपको अपना पाठ्यक्रम या अनुसंधान शुरू करने से पहले एक शैक्षणिक प्रौद्योगि

## Add a Gradio Interface ##

Once everything is set up and tested, it's time to make the assistant interactive. Here, we'll use Gradio to quickly create a user-friendly web interface that lets anyone ask visa-related questions in a textbox and receive an answer from the model in real time. This is the final step in turning your RAG pipeline into a usable application.

In [16]:
gr.Interface(
    fn=multilingual_travel_bot,
    inputs=gr.Textbox(lines=2, placeholder="Ask your visa question in any language"),
    outputs=gr.Textbox(label="Answer"),
    title="Multilingual Travel Visa Assistant",
    description="Ask travel visa questions in any language"
).launch(share=True)


Running on local URL:  http://127.0.0.1:7860


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Running on public URL: https://456f5ee2b4a47bea8e.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


