# Watershed Navigator: A RAG-Based AI Assistant for Environmental Analysis

## Introduction
Watershed Navigator is a local Retrieval-Augmented Generation (RAG) based AI assistant built to answer environmental questions using relevant document contexts. Developed as an experimental integration for LimnoTech, the project explores practical ways AI could enhance LimnoTech's environmental solutions.

**Real-World Goal:** To begin to experiment with how AI could be useful within a company such as LimnoTech, to get experience for my internship with them in the summer of 2025.

**Technical Goal:** Understand how to locally host an LLM and learn the process of RAG.


---

## Dataset Description

The dataset consists of watershed-related PDF documents relevant to LimnoTech’s operations.


- **Source & Types of Documents:**  
  EPA reports, SEMCOG documents, internal papers, slide decks

- **Number and Volume:**  
  ~20 PDFs

- **Rationale for Selection:**  
  These documents related to the work that LimnoTech does and may be useful to clients' concerns

### Dataset Strengths and Limitations
- **Strengths:**  
  Authoritative sources, specifially related to LimnoTech

- **Limitations:**  
  Narrow dataset, not highly researched by me (time constraints)

--- 

## Technical Approach

Watershed Navigator employs a RAG approach combining semantic retrieval and generative AI.

### Technologies Used
- **UI**: Streamlit  
- **Embeddings**: SentenceTransformers MiniLM  
- **Retrieval**: Cosine similarity search  
- **Generative Model**: TinyLLaMA (hosted locally via Ollama)

### Technological Workflow
- **Document Ingestion**: Loading, text chunking, embedding generation
- **Retrieval**: Query embedding, cosine similarity test, thresholding
- **Answer Generation**: Prompt construction/engineering, local generation (TinyLLaMa)
- **UI/UX**: Flow of interaction, irrelevant query handling

---

## Modeling Setup, Validation & Improvement

- **Inputs**: User-entered questions that relate to watersheds and environmental issues
- **Outputs**: Contextually accurate, informative answers generated by TinyLLaMA based on in-context learning

### Validation
- **Evaluation Metrics**: Qualitative manual evaluation/inspection, similarity scores

### Attempted Improvements
- Embedding model adjustments, prompt engineering changes, chunk-size tuning

### Outcomes of Adjustments
- Better accuracy, improved contextual coherence, reduced irrelevant outputs and hallucinations

---

## Alternative Approaches

### Hugging Face API Integration

- Integrated Hugging Face's inference API to leverage cloud-based generative models as an alternative to local hosting.
- Achieved initially promising results with improved model capabilities, encountered API rate-limit restraints

### Experimentation with Larger Models

- Experimented with hosting larger LLMs locally, but ran out of disk storage very fast
- Larger models produced marginally higher-quality responses but required significantly greater computational resources

### Vector Database

- Attemped to use ChromaDB to implement embedding retrieval because of it's ease of initial setup
- Ended up being hard to integrate and get properly working
- FAISS significantly improved retrieval speed, especially as the dataset grew, providing quicker responses and improved scalability.

### Prompt Engineering

- Tested different prompt formats and lengths to optimize generation
- Short, concise prompts improved response coherence but occasionally lacked necessary details.
- Extensive context-rich prompts improved accuracy for complex questions but sometimes confused the model
- Had to find a balanced prompt in the end

---

## Future Direction



### Expand Document Dataset
- Add more documents to the dataset
- Add more qualified information from LimnoTech themselves

### Use Larger Model
- Experiment with quantized versions of larger open-source models, and switch to calling an API over hosting the model locally

### Add Evaluation Framework
- Introduce a structured evaluation protocol using domain-specific benchmarks or user surveys to measure relevance, factuality, and clarity.

### User Feedback
- Allow end-users to rate answers or flag information as incorrect to improve the model's performance over time


--- 

## Conclusion

To summarize, this AI assistant demonstrates a way to integrate AI into an environmental consulting business. By combining local document retrieval with lightweight generative models, the assistant is capable of answering specialized questions using context. It highlights the ability of RAG to support AI workflows.

This project successfully laid the groundwork for future integration within LimnoTech, and helped me to understand some of this before heading into my internship. It will serve as a usable blueprint for AI-powered applications and integration within environmental consulting work.

