# Project Planning

When building full-stack applications, I begin by mapping out a comprehensive overview of each feature to be developed. This holistic, forward-thinking approach helps me anticipate challenges and avoid costly refactoring or rewrites that can result from incomplete planning—a lesson learned from past experience.

For this project, I closely followed the provided outline and specifications, shaping my plan around the following core goals:
- use the right tool for the job
- keep solutions as simple as possible
- prioritize developer experience, maintainability, and scalability

### Main Phases
1. LLM & Ollama Setup
2. Database Setup
3. Backend Setup
4. Frontend Setup
5. Testing and Deployment

### Integration with Cursor

To accelerate development and streamline problem-solving, I leveraged various AI agents through [Cursor](https://www.cursor.com/) and [ChatGPT/Claude](https://t3.chat/). Whenever I encountered challenges—such as refactoring, setting up boilerplate, or making architectural decisions—I consulted these tools. This approach not only helped me overcome obstacles efficiently but also enhanced my skills as a developer, as I used AI as a collaborative assistant rather than a crutch.

# Deployment via Docker

Right from the start, I planned to use Docker because it makes running and sharing apps much easier. I already had some experience with Docker, but I knew that working with several services at once and setting up their network connections could be tricky. Before this project, I hadn’t set up a Docker network in so much detail.

To make things easier, I gave each service a fixed IP address and wrote these into the `docker-compose.yml` file. Doing this early on made it much simpler to connect everything together later. When it was time to link the services, it all worked smoothly.

I also chose to use `docker compose` instead of running each container with `docker run` commands. I find `docker compose` much easier to work with, especially when you need to set up different settings or restart services often.

Setting clear semantic names for the different services also helped when referencing them in commands in Docker.

Overall, planning ahead with Docker saved me time and made the whole process go more smoothly.

# Base LLM: Gemma 3

At first, it wasn’t clear what the main use of the chat app would be, so I didn’t have a specific goal when picking the language model. Because of this, I chose a model that is good at general knowledge, reasoning, and coding. I also wanted a model that is fast and doesn’t take up too much space, since this makes development and testing easier. I avoided models that focus only on reasoning, since they can be slower to start, and speed was more important for this project.

To help me decide, I checked benchmark leaderboards like [WebDevArena](https://web.lmarena.ai/leaderboard) and [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#). After looking at the options, I picked `gemma3:1b`, which is a 1-billion parameter model. It’s new, performs well, and is small enough to run easily on most computers. I also like Google’s `Gemini 2.5 Pro` model, so I thought its "open weights" version would be a good fit too.

You can read more about Gemma 3 in this [blog post](https://blog.google/technology/developers/gemma-3/).

Later, I added the option to load more than one model in the app, so I could compare them easily via the UI. I included `llama3.2:1b` and `qwen2.5-coder:1.5b` because they are similar in size and performance.

# LLM Provider: Ollama

For running the models, I chose Ollama right away. I already use Ollama for my own projects, so I know how it works and how to set it up. It’s also popular in the developer community, so there are lots of guides and help available on places like Stack Overflow and GitHub. Ollama works on many operating systems and is easy to run with Docker.

I followed this guide to set up Ollama with Docker: [Ollama Official Docker Image](https://ollama.com/blog/ollama-is-now-available-as-an-official-docker-image).

Since I already had Ollama running on my computer, I made sure to save the `.ollama` data in a separate volume and used a different port for this project. This way, it wouldn’t interfere with my other work.

Additionally, I added the `OLLAMA_KEEP_ALIVE=24h` key in the compose setup in order to prevent model unloading when they are stale or have not been used recently. I found that this was good when testing different models and at different times - fortunately my GPU had enough VRAM to store all models that were tested at the same time

# Postgres Database

The project instructions said to use Postgres, so I used this SQL database to handle the app’s data. Even if it wasn’t required, I would have picked Postgres anyway because I’m familiar with it. Postgres is popular, has lots of resources and plugins, and is widely used in real-world projects.

To connect my app to Postgres, I used the [langchain-postgres](https://python.langchain.com/docs/integrations/memory/postgres_chat_message_history/) library. This library made it easy to set up the database tables for storing chat message history.

However, `langchain-postgres` only saves the chat messages. I wanted more control, so I added my own table to keep track of different chat sessions. This lets me store things like session titles and link them to each user by their username. Later, I also added some sample data using a `database/seed.sql` file, so the table wouldn’t be empty during initial testing.

Here’s the code I used to create the sessions table, it just used `psycopg` to write the raw SQL:

```python
def create_db_sessions_table(conn: Connection):
    with conn.cursor() as cur:
        cur.execute(
            """
            CREATE TABLE IF NOT EXISTS db_sessions (
                id UUID PRIMARY KEY,
                username VARCHAR(255) NOT NULL,
                title VARCHAR(255) NOT NULL,
                created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
            )
            """
        )
        conn.commit()
```

### Postgres Deployment via Docker

Setting up Postgres with Docker can be a bit tricky because you need to be clear about things like the database username and password. Just like with Ollama, I made sure to use a unique port and a separate data volume for this project. This way, it wouldn’t interfere with any other Postgres instances I have running on my computer.

I followed this guide for setting up Postgres with Docker: [How to use the Postgres Docker Official Image](https://www.docker.com/blog/how-to-use-the-postgres-docker-official-image/).

Here’s the part of my `docker-compose.yml` file that sets up the database:

```yaml
database:
  image: postgres:16
  restart: always
  container_name: bd_database
  ports:
    - "${POSTGRES_PORT:-5432}:5432"
  environment:
    POSTGRES_USER: ${POSTGRES_USER}
    POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
    POSTGRES_DB: ${POSTGRES_DB}
  networks:
    bd_network:
      ipv4_address: "172.28.0.30"
  volumes:
    - bd_pgdata:/var/lib/postgresql/data
    - ./database/seed.sql:/docker-entrypoint-initdb.d/seed.sql
```

> Please note that I injected the seeding script directly on the docker container, so that it would be initialized on startup.

# Backend Setup

From the tools and libraries in the project, it was clear that `Python` was the best-fit programming language for setting up the backend service for the app. Fortunately, I already use Python a lot so I have a lot of experience working on small and big projects with it. For this one, I decided to keep things simple and used as minimal modules and functions as possible while still following best coding practices.

However, I decided to use [uv](https://docs.astral.sh/uv/) when setting up the backend rather than using something like pip. This is because I found that using this Rust-based package and project manager was a very good developer experience in handling the Python environment and interpreter, as well as keeping dependencies in sync. Essentially, it is an `pnpm` for Python!

# Langchain Backend

When working with LLMs, I have a lot of experience using [OpenAI API Spec](https://openai.com/index/openai-api/) services since they have a lot of SDKs as well as resources on how to use them. Honestly, I haven't used langchain prior to this project but after doing some research as well as hands-on experience, I appreciate the hands-free orchestration and simplified setup it offers as well as the community libraries that make bootstrapping applications much quicker and easier

For this, I had three main problems I needed to achieve

### Langchain Integration with Ollama
Fortunately, there is already a resource that does this, [langchain-ollama](https://python.langchain.com/docs/integrations/llms/ollama/). Thus, I just used the provided boilerplate when interfacing Langchain with Ollama

```python
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama.llms import OllamaLLM

template = """Question: {question}

Answer: Let's think step by step."""

prompt = ChatPromptTemplate.from_template(template)

model = OllamaLLM(model="llama3.1")

chain = prompt | model

chain.invoke({"question": "What is LangChain?"})
```

From there, setting up the system prompts, as well as using `langchain-postgres` to differentiate between the `HumanMessage` as well as `AIMessage`, setting up the chat workflows became easy.

### Streaming Chat Comlpetions

Streaming responses from Ollama via Langchain was one of my first hurdles in the project, since I haven't used these before and thus I did not know how to setup streaming. I understood streaming from a network point of view but not how to do it via Python. Unfortunately, this was one of the areas as well that using ChatGPT faltered since it did not set it up correctly, and there were a lot of conflicting ways to do it via different langchain libraries that evolved over the years. Fortunately, I was able to find this github issue that started way back in 2023 that taught me how to implement streaming https://github.com/langchain-ai/langchain/issues/13333

```python
    model_with_streaming = OllamaLLM(
        model=request.model,
        base_url=os.getenv("OLLAMA_BASE_URL"),
        streaming=True,
    )

    full_response = ""

    async def stream_response():
        nonlocal full_response
        async for token in model_with_streaming.astream(messages):
            full_response += token
            yield token

```
Turns out, you needed to define it explicitly in the `OllamaLLM` class as well as use the `astream` method. For a minute there I was stuck trying to implement the old methods like using `AsyncIteratorCallbackHandler` or `StreamingStdOutCallbackHandler`

# FastAPI Backend

For managing the backend, I decided to expose all of these features via a REST HTTP server through [FastAPI](https://fastapi.tiangolo.com/). I like it because setting up a web server through it is very easy, with a lot of references and all the batteries already included

I only had to setup the protocol for which the frontend later will interact with the backend for the different app operatinos

```python
class ChatRequest(BaseModel):
    name: str = "User"
    session_id: str
    content: str
    model: str = "gemma3:1b"

@app.post("/stream")
async def chat(request: ChatRequest):
    ## input validation & session handling

    ## chat completion via langchain
    chat_history = PostgresChatMessageHistory(
        table_name, request.session_id, sync_connection=sync_connection
    )

    new_usr_msg = HumanMessage(
        content=request.content, id=generate_message_id(), name=request.name
    )
    
    model_with_streaming = OllamaLLM(
        model=request.model,
        base_url=os.getenv("OLLAMA_BASE_URL"),
        streaming=True,
    )

    # RESPONSE STREAMING
    full_response = ""

    async def stream_response():
        nonlocal full_response
        async for token in model_with_streaming.astream(messages):
            full_response += token
            yield token

    ## streaming response
    return StreamingResponse(
        stream_response(), media_type="text/plain; charset=utf-8"
    )

```
Fortunately, FastAPI has a built-in helper for the streamed response using `StreamingResponse`. This made the setup much easier to work with

### Background Task for Database Operation post-Streaming
In order to sync the message state with the database while also returning the streamed chat completions as soon as possible to the client, I had to find a way to store the full response after streaming it, which was done via a `BackgroundTask`
```python
async def store_messages():
    new_ai_msg = AIMessage(
        content=full_response, id=generate_message_id(), name="Assistant"
    )
    chat_history.add_messages([new_usr_msg, new_ai_msg])

background_tasks.add_task(store_messages)
```


# Testing the API via Insomnia

Before building the frontend, I used [Insomnia](https://insomnia.rest/) to manually test the FastAPI endpoints and the Ollama integration. This allowed me to quickly verify that the backend was working as expected and that the chat completions were being streamed correctly.

Here are some examples which I exported as `curl` commands:

Getting the available models in Ollama to make sure that they were loaded correctly
```bash
curl --request GET \
  --url 'http://localhost:11435/api/tags?name=John%20Doe' \
  --header 'Content-Type: application/json' \
  --header 'User-Agent: insomnia/11.2.0'
```




# React Frontend

# Testing

# Deployment

# Documentation