# Data Formulator with Ollama: Complete Setup and Usage Guide

This notebook provides a comprehensive guide to setting up and using Data Formulator with Ollama for free, local data analysis and visualization.

## Table of Contents
1. [Overview](#Overview)
2. [Prerequisites](#prerequisites)
3. [Installation](#installation)
4. [Configuration](#configuration)
5. [Running Data Formulator in Jupyter](#running)
6. [Example Analysis](#example)
7. [Troubleshooting](#troubleshooting)
8. [Best Practices](#best-practices)

---

<h2 id="Overview">1. Overview</h2>

**Data Formulator** is a tool that bridges natural language and data visualization, allowing you to describe what you want to analyze and automatically generating appropriate visualizations.

**Ollama** is a local AI inference engine that lets you run large language models on your machine for free.

### System Requirements
- **RAM**: 8GB minimum, 16GB recommended
- **Storage**: 10GB+ free space
- **OS**: macOS, Linux, or Windows
- **Python**: 3.10+

---

<h2 id="prerequisites">2. prerequisites</h2>


Before we begin, let's check if you have the necessary tools installed.

In [1]:
import sys
import subprocess
import os
from pathlib import Path

print(f"Python version: {sys.version}")
print(f"Current working directory: {os.getcwd()}")

# Check if we're in a conda environment
if 'CONDA_DEFAULT_ENV' in os.environ:
    print(f"Conda environment: {os.environ['CONDA_DEFAULT_ENV']}")
else:
    print("Not in a conda environment")

Python version: 3.10.16 (main, Dec 11 2024, 10:24:41) [Clang 14.0.6 ]
Current working directory: /Users/soniacq/Urban/OSCUR-experiments
Conda environment: profilers


## 3. Installation

### Step 3.1: Install Ollama

First, let's check if Ollama is already installed:

In [2]:
# Check if Ollama is installed
try:
    result = subprocess.run(['ollama', '--version'], capture_output=True, text=True)
    if result.returncode == 0:
        print(f"✅ Ollama is installed: {result.stdout.strip()}")
        ollama_installed = True
    else:
        print("❌ Ollama not found")
        ollama_installed = False
except FileNotFoundError:
    print("❌ Ollama not found")
    ollama_installed = False

if not ollama_installed:
    print("\nTo install Ollama:")
    print("macOS/Linux: curl -fsSL https://ollama.com/install.sh | sh")
    print("Windows: Download from https://ollama.com")
    print("\nAfter installation, restart this notebook.")

✅ Ollama is installed: ollama version is 0.5.7


### Step 3.2: Install Data Formulator

In [3]:
# Install Data Formulator
# !pip install data_formulator

### Step 3.3: Install Additional Dependencies for Jupyter Integration

In [38]:
# Install packages for Jupyter integration
!pip install requests pandas matplotlib seaborn

## 4. Configuration
### Step 4.1: Download and Set Up Ollama Models

In [5]:
# Check available Ollama models
if ollama_installed:
    try:
        result = subprocess.run(['ollama', 'list'], capture_output=True, text=True)
        print("Current Ollama models:")
        print(result.stdout)
        
        # Check if we have any suitable models
        if "llama" in result.stdout.lower():
            print("✅ Suitable models found!")
        else:
            print("⚠️  No suitable models found. Let's download one.")
            
    except Exception as e:
        print(f"Error checking models: {e}")
else:
    print("Please install Ollama first (see Step 3.1)")

Current Ollama models:
NAME                      ID              SIZE      MODIFIED     
llama3.2-vision:latest    085a1fdae525    7.9 GB    4 months ago    
llama3.2-vision:11b       085a1fdae525    7.9 GB    4 months ago    

✅ Suitable models found!


### Step 4.2: Download a Model (if needed)

If you don't have a suitable model, uncomment and run the appropriate command below:

In [6]:
# Uncomment ONE of these based on your system capabilities:

# Lightweight option (1.3GB) - good for testing
# !ollama pull llama3.2:1b

# Balanced option (2GB) - recommended
# !ollama pull llama3.2:3b

# Larger option (7.9GB) - best performance if you have space
# !ollama pull llama3.2-vision:11b

print("Model download initiated. This may take several minutes...")
print("Check terminal for progress updates.")

Model download initiated. This may take several minutes...
Check terminal for progress updates.


### Step 4.3: Test Ollama API

In [7]:
import requests
import json

def check_ollama_server():
    """Check if Ollama server is running and get available models"""
    try:
        response = requests.get('http://localhost:11434/api/tags', timeout=5)
        if response.status_code == 200:
            models = response.json()['models']
            print("✅ Ollama server is running!")
            print(f"Available models: {len(models)}")
            for model in models:
                print(f"  - {model['name']} ({model['size']/1e9:.1f}GB)")
            return True, models
        else:
            print(f"❌ Ollama server responded with status {response.status_code}")
            return False, []
    except requests.exceptions.ConnectionError:
        print("❌ Ollama server is not running")
        print("Start it with: ollama serve")
        return False, []
    except Exception as e:
        print(f"❌ Error checking Ollama server: {e}")
        return False, []

ollama_running, available_models = check_ollama_server()

✅ Ollama server is running!
Available models: 2
  - llama3.2-vision:latest (7.9GB)
  - llama3.2-vision:11b (7.9GB)


If Ollama server is not running, you need to start it. Run this in a terminal:

```bash
ollama serve
```

Keep that terminal open - the server needs to keep running.

---

## 5. Running Data Formulator in Jupyter

### Step 5.1: Helper Functions

In [8]:
import threading
import time
import subprocess
from IPython.display import IFrame, display, HTML
import requests

class DataFormulatorManager:
    def __init__(self, port=5001):
        self.port = port
        self.process = None
        self.thread = None
        
    def start(self):
        """Start Data Formulator in background"""
        def run_server():
            try:
                self.process = subprocess.Popen(
                    ['data_formulator', '--port', str(self.port)],
                    stdout=subprocess.PIPE,
                    stderr=subprocess.PIPE,
                    text=True
                )
                self.process.wait()
            except Exception as e:
                print(f"Error starting Data Formulator: {e}")
        
        self.thread = threading.Thread(target=run_server, daemon=True)
        self.thread.start()
        
        print(f"Starting Data Formulator on port {self.port}...")
        
        # Wait for server to start
        for i in range(30):  # Wait up to 30 seconds
            try:
                response = requests.get(f'http://localhost:{self.port}', timeout=1)
                if response.status_code == 200:
                    print(f"✅ Data Formulator is ready at http://localhost:{self.port}")
                    return True
            except:
                pass
            time.sleep(1)
            if i % 5 == 0:
                print(f"Still waiting... ({i}/30 seconds)")
        
        print("❌ Data Formulator failed to start within 30 seconds")
        return False
    
    def display(self, width=1200, height=800):
        """Display Data Formulator in an iframe"""
        return IFrame(f'http://localhost:{self.port}', width=width, height=height)
    
    def stop(self):
        """Stop the Data Formulator process"""
        if self.process:
            self.process.terminate()
            print("Data Formulator stopped")

# Create manager instance
df_manager = DataFormulatorManager()
print("Data Formulator manager created")

Data Formulator manager created


### Step 5.2: Start Data Formulator

In [26]:
# Start Data Formulator
if ollama_running and available_models:
    success = df_manager.start()
    if success:
        print("\n🎉 Ready to configure Data Formulator!")
        print("\nModel Configuration Settings:")
        print(f"Provider: ollama")
        print(f"Model: {available_models[0]['name']}")
        print(f"API Base: http://localhost:11434")
        print(f"API Key: (leave empty)")
    else:
        print("Failed to start Data Formulator. Check the troubleshooting section.")
else:
    print("Please ensure Ollama is running and has models available first.")

Starting Data Formulator on port 5001...
Still waiting... (0/30 seconds)
✅ Data Formulator is ready at http://localhost:5001

🎉 Ready to configure Data Formulator!

Model Configuration Settings:
Provider: ollama
Model: llama3.2-vision:latest
API Base: http://localhost:11434
API Key: (leave empty)


### Step 5.3: Display Data Formulator Interface

In [27]:
# Display Data Formulator embedded in the notebook
try:
    display(df_manager.display(width=1200, height=800))
except Exception as e:
    print(f"Error displaying Data Formulator: {e}")
    print(f"You can access it directly at: http://localhost:{df_manager.port}")

### Step 5.4: Stop Data Formulator
To stop Data Formulator running on port 5001, the method depends on how it was started. Since it is running in the background or as a separate process:



🔍 Find the process:


In [28]:
!lsof -i :5001


COMMAND     PID    USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
python3.1 20650 soniacq    6u  IPv4 0xe7f3176def2eb69d      0t0  TCP *:commplex-link (LISTEN)


This will return something like:
```
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
python3  12345 user   10u  IPv4 123456      0t0  TCP *:5001 (LISTEN)
```

🛑 Kill the process:
Use the PID (process ID) from the output (e.g., 12345 above):

In [30]:
# Uncomment the following line to kill the process:
#!kill 12345

---

## 6. Example Analysis

Let's create some sample data and walk through a complete analysis workflow.

### Step 6.1: Create Sample Dataset

In [31]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Create sample sales data
np.random.seed(42)
n_records = 1000

# Generate sample data
data = {
    'date': [datetime(2024, 1, 1) + timedelta(days=x) for x in range(n_records)],
    'product_category': np.random.choice(['Electronics', 'Clothing', 'Books', 'Home & Garden'], n_records),
    'region': np.random.choice(['North', 'South', 'East', 'West'], n_records),
    'sales_amount': np.random.exponential(100, n_records) + np.random.normal(50, 20, n_records),
    'customer_age': np.random.normal(40, 15, n_records).astype(int),
    'customer_satisfaction': np.random.normal(4.0, 0.8, n_records)
}

# Create DataFrame
df_sample = pd.DataFrame(data)
df_sample['sales_amount'] = np.maximum(df_sample['sales_amount'], 10)  # Ensure positive values
df_sample['customer_age'] = np.clip(df_sample['customer_age'], 18, 80)  # Reasonable age range
df_sample['customer_satisfaction'] = np.clip(df_sample['customer_satisfaction'], 1, 5)  # 1-5 scale

print(f"Created sample dataset with {len(df_sample)} records")
print("\nDataset preview:")
display(df_sample.head())

print("\nDataset info:")
print(df_sample.info())

Created sample dataset with 1000 records

Dataset preview:


Unnamed: 0,date,product_category,region,sales_amount,customer_age,customer_satisfaction
0,2024-01-01,Books,South,52.913376,68,4.277368
1,2024-01-02,Home & Garden,East,111.529378,45,5.0
2,2024-01-03,Electronics,North,251.784602,26,2.527938
3,2024-01-04,Books,North,189.108088,48,3.974176
4,2024-01-05,Books,North,232.551075,18,4.512434



Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   date                   1000 non-null   datetime64[ns]
 1   product_category       1000 non-null   object        
 2   region                 1000 non-null   object        
 3   sales_amount           1000 non-null   float64       
 4   customer_age           1000 non-null   int64         
 5   customer_satisfaction  1000 non-null   float64       
dtypes: datetime64[ns](1), float64(2), int64(1), object(2)
memory usage: 47.0+ KB
None


### Step 6.2: Save Sample Data

In [32]:
# Save the sample data for use in Data Formulator
sample_data_path = 'sample_sales_data.csv'
df_sample.to_csv(sample_data_path, index=False)

print(f"✅ Sample data saved to: {sample_data_path}")
print(f"📍 Full path: {os.path.abspath(sample_data_path)}")
print("\nYou can now upload this CSV file to Data Formulator above!")

✅ Sample data saved to: sample_sales_data.csv
📍 Full path: /Users/soniacq/Urban/OSCUR-experiments/sample_sales_data.csv

You can now upload this CSV file to Data Formulator above!


### Step 6.3: Analysis Examples

Here are some example questions you can ask Data Formulator once you've uploaded the sample data:

#### Basic Questions:
- "Show me sales by product category"
- "What's the trend of sales over time?"
- "Compare sales across different regions"

#### Advanced Questions: (Ollama couldn't perform this one)
- "Create a scatter plot of customer age vs sales amount, colored by region"
- "Show me the correlation between customer satisfaction and sales amount"
- "What's the average sales by month and product category?"

#### Complex Analysis: (Ollama couldn't perform this one)
- "Group customers by age brackets (18-30, 31-50, 51+) and show average satisfaction by category"
- "Create a heatmap showing sales performance by region and product category"
- "Show seasonal patterns in sales data with a moving average"

### Step 6.4: Configuration Reminder

**Don't forget to configure your model in Data Formulator above:**

1. Click "Select Model" in the Data Formulator interface
2. Add a new model with these settings:
   - **Provider**: `ollama`
   - **Model**: Use the model name from your available models (check the output above)
   - **API Base**: `http://localhost:11434`
   - **API Key**: Leave empty
3. Save the configuration
4. Upload your CSV file
5. Start asking questions!

---

## 7. Troubleshooting

### Common Issues and Solutions

#### Issue 1: "Address already in use" - Port 5000 or 5001

In [33]:
def check_port_usage(port):
    """Check what's using a specific port"""
    try:
        result = subprocess.run(['lsof', '-i', f':{port}'], capture_output=True, text=True)
        if result.stdout:
            print(f"Port {port} is in use:")
            print(result.stdout)
        else:
            print(f"Port {port} appears to be free")
    except FileNotFoundError:
        print("lsof command not available (Windows users: use netstat -an | findstr :5001)")

# Check common ports
check_port_usage(5000)
check_port_usage(5001)

Port 5000 is in use:
COMMAND   PID    USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
ControlCe 709 soniacq    7u  IPv4 0xe7f3176dec06d435      0t0  TCP *:commplex-main (LISTEN)
ControlCe 709 soniacq    8u  IPv6 0xe7f31772b66d4fa5      0t0  TCP *:commplex-main (LISTEN)

Port 5001 is in use:
COMMAND     PID    USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
python3.1 20650 soniacq    6u  IPv4 0xe7f3176def2eb69d      0t0  TCP *:commplex-link (LISTEN)



**Solution**: Use a different port:

In [34]:
# Try a different port
df_manager_alt = DataFormulatorManager(port=8080)
print("Alternative Data Formulator manager created on port 8080")
# Uncomment to start:
# df_manager_alt.start()

Alternative Data Formulator manager created on port 8080


#### Issue 2: Ollama Connection Problems

In [35]:
def diagnose_ollama():
    """Comprehensive Ollama diagnostics"""
    print("🔍 Ollama Diagnostics")
    print("=" * 50)
    
    # Check if Ollama is installed
    try:
        result = subprocess.run(['ollama', '--version'], capture_output=True, text=True)
        print(f"✅ Ollama version: {result.stdout.strip()}")
    except FileNotFoundError:
        print("❌ Ollama not found. Install from https://ollama.com")
        return
    
    # Check if server is running
    try:
        response = requests.get('http://localhost:11434/api/tags', timeout=5)
        if response.status_code == 200:
            print("✅ Ollama server is running")
            models = response.json()['models']
            print(f"✅ {len(models)} models available")
            for model in models:
                print(f"   - {model['name']}")
        else:
            print(f"⚠️  Ollama server returned status {response.status_code}")
    except requests.exceptions.ConnectionError:
        print("❌ Ollama server not running")
        print("   Start with: ollama serve")
    except Exception as e:
        print(f"❌ Error: {e}")
    
    # Check processes
    try:
        result = subprocess.run(['pgrep', '-f', 'ollama'], capture_output=True, text=True)
        if result.stdout:
            print(f"✅ Ollama processes running: {result.stdout.strip()}")
        else:
            print("❌ No Ollama processes found")
    except FileNotFoundError:
        print("Unable to check processes (pgrep not available)")

diagnose_ollama()

🔍 Ollama Diagnostics
✅ Ollama version: ollama version is 0.5.7
✅ Ollama server is running
✅ 2 models available
   - llama3.2-vision:latest
   - llama3.2-vision:11b
✅ Ollama processes running: 1074
2635


#### Issue 3: Model Loading Errors

In [36]:
def test_model_inference(model_name):
    """Test if a model can perform inference"""
    try:
        payload = {
            "model": model_name,
            "prompt": "Hello, world!",
            "stream": False
        }
        
        response = requests.post(
            'http://localhost:11434/api/generate',
            json=payload,
            timeout=30
        )
        
        if response.status_code == 200:
            result = response.json()
            print(f"✅ Model {model_name} is working")
            print(f"   Response: {result.get('response', 'No response')[:100]}...")
            return True
        else:
            print(f"❌ Model {model_name} returned status {response.status_code}")
            return False
            
    except Exception as e:
        print(f"❌ Error testing model {model_name}: {e}")
        return False

# Test available models
if available_models:
    print("Testing model inference...")
    for model in available_models[:2]:  # Test first 2 models
        test_model_inference(model['name'])
else:
    print("No models available to test")

Testing model inference...
✅ Model llama3.2-vision:latest is working
   Response: Hello there! It's nice to meet you. Is there something I can help you with or would you like to chat...
✅ Model llama3.2-vision:11b is working
   Response: Hello! It's nice to meet you. Is there something I can help you with or would you like to chat?...


#### Issue 4: Data Formulator Won't Start

In [37]:
# Check Data Formulator installation
try:
    result = subprocess.run(['data_formulator', '--help'], capture_output=True, text=True)
    if result.returncode == 0:
        print("✅ Data Formulator is installed correctly")
    else:
        print("⚠️  Data Formulator installation might have issues")
        print(result.stderr)
except FileNotFoundError:
    print("❌ Data Formulator not found")
    print("   Install with: pip install data_formulator")
except Exception as e:
    print(f"❌ Error checking Data Formulator: {e}")

# Alternative: Start manually
print("\nManual start command:")
print("data_formulator --port 5001")

✅ Data Formulator is installed correctly

Manual start command:
data_formulator --port 5001


---

## 8. Best Practices

- **Use a dedicated environment**: Keep your Data Formulator setup isolated to avoid conflicts.
- **Monitor resource usage**: Keep an eye on RAM and CPU usage, especially with larger models.
- **Backup your data**: Always keep copies of important datasets.

Happy analyzing! 🎉