<a href="https://colab.research.google.com/github/abhijithneilabraham/AI_experiments/blob/master/ollama_datatune_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Let's start by installing Ollama and pulling an LLM

In [None]:
!apt-get update && apt-get install -y wget
!wget https://ollama.ai/install.sh -O install_ollama.sh
!chmod +x install_ollama.sh
!OLLAMA_USE_SYSTEM_CA_CERTS=1 ./install_ollama.sh

In [None]:
!ollama serve > /var/log/ollama.log 2>&1 &



We will use the latest Qwen model by pulling it to our local machine using Ollama

In [5]:
!ollama pull qwen2.5vl:7b

[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l

Now let's install latest pypi release of Datatune using pip

In [None]:
!pip install datatune

Let's import the necessary libraries.

In [None]:
import pandas as pd
import dask.dataframe as dd
import datatune as dt
from datatune.llm.llm import Ollama


In [8]:
# Step 1: Create a simple customer dataset
# Let's create a simple dataset with customer information
data = {
    'customer_id': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
    'name': ['John Smith', 'Maria Garcia', 'Wei Zhang', 'Sarah Johnson', 'Ahmed Hassan',
             'Priya Patel', 'Carlos Rodriguez', 'Emma Wilson', 'Hiroshi Tanaka', 'Fatima Ali'],
    'email': ['john.s@example.com', 'maria.g@example.com', 'wei.z@example.com', 'sarah.j@example.com',
              'ahmed.h@example.com', 'priya.p@example.com', 'carlos.r@example.com', 'emma.w@example.com',
              'hiroshi.t@example.com', 'fatima.a@example.com'],
    'address': ['123 Main St, New York, USA', '456 Maple Ave, Toronto, Canada',
                '789 Beijing Road, Beijing, China', '101 Oxford St, London, UK',
                '202 Pyramid Road, Cairo, Egypt', '303 Gandhi St, Mumbai, India',
                '404 Reforma Ave, Mexico City, Mexico', '505 King St, Sydney, Australia',
                '606 Sakura St, Tokyo, Japan', '707 Desert Road, Dubai, UAE']
}
df = pd.DataFrame(data)



Datatune currently uses dask DataFrames for partitioning, so this step converts your pandas dataframe into Dask Dataframes

In [9]:

# Convert to dask dataframe for Datatune
df = dd.from_pandas(df, npartitions=2)

In [10]:
# make sure you have Ollama running in your machine before running the following line
llm = Ollama(model_name="qwen2.5vl:7b") #if no model specified, this loads the gemma3:4b model by default


In [None]:
#  Use Map to extract country and continent
map = dt.Map(
    prompt="Extract the country and continent from the address field",
    output_fields=["country", "continent"]
)(llm, df)

In [12]:
mapped_df = map.compute() # Trigger Dask's lazy loading compute

 The finalize method cleans the dataframe of any additional metadata that is created during an operation.

In [None]:
final_mapped = dt.finalize(mapped_df) # clean metadata
print(final_mapped)

   customer_id              name                  email  \
0          101        John Smith     john.s@example.com   
1          102      Maria Garcia    maria.g@example.com   
2          103         Wei Zhang      wei.z@example.com   
3          104     Sarah Johnson    sarah.j@example.com   
4          105      Ahmed Hassan    ahmed.h@example.com   
5          106       Priya Patel    priya.p@example.com   
6          107  Carlos Rodriguez   carlos.r@example.com   
7          108       Emma Wilson     emma.w@example.com   
8          109    Hiroshi Tanaka  hiroshi.t@example.com   
9          110        Fatima Ali   fatima.a@example.com   

                                address    country      continent  
0            123 Main St, New York, USA        USA  North America  
1        456 Maple Ave, Toronto, Canada     Canada  North America  
2      789 Beijing Road, Beijing, China      China           Asia  
3             101 Oxford St, London, UK         UK         Europe  
4        2

In [None]:
# Now let's filter the dataset to keep only people starting with a specific letter

filtered = dt.Filter(
    prompt="Keep only people whose name starts with J"
)(llm, map)


In [None]:
filtered_df = filtered.compute()

In [None]:
# while filtering, the metadata as well as the filtered rows gets eliminated with the help of finalize method.

final_filtered = dt.finalize(filtered_df)
print(final_filtered)

   customer_id        name               email                     address  \
0          101  John Smith  john.s@example.com  123 Main St, New York, USA   

  country      continent  
0     USA  North America  
