## Using ollama python package to generate embeddings

First, pip install and import ollama.

In [2]:
import ollama

You can test the embedding by running `ollama.embeddings(model, prompt)`. Make sure you load the model first by running `ollama pull <model>` in the terminal.

In [8]:
ollama.embeddings(model='nomic-embed-text', prompt='The sky is blue because of rayleigh scattering')

EmbeddingsResponse(embedding=[0.5889952182769775, 0.400834858417511, -3.303218126296997, -0.525968074798584, 0.7489901781082153, 1.5185997486114502, -0.1251041144132614, 0.39591342210769653, 0.06778016686439514, -1.1088330745697021, 0.6926167011260986, 1.2775923013687134, 1.146063208580017, 1.089024543762207, 0.2504419982433319, 0.2928600311279297, 0.1518256962299347, -0.6344521045684814, -0.2100622057914734, -0.1958126723766327, -1.7958611249923706, -0.6291590332984924, 0.03886444866657257, -0.6687489748001099, 1.26125967502594, 1.2771027088165283, -0.15987950563430786, -0.0024411454796791077, -0.29727184772491455, -0.4807409644126892, 1.2050529718399048, -0.6383835077285767, -0.5400329828262329, -1.0354485511779785, 0.6314492225646973, -1.208990454673767, 0.6834062337875366, -0.058553166687488556, -0.19721460342407227, 0.12762127816677094, -0.014400124549865723, -0.5544140934944153, 0.3516940772533417, 0.04494372010231018, 0.597441554069519, -0.9552484154701233, 0.5079353451728821, 1

Below is code modified from [this article](https://ollama.com/blog/embedding-models) to generate embeddings from a list of texts. 

In [29]:
documents = [
  "Llamas are members of the camelid family meaning they're pretty closely related to vicuñas and camels",
  "Llamas were first domesticated and used as pack animals 4,000 to 5,000 years ago in the Peruvian highlands",
  "Llamas can grow as much as 6 feet tall though the average llama between 5 feet 6 inches and 5 feet 9 inches tall",
  "Llamas weigh between 280 and 450 pounds and can carry 25 to 30 percent of their body weight",
  "Llamas are vegetarians and have very efficient digestive systems",
  "Llamas live to be about 20 years old, though some only live for 15 years and others live to be 30 years old",
]

embedding_dict = {}

# store each document in a dictionary
import numpy as np
for i, d in enumerate(documents):
  response = ollama.embed(model="nomic-embed-text", input=d)
  embeddings = np.array(response["embeddings"], dtype = float)
  embedding_dict[documents[i]] = embeddings

In [30]:
for key in embedding_dict.keys():
    print(key)

Llamas are members of the camelid family meaning they're pretty closely related to vicuñas and camels
Llamas were first domesticated and used as pack animals 4,000 to 5,000 years ago in the Peruvian highlands
Llamas can grow as much as 6 feet tall though the average llama between 5 feet 6 inches and 5 feet 9 inches tall
Llamas weigh between 280 and 450 pounds and can carry 25 to 30 percent of their body weight
Llamas are vegetarians and have very efficient digestive systems
Llamas live to be about 20 years old, though some only live for 15 years and others live to be 30 years old


In [22]:
type(embedding_dict[documents[0]])

numpy.ndarray

## Test with sample data

This is the process of working out the process of using ollama with data from a dataframe.

In [3]:
import pandas as pd
import os

synth_df = pd.read_csv(os.path.join(os.getcwd(), "../data/synthetic_data.csv"), index_col = 0)

synth_df

Unnamed: 0,id,category,text
0,1,Product Description,Experience unparalleled sound quality with the...
1,2,Movie Synopsis,"In a world ravaged by climate change, a group ..."
2,3,News Article,The city council approved the new public trans...
3,4,Recipe,"Preheat the oven to 375°F. Mix flour, sugar, a..."
4,5,Travel Guide,"Discover the hidden gems of Kyoto, from tranqu..."
5,6,Scientific Abstract,This study investigates the effects of micropl...
6,7,Book Review,"An evocative tale of love and loss, 'The Silen..."
7,8,Job Posting,Looking for a skilled software engineer profic...
8,9,User Manual,"To reset your device, hold the power button fo..."
9,10,Historical Event,"The Berlin Wall, constructed in 1961, symboliz..."


In [4]:
texts = synth_df["text"].tolist()
metadata = synth_df[["id", "category"]].to_dict(orient = "records")

In [5]:
texts

['Experience unparalleled sound quality with the EchoSphere wireless earbuds, featuring noise cancellation, 12-hour battery life, and an ergonomic design perfect for workouts.',
 'In a world ravaged by climate change, a group of unlikely heroes embarks on a perilous journey to save humanity from extinction.',
 'The city council approved the new public transportation plan yesterday, aiming to reduce traffic congestion and lower carbon emissions by 2030.',
 'Preheat the oven to 375°F. Mix flour, sugar, and eggs in a bowl, then fold in fresh blueberries. Bake for 25 minutes or until golden brown.',
 'Discover the hidden gems of Kyoto, from tranquil temples to bustling markets, and experience authentic Japanese culture like never before.',
 'This study investigates the effects of microplastic pollution on marine ecosystems, revealing significant impacts on coral reef health and biodiversity.',
 "An evocative tale of love and loss, 'The Silent Horizon' beautifully captures the complexities 

In [6]:
metadata

[{'id': 1, 'category': 'Product Description'},
 {'id': 2, 'category': 'Movie Synopsis'},
 {'id': 3, 'category': 'News Article'},
 {'id': 4, 'category': 'Recipe'},
 {'id': 5, 'category': 'Travel Guide'},
 {'id': 6, 'category': 'Scientific Abstract'},
 {'id': 7, 'category': 'Book Review'},
 {'id': 8, 'category': 'Job Posting'},
 {'id': 9, 'category': 'User Manual'},
 {'id': 10, 'category': 'Historical Event'},
 {'id': 11, 'category': 'Customer Review'},
 {'id': 12, 'category': 'Health & Fitness'},
 {'id': 13, 'category': 'Legal Document'},
 {'id': 14, 'category': 'E-commerce FAQ'},
 {'id': 15, 'category': 'Educational Content'}]

Generate a dict of embeddings

In [8]:
import numpy as np

text_embeddings = {}
embeddings = []

for idx, doc in enumerate(texts):
    response = ollama.embed(model="nomic-embed-text", input=doc)
    embedding = response["embeddings"]
    embedding = np.squeeze(embedding, axis = 0)
    text_embeddings[texts[idx]] = embedding
    embeddings.append(embedding)
    

In [54]:
np.array(embeddings[0]).shape

(768,)

In [55]:
print(np.array(embeddings).shape)

(15, 768)


Test semantic search and make sure the dimensions are lining up properly for cosine_similarity

In [78]:
from sklearn.metrics.pairwise import cosine_similarity

query = "wireless earbuds with good battery life" 
query_response = ollama.embed(model = "nomic-embed-text", input = query)
query_embedding = np.array(query_response["embeddings"], dtype = float)

similarities = cosine_similarity(X = query_embedding, Y = embeddings) #Returns ndarray of scores

top_indices = np.argsort(similarities)[0][::-1][0:3]

top_texts = [texts[idx] for idx in top_indices]
top_metadata = [metadata[idx] for idx in top_indices]

results = top_metadata
for dictionary in results:
    idx = results.index(dictionary)
    dictionary["text"] = top_texts[idx]

results

[{'id': 1,
  'category': 'Product Description',
  'text': 'Experience unparalleled sound quality with the EchoSphere wireless earbuds, featuring noise cancellation, 12-hour battery life, and an ergonomic design perfect for workouts.'},
 {'id': 9,
  'category': 'User Manual',
  'text': 'To reset your device, hold the power button for 10 seconds until the LED indicator flashes. Release the button and wait for the system reboot.'},
 {'id': 12,
  'category': 'Health & Fitness',
  'text': 'Regular cardio workouts not only improve heart health but also boost mental clarity and reduce stress levels.'}]

In [76]:
type(texts)

list

In [77]:
type(query)

str

## Try Ollama class

Try entire document workflow with Ollama class.

In [None]:
from searchlite.document import Document
from searchlite.embedders.Ollama import OllamaEmbedder

In [5]:
embedder = OllamaEmbedder(model_name = "nomic-embed-text")

In [6]:
print(embedder.__repr__())

Ollama Embedder object. Chosen model: nomic-embed-text. 
Ollama server status: Running
