<a href="https://colab.research.google.com/github/ua-datalab/NLP-Speech/blob/main/Intro_to_Semantic_Search/Semantic_Search_Ollama.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Semantic Search with Ollama

In [None]:
# install and download ollama with dependencies
! sudo apt-get install -y pciutils
!curl https://ollama.ai/install.sh | sh
!pip install ollama chromadb

In [4]:
# import necessary python libraries
import os
import threading
import subprocess
import requests
import json

def ollama():
    os.environ['OLLAMA_HOST'] = '0.0.0.0:11434'
    os.environ['OLLAMA_ORIGINS'] = '*'
    subprocess.Popen(["ollama", "serve"])

In [5]:
# start Ollama
ollama_thread = threading.Thread(target=ollama)
ollama_thread.start()

In [None]:
# run embedding model
! ollama run mxbai-embed-large

## Set up documents

For this mini-example, we will set up 50 'documents'  containing information about Google.

Next we will extract embeddings for them, and store them in a database.

In [7]:
# set up documents (only 50, so results will be limited)
import ollama
import chromadb

documents = [
    "Google was founded in 1998 by Larry Page and Sergey Brin.",
    "It started as a search engine but has since expanded into various services.",
    "Google's mission is to organize the world's information and make it universally accessible.",
    "The company is known for its innovative products, including Google Maps and Google Drive.",
    "Google has made significant investments in artificial intelligence and machine learning.",
    "The Google Chrome browser is one of the most popular web browsers worldwide.",
    "Google's parent company is Alphabet Inc., established in 2015.",
    "The company is also involved in initiatives like Google Cloud and autonomous vehicles.",
    "Google Search uses complex algorithms to deliver relevant results.",
    "The Google Play Store offers millions of apps for Android devices.",
    "YouTube, owned by Google, is the largest video-sharing platform globally.",
    "Google Ads provides businesses with tools for online advertising.",
    "Google Analytics helps website owners track and analyze web traffic.",
    "The company has a significant focus on user privacy and data security.",
    "Google's headquarters, known as the Googleplex, is located in Mountain View, California.",
    "Google has a global presence, with offices in numerous countries.",
    "The company launched its own smartphone line under the Pixel brand.",
    "Google Assistant is an AI-powered virtual assistant available on many devices.",
    "Gmail, Google's email service, has over 1.5 billion users worldwide.",
    "Google Translate can translate text between multiple languages in real time.",
    "The company regularly updates its search algorithms to improve accuracy.",
    "Google Photos offers unlimited photo storage and smart organization features.",
    "The Google Nest line includes smart home devices that integrate with Google Assistant.",
    "Google has been a leader in the development of self-driving car technology.",
    "The company invests heavily in renewable energy and sustainability initiatives.",
    "Google Docs allows users to create and edit documents collaboratively online.",
    "Google Scholar is a free search engine for scholarly literature.",
    "The company promotes diversity and inclusion within its workforce.",
    "Google’s research division has contributed to advancements in quantum computing.",
    "The Chrome OS powers devices like Chromebooks, designed for cloud computing.",
    "Google's Art Project allows users to explore artworks from museums worldwide.",
    "The company frequently hosts events like Google I/O to showcase new technologies.",
    "Google Fiber provides high-speed internet service in select U.S. cities.",
    "The Android operating system, developed by Google, powers billions of devices.",
    "Google's mobile-first indexing prioritizes mobile-friendly websites in search results.",
    "The company has faced scrutiny over antitrust issues and data practices.",
    "Google Earth enables users to explore 3D representations of the planet.",
    "Google Keep is a note-taking service that syncs across devices.",
    "The company has launched various initiatives to support education and research.",
    "Google's algorithms use over 200 factors to rank search results.",
    "The Google Chrome Web Store offers extensions to enhance browser functionality.",
    "Google has acquired numerous companies to expand its product offerings.",
    "The company emphasizes innovation through its '20% time' policy for employees.",
    "Google Maps provides real-time traffic updates and navigation assistance.",
    "The Google News service aggregates news articles from various sources.",
    "Google has a dedicated team for addressing security vulnerabilities.",
    "The company conducts extensive research in fields like natural language processing.",
    "Google’s advertising revenue constitutes a significant portion of its income.",
    "The Google API allows developers to integrate Google services into their applications.",
    "The company actively participates in open-source projects.",
    "Google's Doodle is a special logo displayed on its homepage to celebrate events.",
    "The company has implemented measures to combat misinformation online.",
    "Google's headquarters features unique workspaces and recreational areas for employees.",
    "The company has launched various health initiatives using AI and technology."]

# process documents
client = chromadb.Client()
collection = client.create_collection(name="docs")

# store each document in a vector embedding database
for i, d in enumerate(documents):
  response = ollama.embeddings(model="mxbai-embed-large", prompt=d)
  embedding = response["embedding"]
  collection.add(
    ids=[str(i)],
    embeddings=[embedding],
    documents=[d]
  )

## Querying

Next, we will query our documents to answer a user provided question. In the codeblock below, you can change the prompt to query the documents differently.

The result will be all documents that contain relevant information for the prompt.

In [8]:
# an example prompt
prompt = "Which country is google in?"

# generate an embedding for the prompt and retrieve the most relevant doc
response = ollama.embeddings(
  prompt=prompt,
  model="mxbai-embed-large"
)

results = collection.query(
  query_embeddings=[response["embedding"]],
  n_results=5
)
# get results
data = results['documents'][0][:]

In [9]:
data

['Google has a global presence, with offices in numerous countries.',
 "Google's headquarters, known as the Googleplex, is located in Mountain View, California.",
 'The company is known for its innovative products, including Google Maps and Google Drive.',
 "Google's mission is to organize the world's information and make it universally accessible.",
 'Google was founded in 1998 by Larry Page and Sergey Brin.']