<a href="https://colab.research.google.com/github/vrangayyan6/GenAI/blob/main/ollama_deepseek_r1_google_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ollama Deepseek R1 grounded with Google search

Using content from Google search results to ground DeepSeek-R1-Distill-Qwen-14B (deepseek-r1:14b) on free Colab

Refer to
- https://ollama.com/library/deepseek-r1
- https://dev.to/0xkoji/2-ways-to-run-ollama-on-google-colab-free-tier-3i4

In [1]:
!curl https://ollama.ai/install.sh | sh

!echo 'debconf debconf/frontend select Noninteractive' | sudo debconf-set-selections
!sudo apt-get update && sudo apt-get install -y cuda-drivers

import os

# Set LD_LIBRARY_PATH so the system NVIDIA library
os.environ.update({'LD_LIBRARY_PATH': '/usr/lib64-nvidia'})

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13269    0 13269    0     0  35031      0 --:--:-- --:--:-- --:--:-- 35010
>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
############################################################################################# 100.0%
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:2 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:5 http

In [2]:
# start server
!nohup ollama serve &

nohup: appending output to 'nohup.out'


# Pull Deepseek-R1 model

- DeepSeek-R1-Distill-Qwen-1.5B
 - deepseek-r1:1.5b
- DeepSeek-R1-Distill-Qwen-7B
 - deepseek-r1:7b
- DeepSeek-R1-Distill-Llama-8B
 - deepseek-r1:8b
- DeepSeek-R1-Distill-Qwen-14B
 - deepseek-r1:14b  -- is faster
- DeepSeek-R1-Distill-Qwen-32B
 - deepseek-r1:32b  -- is slow on T4 GPU in free Colab

In [3]:
# pull deepseek-r1:32b
!ollama pull deepseek-r1:14b

[?25lpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest 
pulling 6e9f90f02bb3...   0% ▕▏    0 B/9.0 GB                  [?25h[?25l[2K[1G[A[2K[1Gpulling manifest 
pulling 6e9f90f02bb3...   0% ▕▏    0 B/9.0 GB                  [?25h[?25l[2K[1G[A[2K[1Gpulling manifest 
pulling 6e9f90f02bb3...   0% ▕▏ 1.4 MB/9.0 GB                  [?25h[?25l[2K[1G[A[2K[1Gpulling manifest 
pulling 6e9f90f02bb3...   1% ▕▏  56 MB/9.0 GB                  [?25h[?25l[2K[1G[A[2K[1Gpulling manifest 
pulling 6e9f90f02bb3...   1% ▕▏ 111 MB/9.0 GB                  [?25h[?25l[2K[1G[A[2K[1Gpulling manifest 
pulling 6e9f90f02bb3...   1% ▕▏ 134 MB/9.0 GB                  [?25h[?25l[2K[1G[A[2K[1Gpulling ma

# Ground with content from Google search results

In [4]:
!pip install -q ollama googlesearch-python

In [5]:
import os
from googlesearch import search
import requests
from bs4 import BeautifulSoup
import ollama
import time

In [6]:
# prompt: number of tokens in grounded_prompt

import re

def count_tokens(text):
  """Counts the number of tokens in a given text using a simple regex-based approach."""
  tokens = re.findall(r'\b\w+\b|[^\w\s]', text)  # Matches words and non-alphanumeric characters
  return len(tokens)


In [10]:
# get Google search results
def get_search_results(query, num_results=10):
    results = []
    for j in search(query, num_results=num_results, sleep_interval=2):
        results.append(j)
        time.sleep(2)
    return results

# get webpage content of the Google search results
def get_webpage_content(url):
    try:
        response = requests.get(url, timeout=5)
        soup = BeautifulSoup(response.content, 'html.parser')
        return soup.get_text()[:10000]  # Get first 10000 characters
    except:
        return ""

# generate response with webpage content of the Google search results
def generate_grounded_content(prompt):
    # Get search results
    search_results = get_search_results(prompt)

    # Fetch content from search results
    search_contents = [f"Source {i+1}: {get_webpage_content(url)}" for i, url in enumerate(search_results)]
    # contents_tokens_count = 0
    # search_contents = []
    # for i, url in enumerate(search_results):
    #   while (contents_tokens_count < 15000 or url != ""):
    #     search_contents.append(f"Source {i+1}: {get_webpage_content(url)}")
    #     contents_tokens_count = count_tokens(search_contents)
    #     print("contents_tokens_count: ", contents_tokens_count)

    # Combine prompt with search contents
    grounded_prompt = f"""
    Based on the following information, please answer the question or respond to the prompt:
    Question/Prompt: {prompt}

    Information from search:
    {' '.join(search_contents)}

    Please provide a response that incorporates information from these sources, and include citations in the format [Source X] where X is the source number.
    """

    # Generate response
    # response = model.generate_content(grounded_prompt)
    response = ollama.chat(model='deepseek-r1:14b', messages=[{'role': 'user', 'content': grounded_prompt}], stream=True,)

    return response, search_results, grounded_prompt

In [8]:
user_prompt = """
You are a Python expert specializing in implementing Retrieval-Augmented Generation (RAG) with cutting-edge AI models and tools. Write a Python script to achieve the following:

1. Objective: Build a Retrieval-Augmented Generation (RAG) system using the Google Gemini 1.5 Flash model, Chroma as the vector database, and Streamlit for the user interface. The system should enable users to input a query, retrieve relevant context from a document database using Chroma, and generate a context-aware response using the Google Gemini 1.5 Flash model.

2. Requirements:
   - Document Ingestion with Chroma:
     - Use Chroma to store and manage a set of documents.
     - Read PDF files from a specified folder, extract text from the PDFs, and embed the content using a suitable text embedding model compatible with Chroma.
   - Query Workflow:
     - When a user inputs a query through the Streamlit interface, retrieve the top-k most relevant documents from Chroma.
   - Integration with Google Gemini 1.5 Flash:
     - Use the retrieved documents as context to generate a response from the Google Gemini 1.5 Flash model.
   - Streamlit Interface:
     - Create an intuitive web interface with:
       - A file upload feature for PDFs, which will automatically update the Chroma database with the newly added content.
       - A text input box for user queries.
       - A display area for both the retrieved documents and the generated response.
   - Modularity:
     - Structure the code with clear modular functions, such as:
       - Extracting text from PDFs.
       - Embedding and storing documents in Chroma.
       - Querying Chroma for relevant documents.
       - Generating responses using Google Gemini 1.5 Flash.
       - Streamlit app setup and interaction.

3. Assumptions:
   - Google Gemini 1.5 Flash API access is available and properly configured.
   - Chroma library is installed and accessible.
   - Streamlit and a PDF parsing library like PyPDF2 or pdfplumber are installed and set up.

4. Additional Considerations:
   - Include error handling for cases where no relevant documents are found.
   - Provide comments to explain the purpose of each function and important lines of code.
   - Ensure the code is compatible with Python 3.8+.
   - Ensure uploaded PDFs are processed dynamically without requiring a server restart.

Please generate the Python code for the complete implementation.
"""

In [11]:
# %%time


# response = ollama.chat(model='deepseek-r1:14b', messages=[{'role': 'user', 'content': user_prompt}], stream=True,)
# print(response['message']['content'])

response, sources, grounded_prompt = generate_grounded_content(user_prompt)

for chunk in response:
  print(chunk['message']['content'], end='', flush=True)

print("\nSources:")
# display(Markdown("Sources:"))
for i, source in enumerate(sources, 1):
    print(f"[Source {i}] {source}")
    # display(Markdown(f"[Source {i}] {source}"))

<think>
Alright, I need to help the user by providing a detailed explanation of Retrieval-Augmented Generation (RAG) based on the two sources they provided. Let me start by understanding what RAG entails.

First, from Source 1, RAG combines retrieval methods with language models to enhance generation tasks. It uses external documents to improve accuracy and context. So I should mention how it retrieves relevant information and integrates it into the model's responses.

Looking at Source 2, there's a detailed process flow. The steps include document loading, splitting, embedding, storing in a vector database, retrieval, and generation. Each of these steps is crucial, so I need to explain them clearly, perhaps with examples to make it relatable.

I should also discuss the augmentation phase where external data is incorporated into prompts. This makes the responses more accurate by leveraging both internal knowledge and retrieved context.

It's important to highlight the benefits of RAG—i

In [12]:
print("\nPrompt:")
print(grounded_prompt)
token_count = count_tokens(grounded_prompt)
print(f"The 'grounded_prompt' contains {token_count} tokens.")


Prompt:

    Based on the following information, please answer the question or respond to the prompt:
    Question/Prompt: 
You are a Python expert specializing in implementing Retrieval-Augmented Generation (RAG) with cutting-edge AI models and tools. Write a Python script to achieve the following:

1. Objective: Build a Retrieval-Augmented Generation (RAG) system using the Google Gemini 1.5 Flash model, Chroma as the vector database, and Streamlit for the user interface. The system should enable users to input a query, retrieve relevant context from a document database using Chroma, and generate a context-aware response using the Google Gemini 1.5 Flash model.

2. Requirements:
   - Document Ingestion with Chroma:
     - Use Chroma to store and manage a set of documents.
     - Read PDF files from a specified folder, extract text from the PDFs, and embed the content using a suitable text embedding model compatible with Chroma.
   - Query Workflow:
     - When a user inputs a query t