## Objectives

1. Use SerpAPI to query the web
2. Extract content from top 3-5 pages using newspaper3k
3. Summarize relevant info for prompt context use a local LLM
4. store recent ssearch results with timestamps for reuse

### Dependencies

- serpapi, newspaper, readability-lxml, beautifulsoup4, lxml, transformers, google-search-results

## Code

### Library Import

In [37]:
#Importing libraries
import os
import json
import requests
import sqlite3
from serpapi import GoogleSearch
from newspaper import Article
from bs4 import BeautifulSoup
from readability import Document
from datetime import datetime, timedelta
from sentence_transformers import SentenceTransformer, util
from transformers import pipeline

### Extracting Top URLs from Google

We use SerpAPI to perform a Google search using the query. Default is to collect the top five search results. We also test the function below.

In [38]:
#API key from SerpAPI
SERPAPI_KEY = "c109882a3ac9a1a7728552909b5e962a34ba1201be40921588a932f1e8aeb84e"

def search_google_serpapi(query,api_key,num_results=5):
    """
    Uses SerpAPI to perform a Google search and return top result URLs.
    Parameters:
    - query (str): the search query
    - api_key (str): the SerpAPI key
    - num_results (int): # of URLs to return (default is 5)
    Returns:
    - List[str]: List of top URLs from organic search results.
    """
    
    #Setting parameters for the SerpAPI request
    params={
        "engine":"google",      #use Google engine
        "q":query,              #search query
        "api_key":api_key,      #SerpAPI key
        "num":num_results       #number of results to fetch
    }

    #Making the search request using the SerpAPI client
    search=GoogleSearch(params)
    results=search.get_dict()

    #Checking if organic results are present in the response
    if "organic_results" not in results:
        print("No search results returned.")
        return []

    #List to store the top URLs
    urls=[]

    #Looping through the top organic results and collecting URLs
    for result in results["organic_results"][:num_results]:
        url=result.get("link")
        if url:
            urls.append(url)

    return urls

In [39]:
#Example search
query="Best practices for Microsoft Intune device compliance"
top_urls=search_google_serpapi(query,SERPAPI_KEY,3)
print(top_urls)

['https://learn.microsoft.com/en-us/intune/intune-service/fundamentals/deployment-plan-compliance-policies', 'https://learn.microsoft.com/en-us/intune/intune-service/protect/device-compliance-get-started']


### Extracting Content from URLs

We design a function that tries to extract article text with `newspaper3k`. If it fails, it will try to use readability and BeautifulSoup. It will then return clean extracted text (or an empty string if nothing works). Function is tested below using the top URLs we found in the previous example.

In [40]:
def extract_page_text(url):
    """
    Extracts readable text content from a given URL.
    Attempts to use newspaper3k first, then falls back to readability and BeautifulSoup
    Parameters:
    - url (str): the URL of the page to extract
    Returns:
    - str: extracted article/page text (empty string if failed)
    """
    try:
        #Attempting to use newspaper3k first
        article=Article(url)
        article.download()
        article.parse()
        if article.text.strip():
            return article.text.strip()
    except Exception as e:
        print(f"[newspaper3k] failed for {url}: {e}")

    try:
        #Attempting fallback which is to use readability-lxml and BeautifulSoup
        response=requests.get(url,timeout=10)
        doc=Document(response.text)
        html=doc.summary()  #Extracting main content HTML
        soup=BeautifulSoup(html,"html.parser")
        text=soup.get_text(separator="\n")
        return text.strip()
    except Exception as e:
        print(f"[fallback] failed for {url}: {e}")

    return ""  #Returning empty string if all methods fail

In [41]:
#Example extracting text from one of our top search results in previous example
for url in top_urls:
    print(f"Extracting from: {url}")
    text=extract_page_text(url)
    print(text[:1000])  #Previewing first 1000 characters

Extracting from: https://learn.microsoft.com/en-us/intune/intune-service/fundamentals/deployment-plan-compliance-policies
Access to this page requires authorization. You can try changing directories .

Access to this page requires authorization. You can try signing in or changing directories .

Step 3 – Plan for compliance policies

Previously, you set up your Intune subscription and created app protection policies. Next, plan for and configure device compliance settings and policies to help protect organizational data by requiring devices to meet requirements that you set.

If you’re not yet familiar with compliance policies, see Compliance overview.

This article applies to:

Android Enterprise (Fully Managed, and Personally owned work profiles)

Android Open-Source Project (AOSP)

iOS/iPadOS

Linux

macOS

Windows

You deploy compliance policies to groups of devices or users. When deployed to users, any device the user signs into must then meet the policies requirements. Some common

### Summarizing Top Results

For summarization, we have two options:

1. LLM-based summarization is best for summarizing long documents with nuance, tone, or goal-aware prompts. Pros are that it provides high-quality, human-like summaries, can follow instructions, and is easy to use with Hugging Face or OpenAI APIs. Cons include that it requires GPU/cloud if using local models, may be slower, and has risk of hallucination.
2. Embedding-based (vector similarity) is best for finding most relevant parts of documents. It's fast, scalable, and great for chunk retrieval. However, it's not a real summary, doesn’t rewrite text, and needs good chunking and query crafting.

We will use a hybrid approach:

- Use embeddings to filter or rank which chunks or articles are relevant
- Then use an LLM to summarize the filtered/retrieved content into a clean paragraph and tailor it

In [42]:
#Loading embedding model (same one used in during preprocessing of .docx files)
embedding_model=SentenceTransformer('all-MiniLM-L6-v2')

def rank_chunks_by_similarity(chunks,query,top_k=3):
    """
    Returns top_k (default is 3) chunks most semantically similar to the query
    """
    query_embedding=embedding_model.encode(query,convert_to_tensor=True)
    chunk_embeddings=embedding_model.encode(chunks, convert_to_tensor=True)

    #Computing similarity scores
    similarities=util.pytorch_cos_sim(query_embedding,chunk_embeddings)[0]
    top_indices=similarities.topk(k=top_k).indices

    #Returning top ranked chunks
    return [chunks[i] for i in top_indices]

#Loading summarization pipeline (do this only once)
summarizer=pipeline("summarization",model="facebook/bart-large-cnn")

def summarize_text_with_llm(text,style_prompt="Summarize in plain English."):
    """
    Uses a language model to summarize the text
    """
    #Combining custom instructions with the extracted text
    input_text = f"{style_prompt}\n\n{text}"
    
    #Truncating if too long
    input_text=input_text[:1024]

    summary=summarizer(input_text,max_length=150,min_length=40,do_sample=False)
    return summary[0]['summary_text']

Device set to use cpu


In [43]:
#Example workflow
#1. Search and extract content
urls=search_google_serpapi(query,SERPAPI_KEY)
chunks=[extract_page_text(url) for url in urls]

#2. Rank by similarity to query
top_chunks=rank_chunks_by_similarity(chunks, query)

#3. Summarize best content
summary = summarize_text_with_llm("\n\n".join(top_chunks),style_prompt="Summarize in plain English.")

print(summary)

Use compliance policies to set rules for devices you manage with Intune. Conditional Access can enforce Microsoft Entra access controls based on a devices current compliance status to help ensure that only devices that are compliant are permitted to access corporate resources.


### Search Result Cache

We implement a SQLite cache with 30-day expiration logic which gives us a robust, queryable, and scalable solution. We’ll store the query, result URLs, summary, and timestamp.

In [44]:
#Creating or connecting to the SQLite database
conn=sqlite3.connect("search_cache.db")
cursor=conn.cursor()

#Creating the cache table if it doesn't exist
cursor.execute("""
CREATE TABLE IF NOT EXISTS search_cache (
    query TEXT PRIMARY KEY,
    urls TEXT,
    summary TEXT,
    timestamp TEXT
)
""")
conn.commit()

def is_cache_valid(timestamp_str,expiry_days=30):
    """
    Check if cached timestamp is within expiry period
    """
    timestamp=datetime.fromisoformat(timestamp_str)
    return datetime.utcnow() - timestamp <= timedelta(days=expiry_days)

def get_cached_result_sqlite(query):
    """
    Return cached result if valid; otherwise None
    """
    cursor.execute("SELECT urls, summary, timestamp FROM search_cache WHERE query = ?", (query,))
    row=cursor.fetchone()
    if row:
        urls, summary, timestamp=row
        if is_cache_valid(timestamp):
            return {
                "urls":json.loads(urls),
                "summary":summary,
                "timestamp":timestamp
            }
        else:
            print("Cache expired. Deleting old entry...")
            cursor.execute("DELETE FROM search_cache WHERE query = ?", (query,))
            conn.commit()
    return None

def store_result_in_cache_sqlite(query, urls, summary):
    """
    Store new query results with timestamp
    """
    timestamp=datetime.utcnow().isoformat()
    cursor.execute("""
        INSERT OR REPLACE INTO search_cache (query, urls, summary, timestamp)
        VALUES (?, ?, ?, ?)
    """, (query, json.dumps(urls), summary, timestamp))
    conn.commit()

In [45]:
#Example usage
#1. Check cache
cached=get_cached_result_sqlite(query)
if cached:
    print("Loaded from cache:")
    print(json.dumps(cached,indent=2))
else:
    print("Fetching fresh data...")
    urls=search_google_serpapi(query,SERPAPI_KEY)
    chunks=[extract_page_text(url) for url in urls]
    top_chunks=rank_chunks_by_similarity(chunks, query)
    summary=summarize_text_with_llm("\n\n".join(top_chunks))

    #2. Store in cache
    store_result_in_cache_sqlite(query,urls,summary)
    print("Stored in cache.")
    print(summary)

Loaded from cache:
{
  "urls": [
    "https://learn.microsoft.com/en-us/intune/intune-service/fundamentals/deployment-plan-compliance-policies",
    "https://learn.microsoft.com/en-us/intune/intune-service/protect/device-compliance-get-started",
    "https://www.goworkwize.com/blog/microsoft-intune-best-practices",
    "https://www.reddit.com/r/Intune/comments/1dmozfw/compliance_policies_whats_your_approach/"
  ],
  "summary": "Use compliance policies to set rules for devices you manage with Intune. Conditional Access can enforce Microsoft Entra access controls based on a devices current compliance status to help ensure that only devices that are compliant are permitted to access corporate resources.",
  "timestamp": "2025-08-01T04:50:17.424034"
}
