News summarization with LangChain agents and Vertex AI PaLM text models

Overview

It is a very common practice in real-world business scenarios to integrate Large Language Models (LLMs) with external sources of knowledge or applications. In a groundbreaking paper titled Synergizing Reasoning and Acting in Language Models (https://arxiv.org/pdf/2210.03629.pdf), Google Research and Princeton University introduced a paradigm - ReAct - that combines reasoning and acting with LLMs by allowing models to interact with external environments to gather additional information.

Langchain (https://python.langchain.com/docs/get_started/introduction) is a framework that allows you to create applications powered by language models. It includes extensive support for agents based on the ReAct concepts. Langchain agents use tools to interact with external systems. A tool is a component that performs a specific tasks, such as retrieving data from an external search engine. LangChain includes a number of predefined tools, such as tools for interacting with Google search, wikipedia, ArXiv, SQL databases, and many more. You can also define your own tools. 

This notebook illustrates how to use LangChain agents with VertexAI PaLM text models and custom tools. You will build an agent that can help you discover the most popular Google Search terms and analyze news articles related to those terms. The dataset is hosted on Google BigQuery as part of the Google Cloud Datasets initiative.

The GDELT Project, which is supported by Google Jigsaw, monitors the world's broadcast, print and web news from nearly every corner of every country in over 100 languages. The GDELT database is free to use and accessible via a variety of interfaces, including Google BigQuery and the REST API. In this notebook, we will be using the REST API.

The notebook is structured as follows:

1. You will begin by installing the necessary packages and configuring the GCP environment.
2. Next, you will define and test the custom LangChain tools around the Google Trends dataset and the GDEL API.
3. Finally, you will experiment with using the tools with a few different types of LangChain agents.

Install pre-requisites

Install the following python packages

In [None]:
! pip install -U google-cloud-aiplatform
! pip install -U langchain
! pip install -U python-dateutil
! pip install -U newspaper3k

DO NOT FORGET TO RESTART THE RUNTIME before continue

Configure Google Cloud Environment Settings

Set the following constants to reflect the GCP environment

1. PROJECT_ID: Your Google Cloud Project ID
2. REGION: The region to use for Vertex AI

In [None]:
PROJECT_ID = '<YOUR PROJECT ID HERE>'
REGION = 'us-central1'

Initialize Vertex SDK

In [None]:
import vertexai

vertexai.init(project=PROJECT_ID, location=REGION)

Define custom LangChain tools

LangChain tools allow LangChain agents to communicate with other systems. For more information on how to build and use them, please refer to the Tools getting started documentation

This section of the notebook defines two custom tools:

1. The Google Trends dataset tool allows an agent to retrieve a list of top-ranked search keywords on a given date.
The tool retrieves this information from the Google Trends BigQuery dataset.
2. The GDELT tool allows an agent to retrieve news articles that best match a set of keywords, on a given date, and with the given tone. The tool uses the GDELT API to retrieve the articles' metadata and content.

There are a few options for implementing LangChain tools, including using the Tool dataclass, subclassing the BaseTool class, or using the Tool decorator. Due to the relatively complex logic, both tools are implemented by subclassing the LangChain BaseTool class.

Google Trends dataset tool

The tool extracts a list of the most popular search terms on a specific date with the last 30 days. The tool is a wrapper around the Google Trends Bigquery dataset. The tool expects to receive a JSON object as input, with the following format:

{
    "date": "05-16-2023"
}

Google Trends only stores the top search terms for the previous 30 days, so the date you must provide must be within the following range: [current_date - 30, current_date - 1]. If you provide a date outside of this range, the tool will return an empty list of keywords.

The tool can handle a variety of data formals to account for the different ways that an LLM might return dates

In [None]:
import json

from datetime import date, timedelta, time, datetime
from typing import Any, Dict, List, Optional, ClassVar, Tuple
from google.cloud import bigquery
from dateutil.parser import parse as parse_date
from langchain.tools import BaseTool

class QueryGoogleTrendsDatasetTool(BaseTool):
       name = "query_google_trends"
       description = """Useful for when you need to find top search terms on a given date.
       Input is a JSON object that has the field date.
       The date must be in the following format: YYYY-MM-DD.
       """

       client: Any
       project_id: str
       location: str = 'US'

       def _retrieve_top_terms(self, date:str):
           """retrieves top terms from BigQuery for a given data."""

           query = f"""SELECT term, rank FROM `bigquery-public-data.google_trends.top_terms`
                WHERE refresh_date = '{date}'
                GROUP BY 1,2
                ORDER BY rank ASC
                """
           query_job = self.client.query(
            query=query,
            location=self.location
           )
           df=query_job.to_dataframe()
           return df

       def _parse_date(self, json_params_str: str):
          """Retrieves a date from the JSON input parameters
          and normalizes it to the format required by BigQuery."""

          params = json.loads(json_params_str)

          if 'date' in params:
             try:
                dt = parse_date(params['date'])
                dt = dt.date()
             except:
                dt = date.today()

          else:
               dt = date.today()
          
          if dt >= date.today() or dt <= date.today() - timedelta(days=30):
             dt_str = ""
          else:
             dt_str = dt.strftime('%Y-%m-%d')

          return dt_str

       def _run(self, json_params_str: str):
           """Return top search terms as a JSON list"""

           refresh_date = self._parse_date(json_params_str)
           terms = json.dumps([])
           if refresh_date:
              df = self._retrieve_top_terms(refresh_date)
              if not df.empty:
                 terms = df.loc[0].values[0]
                 terms = json.dumps(terms.split(' '))      

           return terms

       def _arun(self, json_params: str):
           raise NotImplementedError("This tool does not support async")   

Running a quick test

In [None]:
google_trends_tool = QueryGoogleTrendsDatasetTool(
    project_id=PROJECT_ID,
    client=bigquery.Client(project=PROJECT_ID)
)
date_str = (date.today() - timedelta(days=10)).strftime('%Y-%m-%d')
input_params = f'{{"date": "{date_str}"}}'
google_trends_tool.run(input_params)

GDELT Search tool

The GDELT Search tool enables an agent to obtain information about articles that match a set of keywords. The tool takes a JSON object as input, with the following format:

{
    "date": "05-16-2023",
    "keywords": ["Real"]

}