News summarization with PaLM API

Overview

This notebook illustrates how to use Vertex AI PaLM text models for news summarization. You will discover the most popular Google Search terms and summarize news articles related to those terms. A system like that could be beneficial in a variety of business situations, including marketing, political analysis, and more.

Trending search terms are retrieved from Google Trends dataset and news articles from the GDELT database. The Google Trends dataset contains the top 25 overall and top 25 rising queries from Google Trends in past 30 days. The dataset is hosted on Google BigQuery as part of Google Cloud Datasets initiative.

The GDELT Project, which is supported by Google Jigsaw, monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100  languages. The GDELT database is free to use and accessible via variety of interfaces, including Google BigQuery and the REST API. In this notebook, we will be using the REST API.

This notebook is as follows 

1. We will begin by installing the necessary packages and configuring the GCP environment.
2. we will query Google Trends dataset to bring top search terms.
3. we will query GDELT API to bring news related to top search terms.
4. Finally, we will summarize these news articles

Install pre-requisites

Install the following python packages

In [None]:
! pip install -U google-cloud-aiplatform
! pip install -U python-dateutil
! pip install -U newspaper3k

Do not forget to restart the runtime before continue

Configure Google Cloud environment settings

Set the following constants to reflect your GCP environment

1. PROJECT_ID: Your Google Cloud Project ID.
2. REGION: The region to use for VERTEX AI

In [None]:
PROJECT_ID = '<YOUR PROJECT ID HERE>'
REGION = 'us-central1'

Initialize the SDK and import some modules.

In [None]:
import logging
import os
import requests
import vertexai

from newspaper import Article
from newspaper import ArticleException

from dateutil.parser import parse as parse_date
from datetime import date, timedelta, datetime
from google.cloud import bigquery
from vertexai.preview.language_models import TextGenerationModel
from typing import Any, Dict, List

logging.basicConfig(level = logging.INFO)
vertexai.init(project=PROJECT_INFO)

bq_client = bigquery.Client(project=PROJECT_ID)
llm = TextGenerationModel.from_pretrained("text-bison@001")


Google Trends lookup tool

Returns top (rank 1) search term(s) for a given date

In [None]:
class GoogleTrends:
     """Get Trends from BQ dataset
     Useful for when you need to find top search terms on a given date.
     Input is a JSON object that has the field date.
     """

     def __init__(
             self,
             project_id: str,
             bq_client: Any):
        self.project_id = project_id
        self.bq_client = bq_client

     def run(self, json_params: Dict):
         refresh_date = self._parse_date(json_params)

         if refresh_date:
            df = self._query_top_terms(refresh_date)
            terms = df.loc[0].values[0]
            terms = terms.split(' ')
         else:
            terms = []
         return terms

     def _query_top_terms(self, date:str):
         """Retrieve top terms from Google Trends."""
         query = f"""
                 SELECT term, rank FROM `bigquery-public-data.google_trends.top_terms`
                 WHERE refresh_date = '{date}'
                 GROUP BY 1,2
                 ORDER BY rank ASC
         """
         query_job = self.bq_client.query(
          query, 
          location='US',
         )
         df = query_job.to_dataframe()
         return df

     def _parse_date(self, json_params: Dict):
         """Parse date."""
         params = json_params

         if 'date' in params:
             try:
                dt = parse_date(params['date'])
                dt = dt.date()
             except: 
                dt = date.today()
         else:
            dt = date.today()
         dt_str = dt.strftime('%Y-%m-%d')

         if dt >= date.today() or dt <= date.today() - timedelta(days=30):
            dt_str = ""
         else:
            dt_str = dt.strftime('%Y-%m-%d')

         return dt_str         


In [None]:
#Google Trends dataset in BigQuery only stores data from the past month
#Change to a valid date

google_trends_tool = GoogleTrends(project_id=PROJECT_ID, bq_client=bq_client)
google_trends_tool.run({'date': '11-24-2023'})

GDELT Retriever 