## Introduction

Knowledge bases in enterprises are very common in the industry today and can have extensive number of documents in different categories. Retrieving relevant content based on a user query is a challenging task.  Given a query we were able to retrieve information accurately at the document level using methods such as Page Rank developed and made highly accurate especially by Google,  after this point the user has to delve into the document and search for the relevant information.  With recent advances in Foundation Models such as the one developed by Open AI the challenge is alleviated by using “Semantic Search” methods by using encoding information such as “Embeddings” to find the relevant information and then to summarize the content to present to the user in a concise and succinct manner.  

This notebook will introduce the Use Case and will walk you through leveraging Azure Cognitive Search to extract relevant documents and leveraging the power of GPT-3 to address relevant part of the document, and provide a summary based on the prompt (instruction given to the model). It aims to demonstrate how to use Azure OpenAI’s GPT-3 capabilities to adapt to your summarization case, and how to set up and evaluate summarization results. The method is customizable to your summarization use case and can be applied to many different datasets. 

## Use Case

This use case consists of three sections:
- Document search
- Document Zone search
- Text summarization

Document Search is the process of extracting relevant document based on the query from a corpus of documents.
Document Zone search is the process of finding the relevant part of the document extracted from document search.
Text summarization is the process of creating summaries from large volumes of data while maintaining significant informational elements and content value. 
This use case can be useful in helping subject matter experts in finding relevant information from large document corpus.
Example: In the drug discovery process, scientists in pharmaceutical industry read a corpus of documents to find specific information related to concepts, experiment results etc. This use case enables them to ask questions from the document corpus and the solution will come back with the succinct answer. Consequently, expediting the drug discovery process.
 
Benefits of the solution:
1. Shortens reading time
2. Improves the effectiveness of searching for information
3. Removes bias from human summarization techniques
4. Increases bandwidth for humans to focus on more in-depth analysis 


The need for document summarization be applied to any subject matter (legal, financial, journalist, medical, academic, etc) that requires long document summarization. The subject matter that this notebook is focusing on is journalistic - we will walk through news articles. If the topic gets more domain specific, fine-tuning of the GPT3-model would work better rather than just using the few-shot or zero-shot example methods.  


## CNN daily mail dataset
For this walkthrough, we will be using the CNN/Daily Mail dataset. This is a common dataset used for text summarization and question answering tasks. Human generated abstractive summary bullets were generated from news stories in CNN and Daily Mail websites.


## Data Description
The relevant schema for our work today consists of:

- id: a string containing the heximal formatted SHA1 hash of the URL where the story was retrieved from
- article: a string containing the body of the news article
- highlights: a string containing the highlight of the article as written by the article author


## Import python modules

In [1]:
import os, json, requests, sys, re
import requests
from pprint import pprint
import pandas as pd
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient 
from azure.search.documents import SearchClient
from azure.search.documents.indexes.models import (
    SearchIndex,
    SearchField,
    SearchFieldDataType,
    SimpleField,
    SearchableField,
    SemanticConfiguration,
    PrioritizedFields,
    SemanticField,
    SemanticSettings
)


import openai
import numpy as np
from openai.embeddings_utils import get_embedding, cosine_similarity

from dotenv import load_dotenv
load_dotenv()

True

In [2]:
openai.api_key = os.environ['OPENAI_API_KEY']
openai.api_base = os.environ['OPENAI_API_BASE']
openai.api_type = os.environ['OPENAI_API_TYPE']
openai.api_version = os.environ['OPENAI_API_VERSION']

text_model = os.environ['TEXT_DAVINCI_NAME']
chat_model = os.environ['CHAT_MODEL_NAME']
embedding_model=os.environ['EMBEDDING_MODEL_NAME']

In [4]:
# read the CNN dailymail dataset in pandas dataframe
df = pd.read_csv('data/structured/cnn_dailymail_data.csv') #path to CNN daily mail dataset
df.head()

Unnamed: 0,id,article,highlights
0,92c514c913c0bdfe25341af9fd72b29db544099b,Ever noticed how plane seats appear to be gett...,Experts question if packed out planes are put...
1,2003841c7dc0e7c5b1a248f9cd536d727f27a45a,A drunk teenage boy had to be rescued by secur...,Drunk teenage boy climbed into lion enclosure ...
2,91b7d2311527f5c2b63a65ca98d21d9c92485149,Dougie Freedman is on the verge of agreeing a ...,Nottingham Forest are close to extending Dougi...
3,caabf9cbdf96eb1410295a673e953d304391bfbb,Liverpool target Neto is also wanted by PSG an...,Fiorentina goalkeeper Neto has been linked wit...
4,3da746a7d9afcaa659088c8366ef6347fe6b53ea,Bruce Jenner will break his silence in a two-h...,"Tell-all interview with the reality TV star, 6..."


In [5]:
# Create an SDK client
service_endpoint = os.getenv("AZURE_COGNITIVE_SEARCH_ENDPOINT")   
key = os.getenv("AZURE_COGNITIVE_SEARCH_KEY")
credential = AzureKeyCredential(key)

index_name = "news-index"

index_client = SearchIndexClient(
    endpoint=service_endpoint, credential=credential)
index_client

<azure.search.documents.indexes._search_index_client.SearchIndexClient at 0x1afa831c0d0>

In [7]:
fields = [
    SimpleField(name="id", type=SearchFieldDataType.String, key=True),
    SearchableField(name="highlights", type=SearchFieldDataType.String,
                searchable=True, retrievable=True),
    SearchableField(name="article", type=SearchFieldDataType.String,
                filterable=True, searchable=True, retrievable=True),
]

semantic_config = SemanticConfiguration(
    name="my-semantic-config",
    prioritized_fields=PrioritizedFields(
        #title_field=SemanticField(field_name=""), # title field is not present in the dataset. We can use OpenAI to generate title
        #prioritized_keywords_fields=[SemanticField(field_name="")], # keywords are not present in the dataset. We can use OpenAI to generate keywords
        prioritized_content_fields=[SemanticField(field_name="article"), SemanticField(field_name="highlights")]
    )
)


# Create the semantic settings with the configuration
semantic_settings = SemanticSettings(configurations=[semantic_config])

# Create the search index with the semantic settings
index = SearchIndex(name=index_name, fields=fields, semantic_settings=semantic_settings)
result = index_client.create_or_update_index(index)
print(f' {result.name} created')

 news-index created


In [10]:
documents = df.to_dict('records')
documents[0]

{'id': '92c514c913c0bdfe25341af9fd72b29db544099b',
 'article': "Ever noticed how plane seats appear to be getting smaller and smaller? With increasing numbers of people taking to the skies, some experts are questioning if having such packed out planes is putting passengers at risk. They say that the shrinking space on aeroplanes is not only uncomfortable - it's putting our health and safety in danger. More than squabbling over the arm rest, shrinking space on planes putting our health and safety in danger? This week, a U.S consumer advisory group set up by the Department of Transportation said at a public hearing that while the government is happy to set standards for animals flying on planes, it doesn't stipulate a minimum amount of space for humans. 'In a world where animals have more rights to space and food than humans,' said Charlie Leocha, consumer representative on the committee.\xa0'It is time that the DOT and FAA take a stand for humane treatment of passengers.' But could crow

In [11]:
len(documents)

11490

In [12]:
search_client = SearchClient(endpoint=service_endpoint, index_name=index_name, credential=credential)
result = search_client.upload_documents(documents)  
print(f"Uploaded and Indexed {len(result)} documents") 

Uploaded and Indexed 11490 documents


## Section 1: Leveraging Cognitivie search to extract relevant article based on the query 

## Creating Cognitive Seach Index using CNN Dailymail dataset
<img src="../Demo/images/AzureCogSearchIndex.png" alt="Alternative text" />

In [13]:
search_term = "Stock market news"
count = 2
results = search_client.search(search_text=search_term, top=count, include_total_count=True)
for result in results:
    print(result)
    print("\n\n")

{'highlights': 'Shares have performed nearly twice as well under Tories, new figures show .\r\nStock market returns averaged 16 per cent per year under Conservatives .\r\nReturns hovered around nine per cent under Labour and current Coalition .', 'article': "Stock market returns have averaged 16 per cent a year under David Cameron's Conservatives party, compared to just under nine per cent under Labour . Shares have performed nearly twice as well under Conservative governments than under Labour over the last 45 years, according to figures published today. Stock market returns have averaged 16 per cent a year under the Tories compared with just under 9 per cent under Labour and just over 9 per cent under the current Coalition. The figures, in an analysis for the Daily Mail by investment company Hargreaves Lansdown, are a further boost for David Cameron and George Osborne ahead of the general election on May 7. Ed Miliband is struggling to win over business men and women as well as inves

In [14]:
#Extracting relevant article based on the query. eg: Clinton Democratic Nomination
results = search_client.search(search_text="Clinton Democratic nomination", include_total_count=True)
document = next(results)['article']

In [15]:
document

'Apple founder Steve Jobs\' widow Laurene has told of her admiration for Democratic White House front-runner Hillary Clinton. Ms Jobs, 51, called former First Lady Hillary a \'revolutionary\' woman, and added that it\'s not just because she\'s a woman - but \'the type of woman she is\'. Speaking to Time 100, Ms Jobs said: \'Hillary Clinton is not familiar. She is revolutionary. Not radical, but revolutionary: The distinction is crucial. She is one of America’s greatest modern creations. Laurene Jobs, pictured, widow of Apple\'s Steve, has strongly backed Hillary Clinton for president . Laurene Jobs said that Hillary Clinton, right, has \'judgment and wisdom\' based on her public service . \'Her decades in our public life must not blind us to the fact that she represents new realities and possibilities. Indeed, those same decades have conferred upon her what newness usually lacks: judgment, and even wisdom. \'It matters, of course, that Hillary is a woman. But what matters more is what 

In [16]:
#length of article extracted from Azure Cognitive search
len(document) 

6675

## Section 2: Document Zone Search
Document Zone: Azure OpenAI Embedding API
Now that we narrowed on a single document from our knowledge base using Azure Cognitive Search- we can dive deeper into the single document to refine our initial query to a more specific section or "zone" of the article.

To do this, we will utilize the Azure Open AI Embeddings API.

## Embeddings Overview
An embedding is a special format of data representation that can be easily utilized by machine learning models and algorithms. The embedding is an information dense representation of the semantic meaning of a piece of text. Each embedding is a vector of floating-point numbers, such that the distance between two embeddings in the vector space is correlated with semantic similarity between two inputs in the original format. For example, if two texts are similar, then their vector representations should also be similar.

Different Azure OpenAI embedding models are specifically created to be good at a particular task. Similarity embeddings are good at capturing semantic similarity between two or more pieces of text. Text search embeddings help measure long documents are relevant to a short query. Code search embeddings are useful for embedding code snippets and embedding nature language search queries.

Embeddings make it easier to do machine learning on large inputs representing words by capturing the semantic similarities in a vector space. Therefore, we can use embeddings to if two text chunks are semantically related or similar, and inherently provide a score to assess similarity.

## Cosine Similarity
A previously used approach to match similar documents was based on counting maximum number of common words between documents. This is flawed since as the document size increases, the overlap of common words increases even if the topics differ. Therefore cosine similarity is a better approach.

Mathematically, cosine similarity measures the cosine of the angle between two vectors projected in a multi-dimensional space. This is beneficial because if two documents are far apart by Euclidean distance because of size, they could still have a smaller angle between them and therefore higher cosine similarity.

The Azure OpenAI embeddings rely on cosine similarity to compute similarity between documents and a query.

## Setting up Azure OpenAI service and using deployed models

In [17]:
#Defining helper functions
#Splits text after sentences ending in a period. Combines n sentences per chunk.
def splitter(n, s):
    pieces = s.split(". ")
    list_out = [" ".join(pieces[i:i+n]) for i in range(0, len(pieces), n)]
    return list_out

# Perform light data cleaning (removing redudant whitespace and cleaning up punctuation)
def normalize_text(s, sep_token = " \n "):
    s = re.sub(r'\s+',  ' ', s).strip()
    s = re.sub(r". ,","",s)
    # remove all instances of multiple spaces
    s = s.replace("..",".")
    s = s.replace(". .",".")
    s = s.replace("\n", "")
    s = s.strip()    
    return s

In [18]:
document_chunks = splitter(10, normalize_text(document)) #splitting extracted document into chunks of 10 sentences
document_chunks

["Apple founder Steve Jobs' widow Laurene has told of her admiration for Democratic White House front-runner Hillary Clinton Ms Jobs, 51, called former First Lady Hillary a 'revolutionary' woman, and added that it's not just because she's a woman - but 'the type of woman she is' Speaking to Time 100, Ms Jobs said: 'Hillary Clinton is not familiar She is revolutionary Not radical, but revolutionary: The distinction is crucial She is one of America’s greatest modern creations Laurene Jobs, pictured, widow of Apple's Steve, has strongly backed Hillary Clinton for president  Laurene Jobs said that Hillary Clinton, right, has 'judgment and wisdom' based on her public service  'Her decades in our public life must not blind us to the fact that she represents new realities and possibilities Indeed, those same decades have conferred upon her what newness usually lacks: judgment, and even wisdom",
 "'It matters, of course, that Hillary is a woman But what matters more is what kind of woman she i

In [28]:
from openai.error import RateLimitError
from time import sleep


def get_embedding(text: str, engine: str = "text-embedding-ada-002"):
    count=0
    while True:
        try:
            embedding = openai.Embedding().create(input=[text], engine=engine)["data"][0]["embedding"]
            break;
        except RateLimitError:
            count+=1
            #print(f'RateLimitError Count: {count}')
            sleep(2)            
    return np.array(embedding).astype(np.float32)

def get_completion(prompt, model="gpt-35-turbo", temperature=0): # Andrew mentioned that the prompt/ completion paradigm is preferable for this class
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
        engine=model,
        messages=messages,
        temperature=temperature, # this is the degree of randomness of the model's output
    )
    return response.choices[0].message["content"]


In [22]:
embed_df = pd.DataFrame(document_chunks, columns = ["chunks"]) #datframe with document chunks

#Create an embedding vector for each chunk that will capture the semantic meaning and overall topic of that chunk
embed_df['embeddings'] = embed_df["chunks"].apply(lambda x : get_embedding(x, engine = embedding_model))

embed_df 


Unnamed: 0,chunks,embeddings
0,Apple founder Steve Jobs' widow Laurene has to...,"[-0.017156884, -0.008087487, -0.020686451, 0.0..."
1,"'It matters, of course, that Hillary is a woma...","[-0.029816845, -0.022468744, 0.007905178, 0.00..."
2,Bird himself is a frequent participant in Iowa...,"[-0.040256746, -0.02904699, 0.0066080606, -0.0..."
3,Price was executive director of the Iowa Democ...,"[-0.042181276, -0.010906834, -0.00850583, -0.0..."
4,And planting party insiders in place of typica...,"[-0.030688679, -0.013891704, 0.011889042, -0.0..."
5,'I was driving the Vice President when he was ...,"[-0.045601957, -0.031421017, -0.0022003786, -0..."


In [25]:
# search through the document for a text segment most similar to the query
# display top two most similar chunks based on cosine similarity
def search_docs(df, user_query, top_n=3):
    embedding = get_embedding(
        user_query,
        engine=embedding_model,
    )
    df["similarities"] = df['embeddings'].apply(lambda x: cosine_similarity(x, embedding))

    res = (
        df.sort_values("similarities", ascending=False)
        .reset_index(drop=True)
        .head(top_n)
    )
    return res

In [26]:
document_specific_query = "trouble so far in clinton campaign" 
res = search_docs(embed_df, document_specific_query, top_n=2) #finding top 2 results based on similarity 
res

Unnamed: 0,chunks,embeddings,similarities
0,And planting party insiders in place of typica...,"[-0.030688679, -0.013891704, 0.011889042, -0.0...",0.843564
1,"'It matters, of course, that Hillary is a woma...","[-0.029816845, -0.022468744, 0.007905178, 0.00...",0.831203


## Section 3: Summarizer

This section will cover the end-to-end flow of using the GPT-3 models for summarization tasks. 
The model used by the Azure OpenAI service is a generative completion call which uses natural language instructions to identify the task being asked and skill required – aka Prompt Engineering. Using this approach, the first part of the prompt includes natural language instructions and/or examples of the specific task desired. The model then completes the task by predicting the most probable next text. This technique is known as "in-context" learning. 

There are three main approaches for in-context learning: Zero-shot, Few-shot and Fine tuning. These approaches vary based on the amount of task-specific data that is given to the model: 

**Zero-shot**: In this case, no examples are provided to the model and only the task request is provided. 

**Few-shot**: In this case, a user includes several examples in the call prompt that demonstrate the expected answer format and content. 

**Fine-Tuning**: Fine Tuning lets you tailor models to your personal datasets. This customization step will let you get more out of the service by providing: 
-	With lots of data (at least 500 and above) traditional optimization techniques are used with Back Propagation to re-adjust the weights of the model – this enables higher quality results than mere zero-shot or few-shot. 
-	A customized model improves the few-shot learning approach by training the model weights on your specific prompts and structure. This lets you achieve better results on a wider number of tasks without needing to provide examples in the prompt. The result is less text sent and fewer tokens 


In [27]:
'''Designing a prompt that will show and tell GPT-3 how to proceed. 
+ Providing an instruction to summarize the text about the general topic (prefix)
+ Providing quality data for the chunks to summarize and specifically mentioning they are the text provided (context + context primer)
+ Providing a space for GPT-3 to fill in the summary to follow the format (suffix)
'''

# result_1 corresponding to the top chunk from Section 2. result_2 corresponding to the second to top chunk from section 2. 
# change index for desired chunk
result_1 = res.chunks[0]
result_2 = res.chunks[1]
prompt_i = 'Summarize the content about the Clinton campaign given the text provided.\n\nText:\n'+" ".join([normalize_text(result_1)])+ '\n\nText:\n'+ " ".join([normalize_text(result_2)])+'\n\nSummary:\n'
print(prompt_i)

Summarize the content about the Clinton campaign given the text provided.

Text:
And planting party insiders in place of typical Iowans won't go over well in the Hawkeye State, where pressing the flesh and collecting caucus votes is a quadrennial full-contact sport ASTROTURF: Setting up faux events for news cameras is nothing new in politics, but Iowans take presidential contests seriously and could punish Clinton for the deception THE FIXER: Bird said Troy Prince (left, pictured with VP Joe Biden), who was executive director of the Iowa Democratic Party until he left last month to help Clinton's statewide political effort, recruited him and others to attend the 'spontaneous' coffee meeting Clinton's campaign has already taken heat for depicting at least three people in her campaign launch video as 'everyday' Americans who were actually partisans with political connections One was even a former campaign manager for Wendy Davis, the Texas Democrat who mounted a failed bid for Texas gove

In [30]:
get_completion(prompt_i, model=chat_model) # default temperature is set to 0

"The Clinton campaign has been criticized for staging events and using party insiders instead of typical Iowans to meet with Hillary Clinton. The recent coffee shop visit in LeClaire, Iowa was orchestrated from beginning to end, with one of the men pictured being a Democratic Party insider who chauffeured Vice President Joe Biden around Davenport. Iowans take presidential contests seriously and could punish Clinton for the deception. Clinton's campaign has already taken heat for depicting at least three people in her campaign launch video as 'everyday' Americans who were actually partisans with political connections."

In [29]:
get_completion(prompt_i, model=chat_model, temperature=0.5)

'The Clinton campaign has been criticized for using party insiders and staging events to create a false impression of grassroots support in Iowa. The recent coffee shop meeting with three young Iowans was orchestrated by a campaign staffer who drove them to the event and vetted them beforehand. One of the participants, Austin Bird, is a Democratic Party insider who chauffeured Vice President Joe Biden around Davenport, Iowa. Iowans take presidential contests seriously and could punish Clinton for this deception. Clinton\'s campaign has already faced criticism for depicting at least three people in her campaign launch video as "everyday" Americans who were actually partisans with political connections.'