#Create a Meeting Minute using OpenAI's GPT-3 from both Microsoft Team's Meeting Transcript or Zoom's Meeting Transcript (Transformers)

This is my endeavor to replicate upcoming Microsoft Team Premium feature to create meeting notes using AI.

##1. Prerequisites

The following are prerequisites for this tutorial:

- Python Package: openai
- Python Package: torch and transformers

###1.1. Python Packages

####1.1.1. Install Python Packages

##2. Code

Colab code to prettify text output.

In [4]:
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

###2.1. Import Python Packages

In [5]:
import platform
import os

import openai

import re
from os.path import splitext, exists

import torch
from transformers import AutoTokenizer

print('Python: ', platform.python_version())
print('re: ', re.__version__)
print('torch: ', torch.__version__)
print('transformers:', transformers.__version__)

Python:  3.10.11
re:  2.2.1
torch:  2.0.1+cpu


NameError: name 'transformers' is not defined

###2.2. Mount Storage - Google Drive

In [None]:
#from google.colab import drive
#drive.mount('/content/drive')

ModuleNotFoundError: No module named 'google'

###2.3. Clean Meeting Transcript from either Microsoft Team or Zoom, encoded as WEBVTT file.

The meeting transcript is encoded as follows:

    WEBVTT
    
    03951482-18bc-403b-9a4f-9d2699587f03/65-1
    00:00:08.885 --> 00:00:13.589
    transcription making sure that
    the transcription does work. Yep

This is not usually problem with ChatGPT, but OpenAI GPT-3 API charges by a token and we want to minimize the number of tokens sending to it. We will need to remove all lines that is not a transcript.

These two functions clean up .vtt file and then produce a clean text file with the same filename with an extension of .txt.

In [32]:
def clean_webvtt(filepath: str) -> str:
    """Clean up the content of a subtitle file (vtt) to a string

    Args:
        filepath (str): path to vtt file

    Returns:
        str: clean content
    """
    # read file content
    with open(filepath, "r", encoding="utf-8") as fp:
        content = fp.read()

    # remove header & empty lines
    lines = [line.strip() for line in content.split("\n") if line.strip()]
    lines = lines[1:] if lines[0].upper() == "WEBVTT" else lines

    # remove indexes
    lines = [lines[i] for i in range(len(lines)) if not lines[i].isdigit()]

    # remove tcode
    #pattern = re.compile(r'^[0-9:.]{12} --> [0-9:.]{12}')
    pattern = r'[a-f\d]{8}-[a-f\d]{4}-[a-f\d]{4}-[a-f\d]{4}-[a-f\d]{12}\/\d+-\d'
    lines = [lines[i] for i in range(len(lines))
             if not re.match(pattern, lines[i])]

    # remove timestamps
    pattern = r"^\d{2}:\d{2}:\d{2}.\d{3}.*\d{2}:\d{2}:\d{2}.\d{3}$"
    lines = [lines[i] for i in range(len(lines))
             if not re.match(pattern, lines[i])]

    content = " ".join(lines)

    # remove duplicate spaces
    pattern = r"\s+"
    content = re.sub(pattern, r" ", content)

    # add space after punctuation marks if it doesn't exist
    pattern = r"([\.!?])(\w)"
    content = re.sub(pattern, r"\1 \2", content)

    return content


def vtt_to_clean_file(file_in: str, file_out=None, **kwargs) -> str:
    """Save clean content of a subtitle file to text file

    Args:
        file_in (str): path to vtt file
        file_out (None, optional): path to text file
        **kwargs (optional): arguments for other parameters
            - no_message (bool): do not show message of result.
                                 Default is False

    Returns:
        str: path to text file
    """
    # set default values
    no_message = kwargs.get("no_message", False)
    if not file_out:
        filename = splitext(file_in)[0]
        file_out = "%s.txt" % filename
        i = 0
        while exists(file_out):
            i += 1
            file_out = "%s_%s.txt" % (filename, i)

    content = clean_webvtt(file_in)
    with open(file_out, "w+", encoding="utf-8") as fp:
        fp.write(content)
    if not no_message:
        print("clean content is written to file: %s" % file_out)

    return file_out

In [31]:
filepath = "follow up.vtt"

vtt_to_clean_file(filepath)

clean content is written to file: follow up_2.txt


'follow up_2.txt'

###2.4. Count the Number of Tokens

OpenAI GPT-3 is limited by 4,001 tokens it can handle per request which includes both request (i.e., prompt) and response. We will be analyzing how many tokens are in this meeting transcript.

In [66]:
def count_tokens(filename):
    tokenizer = AutoTokenizer.from_pretrained("gpt2")
    with open(filename, 'r') as f:
        text = f.read()

    input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
    num_tokens = input_ids.shape[1]
    return num_tokens

In [67]:
filename = "Acts Retirement-Life.txt"
token_count = count_tokens(filename)
print(f"Number of tokens: {token_count}")

Token indices sequence length is longer than the specified maximum sequence length for this model (5959 > 1024). Running this sequence through the model will result in indexing errors


Number of tokens: 5959


###2.5. Break up the Meeting Transcript into chunks of 2000 tokens with an overlap of 100 tokens

We will be breaking up the Meeting Transcript into chunks of 2,000 tokens with an overlapping 100 tokens to ensure any information is not lost from breaking up the meeting transcript.

In [68]:
def break_up_file_to_chunks(filename, chunk_size=2000, overlap=100):
    tokenizer = AutoTokenizer.from_pretrained("gpt2")
    with open(filename, 'r') as f:
        text = f.read()

    tokens = tokenizer.encode(text)
    num_tokens = len(tokens)

    chunks = []
    for i in range(0, num_tokens, chunk_size - overlap):
        chunk = tokens[i:i + chunk_size]
        chunks.append(chunk)

    return chunks

In [69]:
filename = "Acts Retirement-Life.txt"

chunks = break_up_file_to_chunks(filename)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {len(chunk)} tokens")

Token indices sequence length is longer than the specified maximum sequence length for this model (5959 > 1024). Running this sequence through the model will result in indexing errors


Chunk 0: 2000 tokens
Chunk 1: 2000 tokens
Chunk 2: 2000 tokens
Chunk 3: 259 tokens


###2.6. Validate the break up the Meeting Transcript into chunks of 2000 tokens with an overlap of 100 tokens

In [70]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")

In [71]:
print(tokenizer.decode(chunks[0][-100:]))

 But Open AI just like you just spoke about, right, these use cases are are extremely pivotal in market right now where they are changing the whole experience design of users, you know in your industry like prescribers, subscribers you know to that capacity. So conversational AI, so that's where your natural language processing will play a big role and. You know, whether you're doing it through an automated fashion, whether you're doing it through voice, whether you're documenting it or you are


In [72]:
print(tokenizer.decode(chunks[1][:100]))

 But Open AI just like you just spoke about, right, these use cases are are extremely pivotal in market right now where they are changing the whole experience design of users, you know in your industry like prescribers, subscribers you know to that capacity. So conversational AI, so that's where your natural language processing will play a big role and. You know, whether you're doing it through an automated fashion, whether you're doing it through voice, whether you're documenting it or you are


In [73]:
if tokenizer.decode(chunks[0][-100:]) == tokenizer.decode(chunks[1][:100]):
    print('Overlap is Good')
else:
    print('Overlap is Not Good')

Overlap is Good


###2.7. Set OpenAI API Key

Please note that OpenAI's API service is not free, unlike ChatGPT demo. You will need to sign up for a service with them to get an API key, which requires payment information.

Set an environment variable called “OPEN_API_KEY” and assign a secret API key from OpenAI (https://beta.openai.com/account/api-keys).

In [2]:
import os
os.environ["OPENAI_API_KEY"] = "API KEY"

In [75]:
openai.api_key = os.getenv("OPENAI_API_KEY")

###2.8. Summarize the Meeting Transcript

####2.8.1. Summarize the Meeting Transcript one chunk at a time.

In [76]:
filename = "Acts Retirement-Life.txt"

prompt_response = []
prompt_tokens = []

chunks = break_up_file_to_chunks(filename)
for i, chunk in enumerate(chunks):
    prompt_request = "Summarize this meeting transcript: " + tokenizer.decode(chunks[i])

    response = openai.Completion.create(
            model="text-davinci-003",
            prompt=prompt_request,
            temperature=.5,
            max_tokens=500,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0
    )

    prompt_response.append(response["choices"][0]["text"])
    prompt_tokens.append(response["usage"]["total_tokens"])

Token indices sequence length is longer than the specified maximum sequence length for this model (5959 > 1024). Running this sequence through the model will result in indexing errors


In [77]:
prompt_response

[", you know, pushing it out to the other people, that's where the whole piece comes into play.\n\nSam and Richard discussed the capabilities of Sam's company, which includes a network of 50,000 contractors with expertise in AI, cloud, data, and app development. They discussed the upcoming changes to Microsoft's partnership program, the company's experience in AI, and how their services could help PDHI with automation, natural language processing, voice recognition, and data fabric.",
 " captive team model is that you are basically saying, hey, you know, I need you to do this and then I need you to do this and then I need you to do this. So the whole idea of the captive model is that you are able to basically get the same team to work on different type of projects. So if you're saying that I'm looking at the Salesforce cloud, I'm looking at the Microsoft cloud, I'm looking at the Google cloud, and I'm looking at the AWS cloud. I'm able to basically get the same team to work on all thos

In [78]:
prompt_tokens

[2104, 2252, 2156, 326]

In [79]:
total = 0

for e in range(0, len(prompt_tokens)):
    total = total + prompt_tokens[e]

print("Sum of all elements in given list: ", total)

Sum of all elements in given list:  6838


####2.8.2. Consolidate the Meeting Transcript Summaries.

In [80]:
prompt_request = "Consoloidate these meeting summaries: " + str(prompt_response)

response = openai.Completion.create(
        model="text-davinci-003",
        prompt=prompt_request,
        temperature=.5,
        max_tokens=1000,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )

In [81]:
meeting_summary = response["choices"][0]["text"]
print(meeting_summary)



Sam and Richard discussed the capabilities of Sam's company, which includes a network of 50,000 contractors with expertise in AI, cloud, data, and app development. They discussed the upcoming changes to Microsoft's partnership program, the company's experience in AI, and how their services could help PDHI with automation, natural language processing, voice recognition, and data fabric. The use cases discussed included conversational AI, summarizing, and creating bullet points, as well as Microsoft's Dataverse, Power BI, and other tools. The speaker then discussed the different models of working with their company, including the captive unit, the competent studio, and the core flexi model. The core flexi model was highlighted as the most successful model, as it allows for a team extension and cross education. The meeting concluded with a plan to have an in-person meeting on either the 24th or 26th of July at 11:30. The meeting will include Peter and Chris and will involve a discussion

###2.9. Get Action Items from Meeting Transcript

####2.9.1. Get Action Items from the Meeting Transcript one chunk at a time.

In [82]:
filename = "follow up.txt"

action_response = []
action_tokens = []

chunks = break_up_file_to_chunks(filename)
for i, chunk in enumerate(chunks):
    prompt_request = "Provide a list of important action items with a due date from the provided meeting transcript text: " + tokenizer.decode(chunks[i])

    response = openai.Completion.create(
            model="text-davinci-003",
            prompt=prompt_request,
            temperature=.5,
            max_tokens=500,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0
    )

    action_response.append(response["choices"][0]["text"])
    action_tokens.append(response["usage"]["total_tokens"])

Token indices sequence length is longer than the specified maximum sequence length for this model (18637 > 1024). Running this sequence through the model will result in indexing errors


In [83]:
print(action_response)

['af3-f945f3d9f2f2-0\n\nAction Items: \n1. Schedule call with Peter to designate tasks and resources - Due Date: TBD \n2. Look at website and research organization - Due Date: ASAP \n3. Move more north - Due Date: TBD', "107abc90-6 then if it's something else, then it can go to a person.\n\nAction Items:\n\n1. Source contractors to help organization - Due Date: N/A\n2. Become a service based partner with Microsoft - Due Date: July \n3. Spearhead AI related efforts - Due Date: N/A\n4. Make ways to do more automation - Due Date: N/A\n5. Implement AI ticketing system - Due Date: N/A", " think that's what it's called.\n\nAction Items: \n1. Create a ticketing system to respond to common questions and issues with a due date of April 1st. \n2. Research and create training for AI with a due date of April 15th. \n3. Explore natural language processing and voice recognition with a due date of April 30th. \n4. Investigate Salesforce Health Cloud versus Microsoft Health with a due date of May 15th

####2.9.2. Consolidate the Meeting Transcript Action Items.

In [84]:
prompt_request = "Consolidate these meeting action items, but exclude action items with Due Date of Immediately: " + str(action_response)

response = openai.Completion.create(
        model="text-davinci-003",
        prompt=prompt_request,
        temperature=.5,
        max_tokens=500,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )

action_tokens= response["usage"]["total_tokens"]

In [85]:
meeting_action_items = response["choices"][0]["text"]
print(meeting_action_items)



Action Items: 
1. Schedule call with Peter to designate tasks and resources - Due Date: TBD 
2. Research and create training for AI - Due Date: April 15th 
3. Explore natural language processing and voice recognition - Due Date: April 30th 
4. Investigate Salesforce Health Cloud vs. Microsoft Health - Due Date: May 15th 
5. Pull Apple health data using different vendors - Due Date: May 30th 
6. Analyze voice to text narrative notes - Due Date: June 15th 
7. Research Azure ML OPS and create a pipeline for scaling out tenants - Due Date: 2 weeks 
8. Investigate Open AI use cases and create an experience design - Due Date: 3 weeks 
9. Develop a natural language processing system for conversational AI - Due Date: 4 weeks 
10. Create a workflow for automated decision making - Due Date: 5 weeks 
11. Research Microsoft Dataverse, Data OPS, Power BI, and Orca tools - Due Date: 1 week 
12. Invest in language models for data fabric aspect - Due Date: 2 weeks 
13. Research Microsoft's Build wit

In [86]:
response["usage"]["prompt_tokens"]

1533

In [87]:
response["usage"]["completion_tokens"]

500

In [88]:
response["usage"]["total_tokens"]

2033