<a href="https://colab.research.google.com/github/sungkim11/ai-playground/blob/main/Create_Meeting_Minutes_Using_AI_Workbook_Asynchronous.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Create a Meeting Minute using OpenAI's GPT-3 from both Microsoft Team's Meeting Transcript or Zoom's Meeting Transcript (Asynchronous)

This is my endeavor to replicate upcoming Microsoft Team Premium feature to create meeting notes using AI.

##1. Prerequisites

The following are prerequisites for this tutorial:

- Python Package: openai
- Python Package: torch and transformers

- Python Package: asyncio
- Python Package: nest-asyncio
- Python Package: openai-async


###1.1. Python Packages

####1.1.1. Install Python Packages

In [43]:
%%writefile requirements.txt
openai
openai-async
asyncio
nest-asyncio
torch==1.13.1+cu116
transformers==4.26.1

Overwriting requirements.txt


In [44]:
%pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting asyncio
  Downloading asyncio-3.4.3-py3-none-any.whl (101 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.8/101.8 KB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: asyncio
Successfully installed asyncio-3.4.3


##2. Code

Colab code to prettify text output.

In [1]:
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

###2.1. Import Python Packages

In [2]:
import platform
import os

import openai
import openai_async

import asyncio
import nest_asyncio
nest_asyncio.apply()

import re
from os.path import splitext, exists

import torch
import transformers
from transformers import AutoTokenizer

print('Python: ', platform.python_version())
print('re: ', re.__version__)
print('torch: ', torch.__version__)
print('transformers: ', transformers.__version__)

Python:  3.8.10
re:  2.2.1
torch:  1.13.1+cu116
transformers:  4.26.1


###2.2. Mount Storage - Google Drive

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


###2.3. Clean Meeting Transcript from either Microsoft Team or Zoom, encoded as WEBVTT file.

The meeting transcript is encoded as follows:

    WEBVTT
    
    03951482-18bc-403b-9a4f-9d2699587f03/65-1
    00:00:08.885 --> 00:00:13.589
    transcription making sure that
    the transcription does work. Yep

This is not usually problem with ChatGPT, but OpenAI GPT-3 API charges by a token and we want to minimize the number of tokens sending to it. We will need to remove all lines that is not a transcript.

These two functions clean up .vtt file and then produce a clean text file with the same filename with an extension of .txt.

In [4]:
def clean_webvtt(filepath: str) -> str:
    """Clean up the content of a subtitle file (vtt) to a string

    Args:
        filepath (str): path to vtt file

    Returns:
        str: clean content
    """
    # read file content
    with open(filepath, "r", encoding="utf-8") as fp:
        content = fp.read()

    # remove header & empty lines
    lines = [line.strip() for line in content.split("\n") if line.strip()]
    lines = lines[1:] if lines[0].upper() == "WEBVTT" else lines

    # remove indexes
    lines = [lines[i] for i in range(len(lines)) if not lines[i].isdigit()]

    # remove tcode
    #pattern = re.compile(r'^[0-9:.]{12} --> [0-9:.]{12}')
    pattern = r'[a-f\d]{8}-[a-f\d]{4}-[a-f\d]{4}-[a-f\d]{4}-[a-f\d]{12}\/\d+-\d'
    lines = [lines[i] for i in range(len(lines))
             if not re.match(pattern, lines[i])]

    # remove timestamps
    pattern = r"^\d{2}:\d{2}:\d{2}.\d{3}.*\d{2}:\d{2}:\d{2}.\d{3}$"
    lines = [lines[i] for i in range(len(lines))
             if not re.match(pattern, lines[i])]

    content = " ".join(lines)

    # remove duplicate spaces
    pattern = r"\s+"
    content = re.sub(pattern, r" ", content)

    # add space after punctuation marks if it doesn't exist
    pattern = r"([\.!?])(\w)"
    content = re.sub(pattern, r"\1 \2", content)

    return content


def vtt_to_clean_file(file_in: str, file_out=None, **kwargs) -> str:
    """Save clean content of a subtitle file to text file

    Args:
        file_in (str): path to vtt file
        file_out (None, optional): path to text file
        **kwargs (optional): arguments for other parameters
            - no_message (bool): do not show message of result.
                                 Default is False

    Returns:
        str: path to text file
    """
    # set default values
    no_message = kwargs.get("no_message", False)
    if not file_out:
        filename = splitext(file_in)[0]
        file_out = "%s.txt" % filename
        i = 0
        while exists(file_out):
            i += 1
            file_out = "%s_%s.txt" % (filename, i)

    content = clean_webvtt(file_in)
    with open(file_out, "w+", encoding="utf-8") as fp:
        fp.write(content)
    if not no_message:
        print("clean content is written to file: %s" % file_out)

    return file_out

In [5]:
filepath = "/content/drive/MyDrive/Colab Notebooks/minutes/data/Round_22_Online_Kickoff_Meeting.vtt"

vtt_to_clean_file(filepath)

clean content is written to file: /content/drive/MyDrive/Colab Notebooks/minutes/data/Round_22_Online_Kickoff_Meeting_5.txt


'/content/drive/MyDrive/Colab Notebooks/minutes/data/Round_22_Online_Kickoff_Meeting_5.txt'

###2.4. Count the Number of Tokens

OpenAI GPT-3 is limited by 4,001 tokens it can handle per request which includes both request (i.e., prompt) and response. We will be analyzing how many tokens are in this meeting transcript.

In [6]:
def count_tokens(filename):
    tokenizer = AutoTokenizer.from_pretrained("gpt2")
    with open(filename, 'r') as f:
        text = f.read()

    input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
    num_tokens = input_ids.shape[1]
    return num_tokens

In [7]:
filename = "/content/drive/MyDrive/Colab Notebooks/minutes/data/Round_22_Online_Kickoff_Meeting.txt"
token_count = count_tokens(filename)
print(f"Number of tokens: {token_count}")

Token indices sequence length is longer than the specified maximum sequence length for this model (17537 > 1024). Running this sequence through the model will result in indexing errors


Number of tokens: 17537


###2.5. Break up the Meeting Transcript into chunks of 2000 tokens with an overlap of 100 tokens

We will be breaking up the Meeting Transcript into chunks of 2,000 tokens with an overlapping 100 tokens to ensure any information is not lost from breaking up the meeting transcript.

In [8]:
def break_up_file_to_chunks(filename, chunk_size=2000, overlap=100):
    tokenizer = AutoTokenizer.from_pretrained("gpt2")
    with open(filename, 'r') as f:
        text = f.read()

    tokens = tokenizer.encode(text)
    num_tokens = len(tokens)
    
    chunks = []
    for i in range(0, num_tokens, chunk_size - overlap):
        chunk = tokens[i:i + chunk_size]
        chunks.append(chunk)
    
    return chunks

In [9]:
filename = "/content/drive/MyDrive/Colab Notebooks/minutes/data/Round_22_Online_Kickoff_Meeting.txt"

chunks = break_up_file_to_chunks(filename)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {len(chunk)} tokens")

Token indices sequence length is longer than the specified maximum sequence length for this model (17537 > 1024). Running this sequence through the model will result in indexing errors


Chunk 0: 2000 tokens
Chunk 1: 2000 tokens
Chunk 2: 2000 tokens
Chunk 3: 2000 tokens
Chunk 4: 2000 tokens
Chunk 5: 2000 tokens
Chunk 6: 2000 tokens
Chunk 7: 2000 tokens
Chunk 8: 2000 tokens
Chunk 9: 437 tokens


###2.6. Validate the break up the Meeting Transcript into chunks of 2000 tokens with an overlap of 100 tokens

In [10]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")

In [11]:
print(tokenizer.decode(chunks[0][-100:]))

 So you you can look across all of these open textbooks at once, too. So we will work with you on getting any materials that you've created into an instance of manifold. If they're just ancillary materials that you can download, we'll put them up for download. If you're creating an entirely new open textbook, then we'll help you in getting this into manifold in a really cool way. We even have some training resources in here, including a champions welcome training and kickoff training


In [12]:
print(tokenizer.decode(chunks[1][:100]))

 So you you can look across all of these open textbooks at once, too. So we will work with you on getting any materials that you've created into an instance of manifold. If they're just ancillary materials that you can download, we'll put them up for download. If you're creating an entirely new open textbook, then we'll help you in getting this into manifold in a really cool way. We even have some training resources in here, including a champions welcome training and kickoff training


In [13]:
if tokenizer.decode(chunks[0][-100:]) == tokenizer.decode(chunks[1][:100]):
    print('Overlap is Good')
else:
    print('Overlap is Not Good')

Overlap is Good


###2.7. Set OpenAI API Key

Please note that OpenAI's API service is not free, unlike ChatGPT demo. You will need to sign up for a service with them to get an API key, which requires payment information.

Set an environment variable called “OPEN_API_KEY” and assign a secret API key from OpenAI (https://beta.openai.com/account/api-keys).

In [14]:
os.environ["OPENAI_API_KEY"] = 'openai api key'

In [15]:
openai.api_key = os.getenv("OPENAI_API_KEY")

###2.8. Summarize the Meeting Transcript

####2.8.1. Summarize the Meeting Transcript one chunk at a time.

In [38]:
async def summarize_meeting(prompt, timeout, max_tokens):
    
    #timeout = 30
    temperature = 0.5
    #max_tokens = 1000
    top_p = 1
    frequency_penalty = 0
    presence_penalty = 0
    
    # Call the OpenAI GPT-3 API
    response = await openai_async.complete(
        openai.api_key,
        timeout=timeout,
        payload={
            "model": "text-davinci-003",
            "prompt": prompt,
            "temperature": temperature,
            "max_tokens": max_tokens,
            "top_p": top_p,
            "frequency_penalty": frequency_penalty,
            "presence_penalty": presence_penalty
        },
    )

    # Return the generated text
    return response

In [22]:
filename = "/content/drive/MyDrive/Colab Notebooks/minutes/data/Round_22_Online_Kickoff_Meeting.txt"

prompt_response = []
prompt_tokens = []

chunks = break_up_file_to_chunks(filename)

for i, chunk in enumerate(chunks):
    prompt_request = "Summarize this meeting transcript: " + tokenizer.decode(chunks[i])
    
    loop = asyncio.get_event_loop()
    response = loop.run_until_complete(summarize_meeting(prompt = prompt_request, timeout=30, max_tokens = 1000))
    
    prompt_response.append(response.json()["choices"][0]["text"].strip())
    prompt_tokens.append(response.json()["usage"]["total_tokens"])

Token indices sequence length is longer than the specified maximum sequence length for this model (17537 > 1024). Running this sequence through the model will result in indexing errors


In [23]:
prompt_response

[". So I think that's it for the ALG website.\n\nThis meeting was a kickoff for grant recipients awarded around 22 grants. The Affordable Learning Georgia website was shared as a hub for grantees to access resources including open educational resources, grant procedures, reporting deadlines, and grantee listserv. The two main repositories for OER created from the grants were discussed, and the features of the new Manifold platform were shared. Training resources were also highlighted.",
 'can turn on their cameras.\n\nThis meeting was about introducing the new ALG website, which will have a mega menu with sections on grants, resources, and news, as well as a data page. The website will also link to the Galileo Open Learning Materials repository, which contains old OER, and the Manifold repository, which is more interactive. The ALG Tracking Spreadsheet was also discussed, which is used to track data on grants such as the total award, the project lead, the course name and number, and th

In [24]:
prompt_tokens

[2101, 2147, 2189, 2268, 2388, 2202, 2112, 2166, 2238, 508]

In [25]:
total = 0

for e in range(0, len(prompt_tokens)):
    total = total + prompt_tokens[e]

print("Sum of all elements in given list: ", total)

Sum of all elements in given list:  20319


####2.8.2. Consolidate the Meeting Transcript Summaries.

In [27]:
prompt_request = "Consoloidate these meeting summaries: " + str(prompt_response)

loop = asyncio.get_event_loop()
response = loop.run_until_complete(summarize_meeting(prompt = prompt_request, timeout=45, max_tokens = 1000))

In [28]:
print(response.json()["choices"][0]["text"].strip())

This meeting discussed the Affordable Learning Georgia (ALG) website, which is a hub for grantees to access resources including open educational resources, grant procedures, reporting deadlines, and grantee listserv. The two main repositories for OER created from the grants were discussed, and the features of the new Manifold platform were shared. Training resources were also highlighted. Additionally, ALG project teams from the University of West Georgia, Clayton State, University of North Georgia, and Georgia Gwinnett College discussed their projects which involve transforming a textbook to an open stacks textbook and creating activities and simulations to engage students. These projects are for Spanish, microbiology, introductory physics, and anthropology courses respectively, and are all expected to be completed by the fall of 2021. The team from Kennesaw State University is focused on creating a math primer for a senior level fluid mechanics course in order to help students get up

In [30]:
print(response.json()["usage"]["total_tokens"])

2262


###2.9. Get Action Items from Meeting Transcript

####2.9.1. Get Action Items from the Meeting Transcript one chunk at a time.

In [31]:
filename = "/content/drive/MyDrive/Colab Notebooks/minutes/data/Round_22_Online_Kickoff_Meeting.txt"

action_response = []
action_tokens = []

chunks = break_up_file_to_chunks(filename)

for i, chunk in enumerate(chunks):
    prompt_request = "Provide a list of action items with a due date from the provided meeting transcript text: " + tokenizer.decode(chunks[i])
    
    loop = asyncio.get_event_loop()
    response = loop.run_until_complete(summarize_meeting(prompt = prompt_request, timeout=30, max_tokens = 1000))
    
    action_response.append(response.json()["choices"][0]["text"].strip())
    action_tokens.append(response.json()["usage"]["total_tokens"])

Token indices sequence length is longer than the specified maximum sequence length for this model (17537 > 1024). Running this sequence through the model will result in indexing errors


In [32]:
print(action_response)

[". So if you've just joined us, you can go in here and see what the kickoff training is all about. Action Items:\n\n1. Introduce yourself (due immediately)\n2. Bookmark the ALG grants page (due immediately)\n3. Submit a report using the templates and forms (due December 19th)\n4. Join the ALG grants listserv (due immediately)\n5. Convert any materials into an instance of Manifold (due date TBD)", ', please turn your cameras on so we can see each other. \n\nAction Items: \n1. Work with grant recipients to get any materials created into an instance of Manifold by February 15th \n2. Provide training resources for grant recipients, including Champions Welcome Training and Kickoff Training by February 15th \n3. Create entries in Manifold and Galileo Open Learning Materials to make resources maximally discoverable by February 15th \n4. Create a new ALG website by February 15th \n5. Create a new Data Page by February 15th \n6. Create a new ALG Events page by February 15th \n7. Create a new w

In [33]:
action_tokens

[2119, 2223, 2289, 2266, 2486, 2117, 2223, 2216, 2213, 534]

In [34]:
total = 0

for e in range(0, len(action_tokens)):
    total = total + action_tokens[e]

print("Sum of all elements in given list: ", total)

Sum of all elements in given list:  20686


####2.9.2. Consolidate the Meeting Transcript Action Items.

In [39]:
prompt_request = "Consoloidate these meeting action items, but exclude action items with Due Date of Immediately: " + str(action_response)

loop = asyncio.get_event_loop()
response = loop.run_until_complete(summarize_meeting(prompt = prompt_request, timeout=45, max_tokens = 1000))

In [40]:
print(response.json()["choices"][0]["text"].strip())

Consolidated Action Items: 
1. Introduce yourself (due immediately)
2. Bookmark the ALG grants page (due immediately)
3. Submit a report using the templates and forms (due December 19th)
4. Join the ALG grants listserv (due immediately)
5. Work with grant recipients to get any materials created into an instance of Manifold by February 15th 
6. Provide training resources for grant recipients, including Champions Welcome Training and Kickoff Training by February 15th 
7. Create entries in Manifold and Galileo Open Learning Materials to make resources maximally discoverable by February 15th 
8. Create a new ALG website by February 15th 
9. Create a new Data Page by February 15th 
10. Create a new ALG Events page by February 15th 
11. Create a new way to announce news and events by February 15th 
12. Create a link to Manifold from Galileo Open Learning Materials by February 15th 
13. Create a new ALG Tracking spreadsheet by February 15th 
14. Contact grant recipients yearly for the Sustain

In [41]:
print(response.json()["usage"]["total_tokens"])

3137
