<a href="https://colab.research.google.com/github/sungkim11/ai-playground/blob/main/Create_Meeting_Minutes_Using_AI_Workbook_Tranformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Create a Meeting Minute using OpenAI's GPT-3 from both Microsoft Team's Meeting Transcript or Zoom's Meeting Transcript (Transformers)

This is my endeavor to replicate upcoming Microsoft Team Premium feature to create meeting notes using AI.

##1. Prerequisites

The following are prerequisites for this tutorial:

- Python Package: openai
- Python Package: torch and transformers

###1.1. Python Packages

####1.1.1. Install Python Packages

In [None]:
%%writefile requirements.txt
openai
torch==1.13.1+cu116
transformers==4.26.1

Overwriting requirements.txt


In [None]:
%pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


##2. Code

Colab code to prettify text output.

In [None]:
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

###2.1. Import Python Packages

In [None]:
import platform
import os

import openai

import re
from os.path import splitext, exists

import torch
import transformers
from transformers import AutoTokenizer

print('Python: ', platform.python_version())
print('re: ', re.__version__)
print('torch: ', torch.__version__)
print('transformers: ', transformers.__version__)

Python:  3.8.10
re:  2.2.1
torch:  1.13.1+cu116
transformers:  4.26.1


###2.2. Mount Storage - Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


###2.3. Clean Meeting Transcript from either Microsoft Team or Zoom, encoded as WEBVTT file.

The meeting transcript is encoded as follows:

    WEBVTT
    
    03951482-18bc-403b-9a4f-9d2699587f03/65-1
    00:00:08.885 --> 00:00:13.589
    transcription making sure that
    the transcription does work. Yep

This is not usually problem with ChatGPT, but OpenAI GPT-3 API charges by a token and we want to minimize the number of tokens sending to it. We will need to remove all lines that is not a transcript.

These two functions clean up .vtt file and then produce a clean text file with the same filename with an extension of .txt.

In [None]:
def clean_webvtt(filepath: str) -> str:
    """Clean up the content of a subtitle file (vtt) to a string

    Args:
        filepath (str): path to vtt file

    Returns:
        str: clean content
    """
    # read file content
    with open(filepath, "r", encoding="utf-8") as fp:
        content = fp.read()

    # remove header & empty lines
    lines = [line.strip() for line in content.split("\n") if line.strip()]
    lines = lines[1:] if lines[0].upper() == "WEBVTT" else lines

    # remove indexes
    lines = [lines[i] for i in range(len(lines)) if not lines[i].isdigit()]

    # remove tcode
    #pattern = re.compile(r'^[0-9:.]{12} --> [0-9:.]{12}')
    pattern = r'[a-f\d]{8}-[a-f\d]{4}-[a-f\d]{4}-[a-f\d]{4}-[a-f\d]{12}\/\d+-\d'
    lines = [lines[i] for i in range(len(lines))
             if not re.match(pattern, lines[i])]

    # remove timestamps
    pattern = r"^\d{2}:\d{2}:\d{2}.\d{3}.*\d{2}:\d{2}:\d{2}.\d{3}$"
    lines = [lines[i] for i in range(len(lines))
             if not re.match(pattern, lines[i])]

    content = " ".join(lines)

    # remove duplicate spaces
    pattern = r"\s+"
    content = re.sub(pattern, r" ", content)

    # add space after punctuation marks if it doesn't exist
    pattern = r"([\.!?])(\w)"
    content = re.sub(pattern, r"\1 \2", content)

    return content


def vtt_to_clean_file(file_in: str, file_out=None, **kwargs) -> str:
    """Save clean content of a subtitle file to text file

    Args:
        file_in (str): path to vtt file
        file_out (None, optional): path to text file
        **kwargs (optional): arguments for other parameters
            - no_message (bool): do not show message of result.
                                 Default is False

    Returns:
        str: path to text file
    """
    # set default values
    no_message = kwargs.get("no_message", False)
    if not file_out:
        filename = splitext(file_in)[0]
        file_out = "%s.txt" % filename
        i = 0
        while exists(file_out):
            i += 1
            file_out = "%s_%s.txt" % (filename, i)

    content = clean_webvtt(file_in)
    with open(file_out, "w+", encoding="utf-8") as fp:
        fp.write(content)
    if not no_message:
        print("clean content is written to file: %s" % file_out)

    return file_out

In [None]:
filepath = "/content/drive/MyDrive/Colab Notebooks/minutes/data/Round_22_Online_Kickoff_Meeting.vtt"

vtt_to_clean_file(filepath)

clean content is written to file: /content/drive/MyDrive/Colab Notebooks/minutes/data/Round_22_Online_Kickoff_Meeting_2.txt


'/content/drive/MyDrive/Colab Notebooks/minutes/data/Round_22_Online_Kickoff_Meeting_2.txt'

###2.4. Count the Number of Tokens

OpenAI GPT-3 is limited by 4,001 tokens it can handle per request which includes both request (i.e., prompt) and response. We will be analyzing how many tokens are in this meeting transcript.

In [None]:
def count_tokens(filename):
    tokenizer = AutoTokenizer.from_pretrained("gpt2")
    with open(filename, 'r') as f:
        text = f.read()

    input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
    num_tokens = input_ids.shape[1]
    return num_tokens

In [None]:
filename = "/content/drive/MyDrive/Colab Notebooks/minutes/data/Round_22_Online_Kickoff_Meeting.txt"
token_count = count_tokens(filename)
print(f"Number of tokens: {token_count}")

Token indices sequence length is longer than the specified maximum sequence length for this model (17537 > 1024). Running this sequence through the model will result in indexing errors


Number of tokens: 17537


###2.5. Break up the Meeting Transcript into chunks of 2000 tokens with an overlap of 100 tokens

We will be breaking up the Meeting Transcript into chunks of 2,000 tokens with an overlapping 100 tokens to ensure any information is not lost from breaking up the meeting transcript.

In [None]:
def break_up_file_to_chunks(filename, chunk_size=2000, overlap=100):
    tokenizer = AutoTokenizer.from_pretrained("gpt2")
    with open(filename, 'r') as f:
        text = f.read()

    tokens = tokenizer.encode(text)
    num_tokens = len(tokens)
    
    chunks = []
    for i in range(0, num_tokens, chunk_size - overlap):
        chunk = tokens[i:i + chunk_size]
        chunks.append(chunk)
    
    return chunks

In [None]:
filename = "/content/drive/MyDrive/Colab Notebooks/minutes/data/Round_22_Online_Kickoff_Meeting.txt"

chunks = break_up_file_to_chunks(filename)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {len(chunk)} tokens")

Token indices sequence length is longer than the specified maximum sequence length for this model (17537 > 1024). Running this sequence through the model will result in indexing errors


Chunk 0: 2000 tokens
Chunk 1: 2000 tokens
Chunk 2: 2000 tokens
Chunk 3: 2000 tokens
Chunk 4: 2000 tokens
Chunk 5: 2000 tokens
Chunk 6: 2000 tokens
Chunk 7: 2000 tokens
Chunk 8: 2000 tokens
Chunk 9: 437 tokens


###2.6. Validate the break up the Meeting Transcript into chunks of 2000 tokens with an overlap of 100 tokens

In [None]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")

In [None]:
print(tokenizer.decode(chunks[0][-100:]))

 So you you can look across all of these open textbooks at once, too. So we will work with you on getting any materials that you've created into an instance of manifold. If they're just ancillary materials that you can download, we'll put them up for download. If you're creating an entirely new open textbook, then we'll help you in getting this into manifold in a really cool way. We even have some training resources in here, including a champions welcome training and kickoff training


In [None]:
print(tokenizer.decode(chunks[1][:100]))

 So you you can look across all of these open textbooks at once, too. So we will work with you on getting any materials that you've created into an instance of manifold. If they're just ancillary materials that you can download, we'll put them up for download. If you're creating an entirely new open textbook, then we'll help you in getting this into manifold in a really cool way. We even have some training resources in here, including a champions welcome training and kickoff training


In [None]:
if tokenizer.decode(chunks[0][-100:]) == tokenizer.decode(chunks[1][:100]):
    print('Overlap is Good')
else:
    print('Overlap is Not Good')

Overlap is Good


###2.7. Set OpenAI API Key

Please note that OpenAI's API service is not free, unlike ChatGPT demo. You will need to sign up for a service with them to get an API key, which requires payment information.

Set an environment variable called “OPEN_API_KEY” and assign a secret API key from OpenAI (https://beta.openai.com/account/api-keys).

In [None]:
os.environ["OPENAI_API_KEY"] = ''

In [None]:
openai.api_key = os.getenv("OPENAI_API_KEY")

###2.8. Summarize the Meeting Transcript

####2.8.1. Summarize the Meeting Transcript one chunk at a time.

In [None]:
filename = "/content/drive/MyDrive/Colab Notebooks/minutes/data/Round_22_Online_Kickoff_Meeting.txt"

prompt_response = []
prompt_tokens = []

chunks = break_up_file_to_chunks(filename)
for i, chunk in enumerate(chunks):
    prompt_request = "Summarize this meeting transcript: " + tokenizer.decode(chunks[i])

    response = openai.Completion.create(
            model="text-davinci-003",
            prompt=prompt_request,
            temperature=.5,
            max_tokens=500,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0
    )
    
    prompt_response.append(response["choices"][0]["text"])
    prompt_tokens.append(response["usage"]["total_tokens"])

Token indices sequence length is longer than the specified maximum sequence length for this model (17537 > 1024). Running this sequence through the model will result in indexing errors


In [None]:
prompt_response

[". So if you are a new champion and you're trying to figure out what this is all about, you can go into the champions welcome training and get a more detailed look at what this is all about.\n\nThis meeting was a kickoff for the recipients of 22 grants. The ALG website was presented, which is the hub for everything related to the grant procedure, including deadlines, reporting guidelines, and online training. ALG also has two homes for OER, Galileo open learning materials and Manifold, which is an interactive platform for open materials. The meeting ended with a Champions Welcome Training and Kickoff Training.",
 ' can see me.\n\nThis meeting discussed the new ALG website, which will include a mega menu, a grants page, an information center page, a news and events page, and a data page. The ALG website will also link to the repository Galileo Open Learning Materials and Manifold, which makes materials interactive. They also discussed a tracking spreadsheet for all grants, which includ

In [None]:
prompt_tokens

[2133, 2163, 2154, 2245, 2237, 2191, 2176, 2192, 2213, 516]

In [None]:
total = 0

for e in range(0, len(prompt_tokens)):
    total = total + prompt_tokens[e]

print("Sum of all elements in given list: ", total)

Sum of all elements in given list:  20220


####2.8.2. Consolidate the Meeting Transcript Summaries.

In [None]:
prompt_request = "Consoloidate these meeting summaries: " + str(prompt_response)

response = openai.Completion.create(
        model="text-davinci-003",
        prompt=prompt_request,
        temperature=.5,
        max_tokens=1000,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )

In [None]:
meeting_summary = response["choices"][0]["text"]
print(meeting_summary)



This meeting was a kickoff for grants awarded to participants, discussing the Affordable Learning Georgia (ALG) website, and the two repositories, Galileo Open Learning Materials and Manifold, used to store and access open educational resources created through the grants. Participants were introduced to the grant procedures, deadlines, and templates, as well as the ALG grantees listserv. The importance of the Affordable Materials Grants page was emphasized, and the analytics of the repositories, such as most popular downloads and global usage, was discussed. The ALG website will have a mega menu, an About Us section, a grants overview, an Information Center page, and a page for creating resources. It will also have a news and events page, a data page, and a tracking spreadsheet for data. Manifold will be used for interactive materials, and there will be training resources and a student success workshop. The website is expected to be released by the end of January or early February. T

###2.9. Get Action Items from Meeting Transcript

####2.9.1. Get Action Items from the Meeting Transcript one chunk at a time.

In [None]:
filename = "/content/drive/MyDrive/Colab Notebooks/minutes/data/Round_22_Online_Kickoff_Meeting.txt"

action_response = []
action_tokens = []

chunks = break_up_file_to_chunks(filename)
for i, chunk in enumerate(chunks):
    prompt_request = "Provide a list of action items with a due date from the provided meeting transcript text: " + tokenizer.decode(chunks[i])

    response = openai.Completion.create(
            model="text-davinci-003",
            prompt=prompt_request,
            temperature=.5,
            max_tokens=500,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0
    )
    
    action_response.append(response["choices"][0]["text"])
    action_tokens.append(response["usage"]["total_tokens"])

Token indices sequence length is longer than the specified maximum sequence length for this model (17537 > 1024). Running this sequence through the model will result in indexing errors


ServiceUnavailableError: ignored

In [None]:
print(action_response)

[". So if you can't make it to one of our kickoff meetings, you can always watch this one. Action Items:\n\n1. Introduce yourself to the ALG listserv - Due Date: Immediately \n2. Bookmark the ALG Grants page - Due Date: Immediately \n3. Submit a report using the ALG templates and forms - Due Date: December 19th \n4. Convert static files into Manifold texts - Due Date: As needed \n5. Watch the ALG Champions Welcome Training and Kickoff Training - Due Date: Immediately", ', can you please turn on your camera?\n\nAction Items:\n-Put any ancillary materials up for download (Due Date: ASAP)\n-Help create an entirely new open textbook and get it into Manifold (Due Date: ASAP)\n-Champions Welcome Training and Kickoff Training (Due Date: ASAP)\n-Create Student Success Workshop (Due Date: ASAP)\n-Work with Web Person and Marketing Coordinator to create new ALG website (Due Date: End of January/Beginning of February)\n-Create Accessibility Guides (Due Date: ASAP)\n-Create Open Licensing Resource

####2.9.2. Consolidate the Meeting Transcript Action Items.

In [None]:
prompt_request = "Consoloidate these meeting action items, but exclude action items with Due Date of Immediately: " + str(action_response)

response = openai.Completion.create(
        model="text-davinci-003",
        prompt=prompt_request,
        temperature=.5,
        max_tokens=500,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )

action_tokens= response["usage"]["total_tokens"]

KeyboardInterrupt: ignored

In [None]:
meeting_action_items = response["choices"][0]["text"]
print(meeting_action_items)



Consolidated Action Items:
-Put any ancillary materials up for download (Due Date: ASAP)
-Help create an entirely new open textbook and get it into Manifold (Due Date: ASAP)
-Champions Welcome Training and Kickoff Training (Due Date: ASAP)
-Create Student Success Workshop (Due Date: ASAP)
-Work with Web Person and Marketing Coordinator to create new ALG website (Due Date: End of January/Beginning of February)
-Create Accessibility Guides (Due Date: ASAP)
-Create Open Licensing Resources (Due Date: ASAP)
-Create a new way to announce news and events (Due Date: ASAP)
-Create a Data Page (Due Date: End of January/Beginning of February)
-Create ALG Tracking Spreadsheet (Due Date: ASAP)
-Contact Project Leads Yearly for Sustainability Survey (Due Date: Ongoing)
-Begin hiring process for Program Manager (January)
-Create demonstration videos for laboratory techniques (by February)
-Develop group work activities (by March)
-Redevelop laboratory component (by April)
-Create interactive activ

In [None]:
response["usage"]["prompt_tokens"]

2030

In [None]:
response["usage"]["completion_tokens"]

500

In [None]:
response["usage"]["total_tokens"]

2530