<a href="https://colab.research.google.com/github/sungkim11/ai-playground/blob/main/Create_Meeting_Minutes_Using_AI_Workbook_NLTK.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Create a Meeting Minute using OpenAI's GPT-3 from both Microsoft Team's Meeting Transcript or Zoom's Meeting Transcript (NLTK)

This is my endeavor to replicate upcoming Microsoft Team Premium feature to create meeting notes using AI.

##1. Prerequisites

The following are prerequisites for this tutorial:

- Python Package: openai
- Python Package: nltk (Natural Language Toolkit)

###1.1. Python Packages

####1.1.1. Install Python Packages

In [None]:
%%writefile requirements.txt
openai
nltk
re

Overwriting requirements.txt


In [None]:
%pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


##2. Code

Colab code to prettify text output.

In [None]:
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

###2.1. Import Python Packages

In [None]:
import platform
import os

import openai

import re
from os.path import splitext, exists

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

print('Python: ', platform.python_version())
print('re: ', re.__version__)
print('nltk: ', nltk.__version__)

Python:  3.8.10
re:  2.2.1
nltk:  3.7


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


###2.2. Mount Storage - Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


###2.3. Clean Meeting Transcript from either Microsoft Team or Zoom, encoded as WEBVTT file.

The meeting transcript is encoded as follows:

    WEBVTT
    
    03951482-18bc-403b-9a4f-9d2699587f03/65-1
    00:00:08.885 --> 00:00:13.589
    transcription making sure that
    the transcription does work. Yep

This is not usually problem with ChatGPT, but OpenAI GPT-3 API charges by a token and we want to minimize the number of tokens sending to it. We will need to remove all lines that is not a transcript.

These two functions clean up .vtt file and then produce a clean text file with the same filename with an extension of .txt.

In [None]:
def clean_webvtt(filepath: str) -> str:
    """Clean up the content of a subtitle file (vtt) to a string

    Args:
        filepath (str): path to vtt file

    Returns:
        str: clean content
    """
    # read file content
    with open(filepath, "r", encoding="utf-8") as fp:
        content = fp.read()

    # remove header & empty lines
    lines = [line.strip() for line in content.split("\n") if line.strip()]
    lines = lines[1:] if lines[0].upper() == "WEBVTT" else lines

    # remove indexes
    lines = [lines[i] for i in range(len(lines)) if not lines[i].isdigit()]

    # remove tcode
    #pattern = re.compile(r'^[0-9:.]{12} --> [0-9:.]{12}')
    pattern = r'[a-f\d]{8}-[a-f\d]{4}-[a-f\d]{4}-[a-f\d]{4}-[a-f\d]{12}\/\d+-\d'
    lines = [lines[i] for i in range(len(lines))
             if not re.match(pattern, lines[i])]

    # remove timestamps
    pattern = r"^\d{2}:\d{2}:\d{2}.\d{3}.*\d{2}:\d{2}:\d{2}.\d{3}$"
    lines = [lines[i] for i in range(len(lines))
             if not re.match(pattern, lines[i])]

    content = " ".join(lines)

    # remove duplicate spaces
    pattern = r"\s+"
    content = re.sub(pattern, r" ", content)

    # add space after punctuation marks if it doesn't exist
    pattern = r"([\.!?])(\w)"
    content = re.sub(pattern, r"\1 \2", content)

    return content


def vtt_to_clean_file(file_in: str, file_out=None, **kwargs) -> str:
    """Save clean content of a subtitle file to text file

    Args:
        file_in (str): path to vtt file
        file_out (None, optional): path to text file
        **kwargs (optional): arguments for other parameters
            - no_message (bool): do not show message of result.
                                 Default is False

    Returns:
        str: path to text file
    """
    # set default values
    no_message = kwargs.get("no_message", False)
    if not file_out:
        filename = splitext(file_in)[0]
        file_out = "%s.txt" % filename
        i = 0
        while exists(file_out):
            i += 1
            file_out = "%s_%s.txt" % (filename, i)

    content = clean_webvtt(file_in)
    with open(file_out, "w+", encoding="utf-8") as fp:
        fp.write(content)
    if not no_message:
        print("clean content is written to file: %s" % file_out)

    return file_out

In [None]:
filepath = "/content/drive/MyDrive/Colab Notebooks/minutes/data/Round_22_Online_Kickoff_Meeting.vtt"

vtt_to_clean_file(filepath)

clean content is written to file: /content/drive/MyDrive/Colab Notebooks/minutes/data/Round_22_Online_Kickoff_Meeting.txt


'/content/drive/MyDrive/Colab Notebooks/minutes/data/Round_22_Online_Kickoff_Meeting.txt'

###2.4. Count the Number of Tokens

OpenAI GPT-3 is limited by 4,001 tokens it can handle per request which includes both request (i.e., prompt) and response. We will be analyzing how many tokens are in this meeting transcript.

In [None]:
def count_tokens(filename):
    with open(filename, 'r') as f:
        text = f.read()
        
    tokens = word_tokenize(text)
    num_tokens = len(tokens)
    return num_tokens

In [None]:
filename = "/content/drive/MyDrive/Colab Notebooks/minutes/data/Round_22_Online_Kickoff_Meeting.txt"
token_count = count_tokens(filename)
print(f"Number of tokens: {token_count}")

Number of tokens: 17045


###2.5. Break up the Meeting Transcript into chunks of 2000 tokens with an overlap of 100 tokens

We will be breaking up the Meeting Transcript into chunks of 2,000 tokens with an overlapping 100 tokens to ensure any information is not lost from breaking up the meeting transcript.

In [None]:
def break_up_file(tokens, chunk_size, overlap_size):
    if len(tokens) <= chunk_size:
        yield tokens
    else:
        chunk = tokens[:chunk_size]
        yield chunk
        yield from break_up_file(tokens[chunk_size-overlap_size:], chunk_size, overlap_size)

def break_up_file_to_chunks(filename, chunk_size=2000, overlap_size=100):
    with open(filename, 'r') as f:
        text = f.read()
    tokens = word_tokenize(text)
    return list(break_up_file(tokens, chunk_size, overlap_size))

In [None]:
filename = "/content/drive/MyDrive/Colab Notebooks/minutes/data/Round_22_Online_Kickoff_Meeting.txt"

chunks = break_up_file_to_chunks(filename)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {len(chunk)} tokens")

Chunk 0: 2000 tokens
Chunk 1: 2000 tokens
Chunk 2: 2000 tokens
Chunk 3: 2000 tokens
Chunk 4: 2000 tokens
Chunk 5: 2000 tokens
Chunk 6: 2000 tokens
Chunk 7: 2000 tokens
Chunk 8: 1845 tokens


###2.6. Validate the break up the Meeting Transcript into chunks of 2000 tokens with an overlap of 100 tokens

In [None]:
print(chunks[0])

['Hi', 'everyone', 'I', 'am', 'about', 'to', 'start', 'the', 'recording', 'and', 'the', 'transcription', 'making', 'sure', 'that', 'the', 'transcription', 'does', 'work', '.', 'Yep', 'it', 'does', '.', 'Excellent', '.', 'OK', ',', 'well', ',', 'welcome', 'to', 'the', 'kickoff', 'meeting', 'and', 'congratulations', 'to', 'everybody', 'in', 'here', 'for', 'being', 'a', 'awarded', 'around', '22', 'grants', '.', 'Round', '22', 'was', 'a', 'heck', 'of', 'an', 'application', 'round', '.', 'There', 'were', 'many', 'good', 'applications', '.', 'We', 'had', 'to', 'turn', 'down', 'quite', 'a', 'few', 'and', 'we', 'really', 'hope', 'that', 'those', 'who', 'were', 'turned', 'down', 'will', 'revise', 'and', 'resubmit', 'because', 'there', 'were', 'just', 'so', 'many', 'great', 'ideas', 'and', "y'all", 'were', 'at', 'the', 'top', '.', 'Of', 'the', 'of', 'the', 'list', ',', 'so', 'it', "'s", 'it', "'s", 'excellent', 'to', 'see', 'you', 'all', 'here', '.', 'And', 'today', 'we', "'re", 'just', 'going',

In [None]:
print(chunks[0][-100:])

['put', 'them', 'up', 'for', 'download', '.', 'If', 'you', "'re", 'creating', 'an', 'entirely', 'new', 'open', 'textbook', ',', 'then', 'we', "'ll", 'help', 'you', 'in', 'getting', 'this', 'into', 'manifold', 'in', 'a', 'really', 'cool', 'way', '.', 'We', 'even', 'have', 'some', 'training', 'resources', 'in', 'here', ',', 'including', 'a', 'champions', 'welcome', 'training', 'and', 'kickoff', 'training', ',', 'and', 'all', 'of', 'this', '.', 'These', 'are', 'entries', 'here', 'and', '.', 'You', 'know', ',', 'a', 'few', 'folks', 'from', 'Kennesaw', ',', 'including', 'a', 'few', 'that', 'are', 'here', 'today', ',', 'created', 'a', 'student', 'success', 'workshop', 'that', 'we', 'host', 'right', 'here', 'on', 'Open', 'ALG', 'as', 'well', '.', 'So', 'it', "'s", 'it', "'s", 'exciting']


In [None]:
print(chunks[1])

['put', 'them', 'up', 'for', 'download', '.', 'If', 'you', "'re", 'creating', 'an', 'entirely', 'new', 'open', 'textbook', ',', 'then', 'we', "'ll", 'help', 'you', 'in', 'getting', 'this', 'into', 'manifold', 'in', 'a', 'really', 'cool', 'way', '.', 'We', 'even', 'have', 'some', 'training', 'resources', 'in', 'here', ',', 'including', 'a', 'champions', 'welcome', 'training', 'and', 'kickoff', 'training', ',', 'and', 'all', 'of', 'this', '.', 'These', 'are', 'entries', 'here', 'and', '.', 'You', 'know', ',', 'a', 'few', 'folks', 'from', 'Kennesaw', ',', 'including', 'a', 'few', 'that', 'are', 'here', 'today', ',', 'created', 'a', 'student', 'success', 'workshop', 'that', 'we', 'host', 'right', 'here', 'on', 'Open', 'ALG', 'as', 'well', '.', 'So', 'it', "'s", 'it', "'s", 'exciting', ',', 'it', "'s", 'a', 'new', 'home', 'for', 'these', 'resources', '.', 'It', 'does', "n't", 'link', 'up', 'to', 'our', 'library', 'discovery', 'system', 'the', 'way', 'that', 'Galileo', 'open', 'learning', 'm

In [None]:
if chunks[0][-100:] == chunks[1][:100]:
    print('Overlap is Good')
else:
    print('Overlap is Not Good')

Overlap is Good


###2.7. Set OpenAI API Key

Please note that OpenAI's API service is not free, unlike ChatGPT demo. You will need to sign up for a service with them to get an API key, which requires payment information.

Set an environment variable called “OPEN_API_KEY” and assign a secret API key from OpenAI (https://beta.openai.com/account/api-keys).

In [None]:
os.environ["OPENAI_API_KEY"] = 'paste your openai api key here'

In [None]:
openai.api_key = os.getenv("OPENAI_API_KEY")

###2.8. Convert the NLTK Tokenized Text to Non-Tokenized Text

We will need to convert the NLTK tokenized text to non-tokenized text since OpenAI GPT-3 API does not know how to handle tokenized text very well.

In [None]:
def convert_to_prompt_text(tokenized_text):
    prompt_text = " ".join(tokenized_text)
    prompt_text = prompt_text.replace(" 's", "'s")
    return prompt_text

###2.9. Summarize the Meeting Transcript

####2.9.1. Summarize the Meeting Transcript one chunk at a time.

In [None]:
filename = "/content/drive/MyDrive/Colab Notebooks/minutes/data/Round_22_Online_Kickoff_Meeting.txt"
#response = []
prompt_response = []

chunks = break_up_file_to_chunks(filename)
for i, chunk in enumerate(chunks):
    prompt_request = "Summarize this meeting transcript: " + convert_to_prompt_text(chunks[i])

    response = openai.Completion.create(
            model="text-davinci-003",
            prompt=prompt_request,
            temperature=.5,
            max_tokens=500,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0
    )
    
    prompt_response.append(response["choices"][0]["text"])

In [None]:
prompt_response

[' to see all the different ways that you can use this repository .\n\nThis meeting was a kickoff meeting for the 22 grants that were awarded. It went over the Affordable Learning Georgia (ALG) website, which is the hub for everything that ALG offers. It also discussed the grant procedure and the ALG grants equivalence of the syllabus. There was also a discussion on the two big homes for OER created through the grants, Galileo open learning materials and Manifold. Finally, there was a discussion on the training resources available on Open ALG.',
 ' around and talking about ALG and all of the cool stuff that we have going on .\n\nThis meeting discussed the resources available to help create open textbooks, including training resources and entries on Open ALG. They discussed the new ALG website that they have been working on with feedback from users and the new mega menu. They also discussed the resources available on manifold, which is connected to the library discovery system, and the 

####2.9.2. Consolidate the Meeting Transcript Summaries.

In [None]:
prompt_request = "Consoloidate these meeting summaries: " + str(prompt_response)

response = openai.Completion.create(
        model="text-davinci-003",
        prompt=prompt_request,
        temperature=.5,
        max_tokens=1000,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )

In [None]:
meeting_summary = response["choices"][0]["text"]
print(meeting_summary)



This meeting was a gathering of representatives from various universities to discuss their ALG grants. Each team is transforming a textbook into an open stacks book and developing instructional videos, interactive simulations, laboratory activities, and new grading systems. The teams are from the University of West Georgia, Clayton State University, University of North Georgia, Georgia Gwinnett College, and Georgia Gwinnett College. The projects range from transforming a Spanish textbook, a microbiology course, an introductory physics course for biology students, a survey of chemistry course for allied health majors, and an introduction to anthropology textbook. They also discussed the new ALG website that they have been working on with feedback from users and the new mega menu, the resources available on manifold, which is connected to the library discovery system, and the Galileo Open Learning Materials repository, the ALG tracking spreadsheet, where they track data such as the gra

###2.10. Get Action Items from Meeting Transcript

####2.10.1. Get Action Items from the Meeting Transcript one chunk at a time.

In [None]:
filename = "/content/drive/MyDrive/Colab Notebooks/minutes/data/Round_22_Online_Kickoff_Meeting.txt"

action_response = []

chunks = break_up_file_to_chunks(filename)
for i, chunk in enumerate(chunks):
    prompt_request = "Provide a list of action items with a due date from the provided meeting transcript text: " + convert_to_prompt_text(chunks[i])

    response = openai.Completion.create(
            model="text-davinci-003",
            prompt=prompt_request,
            temperature=.5,
            max_tokens=500,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0
    )
    
    action_response.append(response["choices"][0]["text"])

In [None]:
print(action_response)

[" to see all of this stuff here . And then if you want to see what other folks are doing , you can go over to the project page . This is where all the grants that we 've awarded since 2016 , all the way up to the current ones , all of them are listed here . You can see the status of them . You can see the titles and you can even see if they have a project page or not . So if you 're looking for something specific , you can go right here and search for it . And if you 're looking for something in the same discipline , you can go to the same discipline and see what other projects are out there . So it's really exciting to see what folks are doing . So that's a quick overview of the ALG website . I 'm going to go back to the slides here and I 'm going to go to the next section .\n\nAction Item 1: Introduce participants and have everyone share a greeting - Due Date: Immediately \nAction Item 2: Go through ALG website - Due Date: Immediately \nAction Item 3: Go over grant procedures - Due 

####2.10.2. Consolidate the Meeting Transcript Action Items.

In [None]:
prompt_request = "Consoloidate these meeting action items: " + str(action_response)

response = openai.Completion.create(
        model="text-davinci-003",
        prompt=prompt_request,
        temperature=.5,
        max_tokens=500,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )

In [None]:
meeting_action_items = response["choices"][0]["text"]
print(meeting_action_items)



Action Items: 
1. Introduce participants and have everyone share a greeting - Due Date: Immediately
2. Go through ALG website - Due Date: Immediately
3. Go over grant procedures - Due Date: Immediately
4. Go over OER repositories - Due Date: Immediately
5. Go over grant deadlines and reporting guidelines - Due Date: Immediately
6. Go over Manifold repository - Due Date: Immediately
7. Go over project page - Due Date: Immediately
8. Submit report by December 19th - Due Date: December 19th
9. Put resources up for download - Due Date: ASAP
10. Create entries in Open ALG and Manifold - Due Date: ASAP
11. Create a new ALG website - Due Date: End of January/Beginning of February
12. Create a mega menu - Due Date: End of January/Beginning of February
13. Create a new Data Page - Due Date: End of January/Beginning of February
14. Create a new ALG Events Page - Due Date: End of January/Beginning of February
15. Create an Accessibility Guide - Due Date: End of January/Beginning of February
16.