### Summarize Talk Notebook

**Input:** Audio file containing speech.<br>
**Output:** A brief summary of what is said in the talk.<br>

---
Raúl Arrabales Moreno - raul.arrabales@gmail.com (Jun. 2023)


This notebook utilizes:
- OpenAI Whisper for transcription (ASR - Speech to Text). 
    - https://github.com/openai/whisper
- OpenAI Davinci for text summarization. 
    - https://platform.openai.com/docs/models/gpt-3-5 

## Setup

### Input file specification

In [21]:
# A crappy audio file with bad quality and low volume
# from myself performing a demo of quering datasets using natural language
# See https://www.youtube.com/watch?v=prbl_cM3iJk 
audio_file_path = "data/Quering_Datasets_NLU.mp3"

# Where to save the transcription
text_file_path = "data/QD_NLU_transcription.txt"

### Libs

In [25]:
# OS interfacing
import os 

# for counting tokens
import tiktoken

# OpenAI Client
import openai 



### OpenAI API Credentials

In [16]:
# My OpenAI API key is in a file called 'ram_openai_apikey.txt', 
# in the same directory as this notebook
key_filename = 'ram_openai_apikey.txt'

In [17]:
def get_apikey_from_file(filename):
    """ Given a filename,
        return the contents of that file
    """
    try:
        with open(filename, 'r') as f:
            # It's assumed our file contains a single line,
            # with our API key
            return f.read().strip()
    except FileNotFoundError:
        print("'%s' file not found" % filename)

In [18]:
openai_api_key = get_apikey_from_file(key_filename)
os.environ["OPENAI_API_KEY"] = openai_api_key

## Audio Transcription

In [19]:
# Read the audio file
audio_file = open(audio_file_path, "rb")

In [20]:
# Use Whisper to transcribe the speech into text
transcript = openai.Audio.transcribe("whisper-1", audio_file)

In [22]:
# Save the transcription to disk (for further use)
with open(text_file_path, 'w') as f:
    f.write(transcript.text)

In [23]:
# See the transcription
print(transcript.text)

Hello everyone, Raul speaking. This is a brief description of this project querying datasets using natural language. So the basic idea behind this small application is just to have a chat, like a chatbot, in order to have an interface for a dataset. So imagine that you are, let's say, a psychologist, that you don't have any expertise working with programming languages, you don't know Python, you don't know SQL, but you need to get some insights or some analytics from a dataset, from a data file. So let me show you the application, how it works. So if I jump into the application directly, the first thing is I'm able to browse, select a dataset, a CSV file. So if I open this one, basically this comes from an experiment I ran. It has something to do with a psychological experiment, I'm not going to explain that here, but the important thing is we have several rows, we have columns, we can identify things like gender, age, some kind of code. If I scroll here horizontally, I see some text i

In [33]:
# Check number of tokens
encoding = tiktoken.encoding_for_model("text-davinci-003") # Encoder for davinci model
len(encoding.encode(transcript.text))

750

## Summarizing the talk

In [41]:
# Ask davinci to summarize with "Tl;dr"
# See https://platform.openai.com/examples/default-tldr-summary 
response = openai.Completion.create(
  model="text-davinci-003",
  prompt=f'{transcript.text}\n\nTl;dr',
  temperature=1,
  max_tokens=240,
  top_p=1.0,
  frequency_penalty=0.0,
  presence_penalty=1
)

In [42]:
print(response.choices[0].text)


This project is an application, which allows users to query datasets using natural language. The app allows users to select a CSV file and then use the chatbot to ask questions about the data set in the form of regular sentences. Under the hood, the bot translates these sentences into Python code and runs the queries on the dataset using Pandas. This lets non-programmers interact with data sets and get results without any programming knowledge. Examples are provided to explore this process, such as working out the list of different values in one column and their percentages in the dataset.
