# **AUDIO to TEXT using Openai's Whisper API**

1. Familiarize yourself with the Whisper API: Research the Whisper API and learn how to use it to convert audio input to text.
2. Collect audio data: Collect a sample audio data set to test your application. The data set should include different types of audio inputs, such as speeches, songs, and conversations.
3. Convert audio input to text: Use the Whisper API to convert the audio input to text. Test and measure the accuracy of the transcription against the original audio input.
4. Send text output to OpenAl Davinci model: Once you have the text output, input it into the OpenAl Davinci model to generate a logical response. This step might involve using natural language processing (NLP) techniques to analyze and understand the content of the text input.
5. Convert text output to audio: After generating a response using the OpenAl Davinci model, convert the text output back to audio format. You might use text-to-speech (TTS) software or libraries to achieve this conversion.
6. Document and share your work: Document your work, including any code, data sets, or methodologies you used, and share your findings on a github repository.

In [29]:
# using GPU for faster computation
#!nvidia-smi
# This can also be run using CPU if GPU is not accessible

## Install openai

In [30]:
# install openai
!pip install openai

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [31]:
#install latest release of whisper
!pip install -U openai-whisper

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [32]:
# install the python dependencies
!pip install git+https://github.com/openai/whisper.git 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-nzu2s6d6
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-nzu2s6d6
  Resolved https://github.com/openai/whisper.git to commit 6dea21fd7f7253bfe450f1e2512a0fe47ee2d258
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


# 1. Importing whisper (medium) model

In [33]:
import openai
import whisper

model = whisper.load_model("medium")

In [34]:
# function to display audio
from IPython.display import Audio, display

def display_audio(media_file):
  display(Audio(media_file, autoplay=True))

In [35]:
# function that prints audio to text
def audio2text(media_file):
  result = model.transcribe(media_file, fp16=False)
  print(result["text"])

# 2. Dataset consists of dialogues from movies, songs, speech and conversations

## Audio Clip 1 (Movie)

In [36]:
# download audio from given site
!wget -O audio.mp3 http://www.moviesoundclips.net/movies1/batmanbegins/bats.mp3

--2023-03-22 13:01:15--  http://www.moviesoundclips.net/movies1/batmanbegins/bats.mp3
Resolving www.moviesoundclips.net (www.moviesoundclips.net)... 198.54.115.219
Connecting to www.moviesoundclips.net (www.moviesoundclips.net)|198.54.115.219|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 52128 (51K) [audio/mpeg]
Saving to: ‘audio.mp3’


2023-03-22 13:01:15 (1018 KB/s) - ‘audio.mp3’ saved [52128/52128]



In [37]:
display_audio("audio.mp3")
audio2text("audio.mp3")

 Bats are not pernil. Bats may be, but even a billionaire playboy, sir, three o'clock is pushing him.


## Audio Clip 2 (Audio from a person)

In [38]:
display_audio("sample-0.mp3")
audio2text("sample-0.mp3")

 It wasn't like I was asking for the code to a nuclear bunker or anything like that, but the amount of resistance I got from this


## Audio Clip 3 (Speech)

In [39]:
!wget -O audio1.mp3 http://www.moviesoundclips.net/movies1/pirates4/kingsmen.mp3

--2023-03-22 13:03:33--  http://www.moviesoundclips.net/movies1/pirates4/kingsmen.mp3
Resolving www.moviesoundclips.net (www.moviesoundclips.net)... 198.54.115.219
Connecting to www.moviesoundclips.net (www.moviesoundclips.net)|198.54.115.219|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 174632 (171K) [audio/mpeg]
Saving to: ‘audio1.mp3’


2023-03-22 13:03:33 (1.58 MB/s) - ‘audio1.mp3’ saved [174632/174632]



In [40]:
display_audio("audio1.mp3")
audio2text("audio1.mp3")

 Gentlemen, I shall not ask any more of any man than what that man can deliver. But I do ask this. Are we not King's men? Aye. On the King's mission? Aye. I did not note any fear in the eyes of the Spanish as they passed us by. Are we not King's men? Aye. Aye.


## Audio Clip 4 (Conversations)

In [41]:
display_audio("conversation.mp3")
audio2text("conversation.mp3")

 What do you know, right? Everything. Haven't you heard between me and my brother? We know everything. What's the capital of Australia? It's what my brother knows.


# 3. Audio input to text using Whisper API

In [42]:
def audio2text_usingapi(file_path):
  openai.api_key= 'sk-***' # generate api key from openai account
  f1 = open(file_path,"rb")
  response = openai.Audio.transcribe("whisper-1", f1)
  return response['text']

In [57]:
text_f = audio2text_usingapi("/content/sample-0.mp3")
print(text_f)

It wasn't like I was asking for the code to a nuclear bunker or anything like that, but the amount of resistance I got from this


# Audio Clip 5 (French song)

In [44]:
display_audio("audio_french.mp3")

In [45]:
audio2text_usingapi("audio_french.mp3")

"Il s'est barré beau malin d'ennui A la recherche d'une nouvelle lueur Ça m'rappelle quand on prenait le train"

# Audio Clip 6 (Speech in other language) 

In [46]:
display_audio("audio_lang.mp3")

In [47]:
audio2text_usingapi("audio_lang.mp3")

'Ерлан Тұрғымбаев – Қазақстан Республикасының үшкестер министрі, Батыс Қазақстан Қызылорда областарының жергілікті атқарушу орғандары екі жыл бойынша жоспарланған ұшыраларды ұрындама егеледі.'

# 4. Complete the text output generated using Davinci model

In [48]:
import os
import openai

openai.api_key = "sk-***" # generate api key from openai account

response = openai.Completion.create(
  model="text-davinci-003", # best performing davinci model
  prompt=text_f,
  temperature=0.7,
  max_tokens=256,
  top_p=1,
  frequency_penalty=0,
  presence_penalty=0
)

In [49]:
text_completion = response.choices[0].text

In [50]:
print(text_completion)

 person was astounding! They just kept saying that they weren't allowed to give it out and I had to respect their wishes. I can understand respecting people's privacy, but this person seemed to be taking it to an extreme. In the end, I had to just accept that I wasn't going to be getting the answer I wanted and move on.


In [51]:
text_final = text_f+'...'+text_completion

In [52]:
# the text output has been completed after passing it to the davinci model
print(text_final)

It wasn't like I was asking for the code to a nuclear bunker or anything like that, but the amount of resistance I got from this... person was astounding! They just kept saying that they weren't allowed to give it out and I had to respect their wishes. I can understand respecting people's privacy, but this person seemed to be taking it to an extreme. In the end, I had to just accept that I wasn't going to be getting the answer I wanted and move on.


In [53]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

from nltk import word_tokenize, pos_tag
tokens = word_tokenize(text_final)
print(pos_tag(tokens))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[('It', 'PRP'), ('was', 'VBD'), ("n't", 'RB'), ('like', 'IN'), ('I', 'PRP'), ('was', 'VBD'), ('asking', 'VBG'), ('for', 'IN'), ('the', 'DT'), ('code', 'NN'), ('to', 'TO'), ('a', 'DT'), ('nuclear', 'JJ'), ('bunker', 'NN'), ('or', 'CC'), ('anything', 'NN'), ('like', 'IN'), ('that', 'DT'), (',', ','), ('but', 'CC'), ('the', 'DT'), ('amount', 'NN'), ('of', 'IN'), ('resistance', 'NN'), ('I', 'PRP'), ('got', 'VBD'), ('from', 'IN'), ('this', 'DT'), ('...', ':'), ('person', 'NN'), ('was', 'VBD'), ('astounding', 'VBG'), ('!', '.'), ('They', 'PRP'), ('just', 'RB'), ('kept', 'VBD'), ('saying', 'VBG'), ('that', 'IN'), ('they', 'PRP'), ('were', 'VBD'), ("n't", 'RB'), ('allowed', 'VBN'), ('to', 'TO'), ('give', 'VB'), ('it', 'PRP'), ('out', 'IN'), ('and', 'CC'), ('I', 'PRP'), ('had', 'VBD'), ('to', 'TO'), ('respect', 'VB'), ('their', 'PRP$'), ('wishes', 'NNS'), ('.', '.'), ('I', 'PRP'), ('can', 'MD'), ('understand', 'VB'), ('respecting', 'VBG'), ('people', 'NNS'), ("'s", 'POS'), ('privacy', 'NN'), ('

# 5. Convert Text output to Audio

In [54]:
!pip install gTTS

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gTTS
  Downloading gTTS-2.3.1-py3-none-any.whl (28 kB)
Installing collected packages: gTTS
Successfully installed gTTS-2.3.1


In [55]:
import os
from gtts import gTTS

language = 'en'

obj1 = gTTS(text=text_final, lang=language, slow=False)
  
# Saving the converted audio in a mp3 file
obj1.save("packt.mp3")

In [56]:
display_audio("packt.mp3")
audio2text("packt.mp3")

 It wasn't like I was asking for the code to a nuclear bunker or anything like that. But the amount of resistance I got from this person was astounding. They just kept saying that they weren't allowed to give it out and I had to respect their wishes. I can understand respecting people's privacy, but this person seemed to be taking it to an extreme. In the end, I had to just accept that I wasn't going to be getting the answer I wanted and move on.
