# Part 1: Tool Install and Setup
The first part of the pipeline is of course getting our tools set up and ready to go.
The tools we currently are going to attempt to install are:

- Whisper
- FFMPEG
- my-voice-analysis
- A version of llama 2 if possible

In [3]:
!pip install git+https://github.com/openai/whisper.git
!pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git
!sudo apt update && sudo apt install ffmpeg

[sudo] password for rudito: 


In [None]:
!pip install my-voice-analysis

Collecting my-voice-analysis
  Downloading my_voice_analysis-0.7-py3-none-any.whl (16 kB)
Collecting praat-parselmouth>=0.3.2 (from my-voice-analysis)
  Downloading praat_parselmouth-0.4.3-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (10.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.7/10.7 MB[0m [31m27.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: praat-parselmouth, my-voice-analysis
Successfully installed my-voice-analysis-0.7 praat-parselmouth-0.4.3


Before my-voice-analysis is installed, also make sure that https://github.com/Shahabks/my-voice-analysis/blob/master/myspsolution.praat has been manually downloaded and imported into the working directory.

In [5]:
mysp=__import__("my-voice-analysis")


In [6]:
import IPython.display
import os
import sys
import re

# Part 2: Audio Extraction and basic stat collection
Now that our tools are set up and ready to go, we will get our more basic audio tools ready to go. For this notebook, we will assume that we need to extract audio from a video first. Then, we will use my-voice-analysis to get some of our easy to aquire stats.

In [None]:
video = "INSERT FILE HERE"
output_name = "INSERT FILE HERE"
video_name = "INSERT FILE HERE"
import subprocess
def convert_webm_to_wav(webm_file, wav_file):
    command = [
        "ffmpeg",
        "-i", webm_file,
        "-vn",  # Disable video recording
        "-acodec", "pcm_s16le",  # Set audio codec to PCM 16-bit little-endian
        "-ar", "44100",  # Set audio sample rate to 44100 Hz
        "-ac", "2",  # Set number of audio channels to 2 (stereo)
        wav_file
    ]
    subprocess.run(command, check=True)
def convert_mp4_to_wav(mp4_file, wav_file):
    try:
        command = [
            "ffmpeg",
            "-i", mp4_file,
            "-vn",  # Disable video recording
            "-acodec", "pcm_s16le",  # Set audio codec to PCM 16-bit little-endian
            "-ar", "44100",  # Set audio sample rate to 44100 Hz
            "-ac", "2",  # Set number of audio channels to 2 (stereo)
            wav_file
        ]
        subprocess.run(command, check=True)
        print("Conversion successful!")
    except subprocess.CalledProcessError as e:
        print("Error:", e)


convert_mp4_to_wav(video + ".mp4", video + ".wav")

In [181]:
mysp.mysptotal(video_name, "INSERT FILE HERE")

[]
                           0
number_ of_syllables    2579
number_of_pauses         411
rate_of_speech             3
articulation_rate          5
speaking_duration      489.1
original_duration      864.9
balance                  0.6
f0_mean               238.27
f0_std                 67.33
f0_median              227.5
f0_min                    70
f0_max                   424
f0_quantile25            195
f0_quan75                284


In [182]:
from io import StringIO
old_stdout = sys.stdout
sys.stdout = my_buffer = StringIO()

mysp.myspbala(video_name, "/home/rudito/Code/Cao_Research/JagCoach/")

output = my_buffer.getvalue()
sys.stdout = old_stdout

In [183]:
print(output)

[]
balance= 0.6 # ratio (speaking duration)/(original duration)



In [184]:
balance_str = output[3::].strip('balance= # ratio (speaking duration)/(original duration)\n')
balance = float(balance_str)
print(balance_str)

0.6


In [185]:
old_stdout = sys.stdout
sys.stdout = my_buffer = StringIO()

mysp.myspatc(video_name, "INSERT FILE HERE")

output2 = my_buffer.getvalue()
sys.stdout = old_stdout

In [186]:
print(output2)

[]
articulation_rate= 5 # syllables/sec speaking duration



In [187]:
rate_str = output2[3::].strip('articulation_rate= # syllables/sec speaking duration')
rate_str = rate_str.replace('# syllables/sec speaking duration', '')
rate = float(rate_str)
print(rate)

5.0


In [188]:
old_stdout = sys.stdout
sys.stdout = my_buffer = StringIO()

mysp.myspf0sd(video_name, "INSERT FILE HERE")

output3 = my_buffer.getvalue()
sys.stdout = old_stdout

In [189]:
print(output3)

[]
f0_SD= 67.33 # Hz global standard deviation of fundamental frequency distribution



In [190]:
stdd_str = output3[3::].strip('f0_SD= # Hz global standard deviation of fundamental frequency distribution')
stdd_str = stdd_str.replace('# Hz global standard deviation of fundamental frequency distribution', '')
stdd = float(stdd_str)
print(stdd_str)

67.33 



In [191]:
import json
raw_stats = {
    "Speech to Noise Ratio:": balance_str,
    "Speech Rate (Syllables per Sec):" : int(rate),
    "Speech Fundamental Frequency STD Deviation:" : stdd_str
}
with open(output_name, "w") as outfile:
  json.dump(raw_stats, outfile)

# Step 3: Whisper Analysis
Now that we have the audio ready and the basic statistics recorded, we can now move on to transcribing our audio with Whisper.

In [1]:
import whisper
import torch

In [2]:
model = whisper.load_model("medium.en")
result = model.transcribe("INSERT FILE HERE")

In [None]:
paragraphs = result["text"].split(". ")
formatted_text = "\n\n".join(paragraphs)
print(formatted_text)
with open("INSERT FILE HERE", "w") as f:
    f.write(formatted_text)
torch.cuda.empty_cache()
