<a href="https://colab.research.google.com/github/Themanwhosoldtheworldd/Project-Moriarty/blob/main/Project_Moriarty.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project: Moriarty**
The project's goal in this state, is to take user input using the default system microphone and transcode the speech to text. After that a model (unitaryai/detoxify) is used to classify the text produced from whisper as **toxic**,   **severe_toxic**,   **obscene**,   **threat**,   **insult**, **identity_atttack**.


# **Prerequisites**

In [None]:
!pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
!pip install git+https://github.com/openai/whisper.git 
!apt install libasound2-dev portaudio19-dev libportaudio2 libportaudiocpp0 ffmpeg
!pip install pyaudio
!pip install Numpy as np
!pip install ffmpeg-python
!pip install transformers
!pip install detoxify
!pip install pandas

In [None]:
from IPython.display import HTML, Audio
from google.colab.output import eval_js
from base64 import b64decode
import numpy as np
from scipy.io.wavfile import read as wav_read
import io
import ffmpeg
import scipy
import whisper
import os
import torch
from detoxify import Detoxify
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt


# **Voice Interface**

The code Below is mandatory if you are using colab. It creates an interface so the internet browser can access the microphone.

In [None]:
AUDIO_HTML = """
<script>
var my_div = document.createElement("DIV");
var my_p = document.createElement("P");
var my_btn = document.createElement("BUTTON");
var t = document.createTextNode("Press to start recording");

my_btn.appendChild(t);
//my_p.appendChild(my_btn);
my_div.appendChild(my_btn);
document.body.appendChild(my_div);

var base64data = 0;
var reader;
var recorder, gumStream;
var recordButton = my_btn;

var handleSuccess = function(stream) {
  gumStream = stream;
  var options = {
    //bitsPerSecond: 8000, //chrome seems to ignore, always 48k
    mimeType : 'audio/webm;codecs=opus'
    //mimeType : 'audio/webm;codecs=pcm'
  };            
  //recorder = new MediaRecorder(stream, options);
  recorder = new MediaRecorder(stream);
  recorder.ondataavailable = function(e) {            
    var url = URL.createObjectURL(e.data);
    var preview = document.createElement('audio');
    preview.controls = true;
    preview.src = url;
    document.body.appendChild(preview);

    reader = new FileReader();
    reader.readAsDataURL(e.data); 
    reader.onloadend = function() {
      base64data = reader.result;
      //console.log("Inside FileReader:" + base64data);
    }
  };
  recorder.start();
  };

recordButton.innerText = "Recording... press to stop";

navigator.mediaDevices.getUserMedia({audio: true}).then(handleSuccess);


function toggleRecording() {
  if (recorder && recorder.state == "recording") {
      recorder.stop();
      gumStream.getAudioTracks()[0].stop();
      recordButton.innerText = "Saving the recording... pls wait!"
  }
}

// https://stackoverflow.com/a/951057
function sleep(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

var data = new Promise(resolve=>{
//recordButton.addEventListener("click", toggleRecording);
recordButton.onclick = ()=>{
toggleRecording()

sleep(2000).then(() => {
  // wait 2000ms for the data to be available...
  // ideally this should use something like await...
  //console.log("Inside data:" + base64data)
  resolve(base64data.toString())

});

}
});
      
</script>
"""

The code Below is used to generate a ".wav" file from the user's input, so that we can "feed" it to Whisper

In [None]:
def get_audio():
  display(HTML(AUDIO_HTML))
  data = eval_js("data")
  binary = b64decode(data.split(',')[1])
  
  process = (ffmpeg
    .input('pipe:0')
    .output('pipe:1', format='wav')
    .run_async(pipe_stdin=True, pipe_stdout=True, pipe_stderr=True, quiet=True, overwrite_output=True)
  )
  output, err = process.communicate(input=binary)
  
  riff_chunk_size = len(output) - 8
  # Break up the chunk size into four bytes, held in b.
  q = riff_chunk_size
  b = []
  for i in range(4):
      q, r = divmod(q, 256)
      b.append(r)

  # Replace bytes 4:8 in proc.stdout with the actual size of the RIFF chunk.
  riff = output[:4] + bytes(b) + output[8:]

  sr, audio = wav_read(io.BytesIO(riff))

  return audio, sr

In [None]:
audio, sr = get_audio()

In [None]:
scipy.io.wavfile.write('recording.wav', sr, audio) 

# **Whisper for audio transcode**

Here it is essential that a cuda-enabled GPU is utilized otherwise the process is going to be very slow. If you are using the CPU it is strongly advised that you use "tiny" model for whisper

In [None]:
torch.cuda.is_available()
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

## Available models and languages

There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and relative speed. 


|  Size  | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
|:------:|:----------:|:------------------:|:------------------:|:-------------:|:--------------:|
|  tiny  |    39 M    |     `tiny.en`      |       `tiny`       |     ~1 GB     |      ~32x      |
|  base  |    74 M    |     `base.en`      |       `base`       |     ~1 GB     |      ~16x      |
| small  |   244 M    |     `small.en`     |      `small`       |     ~2 GB     |      ~6x       |
| medium |   769 M    |    `medium.en`     |      `medium`      |     ~5 GB     |      ~2x       |
| large  |   1550 M   |        N/A         |      `large`       |    ~10 GB     |       1x       |

In [None]:
model = whisper.load_model("medium", device=DEVICE)
print(
    f"Model is {'multilingual' if model.is_multilingual else 'English-only'} "
    f"and has {sum(np.prod(p.shape) for p in model.parameters()):,} parameters."
)

In [None]:
audio = whisper.load_audio("recording.wav")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)

In [None]:
# decode the audio
options = whisper.DecodingOptions()
result = whisper.decode(model, mel, options)

# print the recognized text
print(result.text)

# **Detoxify Text-Classification Model**

In [None]:

#each model takes in either a string or a list of strings

results = Detoxify('original').predict(result.text)
array = ["percentage"]
np.array(array)
import pandas as pd

df = pd.DataFrame(results, index=array)
print(pd.DataFrame(results, index=array).round(5))
plt.figure(figsize=(10,8))
sns.barplot(data=df,palette=sns.color_palette("pastel"))
plt.title('Stats')
plt.xlabel('Label')
plt.ylabel('Percentage')
plt.show


# **Credits**
Athanasiadou Christina, https://www.linkedin.com/in/christina-athanasiadou-37288a246

Chatzigeorgiou Spiros, https://www.linkedin.com/in/spiros-chatzigeorgiou-797148201/


**Technologies used:**

https://github.com/openai/whisper

https://github.com/unitaryai/detoxify