**<h2>VOICE TO VOICE RESPONSE</h2>**
    <h3>1. VOICE TO TEXT</h3>
    <h3>2. TEXT TO TEXT</h3>
    <h3>3. TEXT TO VOICE</h3>
    
 <h4>Installation of dependencies and models</h4>

In [1]:
!pip install torch torchaudio transformers jiwer
!pip install git+https://github.com/openai/whisper.git
!pip install webrtcvad


Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-dgsk8ssc
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-dgsk8ssc
  Resolved https://github.com/openai/whisper.git to commit ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone


<h2>1. VOICE TO TEXT</h2>
<h3>Importing the Installed libraries and model. making a class for Audio transcriber for whiisper model</h3>

In [2]:
import webrtcvad
import numpy as np
from pydub import AudioSegment
import whisper
import os

class Transcriber:
    def __init__(self, model_name="base.en", vad_aggressiveness=1):
        self.model = whisper.load_model(model_name)
        self.vad = webrtcvad.Vad()
        #vad
        self.vad.set_mode(vad_aggressiveness)  

    def read_audio(self, file_path):
        audio = AudioSegment.from_file(file_path)
        #mono channel
        audio = audio.set_channels(1).set_frame_rate(16000)
        audio_data = np.array(audio.get_array_of_samples(), dtype=np.int16)
        
        return audio_data, 16000  

    def apply_vad(self, audio, sample_rate):
        frame_duration = 30  # in ms
        frame_size = int(sample_rate * frame_duration / 1000)  
        segments = []

        for start in range(0, len(audio), frame_size):
            stop = min(start + frame_size, len(audio))
            frame = audio[start:stop]

            if len(frame) < frame_size:
                frame = np.pad(frame, (0, frame_size - len(frame)), 'constant')
            elif len(frame) > frame_size:
                frame = frame[:frame_size]

            if self.vad.is_speech(frame.tobytes(), sample_rate):
                segments.append(frame)

        if segments:
            detected_audio = np.concatenate(segments)
            detected_audio = detected_audio.astype(np.float32) / 32768.0
            return detected_audio
        else:
            return None  

    def transcribe(self, audio_file):
        audio, sample_rate = self.read_audio(audio_file)

        detected_audio = self.apply_vad(audio, sample_rate)

        if detected_audio is not None:
            result = self.model.transcribe(detected_audio, language="en")
            return result['text']
        else:
            return "No speech detected."


In [21]:
transcriber = Transcriber(model_name="base.en", vad_aggressiveness=2)

transcription = transcriber.transcribe("/kaggle/input/voices/84-121123-0010.wav")

print("Transcription:", transcription)

output_folder = "/kaggle/working/transcriptions"
os.makedirs(output_folder, exist_ok=True)
output_file_path = os.path.join(output_folder, "transcription.txt")
with open(output_file_path, "w") as f:
    f.write(transcription)

print(f"Transcription saved to: {output_file_path}")


Transcription:  Nautier looked upon morale with one of those melancholy smiles which had so often made Valentine happy and thus fixed his attention.
Transcription saved to: /kaggle/working/transcriptions/transcription.txt


<h3>Converting Voice into text format and saving it in folder</h3>

In [4]:
transcriber = Transcriber(model_name="base.en", vad_aggressiveness=2)

transcription = transcriber.transcribe("/kaggle/input/voices/84-121123-0010.wav")

print("Transcription:", transcription)

output_folder = "/kaggle/working/transcriptions"
os.makedirs(output_folder, exist_ok=True)
output_file = os.path.join(output_folder, "transcription.txt")
with open(output_file, "w") as f:
    f.write(transcription)

print(f"Transcription saved to: {output_file}")

Transcription:  Nautier looked upon morale with one of those melancholy smiles which had so often made Valentine happy and thus fixed his attention.
Transcription saved to: /kaggle/working/transcriptions/transcription.txt


<h2>2. TEXT TO TEXT</h2>
<h3>Importing the required transformers and llama-7b model from huggyllama, and preparing response of the text file generated in previous step.</h3>

In [5]:
from transformers import AutoTokenizer, AutoModelForCausalLM, TextGenerationPipeline

tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")
model = AutoModelForCausalLM.from_pretrained("huggyllama/llama-7b")

generator = TextGenerationPipeline(model=model, tokenizer=tokenizer)

with open(output_file, "r") as file:
    transcription = file.read()

response = generator(transcription, max_length=150, num_return_sequences=1)
response_text = response[0]['generated_text']
print(f"LLM Response:\n{response_text}")

response_dir = "/kaggle/working/response"
os.makedirs(response_dir, exist_ok=True)
response_file = os.path.join(response_dir, "llm_response.txt")

with open(response_file, "w") as file:
    file.write(response_text)

print(f"LLM Response saved to {response_file}")


You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


LLM Response:
 Nautier looked upon morale with one of those melancholy smiles which had so often made Valentine happy and thus fixed his attention.
"I am not a man of the world," he said, "and I have no experience of the world. I have never been in a town, and I have never been in a country house. I have never been in a drawing-room, and I have never been in a ball-room. I have never been in a theatre, and I have never been in a church. I have never been in a shop, and I have never been in a tavern. I have never been in a railway carriage, and I have never been in a boat. I have
LLM Response saved to /kaggle/working/response/llm_response.txt


<h2>3. TEXT TO VOICE</h2>
<h3>Using `espeak-ng` for voice generation, maing a user dependent function which takes pitch, speed, gender as user parameter and other internal parameters </h3>

In [2]:
!sudo apt-get install espeak-ng -y

Reading package lists... Done
Building dependency tree       
Reading state information... Done
espeak-ng is already the newest version (1.50+dfsg-6ubuntu0.1).
0 upgraded, 0 newly installed, 0 to remove and 30 not upgraded.


In [10]:
import os
from IPython.display import Audio, display
import shlex

def text_to_audio_espeak(text, output_file, pitch=70, speed=150, voice='en-us'):
    command = ['espeak-ng', f'-p', str(pitch), f'-s', str(speed), f'-v', voice, text, '--stdout']
    
    with open(output_file, 'wb') as audio_file:
        subprocess.run(command, stdout=audio_file, check=True)

    print(f"Audio saved to {output_file}")
    
    display(Audio(filename=output_file, autoplay=True))


<h1>Final conversion of voice by taking input form userand displaying the audio</h1>

In [30]:
import subprocess
from IPython.display import Audio, display
import os


with open('/kaggle/working/response/llm_response.txt', 'r') as file:
    text = file.read()
gender = str(input("enter gender:"))
os.makedirs("voice", exist_ok = True)
output_file = 'voice/output_female.wav'

if(gender == "male"):
    text_to_audio_espeak(text, output_file, pitch=70, speed=150, voice='en-us')
elif(gender == "female"):
    text_to_audio_espeak(text, output_file, pitch=70, speed=150, voice='en-us+f2')


enter gender: female


Audio saved to voice/output_female.wav


In [3]:
import webrtcvad
import numpy as np
from pydub import AudioSegment
import whisper
import os
from transformers import AutoTokenizer, AutoModelForCausalLM, TextGenerationPipeline
from IPython.display import Audio, display
import subprocess

tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")
model = AutoModelForCausalLM.from_pretrained("huggyllama/llama-7b")
generator = TextGenerationPipeline(model=model, tokenizer=tokenizer)

class Transcriber:
    def __init__(self, model_name="base.en", vad_aggressiveness=1):
        self.model = whisper.load_model(model_name)
        self.vad = webrtcvad.Vad()
        self.vad.set_mode(vad_aggressiveness)  

    def read_audio(self, file_path):
        audio = AudioSegment.from_file(file_path)
        audio = audio.set_channels(1).set_frame_rate(16000)
        audio_data = np.array(audio.get_array_of_samples(), dtype=np.int16)
        return audio_data, 16000  

    def apply_vad(self, audio, sample_rate):
        frame_duration = 30  
        frame_size = int(sample_rate * frame_duration / 1000)  
        segments = []

        for start in range(0, len(audio), frame_size):
            stop = min(start + frame_size, len(audio))
            frame = audio[start:stop]

            if len(frame) < frame_size:
                frame = np.pad(frame, (0, frame_size - len(frame)), 'constant')
            elif len(frame) > frame_size:
                frame = frame[:frame_size]

            if self.vad.is_speech(frame.tobytes(), sample_rate):
                segments.append(frame)

        if segments:
            detected_audio = np.concatenate(segments)
            detected_audio = detected_audio.astype(np.float32) / 32768.0
            return detected_audio
        else:
            return None  

    def transcribe(self, audio_file):
        audio, sample_rate = self.read_audio(audio_file)
        detected_audio = self.apply_vad(audio, sample_rate)

        if detected_audio is not None:
            result = self.model.transcribe(detected_audio, language="en")
            return result['text']
        else:
            return "No speech detected."

def text_to_text(transcription_file, response_file, max_length=150, num_return_sequences=1):
    with open(transcription_file, "r") as file:
        transcription = file.read()

    response = generator(transcription, max_length=max_length, num_return_sequences=num_return_sequences)
    with open(response_file, "w") as file:
        file.write(response[0]['generated_text'])
    print(f"LLM Response saved to {response_file}")

def text_to_audio_espeak(text, output_file, pitch=70, speed=150, voice='en-us'):
    command = ['espeak-ng', f'-p', str(pitch), f'-s', str(speed), f'-v', voice, text, '--stdout']
    with open(output_file, 'wb') as audio_file:
        subprocess.run(command, stdout=audio_file, check=True)
    print(f"Audio saved to {output_file}")
    display(Audio(filename=output_file, autoplay=True))

def voice(response_file, speed, pitch, gender):
    with open(response_file, 'r') as file:
        text = file.read()
    
    os.makedirs("voice", exist_ok=True)
    
    if gender == "male":
        output_file = 'voice/output_male.wav'
        voice_option = 'en-us'
    elif gender == "female":
        output_file = 'voice/output_female.wav'
        voice_option = 'en-us+f2'
    else:
        raise ValueError("Invalid gender specified. Choose 'male' or 'female'.")
    
    text_to_audio_espeak(text, output_file, pitch=pitch, speed=speed, voice=voice_option)

def final_pipeline(audio_file,gender, speed=150, pitch=70):
    transcriber = Transcriber()
    transcription = transcriber.transcribe(audio_file)
    
    transcription_file = 'transcription.txt'
    with open(transcription_file, 'w') as file:
        file.write(transcription)

    response_file = 'response.txt'
    text_to_text(transcription_file, response_file)
    
    voice(response_file, speed, pitch, gender)


You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [4]:
final_pipeline("/kaggle/input/voices/84-121123-0010.wav", 'male', 150, 70)

  checkpoint = torch.load(fp, map_location=device)
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


LLM Response saved to response.txt
Audio saved to voice/output_male.wav
