# AI ML Assignment 2- VOICE CLONING

Create a voice cloning model that can generate a synthetic voice that sounds like a specific person. The model should be able to generate speech from text input, and it should be able to reproduce the unique vocal characteristics of the target speaker.

# Solution:

1) To accomplish this task, I have leverage the BARK model, a text-to-audio model.
2) BARK can generate text-to-audio conversions while customizing the voice to sound like a specific person.
3) Although BARK doesn't support custom voice cloning, we can use its speaker presets to approximate the target speaker's voice.
4) The model can also produce nonverbal communications like laughing, sighing and crying
4) Bark supports 100+ speaker presets across supported languages.

# Hardware Requirements

1. BARK is supported by both CPU and GPU.
2. The full version of BARK required 12 GB VRAM.
3. However, For users with GPUs having limited VRAM (e.g., 8GB), a smaller version of BARK model is also available.
4. The smaller version can fit into users with less than 4 GB VRAM, or can also be runned on CPU standalone.


Note: Running the full version of BARK on GPUs with limited VRAM (less than 4GB) is possible, but the model needs to handle memory constraints during inference. In such cases, the GPU may need to swap data between VRAM and system RAM, which can lead to slower processing times.


#Code Implementation

Step 1)

Installation of bark model
```
pip install git+https://github.com/suno-ai/bark.git
```

Step 2)
By default, bark is model runs on GPU itself.
But while running the code in my local host. I have come up with some issues that its not running on GPU itself, so in order to make sure it runs on GPU (for faster results). Follow the below step and check if 

Note : For the users with VRAM >= 12gb (can run full version of BARK model.
Users with less than 12gb VRAM skip this step.

In order to utilize the GPU power complete the below steps
```
pip uninstall torch torchvision
```
```
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117 
```

Run the code given below to check if GPU is available.
```
import torch
torch.cuda.is_available()  # if OUTPUT- 1 then GPU is available
```

Step 3)

Next, import the necessary modules and packages for using BARK:

```
from bark import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav
from IPython.display import Audio, display
import nltk 
import numpy as np
```

Explaination 

---
1. ```SAMPLE_RATE ```
* It is a constant variable that represents the audio sampling rate used by the BARK model.
* The audio sampling rate indicates the number of audio samples per second and affects the quality of the synthesized audio. In BARK, the default sampling rate is used for audio generation.

<br>

2. ```generate_audio(text_prompt, history_prompt=None)```

*  generate_audio is a function from the bark library that converts the given text_prompt into an audio representation.

* It takes the text_prompt as input, which is the text that you want to convert into speech using BARK.
* Optionally, you can provide a history_prompt as well, which is a context or history for the text-to-speech generation. It can be used to guide the style or tone of the generated speech.

<br>

3. ```preload_models()```
* It is a function from the bark library that downloads and loads all the necessary models for BARK.
* It is important to call this function before using generate_audio to ensure that the required models are available for audio generation.

<br>

4. ```IPython.display.Audio```
* Its a function, not directly related to BARK model itself.
* It is used to display audio data as an interative audio player in colab notebook environment. 

<br>

5. ```from scipy.io.wavfile import write as write_wav``` (OPTIONAL)
* write_wav function is used to write audio data in WAV (Waveform Audio File Format) files.

<br>

6. ```import nltk```
* We will use this to split the sentences
--- 
<br>

Step 4) Environment Variables
```
import os
os.environ["SUNO_OFFLOAD_CPU"] = "Flase"
os.environ["SUNO_USE_SMALL_MODELS"] = "True"

 # "SUNO_OFFLOAD_CPU"
   True  = To utilize GPU power
   False = To utilize CPU power
```
<br>


Step 5) REST OF THE CODE



In [24]:
import os
os.environ["SUNO_OFFLOAD_CPU"] = "True"
os.environ["SUNO_USE_SMALL_MODELS"] = "True"

from bark import SAMPLE_RATE, generate_audio, preload_models
from IPython.display import Audio
import nltk
import numpy as np


#preload_models()

script = """
Hey Raj,
John here from the Recruitment Team at OpeninApp. 
Thank you so much for expressing interest and applying for the role of AI Engineer Intern.
As part of the next steps in our screening process, please complete the technical assessment attached in the PDF below and share a screen recording explaining your code.
You can choose 1 of the 2 assignments below. Deadlines are mentioned within each file. We look forward to hearing from you once the assessment is completed.
""".replace("\n", " ").strip()


#to split the given script into a list of individual sentences.
sentences = nltk.sent_tokenize(script)  #

# selecting the preset of voice
SPEAKER = "v2/en_speaker_6"

# to generate synthesized speech with appropriate pauses and breaks in between the sentences of the given script.
silence = np.zeros(int(0.25 * SAMPLE_RATE))  # quarter second of silence


#To append both the speech and silence segments to the pieces list, it builds the sequence of audio segments with pauses between sentences.
pieces = []
for sentence in sentences:
    audio_array = generate_audio(sentence, history_prompt=SPEAKER)
    pieces += [audio_array, silence.copy()]
    

    
Audio(np.concatenate(pieces), rate=SAMPLE_RATE)

100%|██████████| 313/313 [00:02<00:00, 108.74it/s]
100%|██████████| 16/16 [00:08<00:00,  1.97it/s]
100%|██████████| 426/426 [00:03<00:00, 108.99it/s]
100%|██████████| 22/22 [00:11<00:00,  1.86it/s]
100%|██████████| 608/608 [00:05<00:00, 113.48it/s]
100%|██████████| 31/31 [00:15<00:00,  2.04it/s]
100%|██████████| 299/299 [00:02<00:00, 114.89it/s]
100%|██████████| 15/15 [00:07<00:00,  2.04it/s]
100%|██████████| 142/142 [00:01<00:00, 114.43it/s]
100%|██████████| 8/8 [00:03<00:00,  2.17it/s]
100%|██████████| 290/290 [00:02<00:00, 111.15it/s]
100%|██████████| 15/15 [00:07<00:00,  2.03it/s]
