# Install All Required Libraries
Description:

This cell installs all Python libraries needed for the pipeline:

librosa → audio processing + bio-markers

transformers & torch → mBERT emotion model

sentencepiece → tokenizer support

gTTS → AI voice output

soundfile → audio loading

whisper → open-source speech-to-text

In [None]:
!pip install librosa transformers torch sentencepiece gtts
!pip install soundfile

Collecting gtts
  Downloading gTTS-2.5.4-py3-none-any.whl.metadata (4.1 kB)
Collecting click<8.2,>=7.1 (from gtts)
  Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
Downloading gTTS-2.5.4-py3-none-any.whl (29 kB)
Downloading click-8.1.8-py3-none-any.whl (98 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.2/98.2 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: click, gtts
  Attempting uninstall: click
    Found existing installation: click 8.3.1
    Uninstalling click-8.3.1:
      Successfully uninstalled click-8.3.1
Successfully installed click-8.1.8 gtts-2.5.4


# Import All Libraries
Description:

Imports required modules into Python.
These are used for audio processing, ML models, TTS, and audio playback.

In [None]:
import librosa
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
from gtts import gTTS
import soundfile as sf
from IPython.display import Audio, display

# Upload Audio File
Description:

Allows you to upload an audio file (.wav, .mp3) from your device.
Colab will load the file and play it so you can confirm it uploaded correctly.

In [None]:
from google.colab import files

uploaded = files.upload()

audio_path = list(uploaded.keys())[0]
print("Uploaded:", audio_path)
display(Audio(audio_path))


Saving user_audio.wav to user_audio.wav
Uploaded: user_audio.wav


# Install and Load Whisper for Speech-to-Text
Description:

Installs open-source Whisper

Loads the small model for fast GPU inference

Converts your audio into text

Prints the transcribed text

In [None]:
!pip install git+https://github.com/openai/whisper.git
import whisper

# load small/medium/large depending on GPU
model = whisper.load_model("small")

result = model.transcribe(audio_path)
text = result["text"]

print("TRANSCRIBED TEXT:\n", text)


Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-jlhqy9uo
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-jlhqy9uo
  Resolved https://github.com/openai/whisper.git to commit c0d2f624c09dc18e709e37c2ad90c039a4eb72a2
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: openai-whisper
  Building wheel for openai-whisper (pyproject.toml) ... [?25l[?25hdone
  Created wheel for openai-whisper: filename=openai_whisper-20250625-py3-none-any.whl size=803979 sha256=3791cc7fe65777cf5716796a7a11f5f5608d4cd86fd37d82bb1a3c4f20ed7370
  Stored in directory: /tmp/pip-ephem-wheel-cache-ybxo74dm/wheels/c3/03/25/5e0ba78bc27a3a089f137c9f1d92fdfce16d06996c071a016c
Successfully built openai-whisper
Installing collec

100%|███████████████████████████████████████| 461M/461M [00:09<00:00, 51.3MiB/s]


TRANSCRIBED TEXT:
  Hello, my name is Sheikh Zain and I am the student of BS Computer Science in Sarkozy University of 2025 Batch.


# Extract Bio-Markers From Voice
Description:

This cell defines and runs a function to extract non-linguistic emotional features:

Pitch → detects depression/anxiety

Energy → low energy → sadness

Speech rate → anxiety/slow speech

MFCC → emotion fingerprints

Jitter/Shimmer → voice instability due to stress

In [None]:
def extract_biomarkers(audio_path):
    y, sr = librosa.load(audio_path, sr=16000)

    # Pitch
    pitches, magnitudes = librosa.piptrack(y=y, sr=sr)
    pitch = np.mean(pitches[pitches > 0])

    # Energy
    energy = np.mean(y ** 2)

    # Speech rate (approx)
    tempo, _ = librosa.beat.beat_track(y=y, sr=sr)
    speech_rate = tempo / 60

    # MFCC (13 coefficients)
    mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
    mfcc_mean = np.mean(mfcc, axis=1)

    # Jitter
    voiced_idx = np.where(pitches > 0)[0]
    if len(voiced_idx) > 1:
        diffs = np.abs(np.diff(pitches[voiced_idx]))
        jitter = np.mean(diffs)
    else:
        jitter = 0

    # Shimmer
    amp = np.abs(y)
    shimmer = np.mean(np.abs(np.diff(amp)))

    return {
        "pitch": float(pitch),
        "energy": float(energy),
        "speech_rate": float(speech_rate),
        "mfcc_mean": mfcc_mean.tolist(),
        "jitter": float(jitter),
        "shimmer": float(shimmer)
    }

bio = extract_biomarkers(audio_path)
bio


  "speech_rate": float(speech_rate),


{'pitch': 932.8074340820312,
 'energy': 0.00035945637500844896,
 'speech_rate': 2.8409090909090913,
 'mfcc_mean': [-435.3916015625,
  154.78094482421875,
  -25.2030029296875,
  9.387840270996094,
  9.780198097229004,
  7.9838762283325195,
  -13.282122611999512,
  -6.1007184982299805,
  -5.765298843383789,
  -3.474027633666992,
  -9.137146949768066,
  -6.787515163421631,
  -7.439647674560547],
 'jitter': 63.821990966796875,
 'shimmer': 0.0022441460750997066}

# Load mBERT Emotion Classifier
Description:

Uses a pre-trained HuggingFace emotion model that predicts emotions from text.
Works with the Whisper-generated transcript.

In [None]:
model_name = "j-hartmann/emotion-english-distilroberta-base"

classifier = pipeline(
    "text-classification",
    model=model_name,
    tokenizer=model_name,
    top_k=None
)

emotion_output = classifier(text)
emotion_output


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/329M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/294 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/329M [00:00<?, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cuda:0


[[{'label': 'neutral', 'score': 0.7312163710594177},
  {'label': 'surprise', 'score': 0.1210232824087143},
  {'label': 'joy', 'score': 0.11274832487106323},
  {'label': 'fear', 'score': 0.015405284240841866},
  {'label': 'sadness', 'score': 0.011397955939173698},
  {'label': 'anger', 'score': 0.00499421963468194},
  {'label': 'disgust', 'score': 0.0032145255245268345}]]

# Fuse Bio-Markers + Text Emotion
Description:

This cell combines voice bio-markers and text-based emotion predictions.

Bio-markers adjust emotion levels

Creates final sadness, stress, anxiety scores

This is your Fusion Model (v1)

In [None]:
def fuse_emotion(text_emotion, bio):
    sadness_score = 0
    anxiety_score = 0
    stress_score = 0

    # --- Normalize emotion input ---
    # Case 1: Single dict → convert to list
    if isinstance(text_emotion, dict):
        text_emotion = [text_emotion]

    # Case 2: List of lists → convert to list of dicts
    normalized = []
    for emo in text_emotion:
        if isinstance(emo, dict):  # normal case
            normalized.append(emo)
        elif isinstance(emo, list) and len(emo) == 2:  # [["sadness", 0.8]]
            normalized.append({"label": emo[0], "score": emo[1]})
        elif isinstance(emo, tuple) and len(emo) == 2:
            normalized.append({"label": emo[0], "score": emo[1]})
    text_emotion = normalized

    # --- Emotion fusion ---
    for emo in text_emotion:
        label = emo["label"].lower()
        score = emo["score"]

        if label == "sadness":
            sadness_score += score
        if label == "fear":
            anxiety_score += score
        if label == "anger":
            stress_score += score

    # Bio-metric signals
    if bio.get("pitch", 0) < 120:
        sadness_score += 0.2
    if bio.get("energy", 1) < 0.005:
        sadness_score += 0.2

    if bio.get("speech_rate", 0) > 4:
        anxiety_score += 0.3

    if bio.get("jitter", 0) > 5:
        stress_score += 0.3

    return {
        "sadness": float(sadness_score),
        "anxiety": float(anxiety_score),
        "stress": float(stress_score)
    }


fusion = fuse_emotion(emotion_output, bio)
fusion


{'sadness': 0.2, 'anxiety': 0.0, 'stress': 0.3}

# Generate Voice Response Using gTTS
Description:

Creates an audio response that includes both the detected emotions and the user’s spoken text.

In [None]:
response = f"I detected {fusion}. You said: {text}"

tts = gTTS(text=response, lang='en')
tts.save("response.mp3")
display(Audio("response.mp3"))


# Final Output Summary
Description:

Prints everything clearly:

Whisper transcript

Extracted bio-markers

mBERT text emotion

Final fusion scores

Plays final AI voice response


In [None]:
print("\n===== FINAL EMOTION ANALYSIS =====\n")
print("Text:", text)
print("\nBio-Markers:", bio)
print("\nText Emotion:", emotion_output)
print("\nFusion Model Emotion:", fusion)

display(Audio("response.mp3"))



===== FINAL EMOTION ANALYSIS =====

Text:  Hello, my name is Sheikh Zain and I am the student of BS Computer Science in Sarkozy University of 2025 Batch.

Bio-Markers: {'pitch': 932.8074340820312, 'energy': 0.00035945637500844896, 'speech_rate': 2.8409090909090913, 'mfcc_mean': [-435.3916015625, 154.78094482421875, -25.2030029296875, 9.387840270996094, 9.780198097229004, 7.9838762283325195, -13.282122611999512, -6.1007184982299805, -5.765298843383789, -3.474027633666992, -9.137146949768066, -6.787515163421631, -7.439647674560547], 'jitter': 63.821990966796875, 'shimmer': 0.0022441460750997066}

Text Emotion: [[{'label': 'neutral', 'score': 0.7312163710594177}, {'label': 'surprise', 'score': 0.1210232824087143}, {'label': 'joy', 'score': 0.11274832487106323}, {'label': 'fear', 'score': 0.015405284240841866}, {'label': 'sadness', 'score': 0.011397955939173698}, {'label': 'anger', 'score': 0.00499421963468194}, {'label': 'disgust', 'score': 0.0032145255245268345}]]

Fusion Model Emotion: