<a href="https://colab.research.google.com/github/soumyajoykundu/Applied-Machine-Learning-2025/blob/main/Assignments/Project/Code%20files/SonicShield_Transfer_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### **AML Project Work**

**Team AudioSentinels**
1. Chandranath Bhattacharya -- MDS202318
2. Salokya Deb -- MDS202341
3. Soumyajoy Kundu -- MDS202349

**$$\text{SonicShield :
AI-Powered Guardian Against DeepFake Speech}$$**


This notebook is the third and final part of our work where we have explored transformer based model. The following contents are covered here,
1. Transfer Learning using DistiBERT
2. Developing Apps
  * Streamlit
  * Flask

*Note*: The apps are developed on the model fine-tuned on our data under study.

Link to the data : [Kaggle](https://www.kaggle.com/datasets/birdy654/deep-voice-deepfake-voice-recognition/data)

In [None]:
!pip install librosa pandas numpy scikit-learn torch transformers datasets joblib wandb

Collecting datasets
  Downloading datasets-3.5.1-py3-none-any.whl.metadata (19 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting 

In [None]:
!pip install --upgrade transformers



### Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import os
import librosa
import joblib
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
from transformers import Trainer, TrainingArguments, AutoModel
import wandb

import warnings
warnings.filterwarnings("ignore")

### Initialising Weights & Biases

In [None]:
# Initialize W&B
wandb.init(project="audio_classification", name="audio_classifier_run")

# API key : 5c2818e825500a9539aa7ad361c28f8547ab0e70

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33msoumyajoykundu[0m ([33msoumyajoykundu-chennai-mathematical-institute[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


### Pipeline

#### 1. Extract Features for a Audio file

In [None]:
def extract_features(file_path, sr=22050, n_mfcc=20):
    """
    Extract audio features matching the CSV format from a .wav file.
    Args:
        file_path (str): Path to the audio file.
        sr (int): Sample rate.
        n_mfcc (int): Number of MFCC coefficients.
    Returns:
        np.array: Feature vector with shape (26,) [chroma_stft, rms, spectral_centroid, ...].
    """
    try:
        # Load audio
        audio, _ = librosa.load(file_path, sr=sr)

        # Extract features
        chroma = np.mean(librosa.feature.chroma_stft(y=audio, sr=sr))
        rms = np.mean(librosa.feature.rms(y=audio))
        spectral_centroid = np.mean(librosa.feature.spectral_centroid(y=audio, sr=sr))
        spectral_bandwidth = np.mean(librosa.feature.spectral_bandwidth(y=audio, sr=sr))
        rolloff = np.mean(librosa.feature.spectral_rolloff(y=audio, sr=sr))
        zero_crossing_rate = np.mean(librosa.feature.zero_crossing_rate(y=audio))

        # Extract MFCCs
        mfccs = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=n_mfcc)
        mfcc_means = np.mean(mfccs, axis=1)

        # Combine features
        features = np.array([chroma, rms, spectral_centroid, spectral_bandwidth, rolloff,
                            zero_crossing_rate] + mfcc_means.tolist())
        return features
    except Exception as e:
        print(f"Error processing {file_path}: {e}")
        return None

In [None]:
class AudioFeaturesDataset(Dataset):
    def __init__(self, features, labels):
        self.features = features
        self.labels = labels

    def __len__(self):
        return len(self.features)

    def __getitem__(self, idx):
        return {
            "input_ids": torch.tensor(self.features[idx], dtype=torch.float),
            "labels": torch.tensor(self.labels[idx], dtype=torch.long)
        }

#### 2. DistilBERT

In [None]:
# Transformer-based classifier
class AudioClassifier(nn.Module):
    def __init__(self, input_dim=26, num_labels=2):
        super(AudioClassifier, self).__init__()

        # Load DistilBERT as a lightweight transformer
        self.transformer = AutoModel.from_pretrained("distilbert-base-uncased")

        # Freeze transformer layers to prevent overfitting
        for param in self.transformer.parameters():
            param.requires_grad = False

        # Project input features to transformer hidden size
        self.projection = nn.Linear(input_dim, self.transformer.config.hidden_size)

        # Classifier head
        self.classifier = nn.Sequential(
            nn.Linear(self.transformer.config.hidden_size, 256),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(256, num_labels)
        )

    def forward(self, input_ids, labels=None):
        # Project features
        projected = self.projection(input_ids)
        # Pass through transformer
        transformer_outputs = self.transformer(inputs_embeds=projected.unsqueeze(1))
        pooled_output = transformer_outputs.last_hidden_state[:, 0, :]
        logits = self.classifier(pooled_output)

        loss = None
        if labels is not None:
            loss_fn = nn.CrossEntropyLoss()
            loss = loss_fn(logits, labels)

        return {"loss": loss, "logits": logits} if loss is not None else {"logits": logits}

In [None]:
def load_and_preprocess_data(csv_path):
    df = pd.read_csv(csv_path)

    # Features and labels
    feature_columns = [col for col in df.columns if col != "LABEL"]
    X = df[feature_columns].values
    y = df["LABEL"].values

    # Encode labels (FAKE=0, REAL=1)
    label_encoder = LabelEncoder()
    y = label_encoder.fit_transform(y)

    # Standardize features
    scaler = StandardScaler()
    X = scaler.fit_transform(X)

    # Split data
    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    return X_train, X_val, y_train, y_val, scaler, label_encoder

#### 3. Finetuning

In [None]:
def fine_tune_model(X_train, X_val, y_train, y_val):
    # Create datasets
    train_dataset = AudioFeaturesDataset(X_train, y_train)
    val_dataset = AudioFeaturesDataset(X_val, y_val)

    # Initialize model
    model = AudioClassifier(input_dim=26, num_labels=2)

    # Training arguments
    training_args = TrainingArguments(
        output_dir="./audio_classifier",
        run_name="audio_classifier_run",  # Added to resolve wandb warning
        num_train_epochs=5,
        per_device_train_batch_size=32,
        per_device_eval_batch_size=32,
        warmup_steps=100,
        weight_decay=0.01,
        logging_dir="./logs",
        logging_steps=10,
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="accuracy",
    )

    # Compute metrics
    def compute_metrics(pred):
        labels = pred.label_ids
        preds = pred.predictions.argmax(-1)
        accuracy = (preds == labels).mean()
        return {"accuracy": accuracy}

    # Initialize trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        compute_metrics=compute_metrics,
    )

    # Train
    trainer.train()

    # Save model
    trainer.save_model("./audio_classifier_best")
    torch.save(model.state_dict(), "./audio_classifier_best/pytorch_model.bin")

    return model, trainer

#### 4. Classifying a test audio

In [None]:
def classify_audio(file_path, model, scaler, label_encoder, device="cuda" if torch.cuda.is_available() else "cpu"):
    """
    Classify an audio file as FAKE or REAL.
    Args:
        file_path (str): Path to the .wav file.
        model: Trained AudioClassifier model.
        scaler: Fitted StandardScaler.
        label_encoder: Fitted LabelEncoder.
        device (str): Device for inference.
    Returns:
        str: Predicted label ("FAKE" or "REAL").
    """
    # Extract features
    features = extract_features(file_path)
    if features is None:
        return "Error: Could not extract features."

    # Standardize features
    features = scaler.transform([features])

    # Convert to tensor
    features_tensor = torch.tensor(features, dtype=torch.float).to(device)

    # Set model to evaluation mode
    model.eval()
    model.to(device)

    # Predict
    with torch.no_grad():
        outputs = model(input_ids=features_tensor)
        logits = outputs["logits"]
        pred = torch.argmax(logits, dim=1).cpu().numpy()[0]

    # Decode label
    label = label_encoder.inverse_transform([pred])[0]
    return label

### `main` Execution

In [None]:
# Main execution
if __name__ == "__main__":
    # Path to CSV
    csv_path = "/content/data-balanced.csv"

    # Load and preprocess data
    X_train, X_val, y_train, y_val, scaler, label_encoder = load_and_preprocess_data(csv_path)

    # Fine-tune model
    model, trainer = fine_tune_model(X_train, X_val, y_train, y_val)

    # Example inference
    audio_file = "/content/biden-to-ryan.wav"  # Replace with your audio file path
    prediction = classify_audio(audio_file, model, scaler, label_encoder)
    print(f"Prediction for {audio_file}: {prediction}")

    # Save scaler and label encoder
    import joblib
    joblib.dump(scaler, "./audio_classifier_best/scaler.pkl")
    joblib.dump(label_encoder, "./audio_classifier_best/label_encoder.pkl")

    # Finish W&B run
    wandb.finish()

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Epoch,Training Loss,Validation Loss,Accuracy
1,0.4721,0.390516,0.845925
2,0.3589,0.214888,0.924024
3,0.2948,0.162619,0.940577
4,0.2711,0.144404,0.949491
5,0.2672,0.138844,0.952886


Error processing /content/biden-to-ryan.wav: [Errno 2] No such file or directory: '/content/biden-to-ryan.wav'
Prediction for /content/biden-to-ryan.wav: Error: Could not extract features.


0,1
eval/accuracy,▁▆▇██
eval/loss,█▃▂▁▁
eval/runtime,▁▆█▇█
eval/samples_per_second,█▃▁▂▁
eval/steps_per_second,█▃▁▂▁
train/epoch,▁▁▁▁▁▂▂▂▂▂▂▂▃▃▃▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇████
train/global_step,▁▁▁▂▂▂▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▅▅▅▅▅▅▅▅▆▆▆▆▇▇▇▇██
train/grad_norm,▂▁▁▂▁▂▂▂▃▂▂▂▃▃▄▅▂▄▅▃▂▅▅▂▂▅▃▃▆▃▃▃▃▃▆▅▃█▃▃
train/learning_rate,▂▅▆███▇▇▇▇▇▇▇▇▆▆▆▆▆▅▄▄▄▄▄▄▄▃▃▃▃▃▃▃▃▂▂▁▁▁
train/loss,██▇▇▇▅▅▅▅▅▄▃▂▃▃▃▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▂▂▁▁▂▁▂▂▂

0,1
eval/accuracy,0.95289
eval/loss,0.13884
eval/runtime,7.9445
eval/samples_per_second,296.558
eval/steps_per_second,9.315
total_flos,0.0
train/epoch,5.0
train/global_step,1475.0
train/grad_norm,1.91823
train/learning_rate,0.0


### Developing Apps

#### Streamlit App

In [None]:
!pip install pyngrok

Collecting pyngrok
  Downloading pyngrok-7.2.7-py3-none-any.whl.metadata (9.4 kB)
Downloading pyngrok-7.2.7-py3-none-any.whl (23 kB)
Installing collected packages: pyngrok
Successfully installed pyngrok-7.2.7


In [None]:
from pyngrok import ngrok

# ngrok.set_auth_token("2gxSFqEK5lYWnPdSpUDrEEjo1jo_3dCpCWAFQQ5CC2e11LFSq")
ngrok.set_auth_token("2wl7LRIlFvu8tfFBQkXfozkaXQc_2fK1p8Dbrc5SjcAHkHkLo")

# if 'public_url' not in globals():
#     public_url = ngrok.connect(5000)

# Set up a tunnel
public_url = ngrok.connect("http://localhost:5000")
print("Streamlit app is live at:", public_url)

# Run streamlit
!streamlit run app.py &

Streamlit app is live at: NgrokTunnel: "https://5b53-35-233-155-118.ngrok-free.app" -> "http://localhost:5000"
/bin/bash: line 1: streamlit: command not found


#### Flask App

In [None]:
from flask import Flask, request, render_template_string
from pyngrok import ngrok
import torch
import joblib
import librosa
import numpy as np
import os

# Load your PyTorch model definition
from torch import nn

# Setup
app = Flask(__name__)
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load model and preprocessing tools
model_path = "./audio_classifier_best"
scaler = joblib.load(f"{model_path}/scaler.pkl")
label_encoder = joblib.load(f"{model_path}/label_encoder.pkl")

model = AudioClassifier(input_dim=26, num_labels=2).to(device)
model.load_state_dict(torch.load(f"{model_path}/pytorch_model.bin", map_location=device))
model.eval()


# UI Template
HTML_TEMPLATE = '''
<!DOCTYPE html>
<html>
<head>
    <title>SonicShield</title>
    <style>
        body {
            font-family: 'Segoe UI', sans-serif;
            margin: 0;
            padding: 0;
            background: url("{{ url_for('static', filename='bg.png') }}") no-repeat center center fixed;
            background-size: cover;
            display: flex;
            justify-content: center;
            align-items: center;
            height: 100vh;
            color: #333;
        }

        .form-container {
            background-color: rgba(255, 255, 255, 0.9);
            padding: 30px;
            border-radius: 20px;
            text-align: center;
            box-shadow: 0 8px 30px rgba(0, 0, 0, 0.2);
            width: 90%;
            max-width: 450px;
        }

        .form-container h2 {
            font-size: 26px;
            margin-bottom: 20px;
        }

        .form-container h2::before {
            content: "🎧 ";
            font-size: 28px;
        }

        input[type="file"] {
            margin: 15px 0;
            padding: 10px;
            width: 90%;
            border: 2px dashed #aaa;
            border-radius: 10px;
            background-color: #fafafa;
        }

        input[type="submit"] {
            background-color: #5a67d8;
            color: white;
            border: none;
            padding: 12px 25px;
            border-radius: 10px;
            cursor: pointer;
            font-size: 16px;
            transition: background-color 0.3s ease;
        }

        input[type="submit"]:hover {
            background-color: #434190;
        }

        .result {
            margin-top: 25px;
            padding: 15px;
            background-color: #f0f4f8;
            border-radius: 12px;
            font-size: 20px;
            font-weight: bold;
            color: #2d3748;
        }

    </style>
</head>
<body>
    <div class="form-container">
        <h2>SonicShield: Upload Audio File</h2>
        <form method="post" enctype="multipart/form-data">
            <input type="file" name="audio" accept=".wav" required><br>
            <input type="submit" value="Predict">
        </form>
        {% if result %}
        <div class="result">
            Prediction: {{ result }}
        </div>
        {% endif %}
    </div>
</body>
</html>
'''


@app.route('/', methods=['GET', 'POST'])
def index():
    if request.method == 'POST':
        if 'audio' not in request.files:
            return render_template_string(HTML_TEMPLATE, result="No file uploaded")
        file = request.files['audio']
        if file.filename == '':
            return render_template_string(HTML_TEMPLATE, result="No file selected")
        if file and file.filename.endswith('.wav'):
            filepath = os.path.join('./', file.filename)
            file.save(filepath)
            features = extract_features(filepath)
            os.remove(filepath)
            if features is None:
                return render_template_string(HTML_TEMPLATE, result="Failed to extract features")
            features_scaled = scaler.transform([features])
            tensor_input = torch.tensor(features_scaled, dtype=torch.float).to(device)
            with torch.no_grad():
                outputs = model(input_ids=tensor_input)
                logits = outputs["logits"]
                pred = torch.argmax(logits, dim=1).cpu().numpy()[0]
            label = label_encoder.inverse_transform([pred])[0]
            return render_template_string(HTML_TEMPLATE, result=label)
        else:
            return render_template_string(HTML_TEMPLATE, result="Only .wav files are supported")
    return render_template_string(HTML_TEMPLATE)

# For environments like Google Colab or local dev with tunneling
public_url = ngrok.connect(5001)
print(f"Public URL: {public_url}")
app.run(port=5001)

Public URL: NgrokTunnel: "https://3087-35-233-155-118.ngrok-free.app" -> "http://localhost:5001"
 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:5001
INFO:werkzeug:[33mPress CTRL+C to quit[0m
INFO:werkzeug:127.0.0.1 - - [07/May/2025 11:00:22] "GET / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [07/May/2025 11:00:22] "[33mGET /static/bg.png HTTP/1.1[0m" 404 -
INFO:werkzeug:127.0.0.1 - - [07/May/2025 11:00:23] "[33mGET /favicon.ico HTTP/1.1[0m" 404 -
INFO:werkzeug:127.0.0.1 - - [07/May/2025 11:57:36] "POST / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [07/May/2025 11:57:36] "[33mGET /static/bg.png HTTP/1.1[0m" 404 -


---
                                                         
                                                          
                                                           
                                                            
                                                             Thank You :)