<a href="https://colab.research.google.com/github/karthik-palaniappan/Mini-Project-Speech-Emotion-Classification/blob/main/T7_Copy_of_M2_NB_MiniProject_3_Emotion_Classification_from_Speech.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Program in Computational Data Science
## A program by IISc and TalentSprint
### Mini-Project: Speech Emotion Classification

## Problem Statement

Build a model to recognize emotion from speech using Ensemble learning

## Learning Objectives

At the end of the mini-project, you will be able to :

* extract the features from audio data
* implement ML classification algorithms individually and as Ensembles, to classify emotions
* record the voice sample and test it with trained model

## Dataset

**TESS Dataset**

The first dataset chosen for this mini-project is the [TESS](https://dataverse.scholarsportal.info/dataset.xhtml?persistentId=doi:10.5683/SP2/E8H2MF) (Toronto emotional speech set) dataset. It contains 2880 files.  A set of 200 target words were spoken in the carrier phrase "Say the word _____' by two actresses and the sets were recorded in seven different emotions (anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral). Both actresses spoke English as their first language, were university educated, and had musical training. Audiometric testing indicated that both actresses had thresholds within the normal range.

**Ravdess Dataset**

The second dataset chosen for this mini-project is [Ravdess](https://zenodo.org/record/1188976#.YLczy4XivIU) (The Ryerson Audio-Visual Database of Emotional Speech and Song). This dataset contains 1440 files: 60 trials per actor x 24 actors = 1440. The RAVDESS contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech emotions includes calm, happy, sad, angry, fearful, surprise, and disgust expressions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression.

**File naming convention**

Each of the 1440 files has a unique filename. The filename consists of a 7-part numerical identifier (e.g., 03-01-06-01-02-01-12.wav). These identifiers define the stimulus characteristics:

**Filename identifiers**

* Modality (01 = full-AV, 02 = video-only, 03 = audio-only).
* Vocal channel (01 = speech, 02 = song).
* Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).
* Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.
* Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").
* Repetition (01 = 1st repetition, 02 = 2nd repetition).
* Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).

Filename example: `03-01-06-01-02-01-12.wav`

    - Audio-only - 03
    - Speech - 01
    - Fearful - 06
    - Normal intensity - 01
    - Statement "dogs" - 02
    - 1st Repetition - 01
    - 12th Actor - 12 Female, as the actor ID number is even.

## Information

**Speech Emotion Recognition (SER)** is the task of recognizing the emotion from  speech, irrespective of the semantics. Humans can efficiently perform this task as a natural part of speech communication, however, the ability to conduct it automatically using programmable devices is a field of active research.

Studies of automatic emotion recognition systems aim to create efficient, real-time methods of detecting the emotions of mobile phone users, call center operators and customers, car drivers, pilots, and many other human-machine communication users. Adding emotions to machines forms an important aspect of making machines appear and act in a human-like manner

Lets gain familiarity with some of the audio based features that are commonly used for SER.

**Mel scale** — The mel scale (derived from the word *melody*) is a perceptual scale of pitches judged by listeners to be equal in distance from one another. The reference point between this scale and normal frequency measurement is defined by assigning a perceptual pitch of 1000 mels to a 1000 Hz tone, 40 dB above the listener's threshold. Above about 500 Hz, increasingly large intervals are judged by listeners to produce equal pitch increments. Refer [here](https://towardsdatascience.com/learning-from-audio-the-mel-scale-mel-spectrograms-and-mel-frequency-cepstral-coefficients-f5752b6324a8) for more detailed information.

**Pitch** — how high or low a sound is. It depends on frequency, higher pitch is high frequency

**Frequency** — speed of vibration of sound, measures wave cycles per second

**Chroma** — Representation for audio where spectrum is projected onto 12 bins representing the 12 distinct semitones (or chroma). Computed by summing the log frequency magnitude spectrum across octaves.

**Fourier Transforms** — used to convert from time domain to frequency domain. Time domain shows how signal changes over time. Frequency domain shows how much of the signal lies within each given frequency band over a range of frequencies

**Librosa**

[Librosa](https://librosa.org/doc/latest/index.html) is a Python package, built for speech and audio analytics. It provides modular functions that simplify working with audio data and help in achieving a wide range of applications such as identification of the personal characteristics of different individuals' voice samples, detecting emotions from audio samples etc.

For further details on the Librosa package, refer [here](https://conference.scipy.org/proceedings/scipy2015/pdfs/brian_mcfee.pdf).


### **Kaggle Competition**

Please refer to the link for viewing the
[Kaggle Competition Document](https://drive.google.com/file/d/1kGJfpEC9dayjApciCYZr04NWT7XWkRhV/view?usp=sharing) and join the Kaggle Competition using the hyperlink given in this document under '*Kaggle* Competition site'.


## Grading = 10 Points

In [2]:
#@title Download the datasets and install packages
!wget -qq https://cdn.iisc.talentsprint.com/CDS/MiniProjects/Ravdess_Tess.zip
!unzip -qq Ravdess_Tess.zip
# Install packages
!pip -qq install librosa soundfile
!pip -qq install wavio
print("Datasets downloaded successfully!")

Datasets downloaded successfully!


### Import Neccesary Packages

In [3]:
import librosa
import librosa.display
import soundfile
import os, glob, pickle
import numpy as np
import pandas as pd
import IPython.display as ipd
from matplotlib import pyplot as plt
from datetime import datetime
from IPython.display import Javascript
from google.colab import output
from base64 import b64decode
import warnings
warnings.filterwarnings('ignore')
# sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn import tree
from sklearn.ensemble import VotingClassifier

### Work-Flow

* Load the TESS audio data and extract features and labels

* Load the Ravdess audio data and extract features

* Combine both the audio dataset features

* Train and test the model with TESS + Ravdess Data

* Record the team audio samples and add them to TESS + Ravdess data

* Train and test the model with TESS + Ravdess + Team Recorded (combined) data

* Test each of the models with live audio sample recording.

### Load the Tess data and Ravdess data audio files (1 point)

Hint: `glob.glob`

In [4]:
# YOUR CODE HERE

TESS_files = glob.glob("Tess/*/*.wav")
print("TESS:",len(TESS_files))

RAVDESS_files = glob.glob("ravdess/*/*.wav")
len(RAVDESS_files)
print("RAVDESS:",len(RAVDESS_files))

TESS: 2679
RAVDESS: 1168


#### Play the sample audio

In [5]:
# YOUR CODE HERE
i = 0
print(TESS_files[i])
ipd.Audio(TESS_files[i])

Tess/YAF_neutral/YAF_hit_neutral.wav


In [30]:
# Creating a pandas dataframe with the necessary columns

Modality = {1: "full-AV", 2: "video-only", 3: "audio-only"}
VocalChannel = {1: "speech", 2: "song"}
Emotion = {1: "neutral", 2: "calm", 3: "happy", 4: "sad", 5: "angry", 6: "fearful", 7: "disgust", 8: "surprised"}
EmotionalIntensity = {1: "normal", 2: "strong", 3: "Unknown"} # NOTE: There is no strong intensity for the 'neutral' emotion.
Statement = {1: "Kids are talking by the door", 2: "Dogs are sitting by the door", 3: "Say the word"}
Repetition = {1: "1st repetition", 2: "2nd repetition"}
Actor = np.arange(1,27) #Actors 1 - 24 from RavDESS and 25 and 26 from TESS
ActorGender = {1: "Female", 2: "Male"}
ActorAge = {1: "Young", 2: "Old", 3: "Unknown"}
DataBase = {1: "RAVDESS", 2: "TESS"}

data_columns = ["Modality", "VocalChannel", "Emotion", "EmotionalIntensity", "Statement", "Repetition", "Actor", "ActorGender", "ActorAge", "DataBase", "FilePath"]

dtype_dict = {
    "Modality": 'category',
    'VocalChannel' : 'category',
    'Emotion' : 'category',
    'EmotionalIntensity' : 'category',
    'Statement' : 'str',
    'Repetition' : 'category',
    'Actor' : 'category',
    'ActorGender' : 'category',
    'ActorAge' : 'category',
    'DataBase' : 'category',
    'FilePath' : 'str'}

df = pd.DataFrame(columns=data_columns)
df = df.astype(dtype=dtype_dict)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   Modality            0 non-null      category
 1   VocalChannel        0 non-null      category
 2   Emotion             0 non-null      category
 3   EmotionalIntensity  0 non-null      category
 4   Statement           0 non-null      object  
 5   Repetition          0 non-null      category
 6   Actor               0 non-null      category
 7   ActorGender         0 non-null      category
 8   ActorAge            0 non-null      category
 9   DataBase            0 non-null      category
 10  FilePath            0 non-null      object  
dtypes: category(9), object(2)
memory usage: 972.0+ bytes


In [18]:
"""
Modality = {1: "full-AV", 2: "video-only", 3: "audio-only"}
VocalChannel = {1: "speech", 2: "song"}
Emotion = {1: "neutral", 2: "calm", 3: "happy", 4: "sad", 5: "angry", 6: "fearful", 7: "disgust", 8: "surprised"}
EmotionalIntensity = {1: "normal", 2: "strong", 3: "Unknown"} # NOTE: There is no strong intensity for the 'neutral' emotion.
Statement = {1: "Kids are talking by the door", 2: "Dogs are sitting by the door", 3: "Say the word"}
Repetition = {1: "1st repetition", 2: "2nd repetition"}
Actor = np.arange(1,27) #Actors 1 - 24 from RavDESS and 25 and 26 from TESS
ActorGender = {1: "Female", 2: "Male"}
ActorAge = {1: "Young", 2: "Old", 3: "Unknown"}
DataBase = {1: "RAVDESS", 2: "TESS"}
"""

def TESS_name_decoder(TESSname):

  import re

  actor = re.search('(?<=Tess/)[YO]AF', TESSname).group(0) #
  word  = re.search('(?<=_)[a-zA-Z]+(?=_)', TESSname).group(0)
  emotion  = re.search('(?<=_)[a-zA-Z]+(?=\.wav)', TESSname).group(0)

  Mod = 3
  VC = 1
  Em = emotion #Emotion - to be reset after parsing the name
  EI = 3
  St = Statement[3] + " " + word
  Rp = 1
  Actor = 25 if actor == 'YAF' else 26 # 25 for YAF and 26 for OAF
  AG = 1
  AA = 1 if actor == 'YAF' else 2 # 1 for young and 2 for old
  DB = 2
  FilePath = TESSname

  return {"Modality": Modality[Mod],
          "VocalChannel": VocalChannel[VC],
          "Emotion": Em,
          "EmotionalIntensity": EmotionalIntensity[EI],
          "Statement": St,
          "Repetition": Repetition[Rp],
          "Actor": Actor,
          "ActorGender": ActorGender[AG],
          "ActorAge": ActorAge[AA],
          "DataBase": DataBase[DB],
          "FilePath": FilePath}

TESS_name_decoder(TESS_files[0])

{'Modality': 'audio-only',
 'VocalChannel': 'speech',
 'Emotion': 'neutral',
 'EmotionalIntensity': 'Unknown',
 'Statement': 'Say the word hit',
 'Repetition': '1st repetition',
 'Actor': 25,
 'ActorGender': 'Female',
 'ActorAge': 'Young',
 'DataBase': 'TESS',
 'FilePath': 'Tess/YAF_neutral/YAF_hit_neutral.wav'}

In [29]:
"""
Modality = {1: "full-AV", 2: "video-only", 3: "audio-only"}
VocalChannel = {1: "speech", 2: "song"}
Emotion = {1: "neutral", 2: "calm", 3: "happy", 4: "sad", 5: "angry", 6: "fearful", 7: "disgust", 8: "surprised"}
EmotionalIntensity = {1: "normal", 2: "strong", 3: "Unknown"} # NOTE: There is no strong intensity for the 'neutral' emotion.
Statement = {1: "Kids are talking by the door", 2: "Dogs are sitting by the door", 3: "Say the word"}
Repetition = {1: "1st repetition", 2: "2nd repetition"}
Actor = np.arange(1,27) #Actors 1 - 24 from RavDESS and 25 and 26 from TESS
ActorGender = {1: "Female", 2: "Male"}
ActorAge = {1: "Young", 2: "Old", 3: "Unknown"}
DataBase = {1: "RAVDESS", 2: "TESS"}
"""

"""
  word  = re.search('(?<=_)[a-xA-Z]+(?=_)', TESSname).group(0)
"""

def RAVDESS_name_decoder(RAVDESSname):

  import re

  actor = re.search('(?<=Actor_)\d+', RAVDESSname).group(0) #
  emotion  = re.search('(?<=[0-9]_).*?(?=\.wav)', RAVDESSname).group(0)
  envmt  = re.search('(?<=[0-9]/).*?(?=_)', RAVDESSname).group(0).split('-')


  Mod = int(envmt[0])
  VC = int(envmt[1])
  Em = emotion #Emotion - to be reset after parsing the name
  EI = int(envmt[3])
  St = Statement[int(envmt[4])]
  Rp = int(envmt[5])
  Actor = int(envmt[6]) # 25 for YAF and 26 for OAF
  AG = 1 if Actor%2==0 else 2
  AA = 3
  DB = 1
  FilePath = RAVDESSname

#  return (Mod, VC, Em, EI, St, Rp, Actor, AA, AG, AA, DB, FilePath)

  return {"Modality": Modality[Mod],
          "VocalChannel": VocalChannel[VC],
          "Emotion": Em,
          "EmotionalIntensity": EmotionalIntensity[EI],
          "Statement": St,
          "Repetition": Repetition[Rp],
          "Actor": Actor,
          "ActorGender": ActorGender[AG],
          "ActorAge": ActorAge[AA],
          "DataBase": DataBase[DB],
          "FilePath": FilePath}

print(RAVDESS_files[0])
RAVDESS_name_decoder(RAVDESS_files[0])

#ipd.Audio(RAVDESS_files[0])

ravdess/Actor_06/03-01-05-02-01-01-06_angry.wav


{'Modality': 'audio-only',
 'VocalChannel': 'speech',
 'Emotion': 'angry',
 'EmotionalIntensity': 'strong',
 'Statement': 'Kids are talking by the door',
 'Repetition': '1st repetition',
 'Actor': 6,
 'ActorGender': 'Female',
 'ActorAge': 'Unknown',
 'DataBase': 'RAVDESS',
 'FilePath': 'ravdess/Actor_06/03-01-05-02-01-01-06_angry.wav'}

In [10]:
print(len(TESS_files))

2679


In [31]:
for t in TESS_files:
  df = df.append(TESS_name_decoder(t), ignore_index=True)


In [27]:
df.head()

Unnamed: 0,Modality,VocalChannel,Emotion,EmotionalIntensity,Statement,Repetition,Actor,ActorGender,ActorAge,DataBase,FilePath
0,audio-only,speech,neutral,Unknown,Say the word hit,1st repetition,25,Female,Young,TESS,Tess/YAF_neutral/YAF_hit_neutral.wav
1,audio-only,speech,neutral,Unknown,Say the word cool,1st repetition,25,Female,Young,TESS,Tess/YAF_neutral/YAF_cool_neutral.wav
2,audio-only,speech,neutral,Unknown,Say the word long,1st repetition,25,Female,Young,TESS,Tess/YAF_neutral/YAF_long_neutral.wav
3,audio-only,speech,neutral,Unknown,Say the word perch,1st repetition,25,Female,Young,TESS,Tess/YAF_neutral/YAF_perch_neutral.wav
4,audio-only,speech,neutral,Unknown,Say the word moon,1st repetition,25,Female,Young,TESS,Tess/YAF_neutral/YAF_moon_neutral.wav


In [32]:
for r in RAVDESS_files:
  df = df.append(RAVDESS_name_decoder(r), ignore_index=True)

In [33]:
df.tail()

Unnamed: 0,Modality,VocalChannel,Emotion,EmotionalIntensity,Statement,Repetition,Actor,ActorGender,ActorAge,DataBase,FilePath
3842,audio-only,speech,angry,strong,Dogs are sitting by the door,1st repetition,9,Male,Unknown,RAVDESS,ravdess/Actor_09/03-01-05-02-02-01-09_angry.wav
3843,audio-only,speech,sad,strong,Dogs are sitting by the door,2nd repetition,9,Male,Unknown,RAVDESS,ravdess/Actor_09/03-01-04-02-02-02-09_sad.wav
3844,audio-only,speech,surprised,strong,Dogs are sitting by the door,2nd repetition,9,Male,Unknown,RAVDESS,ravdess/Actor_09/03-01-08-02-02-02-09_surprise...
3845,audio-only,speech,sad,normal,Dogs are sitting by the door,1st repetition,9,Male,Unknown,RAVDESS,ravdess/Actor_09/03-01-04-01-02-01-09_sad.wav
3846,audio-only,speech,disgust,strong,Kids are talking by the door,2nd repetition,9,Male,Unknown,RAVDESS,ravdess/Actor_09/03-01-07-02-01-02-09_disgust.wav


In [34]:
print(df.shape)
print(len(RAVDESS_files), len(TESS_files), len(RAVDESS_files)+len(TESS_files))

(3847, 11)
1168 2679 3847


### Data Exploration and Visualization (1 point)

#### Visualize the distribution of all the labels

In [None]:
# YOUR CODE HERE

#### Visualize sample audio signal using librosa

In [None]:
# YOUR CODE HERE

### Feature extraction (2 points)

Read one WAV file at a time using `Librosa`. An audio time series in the form of a 1-dimensional array for mono or 2-dimensional array for stereo, along with time sampling rate (which defines the length of the array), where the elements within each of the arrays represent the amplitude of the sound waves is returned by `librosa.load()` function. Refer to the supplementary notebook ('Audio feature extraction')

To know more about Librosa, explore the [link](https://librosa.org/doc/latest/feature.html)

In [None]:
# YOUR CODE HERE

#### Create a dictionary or a function to encode the emotions

In [None]:
# YOUR CODE HERE

#### TESS data feature extraction

In [None]:
# YOUR CODE HERE

#### Ravdess data feature extraction

In [None]:
# YOUR CODE HERE

#### Save the features

It is best advised to save the features in dataframe and maintain so that feature extraction step is not required to be performed every time.

* Make a DataFrame with features and labels

* Write dataframe into `.CSV` file and save it offline.

In [None]:
# YOUR CODE HERE

#### Split the data into train and test

In [None]:
# YOUR CODE HERE

### Train the model with TESS + Ravdess data (2 points)

* Apply different ML algorithms (eg. DecisionTree, RandomForest, etc.) and find the model with best performance

In [None]:
# YOUR CODE HERE

#### Apply the voting classifier

In [None]:
# YOUR CODE HERE

### Train the model with TESS + Ravdess + Team recorded data (4 points)

* Record the audio samples (team data), extract features and combine with TESS + Ravdess data features
  - Record and gather all the team data samples with proper naming convention in separate folder

    **Hint:** Follow the supplementary notebook to record team data

  - Each team member must record 2 samples for each emotion (Use similar sentences as given in TESS data)

* Train the different ML algorithms and find the model with best performance

#### Load the team data

In [None]:
# YOUR CODE HERE

#### Extracting features of team data and combine with TESS + Ravdess

In [None]:
# YOUR CODE HERE

#### Train the different ML algorithms

In [None]:
# YOUR CODE HERE

#### Test the best working model with live audio recording

In [None]:
# choose the best working model and assign below
MODEL =

In [None]:
#@title Speak the utterance and test
from IPython.display import Javascript
from google.colab import output
from base64 import b64decode

RECORD = """
const sleep  = time => new Promise(resolve => setTimeout(resolve, time))
const b2text = blob => new Promise(resolve => {
  const reader = new FileReader()
  reader.onloadend = e => resolve(e.srcElement.result)
  reader.readAsDataURL(blob)
})
var record = time => new Promise(async resolve => {
  stream = await navigator.mediaDevices.getUserMedia({ audio: true })
  recorder = new MediaRecorder(stream)
  chunks = []
  recorder.ondataavailable = e => chunks.push(e.data)
  recorder.start()
  await sleep(time)
  recorder.onstop = async ()=>{
    blob = new Blob(chunks)
    text = await b2text(blob)
    resolve(text)
  }
  recorder.stop()
})
"""

if not os.path.exists('ModelTesting/'):
    os.mkdir("ModelTesting/")
def record(sec=3):
    print("Start speaking!")
    now = datetime.now()
    current_time = now.strftime("%Y-%m-%d_%H-%M-%S")
    display(Javascript(RECORD))
    s = output.eval_js('record(%d)' % (sec*1000))
    b = b64decode(s.split(',')[1])
    with open('ModelTesting/audio_'+current_time+'.wav','wb') as f:
        f.write(b)
    return 'ModelTesting/audio_'+current_time+'.wav'
test_i = record()
pred = MODEL.predict(extract_feature(test_i).reshape(1,-1))
idx_emotion = list(emotions.values()).index(pred[0])
print(list(emotions.keys())[idx_emotion])
ipd.Audio(test_i)

### Report Analysis

- Report the accuracy for 10 live samples using the model trained on TESS+Ravdess+Team data
- Discuss with the team mentor regarding deep learnt audio features. Read a related article [here](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8805181).
