# 1 Author

**Student Name**: Sneha Gadade 

**Student ID**:  220798659

In [1]:
from google.colab import drive

import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from numpy import dot
from numpy.linalg import norm

drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
BASE_PATH = '/content/drive/MyDrive/Data/MLEndLS/'

# 2 Problem formulation 



**Crime suspects identification using dimensionality reduction(Unsupervised learning)** 

Lets assume, a crime department in Mile End Campus has MLEnd London Sounds dataset (Train) as a database of all past criminals with their audio samples. There has been a new crime in MileEnd Campus and crime department has found audio(test) of the criminal as evidence only. The department has no other way to find the real criminal. So, the department decides to use the past criminals database to find some suspects to start with. 

The objective of the problem is to get top 10 suspects for the given audio sample of the real criminal from MLEnd London Sounds dataset.
we will use **Participant** column as name of the criminals. We will split the dataset into train and test data. The train data represents the database of criminals that resides with crime department of Mile End Campus. We will use test data as data from real criminals. At tthe end of notebook we can observe suspects and criminals based on audio samples from test data.  

Using the MLEnd London Sounds dataset, we will build a machine learning pipeline that takes input as an audio segment and predicts whether the input audio segment is similar to any participants from the saved data.

Crime suspects identification using dimensionality reduction

In [3]:
BASE_PATH = '/content/drive/MyDrive/Data/MLEndLS/'

In [4]:
MLEndLS_df = pd.read_csv('/content/drive/MyDrive/Data/MLEndLS/data/data.csv')

In [5]:
MLEndLS_df.head()

Unnamed: 0,file_id,feature_power,feature_pitch_mean,feature_pitch_std,feature_voiced_fr,feature_hist_0,feature_hist_1,feature_hist_2,feature_hist_3,feature_hist_4,...,feature_MFCC_14,feature_MFCC_15,feature_MFCC_16,feature_MFCC_17,feature_MFCC_18,feature_MFCC_19,area,spot,in_out,Participant
0,0001.wav,0.026341,86.945514,2.27842,0.016929,7e-06,4e-06,1e-06,6e-06,1.2e-05,...,-3.208296,-0.176207,7.756495,-8.273263,16.583691,-8.927471,british,street,outdoor,S151
1,0002.wav,0.010308,193.033652,34.995743,0.131796,6e-06,1e-05,1.1e-05,1.6e-05,1.3e-05,...,-2.24141,-6.907155,7.761998,-5.321468,-0.472251,3.177591,kensington,dinosaur,indoor,S127
2,0003.wav,0.005994,118.204789,13.032931,0.053735,1.2e-05,2.4e-05,1.9e-05,9e-06,1e-05,...,1.31215,16.873102,-15.021708,3.989271,0.975474,-7.571914,campus,square,outdoor,S18
3,0004.wav,0.016374,127.450592,18.197021,0.105263,4e-06,1e-06,6e-06,1.3e-05,3.7e-05,...,-3.125193,-3.332724,-7.483416,-2.683595,-6.539318,1.335392,kensington,hintze,indoor,S179
4,0005.wav,0.002628,160.158646,25.790774,0.067073,6e-06,2e-05,4.4e-05,5.4e-05,5.9e-05,...,-3.904656,-4.67898,0.059519,-1.126336,-2.390231,-3.384028,campus,square,outdoor,S176


In [6]:
is_multi = MLEndLS_df["Participant"].value_counts() > 1
MLEndLS_df = MLEndLS_df[MLEndLS_df["Participant"].isin(is_multi[is_multi].index)]

In [7]:
features = [column for column in MLEndLS_df.columns if column.startswith('feature_')]

# 3 Modelling and  Machine Learning pipeline


We will split the dataset into train and test data. The train data represents the database of criminals that resides with crime department of Mile End Campus. We will use test data as data from real criminals. At tthe end of notebook we can observe suspects and criminals based on audio samples from test data.  

In [8]:
train_df, test_df, participants_train, participants_test = train_test_split(MLEndLS_df, MLEndLS_df[["file_id", "Participant"]], test_size=0.07, stratify=MLEndLS_df["Participant"])

# 4 Transformation stage

We will use same extracted features for this problem. We will scale the features using StandardScalar from scikit-learn. We will obtain embeddings of samples from the training data using demensionality reduction. For test sample, we will use same similar transformation to obtain the embedding and compare it against all embeddings in train dataset using Cosine Similarity. Higher the cosine similarity for the particiant, higher the chance that the participant is a suspect. 

In [9]:
class CriminalsPCA:
    def __init__(self, train_dataframe, test_dataframe, feature_names):
        self.trainDataframe = train_dataframe
        self.testDataframe = test_dataframe
        self.testDFfileIDs = self.testDataframe["file_id"].values.tolist()
        self.featureNames = feature_names
        self.scaleEstimator = StandardScaler()
        self.scaleEstimator.fit(train_dataframe[self.featureNames])
        self.X = self.scaleEstimator.transform(train_dataframe[self.featureNames])
        self.pca = PCA(n_components= 20)
        self.pca.fit(self.X)
        self.embeddings = self.pca.transform(self.X)
    
    def cosine_similarity(self, a, b):
        return dot(a, b)/(norm(a)*norm(b))
    
    def get_similarities(self, input_vec, arr):
        similarities = []
        for i in range(arr.shape[0]):
            similarities.append(self.cosine_similarity(input_vec, arr[ i,:]))
        return similarities
    
    def identify_suspects(self, file_id):
        feature_vector = self.testDataframe[self.featureNames].iloc[self.testDFfileIDs.index(file_id),:]
        similarSuspectsDF = pd.DataFrame()
        similarSuspectsDF["file_id"] = self.trainDataframe["file_id"].values.tolist()
        similarSuspectsDF["Participant"] = self.trainDataframe["Participant"].values.tolist()
        featureVector_scaled = self.scaleEstimator.transform(np.expand_dims(feature_vector,0))
        featureVector_embedding = self.pca.transform(featureVector_scaled)
        similarities = self.get_similarities(featureVector_embedding, self.embeddings)
        similarSuspectsDF["similarity"] = similarities
        similarSuspectsDF = similarSuspectsDF.sort_values(by = "similarity", ascending=False)
        suspects = similarSuspectsDF['Participant'].head(10).values.tolist()
        print("TOP 10 SUSPECTS ....")
        print(f"{suspects}")
        return suspects

# 8 Results
In this section we will detect top 10 suspects for the audio files in test dataset.

In [10]:
suspect_identifier = CriminalsPCA(train_dataframe=train_df, test_dataframe=test_df, feature_names=features)

In [11]:
criminal_voice_fileIDs = participants_test["file_id"].values.tolist()
criminal_voice_fileIDs[:10]

['0542.wav',
 '2383.wav',
 '1171.wav',
 '0840.wav',
 '0670.wav',
 '1809.wav',
 '0315.wav',
 '0330.wav',
 '0593.wav',
 '1467.wav']

In [16]:
criminal_voice_file_id = criminal_voice_fileIDs[1]
criminal_voice_file_id

'2383.wav'

In [17]:
print(f'Real Criminal is {participants_test["Participant"][participants_test["file_id"] == criminal_voice_file_id].values[0]}')
prime_suspects = suspect_identifier.identify_suspects(criminal_voice_file_id)

Real Criminal is S129
TOP 10 SUSPECTS ....
['S129', 'S61', 'S129', 'S46', 'S118', 'S78', 'S181', 'S118', 'S72', 'S78']




# 9 Conclusions
 We can observe that for most of the samples in test data we are able to find real criminals as primary top 10 suspects using similarity ranking.