#### Denoising of sound data using Principal Component Analysis (PCA) with NumPy
- The dataset "data/sound.csv" contains two different audio recordings. Principal component analysis is used to reduce the number of features of the dataset from 2 to 1 (basically finds an approximation for the first principal component). This will denoise the dataset.
- The denoised dataset is stored in "data/output.csv" and saved as a .WAV file in "data/output.wav".
- The original dataset "data/sound.csv" is saved as a .WAV file in "data/sound_o.wav". You can listen to both "data/sound_o.wav" and "data/output.wav" to hear the difference between the original recording and the denoised version of that recording.

In [1]:
import numpy as np
import pandas as pd
from scipy.io import wavfile

In [2]:
samrate = 8000

In [3]:
# read .CSV dataset into array
txtData = np.genfromtxt('data/sound.csv', delimiter=',')
txtData.shape

(50000, 2)

In [4]:
# save array to .WAV file
scaledData = np.int16(txtData * samrate)
wavfile.write('data/sound_o.wav', samrate, scaledData)

In [5]:
# read .WAV file into array
# The data in sound.csv is processed
# If you use the data generated here, you need to process the data by adding wavData = wavData / samrate
samrate, wavData = wavfile.read('data/sound_o.wav')
samrate, wavData.shape

#wavData = wavData / samrate
#wavData.shape

(8000, (50000, 2))

In [6]:
# save array to .CSV file
np.savetxt('data/sound_o.csv', txtData, delimiter=',')

In [7]:
# building a PCA model with NumPy
class PCA(object):
    def __init__(self, lr, epoch):
        self.lr = lr
        self.epoch = epoch

    def run(self, x, n_components):
        # data standardization x=(x-mean)/standard deviation
        # each feature should have mean equal 0 and std. deviation equal 1
        self.standardized_data = (x-x.mean(axis=0)) / x.std(axis=0)

        # covariance matrix
        self.covariance_matrix = np.cov(self.standardized_data.T)

        # eigenvalues and eigenvectors using eigendecomposition of covariance matrix
        self.eigenvalues, self.eigenvectors = np.linalg.eig(self.covariance_matrix)

        # the more important principal components will have higher eigenvalues
        # then, the matrix needs to be sorted from highest to lowest eigenvalues
        self.low_to_high_eigenvalue = np.argsort(self.eigenvalues)[::-1]
        self.eigenvectors_sorted = self.eigenvectors[:,self.low_to_high_eigenvalue]

        # reducing the number of features to n_components by tranforming the data with the principal components
        self.processed_data = np.matmul(self.standardized_data, self.eigenvectors_sorted[:,:n_components])
        
        return self.processed_data
    
    def evaluation(self):
        print("Covariance matrix: \n",self.covariance_matrix,"\n")
        print("Eigenvalues: \n",self.eigenvalues, "\n")
        print("Eigenvectors: \n",self.eigenvectors,"\n")
        print("Eigenvectors_sorted: \n",self.eigenvectors_sorted, "\n")
        print("Shape of processed data: \n",self.processed_data.shape, "\n")

In [8]:
# initialize and run the model

pca1 = PCA(1,1)

processed_data = pca1.run(txtData, 1) # reducing the number of features from 2 to 1

pca1.evaluation()

Covariance matrix: 
 [[1.00002    0.00547066]
 [0.00547066 1.00002   ]] 

Eigenvalues: 
 [0.99454934 1.00549066] 

Eigenvectors: 
 [[-0.70710678 -0.70710678]
 [ 0.70710678 -0.70710678]] 

Eigenvectors_sorted: 
 [[-0.70710678 -0.70710678]
 [-0.70710678  0.70710678]] 

Shape of processed data: 
 (50000, 1) 



In [9]:
# saving the data

# save array to WAV file
scaledData_processed = np.int16(processed_data * samrate)
wavfile.write('data/output.wav', samrate, scaledData_processed)

# save array to csv file
np.savetxt('data/output.csv', processed_data, delimiter=',')