# Building final Dataset using Ted Talks

The aim of this Notebook is to prepare the dataset that is going to be used by a model or to modify an existing one by changing the noise or incompletion percentage.

The data has been found in a Mozilla forum with semi-open datasets focused in audio files. Here is the main source: https://voice.mozilla.org/en/datasets.

![title](img/mozilla.jpg)

Here we can found the Ted Talks TEDLIUM v3.0. (License: CC-BY-NC-ND 3.0)

Once downloaded, lets see how the files are:

# Studying of the original Dataset

The downloaded folder has 221 GB of audio files as you can see:

![title](img/folder.jpg)

It has several folders with redundant files in different formats, mainly .sph and .stm

It also comes with the files separated in train and test. The train folder has 2351 files and test 11. We'll later see if this is an apropiate distribution of the data.




### First transformation: SPH to WAV

SPH format it is not very common or used so we are going to change all the files to WAV format. WAV is one of the most common audio file format and there are many libraries and functions to work with it in Python, making this project a little easier.

There are several ways for this transformation (Python functions, OS routines, specific programs, etc). We found the easiest way by running a script in the Windows CMD so the OS and SoX (Sound eXchange) will do all the work. This is the script:

![title](img/script.jpg)

Once the script has finished, we have all the files in WAV format, ready to be read by this Notebook :)

# Functions, getting and preparing data

To load the files and work with them we are going to need the following imports:

In [1]:
# (Double check if I need all these)
import numpy as np
import pandas as pd
import os
import wave
from scipy.io import wavfile

# To play WAVs on this Notebook
import IPython.display as ipd

# For random noise
import random

# To save with the current data and time
import time

import matplotlib.pyplot as plt
from matplotlib.pyplot import subplots
%matplotlib inline

# Try this to disable warnings
import warnings
warnings.filterwarnings('ignore')

# To see times
from tqdm import tqdm_notebook, tnrange

## Inputs

Due to memory limitations, files are going to be load in two different steps

Lets locate and open all the WAVs

In [2]:
def w_files():    
    
    # Load all the wavs (names) into a list

    path1 = 'D:/Datasets/Ted Talks TEDLIUM/TEDLIUM_release-3/FINAL/load1/'
    path2 = 'D:/Datasets/Ted Talks TEDLIUM/TEDLIUM_release-3/FINAL/load2/'

    WAV_list1 = os.listdir(path1)
    WAV_list2 = os.listdir(path2)

    WAV_list1 = pd.DataFrame(WAV_list1)
    WAV_list2 = pd.DataFrame(WAV_list2)

    print('First, ', len(WAV_list1), 'files are going to be load and this are the names:')
    display(WAV_list1)

    print('Then, ', len(WAV_list2), 'files are going to be load and this are the names:')
    display(WAV_list2)
    
    return path1, path2, WAV_list1, WAV_list2

One single function for:

### Reading and loading all files

All data can be read directly from the WAV files or from a CSV if this step has already been done.

If you decide to load from WAVs, the files are going to be read with the function wavfile.read() and saved into a Pandas Dataframe. This option might take more than 30 min.

If you load it from a CSV the data is load directly into the Pandas dataframe. This option takes 1 min (reading the CSV takes just a few seconds, most of the time is due to the transformation of the data that comes in a string into NumPy arrays).

and

### Building the Pandas dataframe

We are going to take all the data and info and we are going to load it into an empty dataframe.

In [3]:
def load_data(WAV_list, path):    
    
    print('\nReading audio files...\n')

    # Save data into another list
    data = [0]

    for i in (tnrange(len(WAV_list))):

        fname = path + WAV_list.iloc[i][0]

        if os.path.isfile(fname):
            # Read file
            Dato = wavfile.read(fname)
            #Dato = Dato.tolist()
            data = data + [Dato]

#             if i%50 == 0:
#                 print(len(WAV_list) - i, 'remaining')

        else:
            print("Something went wrong")

    data = pd.DataFrame([data])
    data = data.T
    data = data.iloc[1:].reset_index(drop=True)

#     if len(data) == len(WAV_list):
#         print('\nFinished successfully!')

    WAV_list['data'] = data

    # To lo tocho

    # We first create the empty dataframe
    col_names =  ['name', 'sample_rate', 'original_length', 'final_length', 'data']
    df  = pd.DataFrame(columns = col_names)

    # We create empty lists to load the data and then save it into the dataframe
    all_data = []
    all_rates = []
    all_lengths = []

    for i in range (len(WAV_list)):
        all_data.append(WAV_list['data'][i][1])
        all_rates.append(WAV_list['data'][i][0])
        all_lengths.append(len(WAV_list['data'][i][1]))

    df['name'] = WAV_list[0]
    df['data'] = all_data
    df['sample_rate'] = all_rates
    df['original_length'] = all_lengths

    return df

In [4]:
def load_df(name):
    
    print('\nOpening CSV...\n')
    df = pd.read_csv(name, index_col=[0])
    df.index = pd.RangeIndex(len(df.index))
    
    print('Reading data...\n')
    # Esto del for es canelita en rama. cinnamon in branch.
    for i in tnrange(df.shape[0]):
        #df.data[i] = np.fromstring(df.data[i].replace('[', r'').replace('\r\n', r''), sep=' ', dtype=np.int16)
        df.data[i] = np.fromstring(df.data[i].replace('[', r'').replace('\r\n', r''), sep=' ', dtype=np.float32)
        df.random_data[i] = np.fromstring(df.random_data[i].replace('[', r'').replace('\r\n', r''), sep=' ', dtype=np.float32)
        df.incompleted_data[i] = np.fromstring(df.incompleted_data[i].replace('[', r'').replace('\r\n', r''), sep=' ', dtype=np.float32)
        df.low_quality_data[i] = np.fromstring(df.low_quality_data[i].replace('[', r'').replace('\r\n', r''), sep=' ', dtype=np.float32)

    print('\nFinished successfully!')
    
    return df

## Preprocessing

We are going to prepare data to be ready and understood by a Deep Learning model.

We are going to go through different steps, where we'll changing the shape, lenght, size, etc. of this data.

NOTE: Most of this steps will only be effective if we load the data from the WAV files. If we already have a Dataframe that has been throght this steps, nothing will happen.

### Length

First, lets see the lenghts of the original files.

In [5]:
def get_lenghts():
    
    min_len = df.original_length.min()
    max_len = df.original_length.max()
    avg_len = np.mean(df.original_length)

    print('Max Length: ', max_len, '\nAverage Length: ', int(avg_len), '\nMin Length: ', min_len);

### Rounding

Let's round each value of each audio file. It has to be done after the normalization, because after it, the values result with 18 decimals.

A human survey has been done to conclude how many decimals are going to be used. The survey concludes that humans cannot notice the difference between 8, 4 and 3 decimals of precision, so the number of rounding is going to be used is 3.

In [6]:
def rounding(data):
    for x in range(0, data.size):
        data[x] = round(data[x], 3)
    return data

### Chunks

We are going to chunk all the files in the same length (10 seconds).

10 secs = 163333 frames.

We are going to remove the first 27 seconds (Ted Intro) and take the next 10 seconds, which has actual speech.

NOTE: If we want a different final length we just have to modify the values from the box below.

Main parameters:

In [7]:
def generic_chunk(df, beg, f_len):
    
    new = df[['data']].copy()
    
    new['data'] = [col[beg:beg+f_len] for col in tqdm_notebook(new.data)]
    
    # In case some file was shorten than the length we set, the remaining empty part is going be fill with 0's
    #df['data'] = [np.pad(col, (0, f_len-len(col)), 'constant') for col in df.data if len(df.data) < f_len]
    
    #df['final_length'] = [len(col) for col in df.data]
    
    return new

In [8]:
def chunk(df):
    
    # How much we want to cut from the beginning (441000 = 27 seconds):
    beg = 441000

    # How long we want the final length (163333 = 10 seconds)
    f_len = 163333
    
    df['data'] = [col[beg:beg+f_len] for col in df.data]
    
    # In case some file was shorten than the length we set, the remaining empty part is going be fill with 0's
    df['data'] = [np.pad(col, (0, f_len-len(col)), 'constant') for col in df.data if len(df.data) < f_len]
    
    df['final_length'] = [len(col) for col in df.data]
    
    return df

To take the second segment for a bigger dataset.
From second 37 to 47

In [9]:
def chunk_2(df):
    
    # How much we want to cut from the beginning (604333 = 37 seconds):
    beg = 604333

    # How long we want the final length (163333 = 10 seconds)
    f_len = 163333
    
    df['data'] = [col[beg:beg+f_len] for col in df.data]
    
    # In case some file was shorten than the length we set, the remaining empty part is going be fill with 0's
    df['data'] = [np.pad(col, (0, f_len-len(col)), 'constant') for col in df.data if len(df.data) < f_len]
    
    df['final_length'] = [len(col) for col in df.data]
    
    return df

To take the third segment for a bigger dataset.
From second 47 to 57

In [10]:
def chunk_3(df):
    
    # How much we want to cut from the beginning (604333 = 37 seconds):
    beg = 767666
    
    # How long we want the final length (163333 = 10 seconds)
    f_len = 163333
    
    df['data'] = [col[beg:beg+f_len] for col in df.data]
    
    # In case some file was shorten than the length we set, the remaining empty part is going be fill with 0's
    df['data'] = [np.pad(col, (0, f_len-len(col)), 'constant') for col in df.data if len(df.data) < f_len]
    
    df['final_length'] = [len(col) for col in df.data]
    
    return df

### Chunking the DF to export

In [11]:
def preparing_df(df):
    if 'sample_rate' in df:
        df = df.drop(['sample_rate'], axis=1)
        
    if 'original_length' in df:
        df = df.drop(['original_length'], axis=1)
        
    if 'final_length' in df:
        df = df.drop(['final_length'], axis=1)
        
    return df

## Normalization

### (-1, 1)

The one I'm gonna use

Lets try one of the typical normalizations, between -1 and 1 

In [12]:
def norm_b(data):
    max_data = np.max(data)
    min_data = np.min(data)
    if abs(min_data) > max_data:
        max_data = abs(min_data)
    data = data / max_data
    return data

### (0, 1)
Check which normalization is the best workin w audio files

In [13]:
def audio_norm(data):
    max_data = np.max(data)
    min_data = np.min(data)
    data = (data-min_data) / (max_data - min_data+1e-6)
    return data - 0.5

### Other
Try this other normalization from Scikit learn:

(I removed it cuz it wasn't useful at all)

## Generating the broken files

Here we are going to prepare the broken versions of the audio files, so we have a pair of normal quality file - broken file.

We want to have different kind of broken files such as low quality, noise, missing information, etc.

Different methods are going to be used for this.

We are going to start generating one single broken version per file, and once we have done this succesfully, we'll we if its necessary to have all the different broken methods per each file. 

### Random Noise

This is going to be first method even though we already know that in most of the real cases the incompletion is not going to be due to this, but is a good starting point for the new broken files.   

In [14]:
def br_random(data, p_no):
    
    mu = 0      # mean
    sigma = 0.15 # standard deviation
    
    mask = np.random.choice([0, 1], size=df.data[3].size, p=[1-p_no, p_no])
    mask = np.array(mask, dtype=bool)
    
    sine = np.random.normal(mu, sigma, data.size)
    
    broken = data.copy()
    broken[mask] = sine[mask]
    
    return broken

### Incomplete Noise

This is the second method, it will remove some values and place 0's instead.

In [15]:
def br_incompletion(data, p_in):
    
    # The positions where I want to place the 0's
    maska = np.random.choice([0, 1], size=data.size, p=[1-p_in, p_in])
    
    data = np.ma.array(data, mask=maska, fill_value=0)
    resultado = data.filled()

    return resultado

### Low Quality

The third method consists on reading the files with a lower sampling rate, instade of 16 kHz, 8 kHz.

In [16]:
def br_low_q(data):
    data = data[1::2]
    return data

Lets try this downsampling with the library Librosa

In [17]:
# import librosa    
# y, s = librosa.load('test.wav', sr=8000)

## Outputs

### Save current work

Function to save the current Pandas dataframe into a .CSV

We are going to save this Dataframe into a CSV for future work.

First, we have to set the threshold with a high value to make sure we don't truncate and miss data. Right after writting the CSV we'll set the threshold again with the default parameters so we can print the NumPy arrays properly.

NOTE: The parameter np.set_printoptions(threshold=np.inf) should also work instead of setting the threashold with a high value.

In [18]:
def save_work(df, name):
    
    print('\nExporting to CSV. This might take a while...\n')
    #np.set_printoptions(threshold=200000)
    np.set_printoptions(threshold=164000)
    df.to_csv(name+'.csv')
    np.set_printoptions(edgeitems=3,infstr='inf', linewidth=75, nanstr='nan', precision=8, suppress=False, threshold=1000, formatter=None)
    print('\nFinished successfully!')

### Export group of the 4 files

Here are going to be exported the group of the 4 files composed by: Original, random noise, incompleted and low quality.
The starting file is a random one and it will recive how many from them will be exported.

In [19]:
def export_groups(df, num):
    
    if num !=0:
        
        r_fi =random.randint(0, len(df))

        for i in tnrange(0, num):
            wavfile.write(('outputs/Noise/'+'{}'.format(i))+'_original.wav', 16000, df['data'][i+r_fi])
            wavfile.write(('outputs/Noise/'+'{}'.format(i))+'_random.wav', 16000, df['random_data'][i+r_fi])
            wavfile.write(('outputs/Noise/'+'{}'.format(i))+'_incompl.wav', 16000, df['incompleted_data'][i+r_fi])
            wavfile.write(('outputs/Noise/'+'{}'.format(i))+'_low_q.wav', 8000, df['low_quality_data'][i+r_fi])

### Export to WAV files

We can also export some of the files back into WAV files but with the new lengths, shapes, etc. We are going to use the wavfile.write() function.

Lets export some of them:

In [20]:
def export_files():
    
    option = -1
    
    while option != 0 or option != 1 or option != 2:

        print('- Enter 0 to export 10 random files')
        print('- Enter 1 to export the first 10 files')
        print('- Enter 2 to export one specific file')

        option = input()
        option = int(option)
        
        if option == 0:

            for i in range(10):
                r_fi = random.randint(0, len(df))
                #wavfile.write(('Exported/test'+'{}'.format(r_fi))+'.wav', df['sample_rate'][r_fi], df['data'][r_fi])
                wavfile.write(('outputs/Exported/Random_exported'+'{}'.format(i))+'_'+'{}'.format(r_fi)+'.wav', 16000, df['data'][r_fi])
            break
            
        elif option == 1:

            for i in range(10):
                #wavfile.write(('Exported/test'+'{}'.format(i))+'.wav', df['sample_rate'][i], df['data'][i])
                wavfile.write(('outputs/Exported/10first_exported'+'{}'.format(i))+'_'+'{}'.format(i)+'.wav', 16000, df['data'][i])
            break
            
        elif option == 2:
            
            print('\nEnter the file id you want to export:')
            fia = input()
            fi = int(fia)

            wavfile.write(('outputs/Exported/Specific_exported'+'{}'.format(fi))+'_'+'{}'.format(fi)+'.wav', 16000, df['data'][fi])
            break
            
        else:
            print('Not a valid option, try again...\n\n')

            
    print('Exported successfully')

# MAIN

### Parameters and values

In [21]:
# Frame where we start recording for the df
beginning = 441000

# Audio final length (163333 = 10sec, 81666 = 5sec, 32666 = 2sec)
final_l = 32666 # 2 Seconds

# Sample rate of all the audio files
sample_rate = 16000

# Lengths of all audios once chunked 
final_length = 163333

# To chose if load from WAV's or from an existing DF
option = -1

# Percentage of noise
p_no = 0.3

# Percentage of incompletion
p_in = 0.7

### Load data

In [22]:
print('To check beginnins and lengths:')
print('First file:  ', beginning + (0*final_l), final_l)
print('Second file: ', beginning + (1*final_l), final_l)
print('Third file:  ', beginning + (2*final_l), final_l)
print('Fourth file: ', beginning + (3*final_l), final_l)

To check beginnins and lengths:
First file:   441000 32666
Second file:  473666 32666
Third file:   506332 32666
Fourth file:  538998 32666


In [23]:
while option != 0 or option != 1:    

    print('Do you want to load the data from the WAVs or from a CSV?')
    print('Enter 0 for WAVs or 1 for CSV:')
    option = input()
    option = int(option)

    if option == 0:
        
        path1, path2, WAV_list1, WAV_list2 = w_files()
        
        df1 = load_data(WAV_list1, path1)
        time.sleep(60)
        df2 = load_data(WAV_list2, path2)
        
        frames = [df1, df2]
        df = pd.concat(frames)
        
        # CALL FROM HERE ALL OTHER FUNCTIONS TO PREPARE DF
        #df = chunk(df)
        #df = chunk_2(df)
        #df = chunk_3(df)
        
        # The new chunks come here:
        s_df0 = generic_chunk(df, beginning + (0*final_l), final_l)
        s_df1 = generic_chunk(df, beginning + (1*final_l), final_l)
        s_df2 = generic_chunk(df, beginning + (2*final_l), final_l)
        s_df3 = generic_chunk(df, beginning + (3*final_l), final_l)
        
        frames = [s_df0, s_df1, s_df2, s_df3]
        df = pd.concat(frames)
        
        df.index = pd.RangeIndex(len(df.index))
        break
        
    elif option == 1:
        
        print('Enter the files name:')
        name = input()
        df = load_df('data/'+name+'.csv')
        break
        
    else:
        print('Not a valid option, try again...\n\n')

Do you want to load the data from the WAVs or from a CSV?
Enter 0 for WAVs or 1 for CSV:
1
Enter the files name:
df_all_data

Opening CSV...

Reading data...



HBox(children=(IntProgress(value=0, max=10880), HTML(value='')))



Finished successfully!


### Check what's load

In [24]:
print('DFs shape: ', df.shape)
display(df.head())

DFs shape:  (10880, 4)


Unnamed: 0,data,random_data,incompleted_data,low_quality_data
0,"[-0.031, -0.024, -0.017, -0.001, -0.026, -0.03...","[-0.031, -0.277, -0.017, -0.001, -0.026, -0.03...","[0.0, 0.0, 0.0, 0.0, 0.0, -0.038, -0.043, 0.0,...","[-0.024, -0.001, -0.038, -0.062, -0.066, -0.09..."
1,"[0.125, 0.089, 0.058, 0.027, -0.026, -0.068, -...","[0.125, 0.089, -0.132, -0.096, -0.026, -0.068,...","[0.0, 0.089, 0.0, 0.0, 0.0, -0.068, 0.0, 0.0, ...","[0.089, 0.027, -0.068, -0.113, -0.154, -0.127,..."
2,"[0.133, 0.154, 0.16, 0.192, 0.259, 0.295, 0.23...","[-0.06, -0.091, 0.16, 0.192, -0.103, 0.295, 0....","[0.0, 0.0, 0.0, 0.0, 0.0, 0.295, 0.0, 0.0, 0.0...","[0.154, 0.192, 0.295, 0.15, 0.052, -0.008, -0...."
3,"[-0.009, 0.059, 0.085, 0.061, -0.031, -0.116, ...","[-0.009, 0.316, 0.085, -0.055, -0.031, -0.116,...","[0.0, 0.0, 0.0, 0.061, 0.0, 0.0, 0.0, 0.0, 0.0...","[0.059, 0.061, -0.116, -0.116, -0.004, 0.057, ..."
4,"[0.017, 0.028, 0.029, 0.027, 0.018, 0.021, 0.0...","[0.175, 0.028, 0.029, 0.027, 0.018, 0.021, -0....","[0.0, 0.028, 0.0, 0.0, 0.018, 0.0, 0.0, 0.0, 0...","[0.028, 0.027, 0.021, 0.043, 0.065, 0.05, 0.03..."


In [25]:
print('If you are building a new Dataset, enter 0')
print('If you are checking an existing CSV or modifying rates, enter 1')
mods = input()
mods = int(mods)

if mods == 0:
    print('Nice, keep it goin')

elif mods == 1:
    print('\nThe current dataset is:')
    print('DFs shape: ', df.shape)
    display(df.head())

else:
    print('Wrong option')

If you are building a new Dataset, enter 0
If you are checking an existing CSV or modifying rates, enter 1
1

The current dataset is:
DFs shape:  (10880, 4)


Unnamed: 0,data,random_data,incompleted_data,low_quality_data
0,"[-0.031, -0.024, -0.017, -0.001, -0.026, -0.03...","[-0.031, -0.277, -0.017, -0.001, -0.026, -0.03...","[0.0, 0.0, 0.0, 0.0, 0.0, -0.038, -0.043, 0.0,...","[-0.024, -0.001, -0.038, -0.062, -0.066, -0.09..."
1,"[0.125, 0.089, 0.058, 0.027, -0.026, -0.068, -...","[0.125, 0.089, -0.132, -0.096, -0.026, -0.068,...","[0.0, 0.089, 0.0, 0.0, 0.0, -0.068, 0.0, 0.0, ...","[0.089, 0.027, -0.068, -0.113, -0.154, -0.127,..."
2,"[0.133, 0.154, 0.16, 0.192, 0.259, 0.295, 0.23...","[-0.06, -0.091, 0.16, 0.192, -0.103, 0.295, 0....","[0.0, 0.0, 0.0, 0.0, 0.0, 0.295, 0.0, 0.0, 0.0...","[0.154, 0.192, 0.295, 0.15, 0.052, -0.008, -0...."
3,"[-0.009, 0.059, 0.085, 0.061, -0.031, -0.116, ...","[-0.009, 0.316, 0.085, -0.055, -0.031, -0.116,...","[0.0, 0.0, 0.0, 0.061, 0.0, 0.0, 0.0, 0.0, 0.0...","[0.059, 0.061, -0.116, -0.116, -0.004, 0.057, ..."
4,"[0.017, 0.028, 0.029, 0.027, 0.018, 0.021, 0.0...","[0.175, 0.028, 0.029, 0.027, 0.018, 0.021, -0....","[0.0, 0.028, 0.0, 0.0, 0.018, 0.0, 0.0, 0.0, 0...","[0.028, 0.027, 0.021, 0.043, 0.065, 0.05, 0.03..."


Lets chunk all files in half

In [26]:
if mods == 0:
    for i in range(len(df)):
        df.data[i] = df.data[i][:final_l]

    print(df.data[0].size, df.data[10].size)

### Normalizing data

In [27]:
if mods == 0:
    for i in tnrange(len(df)):
        df.data[i] = norm_b(df.data[i])

### Round data

In [28]:
if mods == 0:
    for i in tnrange(len(df)):
        df.data[i] = rounding(df.data[i])

Lets export some files to check everything has loades and normalized properly

In [29]:
if mods == 0:
    export_files()
    print('Original files exported in outputs/Exported')

### Generate the noise

In [30]:
if mods == 1:
    
    r_fi =random.randint(0, len(df))

    print('Here are some examples of the files:\n')
    print('Original:')
    ipd.Audio(df.data[r_fi], rate=16000)
    print('Random noise:')
    ipd.Audio(df.random_data[r_fi], rate=16000)
    print('Incompleted:')
    ipd.Audio(df.incompleted_data[r_fi], rate=16000)
    print('Low Quality:')
    ipd.Audio(df.low_quality_data[r_fi], rate=8000)

Here are some examples of the files:

Original:
Random noise:
Incompleted:
Low Quality:


In [31]:
if mods == 1:
    print('Do you want to modify the noise and broken rates?')
    print('The current values are:')
    print('Percentage of noise:', p_no)
    print('Percentage of incompletion:', p_in)
    print('\nEnter 0 for NO or 1 for YES')
    change = input()
    change = int(change)

    if change == 0:
        print('\nThe current dataset is:')
        print('DFs shape: ', df.shape)
        display(df.head())

    elif change == 1:
        print('Enter the new Percentage of noise:')
        new_p_no = input()
        p_no = float(new_p_no)
        print('\nEnter the new Percentage of incompletion:')
        new_p_in = input()
        p_in = float(new_p_in)
        print('\nValues changed successfully!')

    else:
        print('Wrong option')

Do you want to modify the noise and broken rates?
The current values are:
Percentage of noise: 0.3
Percentage of incompletion: 0.7

Enter 0 for NO or 1 for YES
0

The current dataset is:
DFs shape:  (10880, 4)


Unnamed: 0,data,random_data,incompleted_data,low_quality_data
0,"[-0.031, -0.024, -0.017, -0.001, -0.026, -0.03...","[-0.031, -0.277, -0.017, -0.001, -0.026, -0.03...","[0.0, 0.0, 0.0, 0.0, 0.0, -0.038, -0.043, 0.0,...","[-0.024, -0.001, -0.038, -0.062, -0.066, -0.09..."
1,"[0.125, 0.089, 0.058, 0.027, -0.026, -0.068, -...","[0.125, 0.089, -0.132, -0.096, -0.026, -0.068,...","[0.0, 0.089, 0.0, 0.0, 0.0, -0.068, 0.0, 0.0, ...","[0.089, 0.027, -0.068, -0.113, -0.154, -0.127,..."
2,"[0.133, 0.154, 0.16, 0.192, 0.259, 0.295, 0.23...","[-0.06, -0.091, 0.16, 0.192, -0.103, 0.295, 0....","[0.0, 0.0, 0.0, 0.0, 0.0, 0.295, 0.0, 0.0, 0.0...","[0.154, 0.192, 0.295, 0.15, 0.052, -0.008, -0...."
3,"[-0.009, 0.059, 0.085, 0.061, -0.031, -0.116, ...","[-0.009, 0.316, 0.085, -0.055, -0.031, -0.116,...","[0.0, 0.0, 0.0, 0.061, 0.0, 0.0, 0.0, 0.0, 0.0...","[0.059, 0.061, -0.116, -0.116, -0.004, 0.057, ..."
4,"[0.017, 0.028, 0.029, 0.027, 0.018, 0.021, 0.0...","[0.175, 0.028, 0.029, 0.027, 0.018, 0.021, -0....","[0.0, 0.028, 0.0, 0.0, 0.018, 0.0, 0.0, 0.0, 0...","[0.028, 0.027, 0.021, 0.043, 0.065, 0.05, 0.03..."


In [32]:
if (mods == 0 ) or (mods == 1 and change == 1):
    print('Aqui es cuando debería rehacer las roturas, sino no')

Creation of the random noise

In [33]:
if (mods == 0) or (mods == 1 and change == 1):
    
    if 'random_data' in df:
        df = df.drop(['random_data'], axis=1)

    broken_r = []

    for i in tnrange(len(df)):
        broken_r.append(br_random(df.data[i], p_no))

    df['random_data'] = broken_r

Creation of the incompleted files

In [34]:
if (mods == 0) or (mods == 1 and change == 1):

    if 'incompleted_data' in df:
        df = df.drop(['incompleted_data'], axis=1)

    broken_i = []

    for i in tnrange(len(df)):
        broken_i.append(br_incompletion(df.data[i], p_in))

    df['incompleted_data'] = broken_i

Creation of the low quality files

In [35]:
if (mods == 0) or (mods == 1 and change == 1):

    if 'low_quality_data' in df:
        df = df.drop(['low_quality_data'], axis=1)

    broken_l = []

    for i in tnrange(len(df)):
        broken_l.append(br_low_q(df.data[i]))

    df['low_quality_data'] = broken_l

### Round new data

Now I have to round again the new data just created, all the broken files that might also have more than 3 decimals.

In [36]:
if (mods == 0) or (mods == 1 and change == 1):

    for i in tnrange(len(df)):
        df.random_data[i] = rounding(df.random_data[i])
        df.incompleted_data[i] = rounding(df.incompleted_data[i])

Lets export some of this files with its 3 versions to see how it sounds

In [37]:
print('Do you want to export some final files in WAV format?')
print('\nEnter how many different files you want to listen. 0 for none.')
exports = input()
exports = int(exports)

export_groups(df, exports)

print('Files exported in: outputs/Noise')

Do you want to export some final files in WAV format?

Enter how many different files you want to listen. 0 for none.
1


HBox(children=(IntProgress(value=0, max=1), HTML(value='')))


Files exported in: outputs/Noise


### Prepare DF to be exported

In [38]:
# Just in case...
df = preparing_df(df)

print('DFs shape: ', df.shape)
display(df.head())

DFs shape:  (10880, 4)


Unnamed: 0,data,random_data,incompleted_data,low_quality_data
0,"[-0.031, -0.024, -0.017, -0.001, -0.026, -0.03...","[-0.031, -0.277, -0.017, -0.001, -0.026, -0.03...","[0.0, 0.0, 0.0, 0.0, 0.0, -0.038, -0.043, 0.0,...","[-0.024, -0.001, -0.038, -0.062, -0.066, -0.09..."
1,"[0.125, 0.089, 0.058, 0.027, -0.026, -0.068, -...","[0.125, 0.089, -0.132, -0.096, -0.026, -0.068,...","[0.0, 0.089, 0.0, 0.0, 0.0, -0.068, 0.0, 0.0, ...","[0.089, 0.027, -0.068, -0.113, -0.154, -0.127,..."
2,"[0.133, 0.154, 0.16, 0.192, 0.259, 0.295, 0.23...","[-0.06, -0.091, 0.16, 0.192, -0.103, 0.295, 0....","[0.0, 0.0, 0.0, 0.0, 0.0, 0.295, 0.0, 0.0, 0.0...","[0.154, 0.192, 0.295, 0.15, 0.052, -0.008, -0...."
3,"[-0.009, 0.059, 0.085, 0.061, -0.031, -0.116, ...","[-0.009, 0.316, 0.085, -0.055, -0.031, -0.116,...","[0.0, 0.0, 0.0, 0.061, 0.0, 0.0, 0.0, 0.0, 0.0...","[0.059, 0.061, -0.116, -0.116, -0.004, 0.057, ..."
4,"[0.017, 0.028, 0.029, 0.027, 0.018, 0.021, 0.0...","[0.175, 0.028, 0.029, 0.027, 0.018, 0.021, -0....","[0.0, 0.028, 0.0, 0.0, 0.018, 0.0, 0.0, 0.0, 0...","[0.028, 0.027, 0.021, 0.043, 0.065, 0.05, 0.03..."


Save into a csv

In [39]:
print('Do you want to save the DF into a CSV?')
print('Enter 0 to exit or 1 to save')
saave = input()
saave = int(saave)

if saave == 0:
    print('\nThis is the final DF')
    print('DFs shape: ', df.shape)
    display(df.head())

elif saave == 1:
    
    print('Enter the name for the CVS')
    print('Warning: if you enter an existing name, you will lose that CSV')
    name = input()
    save_work(df, 'name')

else:
    print('Wrong option')

Do you want to save the DF into a CSV?
Enter 0 to exit or 1 to save
0

This is the final DF
DFs shape:  (10880, 4)


Unnamed: 0,data,random_data,incompleted_data,low_quality_data
0,"[-0.031, -0.024, -0.017, -0.001, -0.026, -0.03...","[-0.031, -0.277, -0.017, -0.001, -0.026, -0.03...","[0.0, 0.0, 0.0, 0.0, 0.0, -0.038, -0.043, 0.0,...","[-0.024, -0.001, -0.038, -0.062, -0.066, -0.09..."
1,"[0.125, 0.089, 0.058, 0.027, -0.026, -0.068, -...","[0.125, 0.089, -0.132, -0.096, -0.026, -0.068,...","[0.0, 0.089, 0.0, 0.0, 0.0, -0.068, 0.0, 0.0, ...","[0.089, 0.027, -0.068, -0.113, -0.154, -0.127,..."
2,"[0.133, 0.154, 0.16, 0.192, 0.259, 0.295, 0.23...","[-0.06, -0.091, 0.16, 0.192, -0.103, 0.295, 0....","[0.0, 0.0, 0.0, 0.0, 0.0, 0.295, 0.0, 0.0, 0.0...","[0.154, 0.192, 0.295, 0.15, 0.052, -0.008, -0...."
3,"[-0.009, 0.059, 0.085, 0.061, -0.031, -0.116, ...","[-0.009, 0.316, 0.085, -0.055, -0.031, -0.116,...","[0.0, 0.0, 0.0, 0.061, 0.0, 0.0, 0.0, 0.0, 0.0...","[0.059, 0.061, -0.116, -0.116, -0.004, 0.057, ..."
4,"[0.017, 0.028, 0.029, 0.027, 0.018, 0.021, 0.0...","[0.175, 0.028, 0.029, 0.027, 0.018, 0.021, -0....","[0.0, 0.028, 0.0, 0.0, 0.018, 0.0, 0.0, 0.0, 0...","[0.028, 0.027, 0.021, 0.043, 0.065, 0.05, 0.03..."


# THE END

---

## Tests & Other

In [40]:
def escuchar_aqui():

    ipd.Audio(df.random_data[0], rate=16000)


In [41]:
escuchar_aqui()
ipd.Audio(df.random_data[0], rate=16000)

In [42]:
akg = names+'.csv'

NameError: name 'names' is not defined

In [None]:
akg

### Encoder and decoder for img

In [None]:
from PIL import Image

In [None]:
arch = df.incompleted_data[1000].copy()
arch  = arch[:81225]

print('Audio: ', arch)
print('Min:',np.amin(arch), ', Max:',np.amax(arch))

#### Encode

In [None]:
for x in range(0, arch.size):
    arch[x] = arch[x]*127.5 + 127.5
    
#arch = np.array(arch, dtype=np.int16)
arch = np.array(arch, dtype=np.uint8)

# Reshape for the image and reshape for the audio
arch  = np.reshape(arch, (-1, 285))
imag  = Image.fromarray(arch)
arch = np.reshape(arch, 81225)

print('Audio: ', arch)
print('Min:',np.amin(arch), ', Max:',np.amax(arch))

#### Decode

In [None]:
arch = np.array(arch, dtype=np.float64)

for x in range(0, arch.size):
    arch[x] = (arch[x]-127)/127
    
arch = rounding(arch)

print('Audio: ', arch)
print('Min:',np.amin(arch), ', Max:',np.amax(arch))

In [None]:
plt.plot(arch)

In [None]:
imag

In [None]:
wavfile.write('QUE_CERA.wav', 16000, arch)

### Useful things and functions

In [None]:
# To save things with the current time and date

print(int(time.time()))
print('/data/df_'+time.strftime('%Y-%m-%d_%H:%M:%S'+'.csv', time.localtime()))

In [None]:
# To make a copy from a DF and don't fuck up the original one

back = df.copy()

In [None]:
# To merge several DFs into a single one

frames = [df_final_1, df_final_2, df_final_3]
df_FINAL = pd.concat(frames)

In [None]:
# To resete index from a DF

df_FINAL.index = pd.RangeIndex(len(df_FINAL.index))

In [None]:
# To delete columns from a DF. (I think it starts in 0, doublecheck)

df = df.drop(df.columns[[1, 2]], axis=1)

In [None]:
# To export from some files the original, random and incompleted version

for i in range(0, 3):
    wavfile.write(('Noise/original_'+'{}'.format(i))+'.wav', 16000, df['data'][i])
    wavfile.write(('Noise/random_'+'{}'.format(i))+'.wav', 16000, df['random_data'][i])
    wavfile.write(('Noise/incompl_'+'{}'.format(i))+'.wav', 16000, df['incpleted_data'][i])
    

---

### Some other garbage I could need

In [None]:
tests1 = df.data[4].copy()

for x in range(0, tests1.size):
    ran = random.randint(0, 1000)
    if ran % 100 == 0:
            for y in range(0, 10):
                if x+y < tests1.size:
                    tests1[x+y] = random.randint(tmin, tmax)
                    x = x+10

In [None]:
tests2 = df.data[nf].copy()

for x in range(0, tests2.size):
    ran = random.randint(0, 100)
    if ran % 20 == 0:
            for y in range(0, 10000):
                if x+y < tests2.size:
                    tests2[x+y] = 0
                    x = x+10000

In [None]:
import librosa

In [None]:
y, s = librosa.load('D:/Datasets/Ted Talks TEDLIUM/TEDLIUM_release-3/FINAL/PRUEBAS/AaronOConnell_2011.wav', sr=8000) # Downsample to 8kHz

In [None]:
#LISTEN THIS TALK, LOOKS FUN

temp['name'][100]

---

In [None]:
new = pd.DataFrame()


---

In [None]:
for i in range (df.shape[0]):
    df.data[i] = np.fromstring(df.data[i].replace('[', r'').replace('\r\n', r''), sep=' ', dtype=np.float16)

In [None]:
ttest

In [None]:
ttest = np.array(df.data[15], dtype=np.float16)

In [None]:
wavfile.write('ttest.wav', 16000, ttest)

In [None]:
def TESTES(data_o):
    
    # The positions where I want to place the 0's
    mask = np.zeros(data_o.size)
    
    print(mask)
    print(type(data_o), type( mask))
#     for x in range(0, data_o.size):
#         ran = random.randint(0, 10)
#         if ran % 2 == 0:
#             mask[x] = 0
            
    data = np.ma.array(data_o, mask, fill_value=1)
    resultado = data.filled()
    
    print(resultado)
    
    #return data.filled()
    return data

In [None]:
broken_i = []

for i in range(len(df_test)):
    broken_i.append(TESTES(df_test.data[i]))
    
df_test['incompleted_data'] = broken_i

In [None]:
df_test

In [None]:
arr =  np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

holas = np.ma.array(arr, mask=[0, 1, 0, 1, 0, 1, 0, 1, 0, 1], fill_value=99)
print(holas.filled())

In [None]:
arr =   np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
maska = np.array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1])

holass = np.ma.array(arr, mask=maska, fill_value=99)
print(holass.filled())

In [None]:
rod= np.random.choice([0, 1], size=df.data[0].size, p=[0.6, 0.4])

In [None]:
if 'incompleted_data' in df:
    print('si')

In [None]:
z = y.copy()
z[r] = x[r]

In [None]:
maska = np.random.choice([0, 1], size=df.data[3].size, p=[1-p_no, p_no])
maaask = np.array(maska, dtype=bool)

In [None]:
maaask

In [None]:
mu, sigma = 0, 0.2 # mean and standard deviation
s = np.random.normal(mu, sigma, df.data[0].size)

In [None]:
s

In [None]:
franelas = df.data[3].copy()

In [None]:
franelas[maaask] = s[maaask]

In [None]:
mu, sigma = 0, 0.2 # mean and standard deviation

maska = np.random.choice([0, 1], size=df.data[3].size, p=[1-p_no, p_no])
maaask = np.array(maska, dtype=bool)

s = np.random.normal(mu, sigma, df.data[0].size)

franelas = df.data[3].copy()
franelas[maaask] = s[maaask]

In [None]:
franelas = df.data[3].copy()

for x in range(0, franelas.size):
    franelas[x] = round(franelas[x], 8)
    
print(franelas)

wavfile.write('Rounding//test.wav', 16000, df.data[10])

In [None]:
wavfile.write('Rounding/test.wav', 16000, df.data[3])

In [None]:
df

In [None]:
plt.plot(df.data[0])

In [None]:
plt.plot(df.random_data[0])

In [None]:
plt.plot(df.incompleted_data[0])

In [None]:
plt.plot(df.low_quality_data[0])

In [None]:
df

---

Lets try some cool things to try to make an image out from a Numpy Array

In [None]:
aud = df.data[5].copy()

In [None]:
fin = np.uint8(df.data[0]*255)

In [None]:
fin = fin[0:163216]

In [None]:
fin.shape

In [None]:
fin = np.reshape(fin, (-1, 404))

In [None]:
from PIL import Image
im = Image.fromarray(fin)

In [None]:
fin

In [None]:
from PIL import Image

fi = 10

aud  = df.data[fi].copy()
aud1 = df.random_data[fi].copy()
aud2 = df.incompleted_data[fi].copy()
aud3 = df.low_quality_data[fi].copy()

# CAMBIAR ETAS TRANSFORMACIONES
fin  = np.uint8(aud*255)
fin1 = np.uint8(aud1*255)
fin2 = np.uint8(aud2*255)
fin3 = np.uint8(aud3*255)

fin  = fin[0:163216]
fin1 = fin1[0:163216]
fin2 = fin2[0:163216]
fin3 = fin3[0:81225]

fin  = np.reshape(fin, (-1, 404))
fin1 = np.reshape(fin1, (-1, 404))
fin2 = np.reshape(fin2, (-1, 404))
fin3 = np.reshape(fin3, (-1, 285))

im  = Image.fromarray(fin)
im1 = Image.fromarray(fin1)
im2 = Image.fromarray(fin2)
im3 = Image.fromarray(fin3)

In [None]:
im

In [None]:
im1

In [None]:
im2

In [None]:
im3

In [None]:
nuevo_array = list(np.asarray(im))

In [None]:
nuevo_array = nuevo_array.flatten()

In [None]:
nuevo_array = np.array(list(im.getdata()), dtype=np.int16)

In [None]:
#plt.plot(nuevo_array)
plt.plot(rans)

In [None]:
wavfile.write('from_img.wav', 16000, nuevo_array)

---

La otra forma de transformar (por rangos)

In [None]:
rans = df.data[10].copy()

In [None]:
print(np.amin(rans), np.amax(rans))

In [None]:
rans.size

In [None]:
for x in range(0, rans.size):
    rans[x] = rans[x]*127.5 + 127.5

In [None]:
con = 0

for x in range(0, rans.size):
    if rans[x] >= -1 and rans[x] <= 0:
        rans[x] = abs(rans[x])*127
        rans[x] = 127-rans[x]
        
    elif rans[x] > 0 and rans[x] <= -1:
        rans[x] = rans[x]*127
        rans[x] = 127+rans[x]
    else:
        con += 1

In [None]:
rans  = rans[0:163332]

rans  = np.reshape(rans, (-1, 404))

ims  = Image.fromarray(rans)

In [None]:
df

In [None]:
rans = abs(rans)

In [None]:
rans = rans[:162867]

In [None]:
abs_rans = np.reshape(abs(rans), (233, 233,  3))

In [None]:
abs_rans.shape

In [None]:
fig, ax = subplots(figsize=(20, 20))
plt.imshow(abs_rans)
plt.axis('off')
plt.show

In [None]:
abs_rans.shape

In [None]:
from matplotlib import pyplot as mp

mp.savefig('hoooooolaaa.png')