# GSheet to Wav2Vec2 Input
Rolando Coto-Solano (Rolando.A.Coto.Solano@dartmouth.edu)<br>
Dartmouth College. Last update: 20250601

The program takes two main inputs:

* `destinationSandbox`: The name of the sandbox you are using. The defaults are {sandbox-user and all-wavs}, but you can use whichever you specified during the installation.<br>
* `installationFolder`: The folder where the ASR sandboxes are contained. The default value is `202506-ood-asr`, but you should use the one you specified during the installation.<br>

The program also takes the following inputs:

* `useCodeSwitchedData`: The sandbox's spreadsheet has a column where you can manually specify if a row has code-switching. Here you can choose if you will include or ignore the rows that have been tagged as having code-switching. If you choose '0', the rows will be ignored during training. If you choose '1', the rows will be used.<br>
* `useDoutbfulData`: The sandbox's spreadsheet has a column where you can manually specify if there's something wrong with the row. Here you can choose to include or ignore those problematic rows. If you choose '0', the rows will be ignored during training. If you choose '1', the rows will be used.<br>
* `percentageTrainSet`: Percentage of the whole dataset to be used for the training of the model. The default is 80%.<br>
* `percentageValidSet`: Percentage of the dataset that will be used for the validation of the model during training. The default is 10%.<br>
* `percentageTestSet`: Percentage of the dataset that will be used for the final testing of the model. The default is 10%.<br>
* `maxWavDuration`: The maximum duration for a recording to be allowed into the Wav2Vec2 data. Wav2Vec2's CUDA memory crashes when processing long files. The default is 15 seconds; this is the maximum duration where I can guarantee that the Colab memory won't crash.<br>
* `software`: Option {wa2vec2, ds}. The second option "ds": _deepspeech_ is to build the files for DeepSpeech training.

The program takes the information from the Google Sheet and creates the necessary files for Wav2Vec2 training.

## (1) Fill out metadata

In [1]:
destinationSandbox = "sandbox-user"    # Please type sandbox-user or all-wavs
installationFolder = "202506-ood-asr"

useCodeSwitchedData = 0
useDoutbfulData = 0

percentageTrainSet = 80
percentageValidSet = 10
percentageTestSet = 10

maxWavDuration = 15    # Seconds. This is the longest duration for a file allowed in the dataset. Wav2Vec2's CUDA memory crashes when processing long files
software = "wav2vec2"   # wav2vec2, ds, whisper, wavlm

urlSandbox = "https://docs.google.com/spreadsheets/d/1L2NkTb6LGHtxZAShYZ48ms4zJt5bRt17_e7-2hngPKQ/edit?usp=sharing"

filePrefix = "ood-wav2vec2"

pathAudioFilesInTraining = "/content/drive/MyDrive/"+installationFolder+"/" + destinationSandbox + "/wav/"  # Folder with split WAV files

In [2]:
if (percentageTrainSet + percentageValidSet + percentageTestSet) != 100:
  print("WARNING, YOUR TRAIN/VALIDATION/TEST PARTITION DOES NOT ADD UP TO 100 PERCENT. PLEASE VERIFY")

## (2) Prepare libraries and access to Google Drive

In [3]:
# Load other libraries
import pandas as pd
import random

# Load libraries for access to Google Spreadsheets
from google.colab import auth
auth.authenticate_user()
import gspread
from google.colab import drive

# It needs this permission to access the ASR spreadsheets in your GDrive
from google.auth import default
creds, _ = default()
gc = gspread.authorize(creds)

In [4]:
# It needs this permission to read and write ASR files into your GDrive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [5]:
# Use this if you have to remount the drive
# You need to do this if you update the files in the Google Drive while
# you are executing this notebook.
drive.mount('/content/drive/',force_remount=True)
#gc = gspread.authorize(GoogleCredentials.get_application_default())
gc = gspread.authorize(creds)

Mounted at /content/drive/


In [6]:
#==============================================
# Read in sandbox information
#==============================================

savePath = "/content/drive/MyDrive/"+installationFolder+"/" + destinationSandbox + "/" # This is where the metadata files will be stored
gsheetURL = urlSandbox

### Support functions

In [7]:
def saveFile(string, path):
  f = open(path, "w")
  f.write(string)
  f.close()

## (3) Open Google Spreadsheet

In [8]:
# Open Google Spreadsheet

wb = gc.open_by_url(gsheetURL)
sheet = wb.worksheet('wav-metadata')
rows = sheet.get_all_values()

metadata = pd.DataFrame(rows)
metadata.iloc[0]
metadata.columns[0]

metadata.columns = metadata.iloc[0]
metadata = metadata.iloc[1:]

if (useCodeSwitchedData == 0):
  metadata = metadata[metadata['codeSwitch'] != "1"]

if (useDoutbfulData == 0):
  metadata = metadata[metadata['needsFurtherCheck'] != "1"]

metadata = metadata[metadata['transcript'] != ""]

#metadata['duration_seconds'] = pd.to_numeric(metadata['duration_seconds'])
metadata['duration_seconds'] = pd.to_numeric(metadata['duration_seconds'].str.replace(',', '.'))
metadata = metadata[metadata['duration_seconds'] <= maxWavDuration]

print(metadata)

0                 wav_filename dataProcessedBy dateAdded speakerPrefix gender  \
1      BPT-RRBPKB1-bpt-001.wav            user  20250602           BPT      m   
2      BPT-RRBPKB1-bpt-002.wav            user  20250602           BPT      m   
3      BPT-RRBPKB1-bpt-003.wav            user  20250602           BPT      m   
4      BPT-RRBPKB1-bpt-004.wav            user  20250602           BPT      m   
5      BPT-RRBPKB1-bpt-005.wav            user  20250602           BPT      m   
..                         ...             ...       ...           ...    ...   
722  MSC-RRMSAvaikiP24-184.wav            user  20250602           MSC      f   
723  MSC-RRMSAvaikiP24-185.wav            user  20250602           MSC      f   
724  MSC-RRMSAvaikiP24-186.wav            user  20250602           MSC      f   
725  MSC-RRMSAvaikiP24-187.wav            user  20250602           MSC      f   
726  MSC-RRMSAvaikiP24-188.wav            user  20250602           MSC      f   

0   wav_filesize  duration_

## (4) Generate Wav2Vec2 files for the whole corpus

`w2v-train.csv` : Training samples<br>
`w2v-valid.csv` : Validation samples<br>
`w2v-test.csv` : Testing samples

Right now the split is set at 80-10-10. You can modify this in the variables at the beginning of this notebook.

In [9]:
longfilenames = []
sentences = []
listOfNumbers = []
filesizes = []
j = 0

for i, r in metadata.iterrows():
  longfilenames.append(pathAudioFilesInTraining + r['wav_filename'])
  sentences.append(r['transcript'])
  filesizes.append(r['wav_filesize'])
  listOfNumbers.append(j)
  j = j+1

random.shuffle(listOfNumbers)

samplesUpToTrainPartition = int(round(len(listOfNumbers) * (percentageTrainSet/100),0))
samplesUpToTestPartition = int(round(len(listOfNumbers) * ((percentageTrainSet + percentageValidSet)/100),0))

trainPath = []
trainText = []
trainSize = []
validPath = []
validText = []
validSize = []
testPath = []
testText = []
testSize = []

counterTrain = 0
counterValid = 0
counterTest = 0

for i in range(0,samplesUpToTrainPartition):
  counterTrain = counterTrain+1
  trainPath.append(longfilenames[listOfNumbers[i]])
  trainText.append(sentences[listOfNumbers[i]])
  trainSize.append(filesizes[listOfNumbers[i]])

for i in range(samplesUpToTrainPartition,samplesUpToTestPartition):
  counterValid = counterValid + 1
  validPath.append(longfilenames[listOfNumbers[i]])
  validText.append(sentences[listOfNumbers[i]])
  validSize.append(filesizes[listOfNumbers[i]])

for i in range(samplesUpToTestPartition,len(listOfNumbers)):
  counterTest = counterTest + 1
  testPath.append(longfilenames[listOfNumbers[i]])
  testText.append(sentences[listOfNumbers[i]])
  testSize.append(filesizes[listOfNumbers[i]])

print("Training samples:   " + str(counterTrain))
print("Validation samples: " + str(counterValid))
print("Test samples:       " + str(counterTest))

Training samples:   580
Validation samples: 72
Test samples:       73


In [10]:
#===========================================================================
# Write files as CSVs
#===========================================================================

def writeCSVFile(header, software, inPath, inSentences, inSizes, inFilename):

  output = header + "\n"

  if (software == "wav2vec2"):
    for i in range(0,len(inPath)): output = output + inPath[i] + "," + inSentences[i] + "\n"
  elif (software == "ds"):
    for i in range(0,len(inPath)): output = output + inPath[i] + "," + inSizes[i] + "," + inSentences[i] + "\n"
  output = output[:-1]
  f = open(inFilename, "w")
  f.write(output)
  f.close()

header = ""

if (software == "wav2vec2"):
  header = "path,sentence"
elif (software == "ds"):
  header = "wav_filename,wav_filesize,transcript"

#filenameTrain = savePath + software+"-train.csv"
#filenameValid = savePath + software+"-valid.csv"
#filenameTest =  savePath + software+"-test.csv"

filenameTrain = savePath + filePrefix+"-train.csv"
filenameValid = savePath + filePrefix+"-valid.csv"
filenameTest =  savePath + filePrefix+"-test.csv"


print(filenameTrain)
print(filenameValid)
print(filenameTest)

writeCSVFile(header, software, trainPath, trainText, trainSize, filenameTrain)
writeCSVFile(header, software, validPath, validText, validSize, filenameValid)
writeCSVFile(header, software, testPath, testText, testSize, filenameTest)

/content/drive/MyDrive/202506-ood-asr/sandbox-user/ood-wav2vec2-train.csv
/content/drive/MyDrive/202506-ood-asr/sandbox-user/ood-wav2vec2-valid.csv
/content/drive/MyDrive/202506-ood-asr/sandbox-user/ood-wav2vec2-test.csv
